### Scaling Features

### Why so hard?

$$ \hat{price} = 8*num\_of\_bathrooms + 12*sq\_feet$$ 


What's so hard?

### Pulling our data

Let's start by again retrieving our Airbnb data.

In [2]:
import pandas as pd
df_bathrooms_sq_feet_price = pd.read_csv('b_room_sq_feet_price.csv', index_col=0)

In [4]:
X[:2]

Unnamed: 0,bathrooms,square_feet
2,1.0,720.0
3,1.0,0.0


Now, we fit our model with the following:

In [5]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
coefs = model.coef_
float(coefs[0])
# 45.09455

# square feet increases by .037
float(coefs[1])

0.037932016223360364

## Introducing a problem

Now let's convert our square feet data to be in meters instead of feet.

In [6]:
# square_meters = square_feet/10.764


And now let's refit our model.

In [3]:
#

### Our fix

What we really want to know when judging feature importances is the following: 

* How much does our dependent variable change given an expected amount of movement in the feature.  



### Reviewing the Z-score

Here's the formula for translating each of our feature variables into their standard deviation from the average.

$z = \frac{X - \hat{X}}{\sigma}$

In [44]:
from scipy.stats import zscore

array([ 0.50623365, -0.92538518, -0.92538518])

### Using SKLearn

The Sklearn library has it's own method for changing each of our feature variables into their respective Z-scores.

In [11]:
from sklearn.preprocessing import StandardScaler

In [4]:
# df_bathrooms_sq_feet_price

# fit
# fit_transform

Now we can model with our scaled data.

In [16]:
scaled_X = scaled_data[:, :-1]
scaled_y = scaled_data[:, -1]

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, scaled_y, test_size=0.33, random_state=42)

In [19]:
scaled_model = LinearRegression()
scaled_model.fit(X_train, y_train)
scaled_model.coef_

array([0.2977407 , 0.36505035])

### Undoing our changes

In [5]:
# scaler.inverse_transform()