### Scaling Features

### Why so hard?

$$ \hat{price} = 30*num\_of\_bathrooms + 12*sq\_feet$$ 


What's so hard?

### Pulling our data

Let's start by again retrieving our Airbnb data.

In [4]:
import pandas as pd
df_bathrooms_sq_feet_price = pd.read_csv('b_room_sq_feet_price.csv', index_col=0)
X = df_bathrooms_sq_feet_price.iloc[:, :-1]
y = df_bathrooms_sq_feet_price.iloc[:, -1]

In [5]:
X[:2]

Unnamed: 0,bathrooms,square_feet
2,1.0,720.0
3,1.0,0.0


Now, we fit our model with the following:

In [6]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
coefs = model.coef_
float(coefs[0])
# 45.09455

# square feet increases by .037
float(coefs[1])

0.037932016223360364

## Introducing a problem

Now let's convert our square feet data to be in meters instead of feet.

In [13]:
square_meters = df_bathrooms_sq_feet_price.square_feet/10.764
df_bathrooms_sq_feet_price['square_meters'] = square_meters
# df_bathrooms_sq_feet_price.square_feet

And now let's refit our model.

In [16]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X_sq_meters = df_bathrooms_sq_feet_price[['bathrooms', 'square_meters']]
model.fit(X_sq_meters, y)
coefs = model.coef_
float(coefs[0])
# 45.09455

# square feet increases by .037
float(coefs[1])

0.408300222628251

### Our fix

What we really want to know when judging feature importances is the following: 

* How much does our dependent variable change given an expected amount of movement in the feature.  



### Reviewing the Z-score

Here's the formula for translating each of our feature variables into their standard deviation from the average.

$z = \frac{X - \hat{X}}{\sigma}$

In [18]:
from scipy.stats import zscore

zscore(df_bathrooms_sq_feet_price.square_feet)[0:3]

array([ 0.50623365, -0.92538518, -0.92538518])

In [19]:
zscore(df_bathrooms_sq_feet_price.square_meters)[0:3]

array([ 0.50623365, -0.92538518, -0.92538518])

### Using SKLearn

The Sklearn library has it's own method for changing each of our feature variables into their respective Z-scores.

In [20]:
from sklearn.preprocessing import StandardScaler

In [21]:
scaler = StandardScaler()


In [22]:
scaler.fit(df_bathrooms_sq_feet_price)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [24]:
df_bathrooms_sq_feet_price[0:3]

Unnamed: 0,bathrooms,square_feet,price,square_meters
2,1.0,720.0,90.0,66.889632
3,1.0,0.0,26.0,0.0
8,1.0,0.0,90.0,0.0


In [35]:
scaled_data = scaler.transform(df_bathrooms_sq_feet_price)

In [51]:
scaled_data = scaler.fit_transform(df_bathrooms_sq_feet_price)
scaled_data

array([[-0.35443162,  0.50623365,  0.37080234,  0.50623365],
       [-0.35443162, -0.92538518, -0.92121207, -0.92538518],
       [-0.35443162, -0.92538518,  0.37080234, -0.92538518],
       ...,
       [ 2.49402797,  1.10870657, -0.84046117,  1.10870657],
       [-0.35443162, -0.71064236, -0.88083662, -0.71064236],
       [-0.35443162, -0.71064236, -0.73952254, -0.71064236]])

Now we can model with our scaled data.

In [36]:
scaled_data_without_dup = scaled_data[:, :-1]


In [39]:
# scaled_data_without_dup[:, :-1]

In [44]:
scaled_X = scaled_data_without_dup[:, :-1]
scaled_y = scaled_data_without_dup[:, -1]

In [45]:
from sklearn.model_selection import train_test_split

In [46]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, scaled_y, test_size=0.33, random_state=42)

In [47]:
scaled_model = LinearRegression()
scaled_model.fit(X_train, y_train)
scaled_model.coef_

array([0.2977407 , 0.36505035])

### Undoing our changes

In [49]:
scaler.inverse_transform(scaled_data)

array([[1.00000000e+00, 7.20000000e+02, 9.00000000e+01, 6.68896321e+01],
       [1.00000000e+00, 0.00000000e+00, 2.60000000e+01, 0.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 9.00000000e+01, 0.00000000e+00],
       ...,
       [2.00000000e+00, 1.02300000e+03, 3.00000000e+01, 9.50390190e+01],
       [1.00000000e+00, 1.08000000e+02, 2.80000000e+01, 1.00334448e+01],
       [1.00000000e+00, 1.08000000e+02, 3.50000000e+01, 1.00334448e+01]])