## Scikit Learn Preprocessing

In this notebook, we'll use `sklearn.preprocessing` to do some scaling for us. If you need to prepare data for machine learning or feature extraction, the [sklearn.preprocessing documentation](http://scikit-learn.org/stable/modules/preprocessing.html) has great examples.

In [1]:
from sklearn import preprocessing
import pandas as pd
from datetime import datetime

In [59]:
hvac = pd.read_csv('../data/HVAC_with_nulls.csv')

## Checking Data Quality

In [60]:
hvac.dtypes

Date           object
Time           object
TargetTemp    float64
ActualTemp      int64
System          int64
SystemAge     float64
BuildingID      int64
10            float64
dtype: object

In [61]:
hvac.shape

(8000, 8)

In [62]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10
0,6/1/13,0:00:01,66.0,58,13,20.0,4,
1,6/2/13,1:00:01,,68,3,20.0,17,
2,6/3/13,2:00:01,70.0,73,17,20.0,18,
3,6/4/13,3:00:01,67.0,63,2,,15,
4,6/5/13,4:00:01,68.0,74,16,9.0,3,


## Impute missing values with mean

In [63]:
imp = preprocessing.Imputer(missing_values='NaN', strategy='mean')

In [64]:
hvac_numeric = hvac[['TargetTemp', 'SystemAge']]
hvac_numeric.head()

Unnamed: 0,TargetTemp,SystemAge
0,66.0,20.0
1,,20.0
2,70.0,20.0
3,67.0,
4,68.0,9.0


In [65]:
hvac_numeric.loc[:10]

Unnamed: 0,TargetTemp,SystemAge
0,66.0,20.0
1,,20.0
2,70.0,20.0
3,67.0,
4,68.0,9.0
5,67.0,28.0
6,70.0,24.0
7,,26.0
8,66.0,9.0
9,65.0,5.0


In [66]:
#imp = imp.fit(hvac_numeric.loc[:10])


In [67]:
transformed = imp.fit_transform(hvac_numeric)

In [68]:
transformed

array([[66.        , 20.        ],
       [67.50773481, 20.        ],
       [70.        , 20.        ],
       ...,
       [67.50773481,  4.        ],
       [65.        , 23.        ],
       [66.        , 21.        ]])

In [69]:
hvac['TargetTemp'], hvac['SystemAge'] = transformed[:,0], transformed[:,1]

In [70]:
hvac.head()

Unnamed: 0,Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID,10
0,6/1/13,0:00:01,66.0,58,13,20.0,4,
1,6/2/13,1:00:01,67.507735,68,3,20.0,17,
2,6/3/13,2:00:01,70.0,73,17,20.0,18,
3,6/4/13,3:00:01,67.0,63,2,15.386643,15,
4,6/5/13,4:00:01,68.0,74,16,9.0,3,


## Scale temperature values

In [71]:
hvac['ScaledTemp'] = preprocessing.scale(hvac['ActualTemp'])



In [72]:
hvac['ScaledTemp'].head()

0   -1.293272
1    0.048732
2    0.719733
3   -0.622270
4    0.853934
Name: ScaledTemp, dtype: float64

## Scale using a min and max scaler

In [73]:
min_max_scaler = preprocessing.MinMaxScaler()

In [74]:
temp_minmax = min_max_scaler.fit_transform(hvac[['ActualTemp']])

In [75]:
temp_minmax

array([[0.12],
       [0.52],
       [0.72],
       ...,
       [0.56],
       [0.32],
       [0.44]])

### Exercise: add the `temp_minmax` back to the dataframe as a new column

In [77]:
# %load ../solutions/preprocessing.py
hvac['MinMaxScaledTemp'] = temp_minmax[:,0]
hvac['MinMaxScaledTemp'].head()


0    0.12
1    0.52
2    0.72
3    0.32
4    0.76
Name: MinMaxScaledTemp, dtype: float64