### Iowa Housing Lab -- Data Encoding

Welcome!! This lab is going to be a bit more of an advanced version of last class, where we build a regression model to predict housing prices, but this time we do so with a dataset that has a more interesting mix of data -- numeric and categorical data, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in your data set**

In [14]:
# your code here
import pandas as pd
import numpy as np
df = pd.read_csv('../data/iowa_housing/train.csv')

**Step 2).  There are missing values throughout this dataset.  Fill them in appropriately**

We already covered this in class, but to give you a reminder:

 - Are the missing values random or not?
 - Encode them as missing if possible

In [15]:
# we'll first mark the missing values as such
def denote_null_values(df):
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

df = denote_null_values(df)

In [16]:
# your code here
# they are not random -- will fill with 'None' and 0
missing_cols_query = df.isnull().sum() > 0
missing_cols_num = df.loc[:, missing_cols_query].select_dtypes(include=np.number).columns.tolist()
missing_cols_cat = df.loc[:, missing_cols_query].select_dtypes(include=np.object).columns.tolist()
df[missing_cols_num] = df[missing_cols_num].fillna(0)
df[missing_cols_cat] = df[missing_cols_cat].fillna('None')

**Step 3): Encode Your Categorical Data**

For now, you can choose which encoding technique you would want to use.  Later on you'll go back and check to see if it made a large difference.  

In [17]:
# your answer here -- we'll use regular categorical encoding
cat_cols = df.select_dtypes(include=np.object).columns.tolist()
df[cat_cols] = df[cat_cols].astype('category')
for col in cat_cols:
    df[col] = df[col].cat.codes

In [18]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,...,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,SalePrice,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing
0,1,60,3,8450,5,7,5,2003,1710,856,...,2,1,1,2003.0,2,2,208500,False,False,False
1,2,20,3,9600,24,6,8,1976,1262,1262,...,2,0,1,1976.0,2,2,181500,False,False,False
2,3,60,3,11250,5,7,5,2001,1786,920,...,2,1,1,2001.0,2,2,223500,False,False,False
3,4,70,3,9550,6,7,5,1915,1717,961,...,1,0,5,1998.0,3,3,140000,False,False,False
4,5,60,3,14260,15,8,5,2000,2198,1145,...,2,1,1,2000.0,2,3,250000,False,False,False


**Step 4):  Declare X & y, and fit your model**

In [20]:
# your code here
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

from sklearn.ensemble import GradientBoostingRegressor

gbm = GradientBoostingRegressor()

**Step 5):  Score your model, and look at your feature importances** 

In [21]:
# your code here
gbm.fit(X, y)
gbm.score(X, y)

0.9429450175233424

**Step 6):  (Time Permitting) Re-encode your categorical variables using the opposite technique, and observe if it made a difference**

In [None]:
# see class discussion about this

If you've made it this far, you can stop.  We'll discuss step 7 as a way to wrap up the class and head into next session.

**Step 7):  Score your model on your validation set**

How much did your results change?

In [22]:
# your answer here
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1985)

In [23]:
gbm.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [24]:
# and things are a little bit worse
gbm.score(X_val, y_val)

0.885057286892258