### Iowa Housing Lab -- Data Encoding

Welcome!! This lab will continue where we left off last class -- building a regression model, but this time with new features added in -- using cross validation to evaluate our scores, and building our encoding steps into pipelines.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in your data set**

In [11]:
# your code here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
import category_encoders as ce
from sklearn.pipeline import make_pipeline
df = pd.read_csv('../../data/iowa_train2.csv')

**Step 2).  There are missing values throughout this dataset.  Fill them in appropriately**

We already covered this in class, but to give you a reminder:

 - Are the missing values random or not?
 - Encode them as missing if possible

In [21]:
# we'll first mark the missing values as such
def denote_null_values(df):
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

df = denote_null_values(df)

In [22]:
# your code here
# they are not random -- will fill with 'None' and 0
missing_cols_query = df.isnull().sum() > 0
missing_cols_num = df.loc[:, missing_cols_query].select_dtypes(include=np.number).columns.tolist()
missing_cols_cat = df.loc[:, missing_cols_query].select_dtypes(include=np.object).columns.tolist()
df[missing_cols_num] = df[missing_cols_num].fillna(0)
df[missing_cols_cat] = df[missing_cols_cat].fillna('None')

**Step 3): Create A Pipeline With Your Model And The Column Encoder of Your Choice**

For now, you can choose which encoding technique you would want to use.  Later on you'll go back and check to see if it made a large difference.  

In [12]:
# your answer here
pipe = make_pipeline(ce.OrdinalEncoder(), GradientBoostingRegressor())

**Step 4).  Create A Training & Test Set**

Re-use the same settings that we've completed previously in class

In [25]:
# your answer here
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1985)

**Step 5).  Get An Initial 10 Fold Cross Validation Score**

This will be your initial baseline for improving your score.  Use your pipeline in this step.

In [26]:
# your answer here
cv_scores = cross_val_score(estimator=pipe, X=X_train, y=y_train, cv=10)

In [27]:
# our scores
cv_scores

array([0.89485054, 0.88229815, 0.88887342, 0.84339908, 0.88340465,
       0.87471986, 0.93623092, 0.60320311, 0.82593033, 0.87250408])

In [28]:
# average
cv_scores.mean()

0.8505414144791285

**Step 6).  Do Parameter Exploration With Your Model To Find the Best Combination On Your Validation Set**

Use pipelines here to make processing easier.

Parameters to explore:

 - `n_estimators` (would not go above 1000 for now)
 - `max_depth`  (usually up to 5 levels deep is okay)
 - `learning_rate` (.001 - 0.1 is a good range)
 
It's a good idea to refer to previous lab exercises to see how best to do this.

Use 5 folds to get your validation score (this is for time)

**Hint:** Use the `steps` attribute in the pipeline to grab the `GradientBoostingRegressor()` in your pipeline and set its params.


In [30]:
# your answer here
num_trees = [100, 500, 1000]
max_depth = [2, 3, 4]
learning_rate = [.001, .01, .1]
cv_scores = []

for tree in num_trees:
    for depth in max_depth:
        for rate in learning_rate:
            pipe.steps[1][1].set_params(n_estimators=tree, max_depth=depth, learning_rate=rate)
            score = cross_val_score(estimator=pipe, X=X_train, y=y_train, cv=5)
            cv_scores.append((score.mean(), tree, depth, rate))
            
max(cv_scores)

(0.8685628826441191, 100, 4, 0.1)

**Step 7).  Take the *best* parameter versions and fit this on your *entire* training set**

In [31]:
# your answer here
pipe.steps[1][1].set_params(n_estimators=100, max_depth=4, learning_rate=0.1)
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['MSZoning', 'Neighborhood', 'GarageType',
                                      'GarageFinish'],
                                drop_invariant=False, handle_missing='value',
                                handle_unknown='value',
                                mapping=[{'col': 'MSZoning',
                                          'data_type': dtype('O'),
                                          'mapping': RL         1
RM         2
RH         3
FV         4
C (all)    5
NaN       -2
dtype: int64},
                                         {'col': 'Neighborhood',
                                          'data_type': dtype('O'),
                                          'mapping': SWISU       1
NAm...
                                           learning_rate=0.1, loss='ls',
                                           max_depth=4, max_features=None,
                                           max

**Step 8).  Score the model on your test set**

How did the two compare?

In [32]:
# your answer here
pipe.score(X_test, y_test)

0.8718705792693213