### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

**Step 1:** Load in the `housing.csv` file

In [59]:
# your answer here
import pandas as pd
import numpy as np
pd.options.plotting.backend = "plotly"
df = pd.read_csv('/Users/harleyhoffmann/dat-02-22/ClassMaterial/Unit3/data/housing.csv')

**Step 2:** Randomly shuffle your dataset using the `sample` method, using a `random_state` of 42.  Use the entire dataset as your sample size.

In [60]:
# your answer here
df = df.sample(df.shape[0], random_state=42)

**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [61]:
# your answer here
y = df['PRICE']
X = df.drop('PRICE', axis=1)

**Step 4:** Create a training and test set.

The training set will be the first 80% of the dataset (for both `X` & `y`), and test set will be the last 20%.  Do this for both `X` & `y`, using your shuffled data.

Subsequent questions will refer to the variables you created in this step as `X_train`, `X_test`, `y_train`, `y_test`.

In [62]:
# your answer here
cutoff = int(X.shape[0]*.8)

X_train, X_test = X[:cutoff].copy(), X[cutoff:].copy()
y_train, y_test  = y[:cutoff].copy(), y[cutoff:].copy()



**Step 5:** Import `GradientBoostingRegressor` and initialize it.

In [63]:
# your answer here
from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor()

**Step 6:** Call the `fit()` method on `X_train` & `y_train`

In [64]:
# your answer here
gbm.fit(X_train,y_train)

GradientBoostingRegressor()

**Step 7:** Make a column that represents the predictions your model made for each sample in your original dataset.

In [75]:
# your answer here
X_train['Prediction'] = gbm.predict(X_train)
#always add the training prediction as the column

In [66]:
X_train.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,Prediction
173,0.09178,0.0,4.05,0,0.51,6.416,84.1,2.6463,5,296,16.6,395.5,9.04,24.261651
274,0.05644,40.0,6.41,1,0.447,6.758,32.9,4.0776,4,254,17.6,396.9,3.53,32.59896
491,0.10574,0.0,27.74,0,0.609,5.983,98.8,1.8681,4,711,20.1,390.11,18.07,14.488047
72,0.09164,0.0,10.81,0,0.413,6.065,7.8,5.2873,4,305,19.2,390.91,5.52,23.323096
452,5.09017,0.0,18.1,0,0.713,6.297,91.8,2.3682,24,666,20.2,385.09,17.27,17.008758


In [67]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
173,0.09178,0.0,4.05,0,0.51,6.416,84.1,2.6463,5,296,16.6,395.5,9.04,23.6
274,0.05644,40.0,6.41,1,0.447,6.758,32.9,4.0776,4,254,17.6,396.9,3.53,32.4
491,0.10574,0.0,27.74,0,0.609,5.983,98.8,1.8681,4,711,20.1,390.11,18.07,13.6
72,0.09164,0.0,10.81,0,0.413,6.065,7.8,5.2873,4,305,19.2,390.91,5.52,22.8
452,5.09017,0.0,18.1,0,0.713,6.297,91.8,2.3682,24,666,20.2,385.09,17.27,16.1


**Step 8:** Check the score of your model using the `score()` method.  Compare your score on both the training set and test set.

In [68]:
# your answer here
#import scikit
#set your parameters
#declare variables
#fit 
#predict, make a nice new column
gbm.score(X_test,y_test)

#score your model on the test

0.8901305326440893

**Step 9:** Take a look at the values returned from the `feature_importances_` attribute

In [70]:
# your answer here
gbm.feature_importances_

array([2.33020480e-02, 2.22380485e-04, 5.75462604e-03, 5.42716264e-04,
       2.85448647e-02, 3.56984370e-01, 6.09738474e-03, 6.49195875e-02,
       1.31615697e-03, 1.69630633e-02, 2.36600060e-02, 9.63408495e-03,
       4.62058712e-01])

**Step 10:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

In [72]:
# your answer here
feats = pd.DataFrame({'columns':X.columns, 'Importance':gbm.feature_importances_})
feats

Unnamed: 0,columns,Importance
0,CRIM,0.023302
1,ZN,0.000222
2,INDUS,0.005755
3,CHAS,0.000543
4,NOX,0.028545
5,RM,0.356984
6,AGE,0.006097
7,DIS,0.06492
8,RAD,0.001316
9,TAX,0.016963


In [46]:
feats.sort_values(by='Importance',ascending=False)

Unnamed: 0,columns,Importance
13,Prediction,0.9999243
5,RM,1.216571e-05
7,DIS,1.194809e-05
12,LSTAT,1.165498e-05
0,CRIM,1.094262e-05
6,AGE,9.506166e-06
11,B,8.746531e-06
4,NOX,3.213456e-06
10,PTRATIO,2.615811e-06
9,TAX,2.184687e-06


**Step 11:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable
 - change your tree depth
 
**Important:** To determine if your model improved or not, make sure to fit it on your training set, and score it on your test set.

In [49]:
# your answer here
gbm.get_params()

{'alpha': 0.9,
 'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'ls',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 500,
 'n_iter_no_change': None,
 'random_state': None,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

In [73]:
n_estimators = [50, 100, 250]
learning_rate = [.05, .01, .15]
tree_depth = [3, 4, 5]
cv_scores = []

for estimator_num in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            print(f"fitting model for: rounds: {estimator_num}, learning_rate: {rate}, depth: {depth}")
            gbm = GradientBoostingRegressor(n_estimators=estimator_num, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train,y_train)
            score = gbm.score(X_test, y_test)
            print(f"Model score: {score}")
            cv_scores.append(score, estimator_num, rate, depth)

fitting model for: rounds: 50, learning_rate: 0.05, depth: 3


ValueError: X has 13 features, but DecisionTreeRegressor is expecting 14 features as input.

In [53]:

#increasing the nestimators got much closer

In [56]:
cv_scores

[]

In [58]:
max(cv_scores)

ValueError: max() arg is an empty sequence