### Gradient Boosting & Scikit-Learn Intro

This lab is designed to give everyone their first introduction to the Scikit-Learn API, and Gradient Boosting, one of the most powerful techniques in predictive modeling.

During this lab you'll see if you can build a model, understand its working parts, and make improvements to your results!  

The great thing about `Scikit Learn` is that its API is almost identical from one algorithm to another, so once you get the hang of how to use it, using different methods is fairly seamless.

**Step 1:** Load in the `housing.csv` file

In [1]:
# your answer here
import pandas as pd
df = pd.read_csv('/Users/harleyhoffmann/dat-02-22/ClassMaterial/Unit3/data/housing.csv')

**Step 2:** Randomly shuffle your dataset using the `sample` method, using a `random_state` of 42.

In [2]:
# your answer here
df = df.sample(df.shape[0], random_state=42)

**Step 3:** Declare your `X` & `y` variables -- We'll be predicting price.

In [3]:
# your answer here
X = df.drop('PRICE', axis=1)
y = df['PRICE']

**Step 4:** Create a training and test set.

The training set will be the first 80% of the dataset (for both `X` & `y`), and test set will be the last 20%.  Do this for both `X` & `y`, using your shuffled data.

Subsequent questions will refer to the variables you created in this step as `X_train`, `X_test`, `y_train`, `y_test`.

In [4]:
# your answer here
cutoff = int(df.shape[0]*.8)

X_train, X_test = X[:cutoff].copy(), X[cutoff:].copy()
y_train, y_test  = y[:cutoff].copy(), y[cutoff:].copy()

**Step 5:** Import `GradientBoostingRegressor` and initialize it.

In [5]:
# your answer here
from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor()

**Step 6:** Call the `fit()` method on `X_train` & `y_train`

In [6]:
# your answer here
gbm.fit(X_train, y_train)

GradientBoostingRegressor()

**Step 7:** Make a column that represents the predictions your model made for each sample in your original dataset.

In [7]:
# your answer here
df['Prediction'] = gbm.predict(X)

**Step 8:** Check the score of your model using the `score()` method.  Compare your score on both the training set and test set.

In [8]:
# your answer here
print(f"Train score: {gbm.score(X_train, y_train)}, Test Score: {gbm.score(X_test, y_test)}")

Train score: 0.9790704008575339, Test Score: 0.894124241944279


**Step 9:** Take a look at the values returned from the `feature_importances_` attribute

In [9]:
# your answer here
gbm.feature_importances_

array([2.39067720e-02, 1.50314007e-04, 2.25165211e-03, 5.42716264e-04,
       2.72501130e-02, 3.54050235e-01, 6.17230537e-03, 6.62399463e-02,
       1.17434598e-03, 1.65955214e-02, 2.78054787e-02, 1.19160685e-02,
       4.61944531e-01])

**Step 10:** To make a bit more sense out of these, let's put these values into a more readable format.  

Try making a 2 column dataframe using `X.columns` and the values from `feature_importances_` (they should correspond to one another).

In [10]:
# your answer here
importances = pd.DataFrame({
    'Column': X.columns,
    'Importance': gbm.feature_importances_
}).sort_values(by='Importance', ascending=False)

importances

Unnamed: 0,Column,Importance
12,LSTAT,0.461945
5,RM,0.35405
7,DIS,0.06624
10,PTRATIO,0.027805
4,NOX,0.02725
0,CRIM,0.023907
9,TAX,0.016596
11,B,0.011916
6,AGE,0.006172
2,INDUS,0.002252


**Step 11:** Can you improve your results?  For now, toy around a little bit with a few different options for getting different results.  These could be any of the following:

 - changing the number of boosting rounds used via `n_estimators`
 - changing the learning rate
 - removing columns that have lower feature importance, or very low correlation with the target variable
 - change your tree depth
 
**Important:** To determine if your model improved or not, make sure to fit it on your training set, and score it on your test set.

In [18]:
# we'll try doing a parameter sweep
n_estimators = [50, 100, 250]
learning_rate = [.05, .1, .15]
tree_depth    = [3, 4, 5]
cv_scores = []
for estimator_num in n_estimators:
    for rate in learning_rate:
        for depth in tree_depth:
            print(f"Fitting model for:  rounds: {estimator_num}, learning_rate: {rate}, depth: {depth}")
            gbm = GradientBoostingRegressor(n_estimators=estimator_num, learning_rate=rate, max_depth=depth)
            gbm.fit(X_train, y_train)
            score = gbm.score(X_test, y_test)
            print(f"Model score: {score}")
            cv_scores.append((score, estimator_num, rate, depth))

Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 3
Model score: 0.8601973753355013
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 4
Model score: 0.8717946076417042
Fitting model for:  rounds: 50, learning_rate: 0.05, depth: 5
Model score: 0.876216653710397
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 3
Model score: 0.8782476618399087
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 4
Model score: 0.8864033623769244
Fitting model for:  rounds: 50, learning_rate: 0.1, depth: 5
Model score: 0.901961398542325
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 3
Model score: 0.8662709009254597
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 4
Model score: 0.9013117091013039
Fitting model for:  rounds: 50, learning_rate: 0.15, depth: 5
Model score: 0.8962384209935698
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 3
Model score: 0.881162316326366
Fitting model for:  rounds: 100, learning_rate: 0.05, depth: 4
Mo

In [17]:
import category_encoders as ce

In [27]:
te = ce.TargetEncoder()
ore = ce.OrdinalEncoder()
ohe = ce.OneHotEncoder(use_cat_names=True)

In [20]:
df = pd.read_csv('/Users/harleyhoffmann/dat-02-22/ClassMaterial/Unit2/data/master.csv', parse_dates=['visit_date'])


In [21]:
df

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [22]:
ore.fit_transform(df)

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,1,2016-01-13,25,1,1,0,1,1,35.658068,139.751599,
1,1,2016-01-14,32,2,2,0,1,1,35.658068,139.751599,
2,1,2016-01-15,29,3,3,0,1,1,35.658068,139.751599,
3,1,2016-01-16,22,4,4,0,1,1,35.658068,139.751599,
4,1,2016-01-18,6,5,5,0,1,1,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
252103,829,2017-04-21,49,390,3,0,4,10,34.695124,135.197852,6.0
252104,829,2017-04-22,60,391,4,0,4,10,34.695124,135.197852,37.0
252105,829,2017-03-26,69,444,7,0,4,10,34.695124,135.197852,35.0
252106,829,2017-03-20,31,467,5,1,4,10,34.695124,135.197852,3.0


In [29]:
X = df.drop('visitors', axis=1)
y = df['visitors']

In [31]:
df.groupby('day_of_week')['visitors'].mean()

day_of_week
Friday       23.072737
Monday       17.177009
Saturday     26.313688
Sunday       23.873362
Thursday     18.922702
Tuesday      17.672137
Wednesday    19.230121
Name: visitors, dtype: float64

In [30]:
te.fit_transform(X,y)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,id,visit_date,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,22.782609,2016-01-13,18.433460,19.230121,0,18.723532,19.609418,35.658068,139.751599,
1,22.782609,2016-01-14,19.229927,18.922702,0,18.723532,19.609418,35.658068,139.751599,
2,22.782609,2016-01-15,23.506897,23.072737,0,18.723532,19.609418,35.658068,139.751599,
3,22.782609,2016-01-16,26.780142,26.313688,0,18.723532,19.609418,35.658068,139.751599,
4,22.782609,2016-01-18,14.486726,17.177009,0,18.723532,19.609418,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...
252103,44.595745,2017-04-21,25.030612,23.072737,0,22.582953,20.466463,34.695124,135.197852,6.0
252104,44.595745,2017-04-22,27.448320,26.313688,0,22.582953,20.466463,34.695124,135.197852,37.0
252105,44.595745,2017-03-26,24.098333,23.873362,0,22.582953,20.466463,34.695124,135.197852,35.0
252106,44.595745,2017-03-20,24.043400,17.177009,1,22.582953,20.466463,34.695124,135.197852,3.0
