# Random Forest and Boosting Lab

In this lab we will practice using Random Forest Regressor and Boosted Trees Regressor on the Project 6 Data.

> Instructor Notes:
- This walks the students through a sample dataset, they should actually do it on the full dataset they have created as part of Project 6.
- The code for this lab is shorter than usual in order to give the students time to practice with Tableau.

## 1. Load and inspect the data

As part of your work of project 6 you should have retrieved the top 250 movies from IMDB. Conduct this lab on the data you have retrieved.

In the [asset folder](../../assets/datasets/imdb_p6_sample.csv) you can find a subset of the movies, in case you have not completed yet Project 6.

1. Load the dataset and inspect it
- Assign the rating to a y vector and the binary columns to an X feature matrix
- What would you do with the year variable?
> Answer: normalize it and use it as feature

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
%matplotlib inline
os.getcwd()


'/Users/HudsonCavanagh/GA_dsi-projects/weekly_work/week06'

In [7]:
movies = pd.read_csv('/Users/HudsonCavanagh/Documents/imdb_p6_sample.csv')
movies = pd.DataFrame(movies)
movies.head()

Unnamed: 0,HA,rating,tconst,title,year,excellent,great,love,beautiful,best,hope,groundbreaking,amazing
0,1633889,9.3,tt0111161,The Shawshank Redemption,1994,0,1,0,1,0,0,0,1
1,1118799,9.2,tt0068646,The Godfather,1972,1,1,0,0,1,0,1,0
2,762879,9.0,tt0071562,The Godfather: Part II,1974,1,1,0,0,1,0,0,1
3,1616346,9.0,tt0468569,The Dark Knight,2008,1,1,1,0,1,0,1,1
4,835155,8.9,tt0108052,Schindler's List,1993,1,1,1,1,1,1,1,1


In [34]:
from sklearn.preprocessing import StandardScaler
movies['year_since'] = movies['year'].apply(lambda x: 2016-x)
movies['year_since'].value_counts()
movies['year_since'] = movies['year_since'].apply(lambda x: float(x))
movies['years_scaled'] = StandardScaler().fit_transform(movies['year_since'])
movies['years_scaled'].value_counts()
movies['the'] = movies['title'].apply(lambda x: 1 if 'the' in x else 0)
movies['the'].value_counts()

X = movies[['excellent','great', 'love', 'beautiful', 'best','hope','groundbreaking','amazing', 'years_scaled', 'the']]
y = movies['rating'].values
print(y)

[ 9.3  9.2  9.   9.   8.9  8.9  8.9  8.9  8.9  8.9  8.8  8.8  8.8  8.8  8.7
  8.7  8.7  8.7  8.7  8.7  8.7  8.6  8.6  8.6  8.6  8.6]




## 2. Decision Tree Regressor


1. Train a decision tree regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
    - They should align to a diagonal line.
- Add some text to the plot indicating the average $R^2$ coefficient

In [37]:
from sklearn import ensemble
from sklearn.ensemble import AdaBoostRegressor
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor
clf = ensemble.GradientBoostingRegressor
clf_bag = ensemble.GradientBoostingRegressor(BaggingRegressor(n_estimators=1000))
ab_bag = AdaBoostRegressor(BaggingRegressor(n_estimators=1000), n_estimators=100)
bc = BaggingRegressor(n_estimators=1000) 
ab_bag.fit(X,y)
clf.fit(X,y)
clf_bag.fit(X,y)
#.973  - .98 after
s10 = cross_val_score(ab_bag, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s10.mean().round(3), s10.std().round(3)))

s11 = cross_val_score(clf, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s11.mean().round(3), s11.std().round(3)))

s12 = cross_val_score(clf_bag, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s12.mean().round(3), s12.std().round(3)))


TypeError: unbound method fit() must be called with GradientBoostingRegressor instance as first argument (got DataFrame instance instead)

In [None]:
dt = DecisionTreeRegressor()
bdt = BaggingClassifier(DecisionTreeRegressor())
rf = RandomForestRegressor(class_weight='balanced', n_jobs=-1, n_estimators=100)
et = ExtraTreesRegressor(class_weight='balanced', n_jobs=-1)

cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=21)

def score(model, name):
    s = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
    print("{} Score:\t{:0.3} ± {:0.3}".format(name, s.mean().round(3), s.std().round(3)))

score(dt, "Decision Tree")
score(bdt, "Bagging DT")
score(rf, "Random Forest")
score(et, "Extra Trees")
score(ab_bag, "Ada Boosted_Bagging")



## 3. Random Forest Regressor


1. Train a random forest regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
- How does this plot compare with the previous one?
> Answer: points are tighter now, indicating a better fit

## 4. AdaBoost Regressor


1. Train a AdaBoost regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
- Compare with previous score

## 4. Gradient Boosting Trees Regressor


1. Train a Gradient Boosting Trees regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
- Compare with previous score

## 5. Tableau Practice

Practice using Tableau to inspect the data and also to plot the results.


## Bonus

Take the best model and try to improve it using grid search.