# Random Forest and Boosting Lab

In this lab we will practice using Random Forest Regressor and Boosted Trees Regressor on the Project 6 Data.

> Instructor Notes:
- This walks the students through a sample dataset, they should actually do it on the full dataset they have created as part of Project 6.
- The code for this lab is shorter than usual in order to give the students time to practice with Tableau.

## 1. Load and inspect the data

As part of your work of project 6 you should have retrieved the top 250 movies from IMDB. Conduct this lab on the data you have retrieved.

In the [asset folder](../../assets/datasets/imdb_p6_sample.csv) you can find a subset of the movies, in case you have not completed yet Project 6.

1. Load the dataset and inspect it
- Assign the rating to a y vector and the binary columns to an X feature matrix
- What would you do with the year variable?
> Answer: normalize it and use it as feature

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import cross_val_score
from sklearn.cross_validation import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, \
AdaBoostClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder

In [29]:
cv = StratifiedKFold(y, shuffle=True, random_state=21)



In [13]:
df = pd.read_csv('/Users/kristensu/Dropbox/GA-DSI/DSI-copy/curriculum/week-06/4.2-lab-random-forests/assets/datasets/imdb_p6_sample.csv')

In [14]:
cdf = df.copy()

In [15]:
df.head()

Unnamed: 0,HA,rating,tconst,title,year,excellent,great,love,beautiful,best,hope,groundbreaking,amazing
0,1633889,9.3,tt0111161,The Shawshank Redemption,1994,0,1,0,1,0,0,0,1
1,1118799,9.2,tt0068646,The Godfather,1972,1,1,0,0,1,0,1,0
2,762879,9.0,tt0071562,The Godfather: Part II,1974,1,1,0,0,1,0,0,1
3,1616346,9.0,tt0468569,The Dark Knight,2008,1,1,1,0,1,0,1,1
4,835155,8.9,tt0108052,Schindler's List,1993,1,1,1,1,1,1,1,1


## 2. Decision Tree Regressor


1. Train a decision tree regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
    - They should align to a diagonal line.
- Add some text to the plot indicating the average $R^2$ coefficient

In [18]:
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
ada = AdaBoostClassifier()
grad = GradientBoostingClassifier()

model_dict = {'dt': DecisionTreeClassifier(),
'rf': RandomForestClassifier(),
'ada': AdaBoostClassifier(),
'grad': GradientBoostingClassifier()}

In [40]:
y = LabelEncoder().fit_transform(df['rating'])
X = pd.get_dummies(df.drop('excellent', axis=1))
cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=21)
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)

def model_score(X, y, model):
    score = cross_val_score(model_dict[model], X, y, cv=cv) 
    print score

In [45]:
model_dict['dt'].fit(X,y)
y_pred = model_dict['dt'].predict(y)



ValueError: Number of features of the model must match the input. Model n_features is 62 and input n_features is 26 

In [41]:
model_score(X,y,'dt')

[ 1.          0.7         0.85714286]


## 3. Random Forest Regressor


1. Train a random forest regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
- How does this plot compare with the previous one?
> Answer: points are tighter now, indicating a better fit

## 4. AdaBoost Regressor


1. Train a AdaBoost regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
- Compare with previous score

## 4. Gradient Boosting Trees Regressor


1. Train a Gradient Boosting Trees regressor on the data and estimate the rating
- Evaluate the score with a 3-fold shuffled cross validation
- Do a scatter plot of the predicted vs actual scores for each of the 3 folds, do they match?
- Compare with previous score

## 5. Tableau Practice

Practice using Tableau to inspect the data and also to plot the results.


## Bonus

Take the best model and try to improve it using grid search.