# Bagging and Boosting Lab

In this lab we will practice using Bagging and Boosting models on a dataset of your choice.

---

For this lab you should choose datasets in the past that have given you trouble with other models or that you are otherwise interested in looking at with bagging and boosting. The goal is so you get an idea for how they perform compared to more conventional regressions and classifiers.

## 1. Load and inspect the data

Load a dataset or datasets that you have been difficult to get results on in the past, either from labs or projects.

---

### 1.1 Inspect the data and create and X, y  classification and for regression


In [17]:
import pandas as pd
import numpy as np
from sklearn.linear_model import (LinearRegression, LogisticRegression, 
                                  Lasso, Ridge,
                                  SGDRegressor, SGDClassifier)
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score, StratifiedKFold
import scipy.stats as stats
from sklearn.metrics import roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
%matplotlib inline

In [2]:
service_reviews = pd.read_csv('/home/llevin/Desktop/DSI-SF-2-llevin16/Week 6 Notes & Code/Datasets/service_reviews.csv')

In [3]:
#Same process for service data
y = service_reviews['target'].values
X = service_reviews[[col for col in service_reviews.columns if col not in\
                  ['target','Class','stars','likes','business_id','Reviews']]]

scale = StandardScaler()
Xn = scale.fit_transform(X)

print Xn.shape,y.shape

(1362, 160) (1362,)


In [4]:
y

array([ 3.        ,  3.        ,  4.31818182, ...,  3.4       ,
        3.80487805,  4.35      ])

In [5]:
# Setup search parameters
search_parameters = {
    "alpha": np.linspace(0.0001, 1.0, 200)
}

lasso = Lasso()

estimator = GridSearchCV(lasso, search_parameters, cv=5, verbose=1)
estimator = estimator.fit(Xn,y)
# Fit some data!
best_lasso = estimator.best_estimator_

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    1.1s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    1.9s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    3.5s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:    5.4s
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    6.4s finished


In [6]:
cv_indices = StratifiedKFold(y, n_folds=5,shuffle=True)

lasso_scores = []

for train,test in cv_indices:
    x_train = Xn[train,:]
    x_test = Xn[test,:]
    y_train = y[train]
    y_test = y[test]
    
    best_lasso.fit(x_train,y_train)
    
    lasso_scores.append(best_lasso.score(x_test,y_test))

print lasso_scores
print 'Lasso: ',np.mean(lasso_scores)

baseline_X = np.ones(Xn.shape[0])[:,np.newaxis]
print 'Baseline: ',LinearRegression(fit_intercept=False).fit(baseline_X, y).score(baseline_X, y)

[0.10748432438327793, 0.087043499801870539, 0.12398975236066567, 0.10057690932885133, 0.11841059674652155]
Lasso:  0.107501016524
Baseline:  0.0




---

### 2. Decision Tree Regressor

1. Train a decision tree regressor on the regression problem
- Evaluate the score with a 5-fold cross-validation
- How does this compare to the model you fit on this data previously?


In [10]:
from sklearn.tree import DecisionTreeClassifier
## Define your DecisionTreeClassifier, search parameters, gridsearch
from sklearn.grid_search import GridSearchCV

## Define your DecisionTreeClassifier
dctc = DecisionTreeClassifier()

## Search parameters
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,'log2','sqrt','auto', 2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

## Gridsearch    
dtc_gs = GridSearchCV(dctc, dtc_params, cv=5, verbose=1)

In [8]:
y = np.asarray(y, dtype="|S6")
y

array(['3.0', '3.0', '4.3181', ..., '3.4', '3.8048', '4.35'], 
      dtype='|S6')

In [13]:
best_dt = dtc_gs.fit(Xn, y)

Fitting 5 folds for each of 440 candidates, totalling 2200 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    3.4s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:    4.9s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:    6.6s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:    7.9s
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:    9.9s
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:   12.8s
[Parallel(n_jobs=1)]: Done 2200 out of 2200 | elapsed:   14.6s finished


In [14]:
best_dt = best_dt.best_estimator_

In [15]:
cv_indices = StratifiedKFold(y, n_folds=5,shuffle=True)

dt_scores = []

for train,test in cv_indices:
    x_train = Xn[train,:]
    x_test = Xn[test,:]
    y_train = y[train]
    y_test = y[test]
    
    best_dt.fit(x_train,y_train)
    
    dt_scores.append(best_dt.score(x_test,y_test))

print dt_scores
print 'DT: ',np.mean(dt_scores)

[0.099315068493150679, 0.098976109215017066, 0.083636363636363634, 0.086956521739130432, 0.096385542168674704]
DT:  0.0930539210505


---

### 3. Random Forest Regressor

1. Train a random forest regressor on the regression problem and predict your dependent.
- Evaluate the score with a 5-fold cross-validation
- How does this compare to the model you fit on this data previously? How does it compare to the single decision tree?

In [18]:
## Define a Random Forest Classifier
rfc = RandomForestClassifier()

rf_params = {
    'max_features':[None,'log2','sqrt', 2,3,4,5],
    'max_depth':[1,2,3,None],
    'min_samples_leaf':np.linspace(1,101,20),
    'n_estimators':[100]
}

## gridsearch parameters, and cv =5
rf_gs = GridSearchCV(rfc, rf_params, cv=5, verbose=1)

In [19]:
best_rf = rf_gs.fit(Xn,y)

Fitting 5 folds for each of 560 candidates, totalling 2800 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   17.5s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   56.1s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  1.8min
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:  3.5min
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  5.1min
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:  7.7min
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed: 12.1min
[Parallel(n_jobs=1)]: Done 2800 out of 2800 | elapsed: 13.5min finished


In [20]:
best_rf = best_rf.best_estimator_

In [21]:
cv_indices = StratifiedKFold(y, n_folds=5,shuffle=True)

rf_scores = []

for train,test in cv_indices:
    x_train = Xn[train,:]
    x_test = Xn[test,:]
    y_train = y[train]
    y_test = y[test]
    
    best_rf.fit(x_train,y_train)
    
    rf_scores.append(best_rf.score(x_test,y_test))

print rf_scores
print 'RF: ',np.mean(rf_scores)

[0.089403973509933773, 0.087248322147651006, 0.11152416356877323, 0.108, 0.13168724279835392]
RF:  0.105572740405


---

### 4. Extra Trees Regressor

1. Train an extra trees regressor on the regression problem and predict your dependent.
- Evaluate the score with a 5-fold cross-validation
- How does this model compare with the others?

In [28]:
from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier()

et.fit(Xn,y)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [29]:
best_et = et.fit(Xn,y)

In [30]:
cv_indices = StratifiedKFold(y, n_folds=5,shuffle=True)

et_scores = []

for train,test in cv_indices:
    x_train = Xn[train,:]
    x_test = Xn[test,:]
    y_train = y[train]
    y_test = y[test]
    
    best_et.fit(x_train,y_train)
    
    et_scores.append(best_et.score(x_test,y_test))

print et_scores
print 'ET: ',np.mean(et_scores)

[0.076666666666666661, 0.076388888888888895, 0.062271062271062272, 0.0728744939271255, 0.05905511811023622]
ET:  0.0694512459728


---

### 5. AdaBoost Classifier

1. Train a AdaBoost classifier on your chosen classification problem.
- Evaluate the classifier performance with a 5-fold cross-validation.

In [31]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()

best_ada = ada.fit(Xn,y)

In [32]:
cv_indices = StratifiedKFold(y, n_folds=5,shuffle=True)

ada_scores = []

for train,test in cv_indices:
    x_train = Xn[train,:]
    x_test = Xn[test,:]
    y_train = y[train]
    y_test = y[test]
    
    best_ada.fit(x_train,y_train)
    
    ada_scores.append(best_ada.score(x_test,y_test))

print ada_scores
print 'ADA: ',np.mean(ada_scores)

[0.08191126279863481, 0.11583011583011583, 0.095890410958904104, 0.093283582089552244, 0.124]
ADA:  0.102183074335


---

### 6. Gradient Boosted Trees Classifier


1. Train a Gradient Boosting Trees classifier on your chosen classification problem.
- Evaluate the score with a 5-fold cross-validation.
- Compare with the AdaBoost score.

In [34]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(verbose=1)

best_gbc = gbc.fit(Xn,y)

      Iter       Train Loss   Remaining Time 
         1 477186735516371394580422328469528586785871701924042652699748206487659869458565672510032397330776041298855195703724998656.0000           12.72m
         2 84842572795123807943517361051740657509094285276718651431138946819715415542502444237767395997842634179239436268859605771566577575912576650272909845730462492589502972616377252109731741029691950280163926736896.0000           10.49m
         3 84842572795123807943517361051740657509094285276718651431138946819715415542502444237767395997842634179239436268859605771566577575912576650272909845730462492589502972616377252109731741029691950280163926736896.0000            9.63m
         4 84842572795123807943517361051740657509094285276718651431138946819715415542502444237767395997842634179239436268859605771566577575912576650272909845730462492589502972616377252109731741029691950280163926736896.0000            9.15m
         5 848425727951238079435173610517406575090942852767186514311389468197154

In [35]:
cv_indices = StratifiedKFold(y, n_folds=5,shuffle=True)

gbc_scores = []
 
for train,test in cv_indices:
    x_train = Xn[train,:]
    x_test = Xn[test,:]
    y_train = y[train]
    y_test = y[test]
    
    best_gbc.fit(x_train,y_train)
    
    gbc_scores.append(best_gbc.score(x_test,y_test))

print gbc_scores
print 'GBC: ',np.mean(gbc_scores)

      Iter       Train Loss   Remaining Time 
         1 113989998926572849741758178141384910843009408553002271617043885796992405544528711393954531351162200957507308666890486391830592455625768818309795154576702358757587354171893169801572827879249003796391819544741515697951239008453992637964178799639823853553729549755267113216453614204092416.0000            7.74m
         2 150609738716817617085085294664412163386436191423256179712631436345959360429796092188244216615459726554228315125402870742170334531478960946935266334535704296601411226692238465541537063661181948068061758290748445702088422070438663302794247470751459761769801261008915822493898802114592768.0000            6.79m
         3 150609738716817617085085294664412163386436191423256179712631436345959360429796092188244216615459726554228315125402870742170334531478960946935266334535704296601411226692238465541537063661181948068061758290748445702088422070438663302794247470751459761769801261008915822493898802114592768.0000            6.1

KeyboardInterrupt: 

### 7. [BONUS] Use gridsearch to fine-tune models of your choice.

1. What are the best parameters found with the gridsearch?
2. How does the best score compare to the model(s) without cross-validation?

**Be careful with many parameters! It can go slow.**