With the same car data, use random forest, gradient boost and adaboost to predict acceptability.

Grid search them to find the best hyperparams!  What kind of increase in performance over the default model hyperparams do you get?

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data')
df.columns = ['buy_price', 'maint_cost', 'doors', 'persons', 'lug_boot', 'safety', 'acceptability']        
df.head()

Unnamed: 0,buy_price,maint_cost,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,med,unacc
1,vhigh,vhigh,2,2,small,high,unacc
2,vhigh,vhigh,2,2,med,low,unacc
3,vhigh,vhigh,2,2,med,med,unacc
4,vhigh,vhigh,2,2,med,high,unacc


In [3]:
from sklearn.preprocessing import LabelEncoder

y = pd.Series(LabelEncoder().fit_transform(df['acceptability']))
X = pd.get_dummies(df.drop('acceptability', axis=1))

What is the baseline accuracy for this data set?

In [4]:
y.value_counts()/len(y)
#70%

2    0.700058
0    0.222351
1    0.039954
3    0.037638
dtype: float64

Using cross-validation, compare the performance for each of the three models with just the default hyperparams.  To ensure you do an apples-to-apples comparison, create a KFold cross val object, and set the random seed.

In [5]:
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, \
GradientBoostingClassifier

cv = StratifiedKFold(n_splits=3, random_state=21, shuffle=True)

In [6]:
rf = RandomForestClassifier()
cross_val_score(rf, X, y, cv=cv).mean()

0.93456119162640905

In [8]:
grad = GradientBoostingClassifier()
cross_val_score(grad, X, y, cv=cv).mean()

0.97625905797101442

In [7]:
ada = AdaBoostClassifier()
cross_val_score(ada, X, y, cv=cv).mean()

0.81935789049919483

Use gridsearch to optimize the hyperparams!  For Adaboost, this is slightly harder... look [here](https://stackoverflow.com/questions/32210569/using-gridsearchcv-with-adaboost-and-decisiontreeclassifier) for clues...

If you want a starting point for your gridsearch, highlight the empty space below (hints included in white text)...

<font color='white'>For random forest, try number of estimators [10, 25, 50], maximum depth [3, 5, 10, None], minimum samples per split [2, 5, 10]<br>
For gradient boost, try maximum depth [3, 5, 10, 25, None], minimum samples per split [2, 5, 10], number of estimators [10, 25, 50], learning rate [0.5, 1]<br>
For adaboost, try max depth [3, 5, 10, 25, None], min_samples_split [2, 5, 10], number of estimators [10, 25, 50], learning rate [0.5, 1] </font>


In [9]:
rf_params = {'n_estimators':[10, 25, 50], 'max_depth': [3, 5, 10, None], 'min_samples_split': [2, 5, 10]}

In [10]:
grid_rf = GridSearchCV(rf, param_grid=rf_params, verbose=1, cv=cv)
grid_rf.fit(X,y)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=1)]: Done 108 out of 108 | elapsed:   14.0s finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=21, shuffle=True),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 25, 50], 'min_samples_split': [2, 5, 10], 'max_depth': [3, 5, 10, None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [11]:
grid_rf.best_score_

0.9559930515344528

In [12]:
grid_rf.best_params_

{'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}

In [13]:
grad_params = {"max_depth" : [3, 5, 10, 25, None],
              'min_samples_split': [2, 5, 10],
              "n_estimators": [10, 25, 50], 'learning_rate':[0.5, 1]
             }

In [14]:
grad = GradientBoostingClassifier()

grid_grad = GridSearchCV(grad, param_grid=grad_params, verbose=1, cv=cv)
grid_grad.fit(X,y)

Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:   49.0s finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=21, shuffle=True),
       error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'min_samples_split': [2, 5, 10], 'n_estimators': [10, 25, 50], 'learning_rate': [0.5, 1], 'max_depth': [3, 5, 10, 25, None]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [15]:
grid_grad.best_score_

0.99015634047481182

In [16]:
grid_grad.best_params_

{'learning_rate': 1,
 'max_depth': 3,
 'min_samples_split': 2,
 'n_estimators': 50}

In [17]:
ada_params = {"base_estimator__max_depth" : [3, 5, 10, 25, None],
              'base_estimator__min_samples_split': [2, 5, 10],
              "n_estimators": [10, 25, 50], 'learning_rate':[0.5, 1]
             }

In [18]:
dt = DecisionTreeClassifier()
ada = AdaBoostClassifier(dt)

grid_ada = GridSearchCV(ada, param_grid=ada_params, verbose=1, cv=cv)
grid_ada.fit(X,y)

Fitting 3 folds for each of 90 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:   26.1s finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=21, shuffle=True),
       error_score='raise',
       estimator=AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=1.0, n_estimators=50, random_state=None),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [10, 25, 50], 'base_estimator__min_samples_split': [2, 5, 10], 'base_estimator__max_depth': [3, 5, 10, 25, None], 'learning_rate': [0.5, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [19]:
grid_ada.best_score_

0.98147075854082222

In [20]:
grid_ada.best_params_

{'base_estimator__max_depth': 25,
 'base_estimator__min_samples_split': 10,
 'learning_rate': 0.5,
 'n_estimators': 50}

In [22]:
# improvements across the board!  But most noticeable for adaboost