In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

In [0]:
df = pd.read_csv("/tmp/wine-binary.csv")

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop("isGood", axis=1), df["isGood"],
    random_state=42
)

# Simple decision tree

In [0]:
from sklearn.tree import DecisionTreeClassifier

Next, try initializing the model and doing a fit on the wine dataset.

In [0]:
# YOUR CODE HERE
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Take a look at the train & test scores. How does this compare to Logistic Regression the results?

In [0]:
model.score(X_train, y_train)

1.0

In [0]:
#model.score(... # YOUR CODE HERE
model.score(X_test, y_test)

0.72

How would you explain the gap between the training set and test set performance?

Let's see if can reduce the gap a bit... check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) and see if there's any way to reduce the complexity of the model to bring train and test closer together (ideally, test score should go up and as a side-effect, train score goes down).

In [0]:
# YOUR CODE HERE: Initialize DecisionTreeClassifier with one or more parameters, fit and evaluate.
model_reg = DecisionTreeClassifier(max_depth=7, min_samples_split=4, min_samples_leaf=6)
model_reg.fit(X_train, y_train)
print(model_reg.score(X_train, y_train))
print(model_reg.score(X_test, y_test))

0.8498748957464554
0.7275


During the wrap up, we'll see who managed to get the highest score and which parameters values were used (I'm trying to appeal to your competetive spirit here ;-))

You might notice that it's pretty difficult solve the overfitting problem with a Decision Tree. 

Luckily, Random Forests help out in this aspect. 

# Random Forest

In [0]:
from sklearn.ensemble import RandomForestClassifier

In [0]:
rf = RandomForestClassifier()

Try setting a first baseline fit with default parameters first. 

In [0]:
# YOUR CODE HERE
rf.fit(X_train, y_train)
print(rf.score(X_train, y_train))
print(rf.score(X_test, y_test))

0.9933277731442869
0.785


What do you notice, compared to the decision tree performance?

Try running the model multiple times. Why do you get different scores in each try?

Ok, let's try playing with the parameters. Some things you can do:
- Bumping `n_estimators` to 20, 50, 100
- Limiting `max_depth` to 5, 10, 15

In [0]:
# YOUR CODE HERE
rf = RandomForestClassifier(n_estimators=100, max_depth=15)
rf.fit(X_train, y_train)
print(rf.score(X_train, y_train))
print(rf.score(X_test, y_test))

1.0
0.805


# Grid Search
Luckily, we can automate this parameter-finding process.

In [0]:
from sklearn.model_selection import GridSearchCV

Read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Complete the grid below using the options in the RandomForestClassifier documention page. Pick a couple you find interesting (not too many!)

In [0]:
param_grid = {
    "n_estimators": [10, 50, 250, 500, 1000],
    # YOUR PARAMS HERE
    "max_depth": [10, 15, 20, 25],
    "min_samples_split": [2, 3],
    "min_samples_leaf": [1, 2, 3]
} 

In [0]:
gcv = GridSearchCV(RandomForestClassifier(), param_grid, cv=3, n_jobs=8)

In [0]:
gcv.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid='warn', n_jobs=8,
       param_grid={'n_estimators': [10, 50, 250, 500, 1000], 'max_depth': [10, 15, 20, 25], 'min_samples_split': [2, 3], 'min_samples_leaf': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

What are your best parameters?

In [0]:
# YOUR CODE HERE
gcv.best_params_

{'max_depth': 15,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 250}

And what is the associated Cross Validation score?

In [0]:
# YOUR CODE HERE
gcv.best_score_

0.7939949958298582

And the score on the test set, with the best model according to the grid search?

In [0]:
# YOUR CODE HERE
print(gcv.score(X_test, y_test))
#or:
gcv.best_estimator_.score(X_test, y_test)

0.8


0.8

# Open-ended bonus assignments
- Try using RandomizedSearchCV instead of GridSearchCV
- Try inspecting the RandomForestClassifier model to see if you can get a better understanding of what the model is doing. Hint: look up `feature_importances`.
- Instead of using Accuracy, see if you can get Precision, Recall and F1 score metrics. The scikit-learn documentation should help. 