#  Model Selection Lab Soln
## Grid Search for *k*-NN

To get us started we have an example that fits a *k*-NN model for the `HotelRevHelpfulness` dataset. It assesses three options:
- whether to use a StandardScaler, MinMaxScaler or no scaler. 
- what <em>k</em> to use for <em>k</em>-NN
- what weighting policy

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.datasets import load_digits
from sklearn.pipeline import Pipeline
import pandas as pd

In [2]:
hotel_rev = pd.read_csv('data/HotelRevHelpfulness.csv')
hotel_rev.head()

Unnamed: 0,hotelId,aveHelpfulnessRatioUser,stdevHelpfulnessRatioUser,pcReviewsExceedMinHelpfulnessSupport,numReviewsUser,numReviewsHotel,ratingUser,numberSubRatingsUser,subRatingMeanUser,subRatingStdevUser,...,completeness_2,completeness_3,numberTermsEntry,percentageAlphaCharsEntry,fractionUpperCaseCharsEntry,fractionYouVsIEntry,numberTermsSummaryQuote,percentageAlphaCharsSummaryQuote,fractionUpperCaseCharsSummaryQuote,reviewHelpfulness
0,17420,1.0,0.0,0.666667,3,16,5,4,4.0,0.0,...,0,1,182,0.788474,0.025703,0.5,6,0.815789,0.096774,1
1,1397,0.772487,0.377321,0.5,12,233,5,0,0.0,0.0,...,0,0,158,0.791888,0.012594,0.5,1,1.0,0.083333,1
2,1348,0.715473,0.300437,0.833333,12,302,4,7,3.714286,0.755929,...,0,3,59,0.799639,0.024831,0.333333,4,0.828571,0.034483,0
3,5940,0.52125,0.481675,0.222222,36,6,1,4,1.0,0.0,...,0,0,95,0.782212,0.029155,0.5,2,0.8,0.0625,0
4,38,0.603175,0.246926,1.0,2,271,3,0,0.0,0.0,...,0,0,43,0.805128,0.028662,0.0,1,1.0,0.142857,0


In [19]:
hotel_rev.pop("hotelId")   # get rid of ID feature
hotel_rev.head()
y = hotel_rev.pop('reviewHelpfulness').values
X = hotel_rev.values


KeyError: 'hotelId'

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=1/2,
                                                    random_state=42)
X_train.shape, X_test.shape

In [None]:
kNNpipe  = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('kNN', KNeighborsClassifier())])

# Parameters for kNN are prefixed with kNN__
param_grid = {'scaler':[StandardScaler(), MinMaxScaler(),'passthrough'], 
              'kNN__n_neighbors':[1,3,5,7],
              'kNN__weights':['uniform','distance']
             }

In [None]:
grid_search = GridSearchCV(kNNpipe, param_grid=param_grid, verbose = 1)
grid_search = grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_params_

### All grid search results
The parameter `cv_results_` gives us access to results on all options tested.  
We store the results in a data frame and print the important information. 

In [None]:
scores_df = pd.DataFrame(grid_search.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_kNN__n_neighbors', 
            'param_kNN__weights','param_scaler']]

## Grid Search for Naive Bayes
**Q1**  
Repeat the exercise above to fit a Naive Bayes model.  
Consider the same scaling options and `GaussianNB` and `BernoulliNB` as classifier options. 

In [20]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Pipeline similar to above but with `naive_bayes` as the classifier.  
In the `param_grid` `GaussianNB` and `BernoulliNB` are the two options to consider for `naive_bayes`.  

In [10]:
NBpipe  = Pipeline(steps=[
    ('scaler','passthrough'),
    ('naive_bayes', GaussianNB())])

param_grid = {'scaler':[StandardScaler(), MinMaxScaler(),'passthrough'], 
              'naive_bayes':[GaussianNB(), BernoulliNB()]
             }

In [11]:
grid_search = GridSearchCV(NBpipe, param_grid=param_grid, cv = 10, verbose = 1)
grid_search = grid_search.fit(X_train,y_train)

Fitting 10 folds for each of 6 candidates, totalling 60 fits


And the winners are...

In [12]:
grid_search.best_params_

{'naive_bayes': BernoulliNB(), 'scaler': StandardScaler()}

In [13]:
scores_df = pd.DataFrame(grid_search.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df[['rank_test_score','param_naive_bayes','mean_test_score','param_scaler']]

Unnamed: 0,rank_test_score,param_naive_bayes,mean_test_score,param_scaler
0,1,BernoulliNB(),0.655167,StandardScaler()
1,2,BernoulliNB(),0.6465,MinMaxScaler()
2,3,GaussianNB(),0.6425,StandardScaler()
3,3,GaussianNB(),0.6425,MinMaxScaler()
4,5,BernoulliNB(),0.6385,passthrough
5,6,GaussianNB(),0.6385,passthrough


## Grid Search for Decision Trees
**Q2**  
Find the best decision tree model for the `HotelRevHelpfulness` dataset considering  `max_leaf_nodes` and the splitting `criterion`. The splitting `criterion` can be either 'gini' or 'entropy', you can select your own options for `max_leaf_nodes`.

In [14]:
from sklearn.tree import DecisionTreeClassifier

There are no preprocessing steps so there is no need for a pipeline.   
We go with [3,5,10,20,50] as the options for `max_leaf_nodes`.

In [15]:
tree_grid = {'criterion':['gini','entropy'], 
              'max_leaf_nodes':[3,5,10,20,50],
             }
tree = DecisionTreeClassifier()
tree_search = GridSearchCV(tree, param_grid=tree_grid, cv = 10, verbose = 1)
tree_search = tree_search.fit(X_train,y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


The winning parameters are...

In [16]:
tree_search.best_params_

{'criterion': 'gini', 'max_leaf_nodes': 5}

The main message we can take from looking at all the results is that less leaf nodes is inclined to be better.   
This suggests that bushier trees are inclined to overfit.  

In [17]:
scores_df = pd.DataFrame(tree_search.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df[['rank_test_score','param_criterion','mean_test_score','param_max_leaf_nodes']]

Unnamed: 0,rank_test_score,param_criterion,mean_test_score,param_max_leaf_nodes
0,1,gini,0.695167,5
1,2,entropy,0.6795,5
2,3,gini,0.678833,3
3,4,gini,0.670667,10
4,5,entropy,0.654167,10
5,6,entropy,0.650833,3
6,7,entropy,0.646833,20
7,8,gini,0.642,50
8,9,gini,0.641833,20
9,10,entropy,0.634333,50


## Model selection
**Q3**
Which model would you recommend for this dataset?

It's a toss up between a Decision Tree using the gini splitting criterion and max_leaf_nodes = 5 and a k-NN classifier with *k*=5, uniform weighting and no scaling   
