# Problem Session 10
## Classifying Pumpkin Seeds III:  Ensemble Learning

In this notebook you will finish your work with the pumpkin seed data from <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load and prepare the data

Run the code below in order to:

- Load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `Data` folder,
- Create a column `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik` and
- Make a train test split setting $10\%$ of the data aside as a test set.

In [2]:
seeds = pd.read_excel("../../Data/Pumpkin_Seeds_Dataset.xlsx")

seeds['y'] = 0

seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
seeds_train, seeds_test = train_test_split(seeds.copy(),
                                              shuffle=True,
                                              random_state=123,
                                              test_size=.1,
                                              stratify=seeds.y.values)

#### 2. Refresh your memory

If you need to refresh your memory on these data and the problem, you may want to look at a small subset of the data, look back on `Problem Session 8` and `Problem Session 9` and/or browse Figure 5 and Table 1 of this paper, <a href="pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>

`C=20`, a bit under $89\%$.

#### 3. Tuning a random forest with `GridSearchCV`

In this problem you will tune the `max_depth` and `n_estimators` hyperparameters of a random forest model. First you will use a `for` loop for the cross-validation. Then you will see how much easier your life could be with `GridSearchCV`.

##### a.
In this problem we are removing the usual scaffolding that we supply in the problem sessions!

Write your own the code below to find the values of `max_depth` and `n_estimators` for a either a Random Forest Classifier or an Extra Trees Classifier with the highest average cross-validation accuracy.

Your code should accomplish the following:
1. Make a stratified 5-fold split of the training data.
2. Select `max_depth` from `range(1,11)`  and `n_estimators` from the two choices `[100,500]` to maximize cross validation accuracy.

Try not to copy/paste from earlier problem sessions:  talk through the logic and write your own cross-validation loop.

Some further questions to consider when training a random forest classifier:

* Is scaling necessary as a preprocessing step?  Why or why not?
* Does colinearity of featuers matter?
* Should you do feature selection first?
* What bias/variance impact do you think increasing `n_estimators` or `max_depth` will have on the model?
* How might you decide whether to use Random Forest or Extra Trees?

##### Sample Solution

In [5]:
## import random forest classifier
from sklearn.ensemble import RandomForestClassifier

## import kfold
from sklearn.model_selection import StratifiedKFold

## import accuracy_score
from sklearn.metrics import accuracy_score

In [6]:
## this will isolate the feature columns
features = seeds_train.columns[:-2]

In [7]:
## set the number of CV folds
n_splits = 5

## Make the kfold object
kfold = StratifiedKFold(n_splits, 
                        random_state=216, 
                        shuffle=True)

In [8]:
## note this will take about 2 minutes to run

max_depths = range(1, 11)
n_trees = [100, 500]

rf_accs = np.zeros((n_splits, len(max_depths), len(n_trees)))



for i,(train_index, test_index) in enumerate(kfold.split(seeds_train, seeds_train.y)):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]

    for j, max_depth in enumerate(max_depths):
        for k, n_estimators in enumerate(n_trees):
            print(i,j,k)
            rf = RandomForestClassifier(n_estimators = n_estimators,
                                           max_depth = max_depth,
                                           max_samples = 0.8,
                                           random_state = 216)
                                           
            rf.fit(seeds_tt[features], seeds_tt.y)
            
            pred = rf.predict(seeds_ho[features])
            
            rf_accs[i,j,k] = accuracy_score(seeds_ho.y,  pred)

0 0 0
0 0 1
0 1 0
0 1 1
0 2 0
0 2 1
0 3 0
0 3 1
0 4 0
0 4 1
0 5 0
0 5 1
0 6 0
0 6 1
0 7 0
0 7 1
0 8 0
0 8 1
0 9 0
0 9 1
1 0 0
1 0 1
1 1 0
1 1 1
1 2 0
1 2 1
1 3 0
1 3 1
1 4 0
1 4 1
1 5 0
1 5 1
1 6 0
1 6 1
1 7 0
1 7 1
1 8 0
1 8 1
1 9 0
1 9 1
2 0 0
2 0 1
2 1 0
2 1 1
2 2 0
2 2 1
2 3 0
2 3 1
2 4 0
2 4 1
2 5 0
2 5 1
2 6 0
2 6 1
2 7 0
2 7 1
2 8 0
2 8 1
2 9 0
2 9 1
3 0 0
3 0 1
3 1 0
3 1 1
3 2 0
3 2 1
3 3 0
3 3 1
3 4 0
3 4 1
3 5 0
3 5 1
3 6 0
3 6 1
3 7 0
3 7 1
3 8 0
3 8 1
3 9 0
3 9 1
4 0 0
4 0 1
4 1 0
4 1 1
4 2 0
4 2 1
4 3 0
4 3 1
4 4 0
4 4 1
4 5 0
4 5 1
4 6 0
4 6 1
4 7 0
4 7 1
4 8 0
4 8 1
4 9 0
4 9 1


In [9]:
max_index = np.unravel_index(np.argmax(np.mean(rf_accs, axis=0)), 
                                       np.mean(rf_accs, axis=0).shape)


print(max_depths[max_index[0]],n_trees[max_index[1]])

6 500


##### c.

In this problem you will learn about `GridSearchCV`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>, a handy class from `sklearn` that makes hyperparameter tuning through a grid search and cross-validation quicker to code up than writing out a series of `for` loops.


Read through the code chunks below and fill in the missing code to run the same grid search cross-validation you did above with `GridSearchCV`.

##### Sample Solution

In [10]:
## first import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [11]:
## This will also take about two minutes
grid_cv = GridSearchCV(RandomForestClassifier(), # first put the model object here
                          param_grid = {'max_depth':max_depths, # place the grid values for max_depth and
                                        'n_estimators':n_trees}, # and n_estimators here
                          scoring = 'accuracy', # put the metric we are trying to optimize here as a string, "accuracy"
                          cv = 5) # put the number of cv splits here

## you fit it just like a model
grid_cv.fit(seeds_train[features], seeds_train.y)

Once a `GridSearchCV` is fit you are easily able to find what hyperparameter combinations were best, what the optimal score was as well as get access to the best model.

In [12]:
## You can find the hyperparameter grid point that
## gave the best performance like so
## .best_params_
grid_cv.best_params_

{'max_depth': 10, 'n_estimators': 500}

In [13]:
## You can find the best score like so
## .best_score_
grid_cv.best_score_

0.8942222222222223

In [14]:
## Calling best_estimator_ returns the model with the 
## best avg cv performance after it has been refit on the
## entire data set
grid_cv.best_estimator_

The `best_estimator_` is a model with the optimal hyperparameters that has been fit on the entire training set. Try and predict the pumpkin seed class on the training set with the `best_estimator_` below.

In [15]:
grid_cv.best_estimator_.predict(seeds_train[features])

array([0, 1, 1, ..., 0, 1, 0])

If you want to look at all of the results, you can do that as well with `.cv_results`. Try that below.

In [16]:
## You can get all of the results with cv_results_
grid_cv.cv_results_

{'mean_fit_time': array([0.08774109, 0.42847228, 0.11546416, 0.57200108, 0.14318261,
        0.70664454, 0.16799951, 0.8398716 , 0.19289546, 0.96691165,
        0.21226001, 1.05761242, 0.23034906, 1.14560099, 0.24504271,
        1.22279983, 0.25823731, 1.28555164, 0.26694407, 1.34207749]),
 'std_fit_time': array([0.00080343, 0.00163796, 0.0005584 , 0.00165412, 0.00192945,
        0.00105942, 0.00036378, 0.004618  , 0.00074657, 0.02124811,
        0.0011755 , 0.00547485, 0.00233812, 0.00637969, 0.00155197,
        0.00718532, 0.00065147, 0.00833651, 0.00129142, 0.01374401]),
 'mean_score_time': array([0.00293498, 0.01020603, 0.00324345, 0.01197758, 0.00362153,
        0.01406369, 0.00385723, 0.01548834, 0.00418825, 0.01693158,
        0.00447502, 0.01854496, 0.0046845 , 0.01949487, 0.00490222,
        0.02094922, 0.00518665, 0.02239914, 0.00528708, 0.02287455]),
 'std_score_time': array([6.90423980e-05, 4.26585142e-05, 6.83653791e-05, 2.88675971e-05,
        3.03529980e-05, 1.11756389e-

##### d.

Using either the `best_estimator_` fitted model or a refitted model according to your results from the `for` loop cross-validation find the feature importance scores. Try and refer back to your notes from `Problem Session 8`, how do the scores compare to your initial EDA?

##### Sample Solution

In [17]:
pd.DataFrame({'feature_importance_score':grid_cv.best_estimator_.feature_importances_},
                 index=features).sort_values('feature_importance_score',
                                                ascending=False)

Unnamed: 0,feature_importance_score
Eccentricity,0.208256
Aspect_Ration,0.207656
Compactness,0.176741
Roundness,0.100275
Major_Axis_Length,0.069315
Minor_Axis_Length,0.05261
Solidity,0.046681
Perimeter,0.038283
Extent,0.02778
Area,0.025153


Copying from `Problem Session 8` EDA notes I thought these features separated the data the most, in no particular order:

- `Major_Axis_Length`
- `Eccentricity`
- `Roundness`
- `Aspect_Ration`
- `Compactness`

These happen to be the features with the highest feature importance scores from the random forest model.

#### 5. Bagging one of the previous models

In the classification section of this course we covered the following algorithms:

* kNN
* Logistic Regression
* LDA/QDA/NB
* Support Vector Machines

Which of these are likely to benefit from bagging, if any?  Does hyperparameter selection impact your answer (for example:  low vs. high $k$ in kNN)?

Choose one algorithm which you think could be improved by bagging and implement hyperparameter tuning for both the single model and bagged model.  Did it improve performance? 

Note:  if you need to "reach inside" of a pipeline during your GridSearchCV, see 

https://scikit-learn.org/stable/modules/grid_search.html#composite-estimators-and-parameter-spaces

Essentially you need to "name mangle" in a particular way.  For instance if you have a 

`pipe = Pipeline([('scale', StandardScaler()),('svm', SVM())])`

then your param_grid for GridSearchCV would look like 

`param_grid = {'pipe__svm_gamma': [1,2,3], pipe__svm_C = [0.1,1,2]}`

Also note:  don't be surprized if it doesn't help much!  We have already seen that "all the models" give performance of around 86 to 88 percent accuracy.  This is just for learning how to use the tools!


##### Sample Solution

Bagging helps with overfitting.  If the algorithm has decision boundaries which are relatively stable with respect to resampling bagging will not accomplish much.  This should be true of kNN with large values of $k$, logistic regression, LDA/QDA/NB, and linear SVM.  Kernel SVM can (for certain kernels and hyperparameters) be prone to overfitting as can kNN for small values of $k$.  So bagging could help to improve such models.

I will use bagging for kNN with $k=1,2,3$ and compare the performance to the "unbagged" kNN.

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate


In [20]:
pipe = Pipeline([('scale', StandardScaler()),('knn',KNeighborsClassifier())])
bag_pipe = BaggingClassifier(pipe, bootstrap = True, max_samples = 0.90)
bag_cv = GridSearchCV(bag_pipe, 
                          param_grid = {'estimator__knn__n_neighbors':[1,2,3], 
                                        'n_estimators':np.arange(1,100,10)}, 
                          scoring = 'accuracy', 
                          cv = 5)
bag_cv.fit(seeds_train[features], seeds_train.y)

In [21]:
print(f"The best mean cv accuracy of {bag_cv.best_score_:.3f} was achieved using k = {bag_cv.best_estimator_.estimator['knn'].n_neighbors} and {bag_cv.best_estimator_.n_estimators} estimators")

The best mean cv accuracy of 0.877 was achieved using k = 3 and 91 estimators


In [22]:
single_pipe = Pipeline([('scale', StandardScaler()),('knn',KNeighborsClassifier(n_neighbors=3))])
single_cv = cross_validate(single_pipe, seeds_train[features], seeds_train.y, cv = 5, scoring = 'accuracy')

In [23]:
print(f"The mean cv accuracy of a single kNN model with k=3 is {single_cv['test_score'].mean():.3f}")

The mean cv accuracy of a single kNN model with k=3 is 0.865


##### 6. Selecting a final model

##### a.

Refer back to your work in `Problem Session 8` and `Problem Session 9`, across all three of the Pumpkin Seed classification notebooks, which model had the best average CV accuracy?

##### Sample Solution

For me it was the random forest model found by `GridSearchCV`.

##### b.

Retrain the best model on the entire training set.

##### Sample Solution

In [24]:
model = grid_cv.best_estimator_

model.fit(seeds_train[features], seeds_train.y)

##### c.

Find the training and test accuracies for this model. Does there appear to be overfitting?

##### Sample Solution

In [25]:
accuracy_score(model.predict(seeds_train[features]), seeds_train.y)

0.9653333333333334

In [26]:
accuracy_score(model.predict(seeds_test[features]), seeds_test.y)

0.876

There is some overfitting yes, but this was still the model with the best average CV score and the test accuracy is comparable to that average.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)