# Problem Session 10
## Classifying Pumpkin Seeds III:  Ensemble Learning

In this notebook you will finish your work with the pumpkin seed data from <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load and prepare the data

Run the code below in order to:

- Load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `Data` folder,
- Create a column `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik` and
- Make a train test split setting $10\%$ of the data aside as a test set.

In [2]:
seeds = pd.read_excel("../../Data/Pumpkin_Seeds_Dataset.xlsx")

seeds['y'] = 0

seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
seeds_train, seeds_test = train_test_split(seeds.copy(),
                                              shuffle=True,
                                              random_state=123,
                                              test_size=.1,
                                              stratify=seeds.y.values)

#### 2. Refresh your memory

If you need to refresh your memory on these data and the problem, you may want to look at a small subset of the data, look back on `Problem Session 8` and `Problem Session 9` and/or browse Figure 5 and Table 1 of this paper, <a href="pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>

`C=20`, a bit under $89\%$.

#### 3. Tuning a random forest with `GridSearchCV`

In this problem you will tune the `max_depth` and `n_estimators` hyperparameters of a random forest model. First you will use a `for` loop for the cross-validation. Then you will see how much easier your life could be with `GridSearchCV`.

##### a. 

Import `sklearn`'s random forest model for classification.

##### Sample Solution

In [5]:
from sklearn.ensemble import RandomForestClassifier

##### b.

Fill in the code below to find the values of `max_depth` and `n_estimators` with the highest average cross-validation accuracy.

##### Sample Solution

In [6]:
## import kfold
from sklearn.model_selection import StratifiedKFold

## import Pipeline and StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

## import accuracy_score
from sklearn.metrics import accuracy_score

In [7]:
## this will isolate the feature columns
features = seeds_train.columns[:-2]

In [8]:
## set the number of CV folds
n_splits = 5

## Make the kfold object
kfold = StratifiedKFold(n_splits, 
                        random_state=2013, 
                        shuffle=True)

In [9]:
## note this will take about 2 minutes to run

max_depths = range(1, 11)
n_trees = [100, 500]

rf_accs = np.zeros((n_splits, len(max_depths), len(n_trees)))


i = 0
for train_index, test_index in kfold.split(seeds_train, seeds_train.y):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    j = 0
    for max_depth in max_depths:
        k = 0
        for n_estimators in n_trees:
            print(i,j,k)
            rf = RandomForestClassifier(n_estimators = n_estimators,
                                           max_depth = max_depth,
                                           max_samples = int(.8*len(seeds_tt)),
                                           random_state = 403)
            
            rf.fit(seeds_tt[features], seeds_tt.y)
            
            pred = rf.predict(seeds_ho[features])
            
            rf_accs[i,j,k] = accuracy_score(seeds_ho.y,  pred)
            k = k + 1
        j = j + 1
    i = i + 1

0 0 0
0 0 1
0 1 0
0 1 1
0 2 0
0 2 1
0 3 0
0 3 1
0 4 0
0 4 1
0 5 0
0 5 1
0 6 0
0 6 1
0 7 0
0 7 1
0 8 0
0 8 1
0 9 0
0 9 1
1 0 0
1 0 1
1 1 0
1 1 1
1 2 0
1 2 1
1 3 0
1 3 1
1 4 0
1 4 1
1 5 0
1 5 1
1 6 0
1 6 1
1 7 0
1 7 1
1 8 0
1 8 1
1 9 0
1 9 1
2 0 0
2 0 1
2 1 0
2 1 1
2 2 0
2 2 1
2 3 0
2 3 1
2 4 0
2 4 1
2 5 0
2 5 1
2 6 0
2 6 1
2 7 0
2 7 1
2 8 0
2 8 1
2 9 0
2 9 1
3 0 0
3 0 1
3 1 0
3 1 1
3 2 0
3 2 1
3 3 0
3 3 1
3 4 0
3 4 1
3 5 0
3 5 1
3 6 0
3 6 1
3 7 0
3 7 1
3 8 0
3 8 1
3 9 0
3 9 1
4 0 0
4 0 1
4 1 0
4 1 1
4 2 0
4 2 1
4 3 0
4 3 1
4 4 0
4 4 1
4 5 0
4 5 1
4 6 0
4 6 1
4 7 0
4 7 1
4 8 0
4 8 1
4 9 0
4 9 1


In [10]:
max_index = np.unravel_index(np.argmax(np.mean(rf_accs, axis=0), axis=None), 
                                       np.mean(rf_accs, axis=0).shape)


print(max_depths[max_index[0]],n_trees[max_index[1]])

9 500


In [11]:
np.mean(rf_accs, axis=0)[max_index]

0.8924444444444445

##### c.

In this problem you will learn about `GridSearchCV`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>, a handy class from `sklearn` that makes hyperparameter tuning through a grid search and cross-validation quicker to code up than writing out a series of `for` loops.


Read through the code chunks below and fill in the missing code to run the same grid search cross-validation you did above with `GridSearchCV`.

##### Sample Solution

In [12]:
## first import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [13]:
## This will also take about two minutes
grid_cv = GridSearchCV(RandomForestClassifier(), # first put the model object here
                          param_grid = {'max_depth':max_depths, # place the grid values for max_depth and
                                        'n_estimators':n_trees}, # and n_estimators here
                          scoring = 'accuracy', # put the metric we are trying to optimize here as a string, "accuracy"
                          cv = 5) # put the number of cv splits here

## you fit it just like a model
grid_cv.fit(seeds_train[features], seeds_train.y)

Once a `GridSearchCV` is fit you are easily able to find what hyperparameter combinations were best, what the optimal score was as well as get access to the best model.

In [14]:
## You can find the hyperparameter grid point that
## gave the best performance like so
## .best_params_
grid_cv.best_params_

{'max_depth': 10, 'n_estimators': 500}

In [15]:
## You can find the best score like so
## .best_score_
grid_cv.best_score_

0.8933333333333333

In [16]:
## Calling best_estimator_ returns the model with the 
## best avg cv performance after it has been refit on the
## entire data set
grid_cv.best_estimator_

The `best_estimator_` is a model with the optimal hyperparameters that has been fit on the entire training set. Try and predict the pumpkin seed class on the training set with the `best_estimator_` below.

In [17]:
grid_cv.best_estimator_.predict(seeds_train[features])

array([0, 1, 1, ..., 0, 1, 0])

If you want to look at all of the results, you can do that as well with `.cv_results`. Try that below.

In [18]:
## You can get all of the results with cv_results_
grid_cv.cv_results_

{'mean_fit_time': array([0.0560782 , 0.26416121, 0.07314363, 0.36555667, 0.09736295,
        0.46733723, 0.11305366, 0.56173539, 0.13054786, 0.64812379,
        0.1434937 , 0.73052917, 0.16864858, 0.79994445, 0.16937399,
        0.85337906, 0.18166289, 0.89137568, 0.19660411, 0.93134222]),
 'std_fit_time': array([0.00238646, 0.00052851, 0.00024437, 0.00555432, 0.00022649,
        0.00039053, 0.00100042, 0.00459884, 0.00039933, 0.01265075,
        0.00326833, 0.01576784, 0.01222119, 0.00896888, 0.0034818 ,
        0.01486816, 0.00755567, 0.02073201, 0.0085036 , 0.0199403 ]),
 'mean_score_time': array([0.00290575, 0.01149626, 0.00292664, 0.01265416, 0.0032166 ,
        0.01412101, 0.00345984, 0.01526675, 0.00368543, 0.01646886,
        0.00390592, 0.01791344, 0.00457168, 0.01854954, 0.00426073,
        0.01969166, 0.0044364 , 0.0200151 , 0.00477052, 0.02033296]),
 'std_score_time': array([1.76124153e-04, 1.50965876e-04, 6.09517272e-05, 2.30143250e-04,
        2.74608528e-05, 1.44076282e-

##### d.

Using either the `best_estimator_` fitted model or a refitted model according to your results from the `for` loop cross-validation find the feature importance scores. Try and refer back to your notes from `Problem Session 8`, how do the scores compare to your initial EDA?

##### Sample Solution

In [19]:
pd.DataFrame({'feature_importance_score':grid_cv.best_estimator_.feature_importances_},
                 index=features).sort_values('feature_importance_score',
                                                ascending=False)

Unnamed: 0,feature_importance_score
Aspect_Ration,0.221758
Eccentricity,0.184379
Compactness,0.181197
Roundness,0.106093
Major_Axis_Length,0.074367
Minor_Axis_Length,0.049387
Solidity,0.047371
Perimeter,0.032563
Extent,0.030083
Area,0.025067


Copying from `Problem Session 8` EDA notes I thought these features separated the data the most, in no particular order:

- `Major_Axis_Length`
- `Eccentricity`
- `Roundness`
- `Aspect_Ration`
- `Compactness`

These happen to be the features with the highest feature importance scores from the random forest model.

##### 5. Selecting a final model

##### a.

Refer back to your work in `Problem Session 8` and `Problem Session 9`, across all three of the Pumpkin Seed classification notebooks, which model had the best average CV accuracy?

##### Sample Solution

For me it was the random forest model found by `GridSearchCV`.

##### b.

Retrain the best model on the entire training set.

##### Sample Solution

In [20]:
model = grid_cv.best_estimator_

model.fit(seeds_train[features], seeds_train.y)

##### c.

Find the training and test accuracies for this model. Does there appear to be overfitting?

##### Sample Solution

In [21]:
accuracy_score(model.predict(seeds_train[features]), seeds_train.y)

0.9653333333333334

In [22]:
accuracy_score(model.predict(seeds_test[features]), seeds_test.y)

0.872

There is some overfitting yes, but this was still the model with the best average CV score and the test accuracy is comparable to that average.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)