# Problem Session 10
## Classifying Pumpkin Seeds III:  Ensemble Learning

In this notebook you will finish your work with the pumpkin seed data from <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load and prepare the data

Run the code below in order to:

- Load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `Data` folder,
- Create a column `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik` and
- Make a train test split setting $10\%$ of the data aside as a test set.

In [None]:
seeds = pd.read_excel("../../Data/Pumpkin_Seeds_Dataset.xlsx")

seeds['y'] = 0

seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
seeds_train, seeds_test = train_test_split(seeds.copy(),
                                              shuffle=True,
                                              random_state=123,
                                              test_size=.1,
                                              stratify=seeds.y.values)

#### 2. Refresh your memory

If you need to refresh your memory on these data and the problem, you may want to look at a small subset of the data, look back on `Problem Session 8` and `Problem Session 9` and/or browse Figure 5 and Table 1 of this paper, <a href="pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>

##### a.
In this problem we are removing the usual scaffolding that we supply in the problem sessions!

Write your own the code below to find the values of `max_depth` and `n_estimators` for a either a Random Forest Classifier or an Extra Trees Classifier with the highest average cross-validation accuracy.

Your code should accomplish the following:
1. Make a stratified 5-fold split of the training data.
2. Select `max_depth` from `range(1,11)`  and `n_estimators` from the two choices `[100,500]` to maximize cross validation accuracy.

Try not to copy/paste from earlier problem sessions:  talk through the logic and write your own cross-validation loop.

Some further questions to consider when training a random forest classifier:

* Is scaling necessary as a preprocessing step?  Why or why not?
* Does colinearity of featuers matter?
* Should you do feature selection first?
* What bias/variance impact do you think increasing `n_estimators` or `max_depth` will have on the model?
* How might you decide whether to use Random Forest or Extra Trees?


In [None]:
# your code here

In [None]:
# Hint:  at some point you will want to know both the maximum cross validation accuracy and the parameter values used to obtain that
# If you want the index of the maximal element of a numpy array `A`, the following code will give you the index

# Maxing a toy array first
np.random.seed(215)
A = np.random.randint(10, size = (2,4,3))

# Here is how to find the index of the maximal value
max_index = np.unravel_index(np.argmax(A), A.shape)

print(f"A = \n{A} \n \n The maximum value of A is A[{max_index}] = {A[max_index]}")

##### b.

In this problem you will learn about `GridSearchCV`, <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</a>, a handy class from `sklearn` that makes hyperparameter tuning through a grid search and cross-validation quicker to code up than writing out a series of `for` loops.


Read through the code chunks below and fill in the missing code to run the same grid search cross-validation you did above with `GridSearchCV`.

In [None]:
## first import GridSearchCV
from sklearn.model_selection import GridSearchCV

In [None]:
## This will also take 1-2 minutes to run
grid_cv = GridSearchCV(, # first put the model object here
                          param_grid = {'max_depth':, # place the grid values for max_depth and
                                        'n_estimators':}, # and n_estimators here
                          scoring = , # put the metric we are trying to optimize here as a string, "accuracy"
                          cv = ) # put the number of cv splits here

## you fit it just like a model


Once a `GridSearchCV` is fit you are easily able to find what hyperparameter combinations were best, what the optimal score was as well as get access to the best model.

In [None]:
## You can find the hyperparameter grid point that
## gave the best performance like so
## .best_params_
grid_cv.best_params_

In [None]:
## You can find the best score like so
## .best_score_
grid_cv.best_score_

In [None]:
## Calling best_estimator_ returns the model with the 
## best avg cv performance after it has been refit on the
## entire data set
grid_cv.best_estimator_

The `best_estimator_` is a model with the optimal hyperparameters that has been fit on the entire training set. Try and predict the pumpkin seed class on the training set with the `best_estimator_` below.

In [None]:
## code here



If you want to look at all of the results, you can do that as well with `.cv_results`. Try that below.

In [None]:
## You can get all of the results with cv_results_
grid_cv.cv_results_

##### c.

Using either the `best_estimator_` fitted model or a refitted model according to your results from the `for` loop cross-validation find the feature importance scores. Try and refer back to your notes from `Problem Session 8`, how do the scores compare to your initial EDA?

In [None]:
## code here



##### Write here



#### 5. Bagging one of the previous models

In the classification section of this course we covered the following algorithm:

* kNN
* Logistic Regression
* LDA/QDA/NB
* Support Vector Machines

Of these which are likely to benefit from bagging, if any?  Does hyperparameter selection impact your answer (for example:  low vs. high $k$ in kNN)?

Choose one algorithm which you think could be improved by bagging and implement hyperparameter tuning for both the single model and bagged model.  Did it improve performance?  


##### 6. Selecting a final model

##### a.

Refer back to your work in `Problem Session 8` and `Problem Session 9`, across all three of the Pumpkin Seed classification notebooks, which model had the best average CV accuracy?

##### Write here

##### b.

Retrain the best model on the entire training set.

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### c.

Find the training and test accuracies for this model. Does there appear to be overfitting?

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### Write here



--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)