# Problem Session 8
## Classifying Pumpkin Seeds I

In the next few notebooks you will work to build models to classify types of pumpkin seeds using features engineered from photographs of the seeds. Here we will introduce the data set, perform some exploratory data analysis and build some simple models.

The problems in this notebook will cover the content covered in our `Classification` notebooks including:
- `Adjustments for Classification`,
- `k Nearest Neighbors`,
- `The Confusion Matrix` and
- `Logistic Regression`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load the data

##### a.

First load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `data` folder.

Note you will want to use the `read_excel` function from `pandas`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel">https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel</a>. Print a random sample of five rows.

In [None]:
## load the data here



In [None]:
## look at the sample here



##### b.

Create a new column of the `DataFrame` called `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik`.

In [None]:
## code here




#### 2. Learn about the data

##### a.

These data represent various measurements of pumpkin seeds that come from high quality photos of the seeds. The data was provided as supplementary material to <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).

In this work the researchers demonstrated how various algorithms could be used to predict whether a pumpkin seed was a Ürgüp Sivrisi seed or a Çerçevelik seed. These data were generated by engineering features from special photos of seeds like so:
<br>
<br>
<img src="problem_session_8_assets/pumpkin_seeds.jpg" width="55%"></img>

As you can see these two seeds can be quite difficult for the human eye to discern, hence the appeal to machine learning algorithms.

A PDF of this paper is provided here, <a href="problem_session_8_assets/pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>. Scroll down to Figure 5 and Table 1 and read about the features of this data set.

#### 3. Train test split

##### a.

Look at how the data is split between the two classes. Does this appear to be imbalanced data? <i>Recall that we say data is imbalanced if one of the classes has a very small presence in the data set.

In [None]:
## code here



This data set seems pretty well balanced.

##### b.

Make a train test split, set aside $10\%$ of the data as the test set (note that we are using $10\%$ because this was the split they used in the paper).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
## Make your train test split here



#### 4. Exploratory data analysis (EDA)

Before building any models you will do some EDA.

##### a. 

One way to try and identify key features for classification algorithms is to plot histograms of the feature values for each of the classes.

Below is an example of such a histogram for the `Area` column made using `plt.hist`.

In [None]:
plt.figure(figsize=(9,5))


plt.hist(seeds_train.loc[seeds_train.y==0].Area.values,
            color='blue',
            alpha=.8,
            label="$y=0$")

plt.hist(seeds_train.loc[seeds_train.y==1].Area.values,
            color='red',
            alpha=.4,
            hatch = '\\',
            edgecolor='black',
            label="$y=1$")

plt.xlabel("Area", fontsize=12)
plt.legend(fontsize=12)

plt.show()

In this plot we can see that the two histograms are right on top of one another, indicating that the two classes of pumpkin seeds tend to have similar areas. This suggests that `Area` may not be a useful variable for discerning the seed class.

Use a `for` loop or some comparable method to produce similar histograms for each of the features. Write down the features that look like they may be useful for classification.

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### Keep track of your selected variables here



##### b.

Now try making a `seaborn` `pairplot` using the variables you identified in part <i>a.</i> as the arguments for `x_vars` and `y_vars`. Use `y` as the argument to `hue`. The main goal with this question is to see if you can identify any pairs of variables that seem to separate the two classes. You will use these plots later in the notebook.

In [None]:
## Fill in the missing code
sns.pairplot(data = ,
                x_vars = [],
                y_vars = [],
                hue = )

plt.show()

#### 5. Metric selection

In the remainder of this notebook you will make some initial models.

##### a.

Now that you have read about the data and looked at the split between the two classes what seems like a reasonable performance metric for this problem? Explain your answer.

##### Write here



##### b.

Recalling that `y=1` implies that the seed is of the Ürgüp Sivrisi class and `y=0` implies that the seed is of the Çerçevelik class, what do the following metrics measure in the context of this classification problem:
- recall
- precision
- false positive rate.

##### Write here

- recall:
- precision:
- false positive rate:

#### 6. Initial modeling attempts

In the remainder of this notebook you will make some initial models.

##### a.

Think of a baseline model for these data. Some common approaches are:
- A random coin flip whose probability for heads is the same as the probability of drawing the more present class,
- Classifying any observation as the majority class.

For whichever baseline you choose project the generalization accuracy of the baseline using the training data.

In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



In [None]:
## code here



##### b.

Fill in the code below to perform 5-fold cross-validation in order to compare logistic regression models regressing `y` on each of the useful features you identified in your EDA above.

In [None]:
## Import what you will need
from sklearn.linear_model import 
from sklearn.metrics import 
from sklearn.model_selection import 

In [None]:
## Make your kfold object
n_splits = 

kfold = 

In [None]:
## Fill in your list of features
features = []

## Make your array of zeros to hold the accuracies
log_reg_accs = np.zeros()

## Loop through the cv splits
i = 0
for train_index, test_index in kfold:
    ## get the training and holdout sets
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    ## loop through your features
    j = 0
    for feature in features:
        ## Define the model
        log_reg = 
        
        ## fit the model
        log_reg
        
        ## Make the prediction
        pred = 
        
        ## Record the accuracy on the holdout set
        log_reg_accs[i,j] = 
        
        j = j + 1
    i = i + 1

In [None]:
## Print out the average cv accuracies here


##### c.

Compare these models to the logistic regression model that incorporates all of the features you identified with your histogram exploration.

In [None]:
## fill in the missing code below


i = 0
for train_index, test_index in :
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    ## Define the model, fit the model, then record the accuracies
    
    
    
    i = i + 1

In [None]:
## What is the avg. cv. accuracy?


##### Make any notes you want here



##### d.

Fill in the code to find the optimal $k$ for a $k$ nearest neighbors model encorporating all of the features.

In [None]:
## Import the model class
from sklearn.neighbors import 

In [None]:
## Fill in the range you want to try for k
ks = range()

## This will give you a list of all feature column names
all_features = seeds_train.columns[:-2]

## Make an array to hold the accuracies
k_all_accs = np.zeros()

i = 0
for train_index, test_index in :
    ## Get the train and holdout sets
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    ## Loop through the different ks
    j = 0
    for k in ks:
        ## Make the model object
        knn = 
        
        ## Fit the model
        
        
        ## Make your prediction
        pred = 
        
        ## Record the accuracy on the holdout set
        k_all_accs[i,j] = accuracy_score(seeds_ho.y.values, pred)
        
        j = j + 1
    i = i + 1

In [None]:
## Plots the accuracies as a function of k
plt.figure(figsize=(7,5))


plt.plot(ks, 
         np.mean(k_all_accs, axis=0),
         '-o')


plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.xlabel("$k$", fontsize=12)
plt.ylabel("Avg. CV Accuracy", fontsize=12)

plt.show()

##### e. 

Now see if you can improve the accuracy by using just the features you chose as a result of your histogram explorations. Did the best accuracy change? Did the optimal value of $k$ change?

In [None]:
k_select_accs = np.zeros((n_splits, len(ks)))

i = 0
for train_index, test_index in kfold.split(seeds_train, seeds_train.y):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    j = 0
    for k in ks:
        ## Make your model object
        knn = 
        
        ## Fit your model object
        knn
        
        ## Make your prediction on the holdout set
        pred = 
        
        ## Record the accuracies
        k_select_accs[i,j] = accuracy_score(seeds_ho.y.values, pred)
        
        j = j + 1
    i = i + 1

In [None]:
## This will plot the avg cv accuracies as a function of k
plt.figure(figsize=(7,5))


plt.plot(ks, 
         np.mean(k_select_accs, axis=0),
         '-o')


plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.xlabel("$k$", fontsize=12)
plt.ylabel("Avg. CV Accuracy", fontsize=12)

plt.show()

##### f.

As a final check see if you can improve the cross-validation accuracy further by only considering a pair of features from your `pairplot` exploration earlier.

In [None]:
k_final_accs = np.zeros((n_splits, len(ks)))

i = 0
for train_index, test_index in kfold.split(seeds_train, seeds_train.y):
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    j = 0
    for k in ks:
        ## Make the model
        knn = 

        
        ## Fit the model
        knn.fit()
        
        ## Make the prediction on the holdout set
        pred = 
        
        ## record the accuracy on the holdout set
        k_final_accs[i,j] = accuracy_score(seeds_ho.y.values, pred)
        
        j = j + 1
    i = i + 1

In [None]:
## This plots the accuracies as a function of k
plt.figure(figsize=(7,5))


plt.plot(ks, 
         np.mean(k_final_accs, axis=0),
         '-o')


plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.xlabel("$k$", fontsize=12)
plt.ylabel("Avg. CV Accuracy", fontsize=12)

plt.show()

##### 7. Summarizing the current results

Consider the best average CV accuracies of all of the models you built. Which one performed the best?

##### Write your answer here

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)