# Problem Session 9
## Classifying Pumpkin Seeds II

In this notebook you continue to work with the pumpkin seed data from <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load then prepare the data


- Load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `Data` folder,
- Create a column `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik` and
- Make a train test split setting $10\%$ of the data aside as a test set.

##### Sample Solution

In [None]:
seeds = pd.read_excel("../../Data/Pumpkin_Seeds_Dataset.xlsx")

seeds['y'] = 0

seeds.loc[seeds.Class=='Ürgüp Sivrisi', 'y']=1

In [None]:
from sklearn.model_selection import 

In [None]:
seeds_train, seeds_test = 

#### 2. Refresh your memory

If you need to refresh your memory on these data and the problem, you may want to look at a small subset of the data, look back on `Problem Session 8` and/or browse Figure 5 and Table 1 of this paper, <a href="pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>

We will store our different model accuracies in a dictionary for easy comparson at the end of the notebook.  I am starting it off with the two best models from problem session 8.

In [None]:
model_accs = {
         "knn" : 0.886,
         "log_reg": 0.867
            }

In [None]:
seeds_train.sample(5)

#### 3. Principal components analysis (PCA)

One way you may use PCA is as a data preprocessing step for supervised learning tasks. In this problem you will try it as a preprocessing step for the pumpkin seed data and see if this preprocessing step helps your model outperform the models from `Problem Session 8`.

##### a. 

Run the training data through PCA with two components and then plot the resulting principal values. Color each point by its class.

<i>Hint: Remember to scale the data before running it through PCA</i>.

In [None]:
from sklearn.decomposition import 
from sklearn.preprocessing import
from sklearn.pipeline import 

In [None]:
features = seeds_train.columns[:-2]

pca = # make scale/pca pipe

# fit model object

pca_values = # transformed values here

In [None]:
plt.figure(figsize=(7,5))

plt.scatter(pca_values[seeds_train.y==0, 0], 
            pca_values[seeds_train.y==0, 1],
            color = 'b',
            label="$y=0$",
            alpha=.6)

plt.scatter(pca_values[seeds_train.y==1, 0], 
            pca_values[seeds_train.y==1, 1],
            color='r',
            marker='x',
            label="$y=1$",
            alpha=.6)

plt.legend(fontsize=10)

plt.xlabel("First PCA Value", fontsize=12)
plt.ylabel("Second PCA Value", fontsize=12)

plt.show()

##### b.

How does the PCA with only two componenets appear to separate the data?

##### c.

Run 5-fold cross-validation below to find the optimal value of $k$ for a $k$ nearest neighbors model fit on the first and second PCA values. What is the optimal $k$ and the associated average cross-validation accuracy? How does this compare to the accuracies from `Problem Session 8`?

##### Sample Solution

In [None]:
from sklearn.neighbors import 
from sklearn.model_selection import 
from sklearn.metrics import 

In [None]:
n_splits = 5
kfold = 

In [None]:
neighbors = range(1, 51)

pca_2_accs = np.zeros((n_splits, len(neighbors)))

# Note:  switching to using "enumerate" from this point in the bootcamp forward.
for i,(train_index, test_index) in enumerate(kfold.split(seeds_train, seeds_train.y)):
    print("CV Split", i)
    seeds_tt = 
    seeds_ho = 
    
    ## Note, putting the PCA here speeds up the loop
    pca_pipe = 
    
    pca_tt = 
    pca_ho = 
    
    for j, n_neighbors in enumerate(neighbors):
        # No need to scale knn first since PCA is handling that
        knn =
        
        knn.fit()
        
        pred = knn.predict()
        
        pca_2_accs[i,j] = accuracy_score()

In [None]:
plt.figure(figsize=(7,5))

plt.plot(neighbors, 
         np.mean(pca_2_accs, axis=0),
         '-o')

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.xlabel("$k$", fontsize=12)
plt.ylabel("Avg. CV Accuracy", fontsize=12)

plt.show()

In [None]:
print(f'The best mean CV accuracy of {np.max(np.mean(pca_2_accs, axis=0)):.3f} was achieved with k = {neighbors[np.argmax(np.mean(pca_2_accs, axis=0))]}' )

##### d.

We can think of the number of components used in PCA as another hyperparameter we can tune.

Fill in the missing code below to find the optimal number of components and $k$ pairing for this problem. What is the best average cross-validation accuracy?

##### Sample Solution

In [None]:
neighbors = range(1, 51)
comps = range(2,6)

pca_accs = np.zeros((n_splits, len(comps), len(neighbors)))

for i,(train_index, test_index) in enumerate():
    print("CV Split", i)
    seeds_tt = 
    seeds_ho =
    
    for j, n_comps in enumerate(comps):
        pca_pipe = 
        pca.pipe.fit()

        pca_tt = 
        pca_ho = 
        
        for k, n_neighbors in enumerate(neighbors):
            knn = 
            
            knn.fit()

            pred = knn.predict()

            pca_accs[i,j,k] = accuracy_score()

In [None]:
max_index = np.unravel_index(np.argmax(np.mean(pca_accs, axis=0), axis=None), 
                                       np.mean(pca_accs, axis=0).shape)


print(f"The pair with the highest AVG CV Accuracy was k = {neighbors[max_index[1]]} and number of components = {comps[max_index[0]]:.1f}")
print(f"The highest AVG CV Accuracy was {np.max(np.mean(pca_accs, axis=0)):.3f}")

In [None]:
# Add this best model to our dict of model accuracies
model_accs['pca_knn'] = 

#### 4. Trying Bayes based classifiers

Build LDA, QDA and naive Bayes' models on these data by filling in the missing code for the cross-validation below. 

Do these outperform your PCA-$k$-NN model from above?=

##### Sample Solution

In [None]:
from sklearn.naive_bayes import 
from sklearn.discriminant_analysis import 

In [None]:
bayes_accs = np.zeros((n_splits, 3))

for i, (train_index, test_index) in enumerate():
    seeds_tt = seeds_train.iloc[train_index]
    seeds_ho = seeds_train.iloc[test_index]
    
    ## Linear Discriminant Analysis
    lda = 
    
    lda.fit()
    lda_pred = 
    
    bayes_accs[i, 0] = accuracy_score()
    
    ## Quadratic Discriminant Analysis
    qda = 
    
    qda.fit()
    
    qda_pred = 
    
    bayes_accs[i, 1] = accuracy_score()
    
    
    ## Gaussian Naive Bayes
    nb = 
    
    nb.fit()
    
    nb_pred = nb.predict()
    
    bayes_accs[i, 2] = accuracy_score()

In [None]:
np.mean(bayes_accs, axis=0)

In [None]:
# Come up with a reasonable short name for your best model and store the accuracy rounded to the nearest thousandth place here.
model_accs[''] = 

In [None]:
model_accs

#### 5. A support vector machine classifier

In this problem you will work to build a support vector classifier on these data.

##### a.

Start by importing the support vector classifier from `sklearn`.  We will use the default kernel which is `rbf`.

In [None]:
from sklearn.svm import 

##### b.

You will now perform hyperparameter tuning on the `C` parameter of the support vector classifier. Fill in the missing pieces of the code below to perform 5-fold cross-validation for different values of `C`.

In [None]:
## set the number of CV folds
n_splits = 5

## Make the kfold object
kfold = StratifiedKFold()

## the values of C you will try
Cs = [.01, .1, 1, 10, 25, 50, 75, 100, 125, 150]

## this will hold the CV accuracies
C_accs1 = np.zeros((n_splits, len(Cs)))


## the cross-validation
for i,(train_index, test_index) in enumerate():
    seeds_tt = 
    seeds_ho = 
    
    for j,C in enumerate(Cs):
        pipe = 
    
        pipe.fit()
    
        pred = 

        C_accs1[i, j] = accuracy_score()

##### c.

Plot the average cross-validation accuracy against the $\log$ of `C`.

In [None]:
plt.figure(figsize = (8,6))

plt.plot(np.log10(np.array(Cs)), 
         np.mean(C_accs1, axis=0), 
         '-o')

plt.xlabel("$\log(C)$", fontsize=12)
plt.ylabel("Avg. CV Accuracy", fontsize=12)
plt.xticks(np.arange(-2,3,.5),fontsize=10)
plt.yticks(fontsize=10)

plt.show()

##### e.

What was the optimal value of `C`, what was the average cross-validation accuracy for this value of `C`?

In [None]:
mean_cv_accuracy = np.mean(C_accs1, axis=0)
optimal_index = np.argmax(mean_cv_accuracy)
optimal_C = Cs[optimal_index]
optimal_accuracy = mean_cv_accuracy[optimal_index]

print(f"The optimal C was {optimal_C} which gave a mean CV accuracy of {optimal_accurac:.3f}")

In [None]:
model_accs['svc'] = np.round(optimal_accuracy,3)

In [None]:
model_accs

These models all perform quite similarly!  It is very possible that we just don't have a set of features which are sufficiently discriminating to do any better.  Let's actually find which two training samples with different class are closest to each other.

In [None]:
import numpy as np
from sklearn.neighbors import KDTree
from sklearn.preprocessing import StandardScaler

# Scale the feature data
scaler = StandardScaler()
scaled_X = scaler.fit_transform(seeds_train[features])

# Construct a KDTree for fast nearest-neighbor search
kdt = KDTree(scaled_X, leaf_size=30, metric='euclidean')

# Find the two nearest neighbors for each point
distances, indices = kdt.query(scaled_X, k=2, return_distance=True)

# Sort indices by nearest neighbor distance (excluding self-distance)
sorted_indices = np.argsort(distances[:, 1])  # Only the 2nd column matters

# Reorder neighbor pairs accordingly
sorted_pairs = indices[sorted_indices]

# Identify the first pair with different class labels
labels = seeds_train.y.values
mismatch_index = np.argmax(labels[sorted_pairs[:, 0]] != labels[sorted_pairs[:, 1]])

# Retrieve the mismatched pair
index_1, index_2 = sorted_pairs[mismatch_index]

# Display the mismatched data points
seeds_train.iloc[[index_1, index_2], :]

We can see that these are extremely close to each other despite having different classes.  If there are many such examples, it may be impossible for us to get a better classification accuracy.

It is possible that these two cultivars could be very different genetically, very different morphologically as adult plants, and yet have seeds which are similar enough that some of them cannot be distinguished from each other based exclusively on their geometry.  

It is also possible that the seeds **are** distinct, but that some of our samples have been mislabeled.  This would also spell doom for any improved accuracy on this dataset.

#### 6. (OPTIONAL) LDA for supervised dimensionality reduction

Only do this section if you have time.

While we introduced linear discriminant analysis (LDA) as a classification algorithm, it was originally proposed by Fisher as a supervised dimension reduction technique, <a href="https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf">https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf</a>. In particular, the initial goal was to project the features, $X$, corresponding to a binary output, $y$, onto a single dimension which best separates the possible classes. This single dimension has come to been known as <i>Fisher's discriminant</i>.

In the case of two classes, we are projecting onto the line connecting the class sample means.  However we are **not** projecting orthogonally with respect to the Euclidean metric!  As discussed in the week 8 math hour, we will end up doing orthogonal projection with respect to the Mahalanobis metric of the learned LDA covariance matrix.

Walk through the code below to perform this supervised dimension reduction technique on these data.

##### a.

First make a validation set from the training set for demonstration purposes.

##### Sample Solution

In [None]:
## First we make a validation set for demonstration purposes
seed_tt, seeds_val = 

##### b.

Now make a pipeline that first scales the data and ends with linear discriminant analysis. Then fit the pipeline.

##### Sample Solution

In [None]:
pipe =

In [None]:
pipe.fit()

##### c. 

Now calculate the Fisher discriminant by using `transform` with the pipeline you fit in <i>b.</i>

##### Sample Solution

In [None]:
fish = 

##### d. 

To visualize how LDA separated the two classes while projecting the 12 dimensional data onto a one dimensional subspace you can plot a histogram of the Fisher discriminant colored by the pumpkin seed class of the observation.

##### Sample Solution

In [None]:
plt.figure(figsize=(10,6))

plt.hist(fish[seeds_tt.y==0], 
         color='blue',
         edgecolor="black",
         label="$y=0$")

plt.hist(fish[seeds_tt.y==1], 
         color='orange', 
         hatch='/', 
         alpha=.6,
         edgecolor="black",
         label="$y=1$")

plt.xlabel("Fisher Discriminant", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.legend(fontsize=12)

plt.show()

##### e.

While there is some separation between the two classes, it is not perfect, this should be expected based on the exploratory data analysis you did in `Problem Session 8`.

We could use this discriminant in order to make classifications, for example by setting a simple cutoff value or as input into a different classification algorithm.

However, it is important to note that the LDA algorithm maximizes the separation of the two classes among observations of the training set. It is possible that separation would not be as good for data the algorithm was not trained on.

In this example we can visually inspect by plotting a histogram of the Fisher discriminant values for the validation set we created. Does the separation seem as pronounced on the validation data?

##### Sample Solution

In [None]:
fish_val = 

In [None]:
plt.figure(figsize=(10,6))

plt.hist(fish_val[seeds_val.y==0], 
         color='blue',
         edgecolor="black",
         label="$y=0$")

plt.hist(fish_val[seeds_val.y==1], 
         color='orange', 
         hatch='/', 
         alpha=.6,
         edgecolor="black",
         label="$y=1$")

plt.xlabel("Fisher Discriminant", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.legend(fontsize=12)

plt.show()

There appears to be a little more overlap, but overall the separation appears similar on the validation set.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023. Modified by Steven Gubkin 2024.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)