## FINAL PROJECT

In this final project, we compare the tools that we have learned to classify the Fashion MNIST data set. You can read about the description of the data set here:

https://github.com/zalandoresearch/fashion-mnist

This porject is roughly divided into four parts:
* Use PCA to reduce the dimension of each image. 
* KNN on PCA
* KNN on LDA
* Neural network / convolution neural network using keras

If you are interested, you can try other learing methods that we talked about / methods that you know but we did not cover in the class. 

* For each method, I want you to write one to two paragraphs explaining the fundamental ideas about the learning methods. For example, explain the mechanism of the algorithm, what the key parameters are and how to choose them, and pros and cons of the methods.

* When you see a question, you can create a Markdown below the question block (or the code block, whichever makes more sense) and write your answer in it

* For the write up, please explain as much of your code as possible, and avoid a large block of code (try to put them in different blocks). Keep all of the intermediate plots if any. 

* Set the random seed to 42 for reproducibility. 

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

from sklearn import svm

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score


If any of the modules above is missing, you can use the following command to install it:

python3 -m pip install MODULE NAME

If you like, you are also welcome to use PyTorch for the neural network part. 

In [None]:
## Load the data
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

In [None]:
## Preprocess the data
train_images = train_images.reshape((-1, 28 * 28)).astype('float32') / 255.0
test_images = test_images.reshape((-1, 28 * 28)).astype('float32') / 255.0



In [None]:
# Split data into training and validation sets using a 80/20 ratio
train_images, val_images, train_labels, val_labels = train_test_split(train_images, train_labels, 
                                                                      test_size=0.2, 
                                                                      random_state=42)


In [None]:
# Normalize the training/validation data
scaler = StandardScaler()
train_images_scaled = scaler.fit_transform(train_images)
val_images_scaled = scaler.transform(val_images)
test_images_scaled = scaler.transform(test_images)


## Part I: Principal Component Analysis

A question to ask ourselves: Should the data really live in 784 dimensions?
* The following block of code estimates the number of components to account for 90% of the variance in the data. What is the number of components needed?
* Modify the code, and plot the number of components against the explained variance for the range 60-95% with 1% increment in each iteration
* To speed things up, you may use the first 1000 of the training images instead of all of them. This applies to the rest of the PCA section.

In [None]:
# Apply PCA, initialed with unknown number of components
pca = PCA(n_components=None)  

# Fit PCA to training image
pca.fit(train_images_scaled)

# Calculate the cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Find the number of components required to explain 90% of the variance
n_components = np.argmax(cumulative_variance_ratio >= 0.90) + 1

pca = PCA(n_components=n_components)


In [None]:
# You can now apply the updated PCA to the data
# (depending on the number of components you end up using)

train_images_pca = pca.fit_transform(train_images_scaled)
val_images_pca = pca.transform(val_images_scaled)
test_images_pca = pca.transform(test_images_scaled)

What was the advantage of this projection into the reduced dimensional space? We can
expect the algorithms will run MUCH faster on the reduced dimension data, but will we
sacrifice accuracy for this speed boost? We investiage this in the following part. 


## PART 2 k-Nearest Neighbor

Task:
* Give a description of the mechanism of the kNN algorithm
* Run the kNN on both the original data AND on the reduced dimension data from PCA (90% total variance explained on the training set)
* Consider k = [1;3;5;7;9;11]. Use 10-fold Cross-Validation to tune choose the best k
    - Please include a runing time evaluation
    - Compare the performance across data sets both based on accuracy (on the test data)
and on running time
    - Name and save the KNN model (for the best tuned k) re-trained on the full PCA
training set, as knn_best (you will need this for later comparisons).


In [None]:
# Some potentially useful code for kNN, 
# replace X_train/X_test/y_train/y_test with actual data


# Initialize the KNN classifier with 1-nearest neighbor
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = knn.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Perform 10-fold cross-validation on the original data set (X,y)
scores = cross_val_score(knn, X, y, cv=10)

## Part 3: LDA/QDA and SVM

We will now consider other classifiers trained on the PCA dataset.  

Tasks:
* Use the SVM to train
    - one using linear kernel
    - one using RBF
* Use the sklearn.discriminant_analysis to train
    - one LDA model
    - one QDA model
* For each model (including the kNN), compare test accuracy for the tuned model as well as runtimes
* Based on the time and accuracy, which model would you choose out of these?

In [None]:
# The basic building blocks for SVM

# Initialize the SVM classifier
clf = svm.SVC(kernel='linear')  # Linear kernel
clf = svm.SVC(kernel='rbf') # RBF kernel

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test data
y_pred = clf.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

In [None]:
# Building block for LDA

# Initialize the LDA classifier
lda = LinearDiscriminantAnalysis()

# Fit the LDA classifier to the training data
lda.fit(X_train, y_train)

# Predict the labels of the test data using LDA
y_pred_lda = lda.predict(X_test)

# Calculate the accuracy of the LDA model
accuracy_lda = accuracy_score(y_test, y_pred_lda)

In [None]:
# Building block for QDA

# Initialize the QDA classifier
qda = QuadraticDiscriminantAnalysis()

# Fit the QDA classifier to the training data
qda.fit(X_train, y_train)

# Predict the labels of the test data using QDA
y_pred_qda = qda.predict(X_test)

# Calculate the accuracy of the QDA model
accuracy_qda = accuracy_score(y_test, y_pred_qda)

## Part 4: (Deep) Neural Network

This section heavily relies on keras (if you are sticking with tensorflow).

Tasks:
* You can use the same CNN architecture considered for the MNIST dataset discussed in class (or feel free to try other
architectures). 
* Train using categorical cross entropy loss, adam optimizer, and track the
training and validation accuracy. 
* Train for 30 epochs, batchsize of 128, and a validation split of 0.2. Remember to time the training.
* In your write-up, include the history plots for training and validation sets. How much does the test accuracy improve by compared to the previous classification methods?
* Re-train the DNN but on the training set with only 1000 samples (remember to extend the channel dimension for x to use 2D Conv layers). Does the DNN still yield better performance than the previous classifiers with fewer training samples? Explain.

In [None]:
# Some building blocks for CNN

# Reshape data to fit the model
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)

# Convert labels to categorical one-hot encoding
y_train = to_categorical(y_train, num_classes=10)
y_test = to_categorical(y_test, num_classes=10)

# Define the CNN model
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=30, 
                    batch_size=128, validation_data=(X_test, y_test), 
                    verbose=0)

# Plot training error over epochs
plt.plot(history.history['loss'], label='Training Error')
plt.xlabel('Epochs')
plt.ylabel('Error')
plt.title('Training Error Over 30 Epochs')
plt.legend()
plt.show()
