# Loading Data

1. Load mnist_test.csv from https://www.kaggle.com/datasets/oddrationale/mnist-in-csv?select=mnist_test.csv as data.

2. Split data into X and y. X should have the shape as (10000,784) and y should have the shape as (10000,1).

3. Split X and y into the train set (80%) and the test set (20%). The train set is for fitting your model while the test set is for evaluating your model. As a result, you will have X_train.shape as (8000,784), y_train.shape as (8000,1), X_test.shape as (2000,784),and y_test.shape as (2000,1). (Hint: use sklearn.model_selection.train_test_split.) 

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Add your code to this 

data_pd = pd.read_csv('/content/mnist_test.csv')
data = np.array(data_pd)

X = data[:, 1:]
print(X.shape)
y = data[:, 0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(10000, 784)
(8000, 784)
(2000, 784)
(8000,)
(2000,)


## Fitting and Evaluating Your Model
1. Use sklearn.linear_model.RidgeClassifier to fit X_train and y_train to get a multi-class classification model. 
2. Test your model on (X_test, y_test) and get testing accuracy by using clf.score(X_test, y_test) assuming your model is named as "clf".

In [3]:
from sklearn.linear_model import RidgeClassifier

clf = RidgeClassifier()
clf.fit(X_train, y_train)

acc = clf.score(X_test, y_test)
print(acc)

0.8515


## Optimizing the RidgeClassifier
1. In sklearn.linear_model.RidgeClassifier, there is one argument called "alpha" corresponding to the coefficient for the regularization. By default, alpha is equal to 1. There are benefits and drawbacks to having a large alpha. The larger is the alpha, the more likely you are going to have underfitting problems with your graph. Higher alpha does not necessarily mean better results. On the other hand, a low alpha may lead to overfitting problems and a more complicated model. 

  More information: https://towardsdatascience.com/preventing-overfitting-with-lasso-ridge-and-elastic-net-regularization-in-machine-learning-d1799b05d382

  Please try different alpha to train your model and evaluate your model's test accuracy. Note: you cannot try number ranges such as (1-10) or (1-50), these numbers are too similar. (Hint: you'll want try alphas that are different powers of 10.) Out of what you chose, what is the best choice for alpha in MNIST classification?

2. Instead of fitting the full dimension (784) of data to the RidgeClassifier, you can apply PCA to the data (PCA over X with the shape 10000*784) to reduce the dimension from 784 to 100 (for example) and train another RidgeClassifier with 100-dimension features. 

  Typically, we want the explained variance to be between 95–99% (which is what we would set n_components to). With alpha=1, iterate through the array 0.95-0.99 ([0.95, 0.96, ...., 0.99]) and set n_components to the value you're currently on in the array. Each time, print the shape of X_reduced to get the number of components that are left from the second value in the tuple. For example, (10000, 168) has 168 components left. For further explanation, see the scikit learn PCA function documentation: 
  > if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components
 
  Then, set n_components to 784 and run again. I should see the results from 0.95-0.99 variance as well as the result of running n_components = 784 in your answer. Then answer this question: what is the best reduced dimension number of components to get the highest test accuracy? (Hint: after applying PCA to 10000 samples, remember to split train and test.)
  
  More information on variance and PCA:  https://stackoverflow.com/questions/32857029/python-scikit-learn-pca-explained-variance-ratio-cutoff


In [20]:
# 1 
# Add your code to this 
acc_best = 0
for alpha in [.0001, .001, .01, .1, 1, 10, 100]: 
  clf = RidgeClassifier(alpha = alpha)
  clf.fit(X_train, y_train)

  acc = clf.score(X_test, y_test)
  print(f'{alpha} has accuracy score of: {acc}')
  
  
  if acc > acc_best:
    acc_best = acc
    alpha_best = alpha
  

print(f"\n\nbest accuracy is {acc_best} with alpha: {alpha_best}\n\n")


# 2 
# Add your code to this 

acc_best_PCA = 0
from sklearn.decomposition import PCA
for i in np.linspace(.95, .99, 5):
    print(i)
    my_model = PCA(n_components= i, svd_solver = 'full')
    x_reduced = my_model.fit_transform(X)
    print(x_reduced.shape)
    y = data[:, 0]
    X_train, X_test, y_train, y_test = train_test_split(x_reduced, y, test_size = 0.2)
    clf = RidgeClassifier()
    clf.fit(X_train, y_train)

    acc_with_PCA_reduced = clf.score(X_test, y_test)
    print(f'Accuracy is {acc_with_PCA_reduced} with {x_reduced.shape[1]} columns\n')

    if acc_with_PCA_reduced > acc_best_PCA:
      acc_best_PCA = acc_with_PCA_reduced
      PCA_best = x_reduced.shape[1]

    


my_model = PCA(n_components= 784, svd_solver= 'full')
X_PCA = my_model.fit_transform(X)
print(my_model.n_components)
print(X_PCA.shape)
y = data[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X_PCA, y, test_size = 0.2)
clf = RidgeClassifier()
clf.fit(X_train, y_train)
acc_with_PCA_784 = clf.score(X_test, y_test)
print(f'Best accuracy is {acc_with_PCA_784} with {X_PCA.shape[1]} columns')


print(f'\n\n\n\nThe best reduced dimension number of components to get the highest test accuracy is {PCA_best} components returning a test accuracy of {acc_best_PCA}')

 

  


0.0001 has accuracy score of: 0.8385
0.001 has accuracy score of: 0.8385
0.01 has accuracy score of: 0.8385
0.1 has accuracy score of: 0.839
1 has accuracy score of: 0.839
10 has accuracy score of: 0.839
100 has accuracy score of: 0.839


best accuracy is 0.839 with alpha: 0.1


0.95
(10000, 149)
Accuracy is 0.861 with 149 columns

0.96
(10000, 174)
Accuracy is 0.854 with 174 columns

0.97
(10000, 207)
Accuracy is 0.8745 with 207 columns

0.98
(10000, 253)
Accuracy is 0.861 with 253 columns

0.99
(10000, 323)
Accuracy is 0.858 with 323 columns

784
(10000, 784)
Best accuracy is 0.839 with 784 columns




The best reduced dimension number of components to get the highest test accuracy is 207 components returning a test accuracy of 0.8745


## Confusion Matrix and Classification Report
1. Read https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html to understand how to use a confusion matrix. Based on the information you learned from #1, can you plot the confusion matrix accordingly? (Hint: use clf.predict(X_test) to get the prediction labels over X_test.)

2. Read https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report to understand how to use a classification report. Based on the information you learned in #3, can you output the classification report accordingly? What is the label with the lowest precision?

In [29]:
from sklearn.metrics import confusion_matrix, classification_report

# Add your code here
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))


print(classification_report(y_test, y_pred))

[[220   0   0   1   1   3   3   0   2   0]
 [  0 199   2   0   0   1   2   0   4   0]
 [  4   3 147   2   5   0   8   6   8   4]
 [  1   2   6 188   1   4   1   4   5   3]
 [  0   2   2   1 156   0   2   0   0  14]
 [  2   4   1  17  10 131   1   2  11   1]
 [  3   2   4   0   5   1 195   0   4   0]
 [  2   7   4   0   7   0   0 169   1  15]
 [  2   6   7  10   4   6   3   1 152   4]
 [  4   1   2   5  15   0   0  12   3 147]]
              precision    recall  f1-score   support

           0       0.92      0.96      0.94       230
           1       0.88      0.96      0.92       208
           2       0.84      0.79      0.81       187
           3       0.84      0.87      0.86       215
           4       0.76      0.88      0.82       177
           5       0.90      0.73      0.80       180
           6       0.91      0.91      0.91       214
           7       0.87      0.82      0.85       205
           8       0.80      0.78      0.79       195
           9       0.78     