# Dimensionality Reduction Exercise

In this exercise, you will be asked to build several Machine Learning models, while understanding the value of PCA dimensionality reduction. Make sure your code is readable, functional, documented and that you give elaborate explanations and some plots to go with your code.

In [1]:
import pandas as pd

## Load the MNIST dataset attached to this exercise (it is already divided to train and test sets, load both)

In [2]:
df_train = pd.read_csv('mnist_train.csv')
df_test = pd.read_csv('mnist_test.csv')
df_train.head()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 1. Build a classifier of your choice on the given data (your features are the pixels), and evaluate it. Elaborate on the performance of your model.

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_train, y_train = df_train.drop(columns=['label']), df_train['label']
X_test, y_test = df_test.drop(columns=['label']), df_test['label']

rfc_model = RandomForestClassifier(n_estimators=100, random_state=42)
rfc_model.fit(X_train, y_train)
accuracy = accuracy_score(y_test, rfc_model.predict(X_test))
print("Accuracy of the Random Forest Classifier:", accuracy)

Accuracy of the Random Forest Classifier: 0.9705


## 2. Perform a PCA dimensionality reduction on the data, and re-train the same model on the new top k PCA-ed features. Evaluate the new model and elaborate on the performance of your model, and compare it to the performance of model without PCA.
## The value of k is for you to choose, but it must be pretty small.  Try some different numbers, and explain why you chose that number.

In [4]:
from sklearn.decomposition import PCA

k_values = [2, 3, 4, 5, 6, 10, 20]

for k in k_values:
    pca = PCA(n_components=k)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    rfc_model_pca = RandomForestClassifier(n_estimators=100, random_state=42)
    rfc_model_pca.fit(X_train_pca, y_train)
    
    accuracy_pca = accuracy_score(y_test, rfc_model_pca.predict(X_test_pca))
    print(f"Accuracy of the Random Forest Classifier with PCA (k={k}):", accuracy_pca)


Accuracy of the Random Forest Classifier with PCA (k=2): 0.4192
Accuracy of the Random Forest Classifier with PCA (k=3): 0.5082
Accuracy of the Random Forest Classifier with PCA (k=4): 0.6514
Accuracy of the Random Forest Classifier with PCA (k=5): 0.758
Accuracy of the Random Forest Classifier with PCA (k=6): 0.8361
Accuracy of the Random Forest Classifier with PCA (k=10): 0.9129
Accuracy of the Random Forest Classifier with PCA (k=20): 0.9484


We can see that for small values of k, the bigger it is, the best accuracy score we obtain. Altough, it is not as good as using the model without PCA.  let's go with k = 20.

## 3. Compare the model metrics that you got from question 2, to a model with random subset of regular features:
- Use the same number of features k as you used in question 2.
- The actual features used is full regular pixel features without PCA.  
- But instead of using all such 784 features, use a random subset of size k of features from question 2.

Elaborate on your findings.

In [5]:
import random

random_features_indices = random.sample(range(784), k=20)

X_train_random = X_train.iloc[:, random_features_indices]
X_test_random = X_test.iloc[:, random_features_indices]

rfc_model_random = RandomForestClassifier(n_estimators=100, random_state=42)
rfc_model_random.fit(X_train_random, y_train)

accuracy_random = accuracy_score(y_test, rfc_model_random.predict(X_test_random))

print("Accuracy of the Random Forest Classifier with PCA (k=20):", accuracy_pca)
print("Accuracy of the Random Forest Classifier with Random Subset of Regular Features (k=20):", accuracy_random)

Accuracy of the Random Forest Classifier with PCA (k=20): 0.9484
Accuracy of the Random Forest Classifier with Random Subset of Regular Features (k=20): 0.5718


We can see, as expected, that we obtained a better score when using PCA with 20 features compared to when using 20 random features of the originally ones. We are working with the same amount of features, but with PCA, they are new ones, intelligently computed, so without surprise, we obtain better model performance 
