## **HW4 Principal Component Analysis**

# 1. Introduction
Congratulations on reaching the final assignment! In this assignment, you will learn how to use Principal Components Analysis (PCA) to reduce the dimensionality of high-dimensional data. Additionally, you will compare various differences between the original high-dimensional data and the transformed data obtained through PCA.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
! pip install import_ipynb

In [None]:
'''
You are not allowed to import other packages

If you cannot import the following ipynb file, Please run the ipynb file first and then restart the HW4.ipynb.
'''

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
import math
from tqdm import tqdm
import time

import import_ipynb
from PCA import MY_PCA, MY_SparsePCA
from Model import *
from Loss import *
from Utils import *
from Data_preprocess import *
from Trainer import *
from Config import *

## Model & Data preprocess

As mentioned in Assignment 3, this assignment is closely related to it. Please follow the data preprocessing and model implementation steps from Assignment 3. Note that there are additional constraints on the layer stacking in the model implementation this time. Be sure to follow the prompts for designing the model accordingly.

This time, we'll organize different functionalities into separate files for better code readability. For the model and data preprocessing, please implement them in the following files: Loss.ipynb, Model.ipynb, and Data_preprocess.ipynb.

In [None]:
X_train, Y_train, X_test = load_data('basic_data.npz')
x_train, y_train, x_val, y_val = data_preprocess(X_train, Y_train)

# 2. Basic Part

## PCA Implement
In this section, you are required to implement PCA by completing the following steps in the PCA.ipynb file.
>* Step1. Centering --> in HW4.ipynb
>* Step2. Covariance matrix computation --> in PCA.ipynb
>* Step3. Eigenvectors and eigenvalues computation --> in PCA.ipynb
>* Step4. Projection --> in PCA.ipynb

After implementing PCA, you need to reduce the data to two dimensions, observe the two-dimensional scatter plot of the data, and include it in the report.

In [None]:
# GRADED CODE: Implement centering function. (5%)
### START CODE HERE ###
'''
PCA Step1
HINT: It is important to choose the appropriate mean for data centralization..

x_train_cent -> Centeralized training data
x_val_cent -> Centeralized validation data
x_test_cent -> Centeralized testing data
'''

x_train_cent = None
x_val_cent = None
X_test_cent = None
### END CODE HERE ###

In [None]:
# GRADED CODE:
# Reduce the dimensions to two and generate scatter plots.
# (Training dataset  5%, Validation dataset 5%)

### START CODE HERE ###
'''
x_train_pca -> PCA of training data
x_val_pca -> PCA of validation data
x_test_pca -> PCA of testing data
Please use pca.function(data) to generate PCA of these datasets

Parameters:
MY_PCA:
n_components = Number of components to do the transformation.

pca.PCA_visualization:
data_pca -> The dataset you want to visuallize.
label -> The coresponding labels of the data_pca
text -> True if you want to plot the number on the scater plot.
tag -> You can set different tag for different figure
'''
pca = MY_PCA(n_components=2)
x_train_pca = None
x_val_pca = None
x_test_pca = None

pca.PCA_visualization(data_pca=None, label=None, text=None, n_components=2, tag=None)
### END CODE HERE ###

The example PCA visuallization of IRIS Datasets.

![figure](./iris_pca.png "IRIS PCA dataset")

In [None]:
# For grading, Please put the basic_cov into your output.npy.
basic_cov = pca.covariance_matrix
print('covariance_matrix: ', (basic_cov[100][12:16]*10000).round(3))
#The reason for multiplying by 10,000 here is that the original values are too small and difficult to observe
print(f'x_train_pca:', x_train_pca[0].round(3))
print('x_val_pca: ',x_val_pca[0].round(3))

**Expected Output**
$$ covariance\_matrix:\  [0.015\ \  0.322\ \ 0.322\ \ 0.013]$$
$$ x\_train\_pca:\ [0.707\ \ 3.321]$$
$$ x\_val\_pca:\ [-2.295\  -2.88]$$
$$or$$
$$ x\_train\_pca:\ [-0.707\ \ -3.321]$$
$$ x\_val\_pca:\ [2.295\  2.88]$$

In [None]:
# GRADED CODE: TRAINING MODEL WITH PCA DATA (PCA for Training, Validation and Test Dataset)
### START CODE HERE ###
'''
n_components -> Number fo components of PCA, and it will be the input dimension of your basic model.
pca -> Define your MY_PCA class here.

x_train_pca -> PCA of training data
x_val_pca -> PCA of validation data
x_test_pca -> PCA of testing data
Please use pca.xxx(data) to generate PCA of these datasets
'''

n_components = None
pca = None
x_train_pca = None
x_val_pca = None
x_test_pca = None
### END CODE HERE ###



In [None]:
# GRADED CODE: PCA INFORMATION REMAIN RATIO PLOT
# Calculate the minimum number of principal components to cover 80% variance (5%)
### START CODE HERE ###
'''
num_PC -> Minimum number of principal components to cover 80% variance
var_ratio -> The variance ratio of each component
'''

num_PC, var_ratio = pca.components_remain_ratio(0.80)
### END CODE HERE###


plt.plot(var_ratio)
plt.axvline(x=num_PC, color='r', linestyle='--')

plt.text(x=num_PC+3, y=0.02, s='80%', color='r')
plt.title('Information Remain Ratio')
plt.xlabel('# of Principle Components')
plt.ylabel('Remain Ratio of Data')
plt.savefig('Infotmation Remaining Ratio.png')
plt.show()
plt.close()

## Reconstruct  Data & Eigenvectors Visuallization

In [None]:


# GRADED CODE:
# Reconstruct image by using K components and compare with the original image (Training data 5%, Validation data 5%)
# Visuallize at least one eigenvector
# For grading, please put the reconstruct_data_train and the reconstruct_data_val in output.npy
reconstruct_data_train, z_train = pca.reconstructData(x_train[0], np.mean(x_train, axis=0), k=4)
reconstruct_data_val, z_val = pca.reconstructData(x_val[0], np.mean(x_train, axis=0), k=4)

### START CODE HERE ###
'''
reconstruct_img -> The reconstruct image of x_train[0].
eigenvector_img -> The image of eigenvector
'''

reconstruct_img = None
eigenvector_img = None
### END CODE HERE ###

plt.imshow(reconstruct_img, cmap='binary')

### Please put the reconstruct img in your report
plt.savefig('reconstruct_img.png')
plt.close()

train_squared_reconstruct_error = np.sum(x_train[0] - reconstruct_data_train)**2/reconstruct_data_train.shape[0]
val_squared_reconstruct_error = np.sum(x_val[0] - reconstruct_data_val)**2/reconstruct_data_val.shape[0]

In [None]:
print('k principle components:', num_PC)
print('Train Squared Reconstruct Error: ', train_squared_reconstruct_error.round(3))
print('Validation Squared Reconstruct Error: ',val_squared_reconstruct_error.round(3))
print('z: ', ['%.3f' %(z) for z in z_train])

**Expected output:**

$$ k\ principle\ components:  40 $$
$$ Train\ Squared\ Recontstruct\ Error:  0.043 $$
$$ Validation\ Squared\ Reconstruct\ Error:  0.077 $$
$$ z:  [0.707, 3.321, 0.928, 0.603] $$
$$or$$
$$ z:  [-0.707, -3.321, 0.928, -0.603] $$

### Model
In this part, you need to train your model with low-dimensional data (after PCA) and original data, respectively. Compare the difference between them.

In [None]:
config = Config([x_train.shape[1], 128, 10], 'focal_loss')

#CODE: TRAINING MODEL WITHOUT PCA DATA (MODEL SETTING AND TRAINING)

# Call Model.ipynb with config to define 'model'
# Use Trainer.ipynb to train your model.
### START CODE HERE ###
None
### END CODE HERE ###

In [None]:
pred_train = predict(x_train, y_train, model)

In [None]:
pred_val = predict(x_val, y_val, model)

In [None]:
config = Config([n_components, 128, 10], 'focal_loss')

# GRADED CODE: TRAINING MODEL WITH PCA DATA (MODEL SETTING AND TRAINING)
# Use PCA and the model from HW3 (advanced part) to train models on the imbalance MNIST dataset. (10%)

# Call Model.ipynb with config to define 'model'
# Use Trainer.ipynb to train your model.
### START CODE HERE ###
None
### END CODE HERE ###

In [None]:
pred_train = predict(x_train_pca, y_train, model)

In [None]:
pred_val = predict(x_val_pca, y_val, model)

In [None]:
pred_test = predict(x_test_pca, None, model)
outputs = {}

### for grading
outputs["basic_pred_test"] = pred_test
outputs["basic_layers_dims"] = config.layers_dims
outputs["basic_activation_fn"] = config.activation_fn
outputs["basic_loss_function"] = config.loss_function
outputs["basic_alpha"] = config.alpha
outputs["basic_gamma"] = config.gamma
outputs["basic_reconstruct_data_train"] = reconstruct_data_train
outputs["basic_reconstruct_data_val"] = reconstruct_data_val
outputs["basic_covariance_matrix"] = basic_cov
outputs["basic_var_ratio"] = var_ratio
basic_model_parameters = []
for basic_linear in model.linear:
    basic_model_parameters.append(basic_linear.parameters)
outputs["basic_model_parameters"] = basic_model_parameters

# 3. Advanced Part

In the advanced section, you will learn how to implement non-linear PCA, Sparse PCA.
Please complete the PCA.ipynb file for this purpose.

In [None]:
X_noise_train, Y_noise_train, X_noise_test = load_data('advanced_data.npz')
x_noise_train, y_noise_train, x_noise_val, y_noise_val = data_preprocess(X_noise_train, Y_noise_train)

In [None]:
# GRADED CODE: DATA CENTRALIZATION
### START CODE HERE ###
'''
x_noise_train_cent -> Centeralized training data
x_noise_val_cent -> Centeralized validation data
x_noise_test_cent -> Centeralized testing data
'''
x_noise_train_cent = None
x_noise_val_cent = None
X_noise_test_cent = None
### END CODE HERE ###

In [None]:
# YOU CAN DO PCA HERE TO COMPARE THE PERFORMANCE WITH SPARCEPCA (NOT FOR GRADING)
# PCA PART
### START CODE HERE ###
#PCA PART
None

#TRAIN WITH ORIGINAL DATA
None

#TRAIN WITH PCA DATA
None

### END CODE HERE ###

### SparsePCA

In [None]:
n_components = 2
sparse_pca = MY_SparsePCA(n_components, 0.001, 1000)

# GRADED CODE: SPARSE PCA IMPLEMENT
### START CODE HERE ###
'''
x_train_spca -> Sparse PCA of training data
x_val_spca -> Sparse PCA of validation data
x_test_spca -> Sparse PCA of testing data
'''
x_train_spca = None
x_val_spca = None
x_test_spca = None
### END CODE HERE ###

In [None]:
### For grading please put sparse_pca_check and sparse_Vt in output.npy
sparse_pca_check = x_train_spca
sparse_Vt = sparse_pca.Vt[0][0]

print('Sparse_pca init Vt: ', sparse_Vt)
print('x_train_spca: ', sparse_pca_check[0].round(3))

**Expected Output**
$$ Sparse\_pca\ init\ Vt:\ 1.74160428e^{-20} $$
$$ x\_train\_spca:\ [-0.229\ \ -0.762]$$

## Sparse PCA imple

In [None]:
# GRADED CODE: SPARSE PCA IMPLEMENT
### START CODE HERE ###
'''
n_components -> Number of components fo Sparse PCA, and it will be the input dimension of your basic model.
sparse_pca -> Please use
x_train_spca -> Sparse PCA of training data
x_val_spca -> Sparse PCA of validation data
x_test_spca -> Sparse PCA of testing data
'''
n_components = None
sparse_pca = None
x_train_spca = None
x_val_spca = None
x_test_spca = None
### END CODE HERE ###

## Model

In [None]:
config = Config([n_components, 64, 10], 'focal_loss')

# GRADED CODE: SPARSE PCA IMPLEMENT
# Please call Model.ipynb and Trainer.ipynb to define and train your model.
### START CODE HERE ###
None
### END CODE HERE ###

In [None]:
sparse_pred_train = predict(x_train_spca, y_noise_train, model)

In [None]:
sparse_pred_val = predict(x_val_spca, y_noise_val, model)

In [None]:
sparse_pred_test = predict(x_test_spca, None, model)

### Advanced Ranking
In the advanced ranking section, you are allowed to integrate PCA with additional data preprocessing. However, please note that you are not permitted to use existing data preprocessing and PCA libraries, modify the model's architecture, or alter the predetermined configuration.

In [None]:
# GRADED CODE: RANKING PART DO YOUR DATA PREPROCESS HERE (10%)
## input_dim comment ex. number of  principle components or image dim
## loss function = 'focal loss' or 'crossentropy'
### START CODE HERE ###
'''
input_dim -> The first input dimension of your model
eg. It could be the number of principle comopnents or original data dimension. Its depends on your data preprocess

loss_function -> You can choose the loss function from HW3. (eg. 'focal_loss')
'''
input_dim = None
loss_function = None

### END CODE HERE ###

In [None]:
adv_config = Config([input_dim, 128, 10], loss_function)
adv_model = Model(adv_config)

# GRADED CODE: RANKING PART DO YOUR TRAINING WORK HERE
# Please call Trainer.ipynb to train the adv_model.
### START CODE HERE ###
None
adv_pred_test = predict(None, None, adv_model)
### END CODE HERE ###

In [None]:
# for grading
outputs["sparse_Vt"] = sparse_Vt
outputs["sparse_pca"] = sparse_pca_check
outputs["sparse_pred_train"] = sparse_pred_train
outputs["sparse_pred_val"] = sparse_pred_val
outputs["sparse_pred_test"] = sparse_pred_test

outputs["advanced_pred_test"] = adv_pred_test
outputs["advanced_layers_dims"] = adv_config.layers_dims
outputs["advanced_activation_fn"] = adv_config.activation_fn
outputs["advanced_loss_function"] = adv_config.loss_function
outputs["advanced_alpha"] = adv_config.alpha
outputs["advanced_gamma"] = adv_config.gamma


advanced_model_parameters = []
for advanced_linear in adv_model.linear:
    advanced_model_parameters.append(advanced_linear.parameters)
outputs["advanced_model_parameters"] = advanced_model_parameters

In [None]:
# sanity check
assert list(outputs.keys()) == [
    'basic_pred_test',\
    'basic_layers_dims',\
    'basic_activation_fn',\
    'basic_loss_function',\
    'basic_alpha',\
    'basic_gamma',\
    'basic_reconstruct_data_train',\
    'basic_reconstruct_data_val',\
    'basic_covariance_matrix',\
    'basic_var_ratio',\
    'basic_model_parameters',\
    'sparse_Vt',\
    'sparse_pca',\
    'sparse_pred_train',\
    'sparse_pred_val',\
    'sparse_pred_test',\
    'advanced_pred_test',\
    'advanced_layers_dims',\
    'advanced_activation_fn',\
    'advanced_loss_function',\
    'advanced_alpha',\
    'advanced_gamma',\
    'advanced_model_parameters'],\
"You're missing something, please restart the kernel and run the code from begining to the end. If the same error occurs, maybe you deleted some outputs, check the template to find the missing parts!"

In [None]:
np.save("output.npy", outputs)

In [None]:
# sanity check
submit = np.load("output.npy", allow_pickle=True).item()
for key, value in submit.items():
    print(str(key) + "： " + str(type(value)))