## Lab 2: Principal Component Analysis
You can use external libraries for linear algebra operations but you are expected to write your own algorithms.

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder

# Exercise 1
Use the  ```Dry_Bean_Dataset.xlsx``` available on the  ```github``` page of the labs.
- Divide your dataset into a train and a test set.
- Preprocess the data by centering the variables and dividing them by their standard deviation.

In [29]:
df = pd.read_excel("../Datasets/Dry_Bean_Dataset.xlsx")

In [None]:
df.head()

In [31]:
y = df['Class']
X = df.drop('Class', axis=1)

In [None]:
encoder = OrdinalEncoder()
y=np.array(y)
encoder.fit(y.reshape(-1,1))
y = encoder.transform(y.reshape(-1, 1))
print(y)  

In [33]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

In [34]:
X_mean = Xtrain.mean()
X_std = Xtrain.std()

In [35]:
Xtrain = (Xtrain-X_mean )/X_std

In [None]:
Xtrain.head()

- Write your own algorithm to perform PCA on the variables.

In [37]:
#Since the intent of these laboratories is for YOU to learn and test the algorithms, we will not provide a "hand-made" version of the algorithm in these solutions (as your code will be commented during the exam).
#We will use instead the sklearn version of PCA. You can check if your results match the provided solution.

In [38]:
from sklearn.decomposition import PCA

In [39]:
pca = PCA() #this will keep all the components
#alternatively you can specify the number of components you want to keep by writing 
# pca = PCA(n_components =3)

In [None]:
pca.fit(Xtrain) #this is just a fit of the model to the training set

- Using the training set, obtain and plot the eigenvalue spectrum using the log-scale for the y-axis. What number of principal components would you select?

In [None]:
plt.figure(figsize=(10,8))
plt.plot(pca.singular_values_, "o-")
plt.title("Eigenvalues spectrum")
plt.show()

In [None]:
plt.figure(figsize=(10,8))
plt.plot(np.log(pca.singular_values_),"o-")
plt.title("Log-scale for eigenvalues")
plt.show()

In [None]:
plt.figure(figsize=(10,8))
plt.plot(pca.explained_variance_ratio_,"o-")
plt.title("Amount of variance explained by each PC")
plt.show()

- Project the data (training set) in the first two principal components and color by class. Do it also for three principal components. 

In [None]:
X_new = pca.transform(Xtrain)
print(X_new.shape)

#you can also fit the model and apply the dimensionality reduction to the same set by writing
#pca.fit_transform(Xtrain)

In [45]:
X_new_3 = X_new[:,:3] #we need at most 3 PCs

In [46]:
ytrain = np.array(ytrain)

In [None]:
data = np.column_stack((X_new_3, ytrain))
print(data.shape)

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(data[:,0], data[:,1], c=data[:,3])
plt.show()

In [None]:
plt.figure(figsize=(10,8))
axes = plt.axes(projection='3d')
axes.scatter3D(data[:,0], data[:,1], data[:,2], c=data[:,3])
axes.view_init(30,15)
axes.set_xlabel('1st PC')
axes.set_ylabel('2nd PC')
axes.set_zlabel('3rd PC')
plt.show()

- For an increasing number of principal components (1 to 16):
- - Apply a multinomial logistic regression to learn a model on the training set (use  ```sklearn.linear_model.LogisticRegression``` ).
- - Transform the test set with the matrix learned from the traning set. Make a prediction with the logistic model learned. 
- - Assess the quality of the predictions and comment on the results. 

In [51]:
Xtest = (Xtest- Xtest.mean())/Xtest.std() #proprocessing of the test set

In [None]:
score = []
for i in range(X_new.shape[1]): #X_new is the PCA transformation of the test set with all components kepts
    lr = LogisticRegression(multi_class='multinomial', max_iter=1000)
    lr.fit(X_new[:,:i+1], ytrain.ravel())

    x_PC = pca.transform(Xtest)
    x_PC = x_PC[:,:i+1] #keep only the needed PCs
    
    yhat = lr.predict(x_PC)
    print(i, ": ", lr.score(x_PC, ytest.ravel()))
    score.append(lr.score(x_PC, ytest.ravel()))

In [None]:
plt.plot(score, "o-")
plt.show()

In [None]:
print(f"The maximum values of the accuracy score is reached with {np.argmax(score)} PCs and it is equal to {np.max(score)}")

# Exercise 2
Try to apply PCA to the Swiss Roll dataset ($n=1000$) from Lab 1 and plot the projection on the first two principal components. Choose an appropriate color scheme for visualization and comment on your results. 

In [58]:
def swiss_roll(n): #from lab 1
    """
    Parameters:
    n: int
        Number of points to generate"""
    
    data = np.zeros((n,3))
    phi = np.random.uniform(low=1.5*np.pi, high=4.5*np.pi, size=n)
    psi = np.random.uniform(0,10,n)
               
    data[:,0]=phi*np.cos(phi) #x coordinte
    data[:,1]=phi*np.sin(phi) #y coordinate
    data[:,2]=psi #z coordinate
    return data

In [59]:
X = swiss_roll(1000)

In [60]:
#X = (X-np.mean(X)) #or 
X = (X-np.mean(X))/np.std(X) #but it is only necessary to centralize the data in practice

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(X[:,0], X[:,1])
plt.title("Swiss roll")
plt.show()

In [63]:
pca_swiss = PCA(n_components=2)
X_transformed = pca_swiss.fit_transform(X)

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(X_transformed[:,0], X_transformed[:,1])
plt.title("Swiss roll - 2 PCs")
plt.show()