1\. **PCA on 3D dataset**

* Generate a dataset simulating 3 features, each with N entries (N being ${\cal O}(1000)$). Each feature is made by random numbers generated according the normal distribution $N(\mu,\sigma)$ with mean $\mu_i$ and standard deviation $\sigma_i$, with $i=1, 2, 3$. Generate the 3 variables $x_{i}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$
* Find the eigenvectors and eigenvalues using the eigendecomposition of the covariance matrix
* Find the eigenvectors and eigenvalues using the SVD. Check that the two procedures yield to same result
* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained
* Redefine the data according to the new basis from the PCA
* Plot the data, in both the original and the new basis. The figure should have 2 rows (the original and the new basis) and 3 columns (the $[x_0, x_1]$, $[x_0, x_2]$ and $[x_1, x_2]$ projections) of scatter plots.

In [13]:
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as la

# Generate the dataset
N = 10000
x1 = np.random.normal(0, 1, N)
x2 = x1 + np.random.normal(0, 3, N)
x3 = 2*x1 + x2
X = np.array([x1, x2, x3])

# Compute the eigenvectors and eigenvalues using the eigendecomposition of the covariance matrix
cov = np.cov(X)
l, V = la.eig(cov)
l= np.real_if_close(l)

print("Eigenvalues:\n", l, '\n')
print("Eigenvectors:\n", V, '\n')

# Compute the eigenvectors and eigenvalues using the SVD
U, S, Vt = la.svd(X)
l_svd = s**2/(N-1)
print("Eigenvalues:\n", l_svd, '\n')
print("Eigenvectors:\n", U, '\n')


#reducing dimensions + estimate nof retained variability
Lambda=np.diag(np.sort(l)[::-1])
p = Lambda[0][0] / Lambda.trace()
p2 = (Lambda[0][0] + Lambda[1][1])/Lambda.trace()

print(f"variability retained choosing the first component: {p*100}")
print(f"variability retained chooding the 2 largest eigenvalues: {p2*100}")
      

Eigenvalues:
 [ 2.81346400e+01 -2.11607874e-15  2.02362998e+00] 

Eigenvectors:
 [[-0.1145129  -0.81649658  0.56587996]
 [-0.57854567 -0.40824829 -0.70612905]
 [-0.80757148  0.40824829  0.42563087]] 

Eigenvalues:
 [2.74252474e+01 1.99457801e+00 1.55271721e-31] 

Eigenvectors:
 [[-0.1145169   0.56587915 -0.81649658]
 [-0.57854069 -0.70613313 -0.40824829]
 [-0.80757449  0.42562517  0.40824829]] 

variability retained choosing the first component: 93.2899666505502
variability retained chooding the 2 largest eigenvalues: 100.00000000000003


2\. **PCA on a nD dataset**

* Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normally distributed, with a standard deviation much smaller (e.g. a factor 20) than those used to generate the $x_1$ and $x_2$. Repeat the PCA procedure and compare the results with what you have obtained before.

In [42]:
for n in range(3):
    for e in range(10):
        X[n] = X[n] + np.random.normal(0, 1/20, N)

# covariance matrix
covn = np.cov(X)
# find the eigenvectors of the covariance matrix
ln, Vn = la.eig(covn)
# take only the real component
ln = np.real_if_close(ln)

print('Covariance Matrix:')
print(covn,'\n')

print("Eigendecomposition:")
print("Eigenvalues:\n", ln)
print("Eigenvectors:\n", Vn)

Lambda2 = np.diag(np.sort(ln)[::-1]) #sort the eigenvalues in descending order
p2 = Lambda2[0][0]/Lambda2.trace()
p22 = (Lambda2[0][0]+Lambda2[1][1])/Lambda2.trace()
print()
print(f"variability retained choosing the first component: {p2*100}")
print(f"variability retained chooding the 2 largest eigenvalues: {p22*100}")

Covariance Matrix:
[[ 1.30834128  1.20242191  3.24907005]
 [ 1.20242191 10.76528468 12.71726984]
 [ 3.24907005 12.71726984 19.08004388]] 

Eigendecomposition:
Eigenvalues:
 [28.70890042  0.24151083  2.20325859]
Eigenvectors:
 [[-0.12100129 -0.80870633  0.57563248]
 [-0.57937149 -0.41333579 -0.70248289]
 [-0.80603187  0.41850638  0.41852723]]

variability retained choosing the first component: 92.15254757964783
variability retained chooding the 2 largest eigenvalues: 99.22477566617893


3\. **Optional**: **PCA on the MAGIC dataset**

Perform a PCA on the magic04.data dataset.

In [None]:
# get the dataset and its description on the proper data directory
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P data/
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P data/ 