## Due date: Tuesday 02/25 at 11.59 pm ##

# Lab 3: Principal Component Analysis

In this lab assignment, we will walk through an example of using Principal Component Analysis (PCA) on a dataset involving [iris plants](https://en.wikipedia.org/wiki/Iris_(plant)).


In [None]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

To begin, run the following cell to load the dataset into this notebook. 
* `iris_features` will contain a numpy array of 4 attributes for 150 different plants (shape 150 x 4). This is the matrix onto which we are going to apply PCA. To follow the lecture notes, we are going to traspose it to have shape 4x150. 
* `iris_target` will contain the class of each plant. There are 3 classes of plants in the dataset: Iris-Setosa, Iris-Versicolour, and Iris-Virginica. The class names will be stored in `iris_target_names`.
* `iris_feature_names` will be a list of 4 names, one for each attribute in `iris_features`. 

Additional information on the dataset will be included in the description printed at the end of the following cell.

In [None]:
iris_data = load_iris() # Loading the dataset

# Unpacking the data into arrays
iris_features = iris_data['data']
iris_target = iris_data['target']
iris_feature_names = iris_data['feature_names']
iris_target_names = iris_data['target_names']

# Convert iris_target to string labels instead of int labels currently (0, 1, 2) for the classes
iris_target = iris_target_names[iris_target]

#traspose iris_features to have shape mxN
iris_features=iris_features.T
iris_features.shape

Let's explore the data by creating a scatter matrix of our iris features. To do this, we'll create 2D scatter plots for every possible pair of our four features. This should result in six total scatter plots in our scatter matrix.

In [None]:
plt.figure(figsize=(14, 10))
plt.suptitle("Scatter Matrix of Iris Features")
plt.subplots_adjust(wspace=0.3, hspace=0.3)
for i in range(1, 4):
    for j in range(i):
        plt.subplot(3, 3, i+3*j)
        sns.scatterplot(iris_features[i,: ], iris_features[j, :], hue=iris_target) # SOLUTION
        plt.xlabel(iris_feature_names[i])
        plt.ylabel(iris_feature_names[j])

## Question 1a

To apply PCA, we will first need to "center" the data so that the mean of each feature is 0. Additionally, we will need to scale the centered data by $\frac{1}{\sqrt n}$, where $n$ is the number of samples (rows) we have in our dataset. 

Compute the rowise mean of `iris_features` in the cell below and store it in `iris_mean` (should be a numpy array of 4 means, 1 for each attribute). Then, subtract `iris_mean` from `iris_features`, divide the result by the $\sqrt n$, and save the result in `normalized_features`.

**Hints:** 
* Use `np.mean` or `np.average` to compute `iris_mean`, and pay attention to the `axis` argument.
* If you are confused about how numpy deals with arithmetic operations between arrays of different shapes, see this note about [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) for explanations/examples.


In [None]:
#student
n = ... # should be 150
iris_mean = ...
iris_mean.shape=(iris_mean.size,1)  #just fixing the vector to be two dimensional to be able to perform the next calculation
normalized_features = 

In [None]:
#solution
n = iris_features.shape[1] # should be 150
iris_mean = np.mean(iris_features, axis=1) 
iris_mean.shape=(iris_mean.size,1)
normalized_features = (iris_features - iris_mean) / np.sqrt(n) 

## Question 1b

As you may recall from lecture, PCA is a specific application of the singular value decomposition (SVD) for matrices. In the following cell, let's use the [`np.linalg.svd`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) function compute the SVD of our `normalized_features`. Store the left singular vectors, singular values, and right singular vectors in `u`, `s`, and `vt` respectively.

**Hint:** Set the `full_matrices` argument of `np.linalg.svd` to `False`.

In [None]:
#student
u, s, vt = ...

In [None]:
#solution
u, s, vt = np.linalg.svd(normalized_features, full_matrices=False) # SOLUTION
u.shape, s, vt.shape

## Question 1c

What can we learn from the singular values in `s`? First, we can compute the total variance of the data by summing the squared singular values. We will later be able to use this value to determine the variance captured by a subset of our principal components.

Compute the total variance below and store the result in the variable `total_variance`.


In [None]:
#student
total_variance = ...
print("total_variance: {:.3f} should approximately equal the sum of feature variances: {:.3f}"
      .format(..., ))

In [None]:
#solution
total_variance = np.sum(np.square(s)) # SOLUTION
print("total_variance: {:.3f} should approximately equal the sum of feature variances: {:.3f}"
      .format(total_variance, np.sum(np.var(iris_features, axis=1))))

## Question 2a

Let's now use only the first two principal components to see what a 2D version of our iris data looks like.
Print the first two principal components:

In [None]:
#student

In [None]:
#solution
u[:,np.array((0,1))]

What is then the best 2-dimensional affine subspace? Remember that to define an affine subspace you need to specify the translation vector ($\mu$ in your lecture notes) and a matrix $Q$ with as columns a basis for the subspace.

In [None]:
#student
mu=...
Q=
print(mu,Q)

In [None]:
#solution
mu=iris_mean
Q=u[:,np.array((0,1))]
print(mu,Q)

Find the rank-2 principal coordinates of the data.

In [None]:
#student
beta = ...

In [None]:
#solution
beta_2d = np.dot(Q.T,np.sqrt(n)*normalized_features) 

Find the rank-2 principal mean of the data.

In [None]:
#student
beta_bar=...

In [None]:
#solution
beta_bar=np.dot(Q.T,iris_mean)

Construct now the 2D version of the iris data finding the rank-2 principal coordinates shifted by the rank-2 principal mean.

In [None]:
#student
iris_2d=...

In [None]:
#solution
iris_2d=beta_2d+beta_bar

Now, run the cell below to create the scatter plot of our 2D version of the iris data, `iris_2d`.

In [None]:
plt.figure(figsize=(9, 6))
plt.title("PC2 vs. PC1 for Iris Data")
plt.xlabel("Iris PC1")
plt.ylabel("Iris PC2")
sns.scatterplot(iris_2d[0, :], iris_2d[1, :], hue=iris_target);

## Question 2b

What do you observe about the plot above? If you were given a point in the subspace defined by PC1 and PC2, how well would you be able to classify the point as one of the three iris types?


#student

**SOLUTION:** The setosa class is still separated from the other two classes, and there is some slight overlap between the other two classes. Perfect classification between the versicolor and verginica classes would be hard using this representation.

## Question 2c

What proportion of the total variance is accounted for when we project the iris data down to two dimensions? Compute this quantity in the cell below.

In [None]:
#student

In [None]:
#solution
two_dim_variance = np.sum(np.square(s[:2])) / total_variance 
two_dim_variance

## Question 3

As a last step, let's create a [scree plot](https://en.wikipedia.org/wiki/Scree_plot) to visualize the weight of each of each principal component. In the cell below, create a scree plot by plotting a line plot of the square of the singular values in `s` vs. the principal component number (1st, 2nd, 3rd, or 4th).


In [None]:
#student
plt.xticks([1, 2, 3, 4])
plt.xlabel("Principal Component")
plt.ylabel("Variance (Component Scores)")
plt.title("Scree Plot of Iris Principal Components")
plt.plot([1, 2, 3, 4], ...);

In [None]:
# solution
plt.xticks([1, 2, 3, 4])
plt.xlabel("Principal Component")
plt.ylabel("Variance (Component Scores)")
plt.title("Scree Plot of Iris Principal Components")
plt.plot([1, 2, 3, 4], np.square(s));

## Submission Instructions ##

Many assignments throughout the course will have a written portion and a code portion. Please follow the directions below to properly submit both portions.

### Written Portion ###
*  Scan all the pages into a PDF. You can use any scanner or a phone using applications such as CamScanner. Please **DO NOT** simply take pictures using your phone. 
* **Please start a new page for each PART**. If you have already written multiple questions on the same page, you can crop the image in CamScanner or fold your page over (the old-fashioned way). This helps expedite grading.
* It is your responsibility to check that all the work on all the scanned pages is legible.

### Code Portion ###
* Save your notebook using File > Save and Checkpoint.
* Use File > Downland as > PDF via Latex.
* Download the PDF file and confirm that none of your work is missing or cut off. 
### Submitting ###
* Combine the PDFs from the written and code portions into one PDF.  [Here](https://smallpdf.com/merge-pdf) is a useful tool for doing so.  
* Submit the assignment to Lab3 on Gradescope. 
* **Make sure to assign each page of your pdf to the correct question.**

