In [None]:
import pandas as pd
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz as gp
from sklearn import decomposition as dcp

# Part 1: How does PCA work?

We use here a subset of the McDonald's dataset seen in class: we have retained only three of the features [Vitamin C, Total Fat, Cholesterol] to better understand how PCA works.


In [None]:
menu=pd.read_csv("McDonaldsMenu.csv")

1. Visualize the header of the dataframe and make sure there are no empty values.

In [None]:
menu.head()

In [None]:
menu.isna().sum()

2. What is the variance of each feature? Use `menu.cov()` to find this. Which variable has highest variance? Which variable has lowest variance?

In [None]:
menu.cov()

3. If we add the variance of each feature, how much do we get? What is the ratio variance/total dataset variance for each feature? How low does this go?

In [None]:
694.087+478.961925+846.324265

In [None]:
694.087600/(694.087+478.961925+846.324265)

In [None]:
478.961925/2019.37319

In [None]:
846.324265/(694.087+478.961925+846.324265)

4. We now consider a linear transformation of the variables that we give below. We obtain a new dataframe with three features `z1`, `z2`, and `z3`. Visualize the header of this matrix and obtain its covariance matrix. What is the variance of each feature? What is the sum of all three variances?

In [None]:
df=pd.DataFrame()

#creating centered variables
Vit_C_centered=menu["Vitamin C"]-menu["Vitamin C"].mean()
Total_Fat_centered=menu["Total Fat"]-menu["Total Fat"].mean()
Cholesterol_centered=menu["Cholesterol"]-menu["Cholesterol"].mean()

df["Category"]=menu["Category"]
df["Item"]=menu["Item"]
df["z1"]=-0.17718634*Vit_C_centered+0.54454878*Total_Fat_centered+0.81979975*Cholesterol_centered
df["z2"]=0.98405437*Vit_C_centered+0.08485918*Total_Fat_centered+0.15631992*Cholesterol_centered
df["z3"]=0.01555629*Vit_C_centered+0.83442528*Total_Fat_centered-0.5509149*Cholesterol_centered

In [None]:
df.head()

In [None]:
df.cov()

In [None]:
1147.864+679.5013+192.011090

5. What is the variance of `z3` relative to the total variance? Would you feel comfortable dropping `z3`?

In [None]:
192.011090/2019.37639

# Part 2: Understanding what Python outputs when using PCA

We are going to see what commands lead to what output in Python. We use the scikit-learn decomposition library for PCA.

In [None]:
from sklearn import decomposition as dcp

We then also have to drop the Item column of menu as PCA only works on numerical data.

In [None]:
menu=pd.read_csv("McDonaldsMenu.csv")
menu_num=menu.drop(columns=["Item", "Category"])

We now fit the PCA transform to the data. Note that we have 3 "old" variables. Here, we are specifying that we want 3 new variables from the three old ones.

In [None]:
pca=dcp.PCA(n_components=3)
pca.fit(menu_num)

1. Run the code below. Do you recognize its output? Where have you seen these numbers before during this class?

In [None]:
pca.components_

2. Run the code below. Again, do you recognize its output?

In [None]:
pca.explained_variance_

3. Run the code below. Again, do you recognize its output? What does it correspond to?

In [None]:
pca.explained_variance_ratio_

In [None]:
pca.explained_variance_ratio_.sum()

4. Run the code below. What do you obtain here?

In [None]:
data_pca = pca.fit_transform(menu_num)
data_pca

# Part 3: Considering a much larger dataset with more features

This dataset is the dataset from above with additional features added: all of these correspond to quantity of an element present in the item as a percentage of the daily intake.

In [None]:
menu=pd.read_csv("McDonaldsMenu_morefeatures.csv")
menu.head()

1. Drop the columns "Item" and "Category" using `.drop(columns=["Item","Category"]` to obtain a purely numerical dataset.

In [None]:
menu_num=menu.drop(columns=["Item","Category"])

In [None]:
menu_num

2. Similarly to above, run pca on the dataset obtained with `n_components=10`

In [None]:
from sklearn import decomposition as dcp
pca=dcp.PCA(n_components=10) #number of components we keep: here we have three features and we are asking for three new variables out of the old variables
pca.fit(menu_num)

3. What are the explained variance ratios?

In [None]:
pca.explained_variance_ratio_

4. Using the code below, plot the **cumulative** explained variance ratio. One rule of thumb is to take enough components such that 50% of the variance is explained:

In [None]:
explained_variance_ratio_cumul_sum=np.cumsum(pca.explained_variance_ratio_) #compute the cumulative sum
explained_variance_ratio_cumul_sum

In [None]:
plt.title("Explained Variance Ratio by Component")
plt.plot(np.arange(1,11),explained_variance_ratio_cumul_sum) #so that the first component is at 1, not 0
plt.plot([1,10],[0.5,0.5])
plt.xlabel("Component")
plt.ylabel("Variance Ratio")
plt.show()

5. How many components would you use based on this?

 Based on the 50% rule of thumb, we would pick 2 components here. If we take an 80% rule of thumb, we would pick 4.

6. Look at the explained variance using `.explained_variance_`. Is there a large drop at some point? This can also be a good rule of thumb for picking a good number of features (if the data is normalized, a variance cut-off of 1 would be a more precise rule of thumb)

In [None]:
pca.explained_variance_

There is a large drop between one component and two, as well as between 5 components and 6.

7. Use the code below to draw an "elbow plot". This allows the task above to be done more systematically: the idea of the plot is to show you directly how much you still gain by adding more components. If there is a significant kink, that gives you an idea that you want to stop adding components.

In [None]:
plt.plot(np.arange(1,11),pca.explained_variance_)
plt.xlabel("Number of components")
plt.ylabel("Explained variance")
plt.show()

8. We use 2 components moving forward (i.e., we are going to replace the 10 factors by simply 2). Use the code below to look at the loadings for the two new variables. Is it easy to interpret?

In [None]:
loadings=pd.DataFrame(pca.components_[0:2,:].T).set_index(np.arange(1,11))
loadings.columns = ['z1','z2']
loadings.index = menu_num.columns
loadings

9. It can be very difficult to see here what each feature represents if there are too many features. To resolve this issue, we can instead use `dcp.SparsePCA` which does the following: it attempts to find a compromise between being close to the "true" PCA components and only selecting one feature for each component. This is controlled by a parameter `alpha` that can be tuned. Run the code below for `alpha=0`, `alpha=5`, and `alpha=20`. Can you see what is happening?

In [None]:
pca_sparse=dcp.SparsePCA(alpha=20,n_components=2)
pca_sparse.fit(menu_num)
loadings_sparse=pd.DataFrame(pca_sparse.components_.T).set_index(np.arange(1,11))
loadings_sparse.columns = ['z1','z2']
loadings_sparse.index = menu_num.columns
loadings_sparse

10. Finally, we will match our components to the menu items. The code below is used to generate a dataframe that contains four columns : Item, Category and the two first components created via the "standard" PCA:

In [None]:
data_pca = pca.fit_transform(menu_num)

df=pd.DataFrame()
df["Item"]=menu["Item"]
df["Category"]=menu["Category"]
df["z1"]=data_pca[:, 0].reshape(-1)
df["z2"]=data_pca[:,1].reshape(-1)
df

11. Plot a scatter plot of z2 as a function of z1, with the hue of the points being determined by the category.

In [None]:
sns.relplot(data=df,x="z1",y="z2",hue="Category")
plt.show()

# Exercise: other applications of PCA - understanding what it means to compress an image

Please install the `pillow` library (for image reading and editing) before starting this exercise. Then restart the kernel and proceed with the exercise.

In [None]:
conda install -c anaconda pillow

In [None]:
from PIL import Image, ImageOps

We'll be using a high-resolution photo of Nice, France, a favorite among British tourists:

Picture credit: Photo by <a href="https://unsplash.com/@florielaure?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Caudron Florie Laure</a> on <a href="https://unsplash.com/s/photos/france-nice?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

In [None]:
img = Image.open('Nice.jpg')

1. Take a look at the image by running the code below --- isn't Nice gorgeous??

In [None]:
img

For those of you who have a little photography knowledge, you'll know that any image is made up of *pixels* and each pixel is represented as a color via a triple (R,G,B) corresponding to the amount of Red, Green, and Blue in the pixel. This is a bit complex for us for the example so we will convert the image to grayscale so we only have one number per pixel (the amount of gray). To do this, we will use `.grayscale`.

In [None]:
img_gray=ImageOps.grayscale(img)

In [None]:
img_gray

2. We can thus obtain a representation of the image via an array of numbers: this array is the same size as the number of pixels in the photo and contains numbers from 0 to 255, with 0 being white and 255 being black. Run the code below to obtain the aforementioned array. What is the shape of the array? Look at the properties of the picture - what do these numbers correspond to? How many numbers does the computer have to store to store this image?

In [None]:
array_gray=np.array(img_gray)

In [None]:
array_gray.shape

In [None]:
array_gray.size

This corresponds to the size of the picture (3778 pixels by 3024). The computer has to store 3778 * 3024 numbers, i.e., 11,424,672 numbers.

3. Our goal is to apply PCA to the picture to understand how image compression works. Fit the PCA transform with no limits on the number of components to the data `array_gray` (this may take a couple of seconds to compute).

In [None]:
pca=dcp.PCA()
pca.fit(array_gray)

4. What is the size of the array returned by `pca.explained_variance_ratio_`? If we were to view `array_gray` as a dataframe, how many observations would there be? How many features? Does the size of the array then make sense?

In [None]:
pca.explained_variance_ratio_.size

5. Build an array containing the cumulative sum of the explained variance ratio. Plot this array as a function of the number of components. What do you observe?

In [None]:
explained_variance_ratio_cumul_sum=np.cumsum(pca.explained_variance_ratio_) #compute the cumulative sum
plt.title("Explained Variance Ratio by Component")
plt.plot(np.arange(1,3025),explained_variance_ratio_cumul_sum) #so that the first component is at 1, not 0
plt.xlabel("Component")
plt.ylabel("Variance Ratio")
plt.show()

6. If you wanted to explain 95% of the variance, how many components should you keep? What if you wanted to keep 99% of the variance? Use `np.where`.

In [None]:
index95=np.min(np.where(explained_variance_ratio_cumul_sum>=0.95))

In [None]:
index95

In [None]:
index99=np.min(np.where(explained_variance_ratio_cumul_sum>=0.99))

In [None]:
index99

We would keep the first 105 components for 95% and 434 for 99%.

7. Rerun PCA with a number of components equal to 105. Obtain the array corresponding to the scores and set it equal to `scores`. What is the shape of `scores`?

In [None]:
pca=dcp.PCA(n_components=105)
scores=pca.fit_transform(array_gray)
scores.shape

8. We know from lecture that scores contains 105 new features (with each feature containing 3778 observations). For one given observation (i.e., if we look at a row of `scores`), we have for example that $z_1$, the first new feature (i.e., the first column of `scores`), is equal to a linear combination of the old features (which have been centered):
$$z_1=a_{11} (x_1-mean_{feature 1})+...+a_{1n} (x_n-mean_{feature n})$$
It is possible, to go the other direction as well: rewrite all these equalities that we have in such a way that $x_1,...,x_n$ are written as a linear combination of the new features $z_1,...,z_n$ and the loadings $a_{11},...,a_{1n},...., a_{nn}$. 
This is what the `.inverse_transform` function does when applied to `scores`. Give it a try and let `new_image` be the result. What is the shape of `new_image`?

In [None]:
new_image=pca.inverse_transform(scores)

In [None]:
new_image.shape

9. To obtain this array (that has exactly the same size as the previous image), we only need know the scores, the loadings, and the means. How many numbers does this constitute? Compare this to the number of numbers we had to retain for `array_gray`.

In [None]:
pca.components_.size+scores.size+3024

In [None]:
array_gray.size

In [None]:
717234/(11424672)

We are only retaining 6% of the total number of numbers!

10. We now plot `new_image`: this requires a couple of manipulations. Indeed, a quick check shows us that `new_image` contains numbers that are not integers between 0 and 255. We need to rescale the matrix to obtain only numbers betwen 0 and 255. Run the following code and look at the image. Can you tell the difference with the first image?

In [None]:
print(new_image.max())
print(new_image.min())

In [None]:
new_image=(new_image-new_image.min())/(new_image.max()-new_image.min())*255

In [None]:
new_image= Image.fromarray(new_image.astype("uint8"))

In [None]:
new_image

11. If you are not satisfied with the quality of the picture, rerun the same process with a number of components equal to 434. How many numbers are you storing here?

In [None]:
pca=dcp.PCA(n_components=434)
scores1=pca.fit_transform(array_gray)

In [None]:
new_image1=pca.inverse_transform(scores1)

In [None]:
pca.components_.size+scores.size+3024

In [None]:
2955092/(11424672)

In [None]:
new_image1=(new_image1-new_image1.min())/(new_image1.max()-new_image1.min())*255

In [None]:
new_image1= Image.fromarray(new_image1.astype("uint8"))

In [None]:
new_image1

12. Do you understand why this is called image compression? Explain why companies such as Facebook or Instagram would find image compression very valuable.

Image compression enables us to store images on a server using a lot less space (e.g., one quarter here or 6% of the space depending on the quality of the photo we want to store). This is done with no visible change to the photograph that is stored when e.g., we keep 99% of the variance explained.