In [None]:
from plotnine import *
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split # simple TT split cv
import numpy as np
import seaborn as sb

# 0. Together

Principal component analysis (PCA) is a method of *dimensionality reduction* that takes the information in multiple variables/predictors, and presents that information (or at least MOST of it) in a smaller number of features. This smaller number of features--called components--are all linear combinations of the original variables, and the features are created in a way (eigendecomposition, if you're interested!) that makes the 1st component contain the MOST variability in the data, the 2nd component contain the second most variability...and so on.

This allows us to choose only a handful of features (usually the first N features) that contain *most* of the information from the original data. This is helpful becuase few features often means faster models.

We discussed 2 ways to choose the number of components that you retain:


* a) **The Elbow Method**: create a scree plot, and find the "elbow" of the graph. Retain all the components at and before the elbow.
* b) **The Percentage Method**: specify a specific % of variance that's acceptable to retain (e.g. 95%), and retain enough components to achieve that %.
    
    
    
<img src="https://drive.google.com/uc?export=view&id=1crCW8BAFVEu50th9VhdJMYakZho03kp0" width = 500px />

# PCA Visually
<img src="https://drive.google.com/uc?export=view&id=1wBXkUp2MNZSSrnXhHsqNYmdYhC9Bs6x1" width = 500px />
<img src="https://drive.google.com/uc?export=view&id=1ZHMOB5oDdogQgBBqTTrKfEnexXjPwEWS" width = 500px />


# 1. Using PCs as predictors

Using `data`, build a logistic regression model using *all* the variables (V1-V11; make sure you z score) that predicts the variable `Outcome`. Then run PCA on the predictors (V1-V11). Now build a model that uses the first 1,3,5, and 10 components.

* How much variance do the 1,3,5, and 10 components cover?
* How does each model do on train/test accuracy? How do they do compared to the full model?
* What patterns do you observe, and why do you think those patterns exist?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pcaLogit.csv")

data.head()
### YOUR CODE HERE ###

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,Outcome
0,-0.149337,0.442119,0.576677,1.153935,2.200429,-3.263321,-1.168562,-0.061753,-1.826888,0.152121,-0.780083,1
1,0.814597,-0.071904,2.436344,1.441524,-0.262304,1.088893,0.839062,0.783944,0.398013,-0.243652,2.141202,0
2,0.367023,0.65514,-0.431427,-1.283513,-1.331135,-0.422708,0.399706,-0.376874,0.751164,1.427323,-1.289143,1
3,0.032182,-1.268488,-0.327227,1.398891,-0.871737,0.211705,0.239753,3.127677,-1.364123,-2.663003,0.437166,0
4,-1.09272,-0.093706,-0.744864,0.891042,-0.212384,0.086259,0.382121,1.322998,-0.629167,-1.198671,0.699633,0


# 2. PCA with different variable correlations

You can grab all pairwise correlations between variables/features in your model by using the command [`df.corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) where `df` is your dataframe, and `sb.heatmap(df.corr(), cmap="Blues", annot=True)`.

For the following datasets:

1. Z score the data.
2. Look at the correlations between all the variables in the dataframe. Are they high? low?
3. Perform PCA
4. Make a Scree plot (be sure to add `+ ylim(0,1)`. What do you notice about the patterns in the screeplot? How do those relate to the correlations you saw?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />


TIP: to make the scree plot more clear, it can help to add the row [0,0,0] to your dataframe of the explained/cumulative variance. You can do this with:

In [None]:
# THIS CODE WON'T RUN, it's just an example of how to do this
pcaDF = pd.DataFrame({"expl_var" : pca.explained_variance_ratio_, "pc": range(1,12), "cum_var": pca.explained_variance_ratio_.cumsum()})

# add zeros
pcaDF = pcaDF.append(pd.DataFrame({"expl_var" : [0], "pc": [0], "cum_var": [0]}))

In [None]:
d1 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pca0.csv")
d1.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,-0.589325,-0.962855,-0.409353,-0.481657,0.39122,-0.261497,0.845311,-0.630942,0.105109,-1.114231,0.819604
1,-0.585161,0.30328,-0.754514,1.478001,-0.230236,-1.620006,2.303447,-2.430098,-0.908377,0.547499,0.352563
2,-0.040111,0.726222,-0.927827,-0.988774,-1.140503,-0.63683,1.886706,-1.167584,0.298881,1.509339,0.769915
3,-1.552609,-0.006793,0.680347,-0.758141,-1.189817,-0.409761,-1.161692,-1.803187,-0.201474,-0.239718,-1.85761
4,-0.238883,0.16102,0.452826,-1.205539,-0.264308,0.587101,-0.335472,-2.392979,-0.826877,1.14807,-0.900304


In [None]:
d2 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pca5.csv")
d2.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0.039554,0.566799,1.46038,0.551369,1.353559,1.030651,0.140741,0.810873,1.343591,2.26832,-0.887089
1,-1.209627,0.350666,-0.340243,-0.390595,-2.466304,-1.346314,-0.926472,-0.280642,-0.640843,0.055363,-1.034537
2,-0.12936,-0.157916,-0.72989,-0.681124,0.528222,0.643489,1.002502,0.505844,-0.286316,-0.342354,0.815404
3,0.83215,-0.276044,-1.001197,-0.831879,0.625231,0.364713,0.444775,-0.151327,0.841734,0.311781,-0.129617
4,-0.773324,0.423212,0.26214,2.088551,0.291388,0.590524,-0.120924,1.045262,0.812294,-0.706862,0.419602


In [None]:
d3 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pca9.csv")
d3.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,1.243909,1.182441,1.245221,1.283435,1.379725,1.231338,1.230737,1.102804,1.146487,1.1768,1.208353
1,-0.150383,-0.33787,-0.380147,-0.245898,-0.323991,-0.337898,-0.457001,-0.362397,-0.330366,-0.460161,-0.288632
2,-0.384532,-0.173443,-0.25187,-0.373778,-0.232588,-0.341743,-0.252138,-0.069091,-0.216869,-0.356769,-0.379533
3,1.956517,1.799477,1.918521,1.741209,1.873199,1.929961,2.025591,1.901917,1.86195,1.689937,1.870567
4,-0.391128,-0.239054,-0.34174,-0.204021,-0.196644,-0.244263,-0.062199,-0.24713,-0.303675,-0.200597,-0.239926


In [None]:
d4 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pca10.csv")
d4.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,-1.247081,-1.349485,-0.971545,-0.209888,-1.744218,-1.143955,-0.122581,-0.557669,-0.076955,-2.072527,-0.246774
1,0.837563,1.292048,-0.900321,0.313482,0.076978,0.071709,-1.077348,-1.180627,-0.079229,0.594995,0.43794
2,-0.593868,-2.244652,-0.672393,-0.90671,1.01561,0.713235,1.135665,0.926989,-0.801986,-0.512415,-1.586492
3,0.76407,1.93597,-1.345882,-0.154152,-0.329562,-0.077087,0.462192,1.834738,-1.61359,2.502433,-1.652905
4,0.008114,-1.599102,-1.025038,1.723897,0.884198,-0.065314,-2.177886,-0.10233,0.497632,-1.015951,1.868559


In [None]:
d5 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pca11.csv")
d5.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
0,0.469553,-1.492115,-0.667141,-0.348074,1.875072,-0.273789,-0.249197,-1.510814,0.646058,1.18223,0.879918
1,-0.33036,-1.395627,1.051138,1.013503,-0.325805,1.598626,-0.210261,-0.988165,1.070902,1.040934,0.821657
2,0.997758,0.825768,-1.03369,-0.22855,0.120673,-1.724621,1.288422,-0.489717,1.85678,-1.413949,0.240121
3,0.239127,1.292628,-0.388788,1.50939,-1.637108,-0.969115,1.69837,1.869058,0.305677,-2.424403,-2.378896
4,-0.984851,-1.142951,-0.221873,-0.282117,1.247947,-1.381092,-0.398786,-0.40897,1.157696,-1.154791,-0.257376


In [None]:
d6 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/pca12.csv")
d6.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10
0,-1.199573,-0.88642,-0.767398,-0.749321,-0.550645,-1.098962,-2.521482,-1.370433,-1.656022,-0.626147
1,-0.517583,-1.619886,-1.132602,-0.843867,-1.259561,-2.952811,-1.709159,-1.772834,-1.631508,-2.737225
2,0.662478,1.893808,1.256534,1.034133,0.868403,0.503153,0.607494,0.365996,1.209286,1.229888
3,2.000224,1.720538,1.872528,1.082032,1.675314,-0.470523,-0.482561,-0.23563,-0.984606,-0.984836
4,-0.188601,0.433413,0.346252,-0.131676,-0.18761,-0.307936,0.11065,1.344622,0.871384,1.465256


# PCA and clustering

Using the McDonald's Nutritional Data found here: https://github.com/cmparlettpelleriti/CPSC392ParlettPelleriti/blob/master/Data/McMenu.csv (see [Kaggle](https://www.kaggle.com/mcdonalds/nutrition-facts/version/1) for more info), use k-means to cluster the foods using all the variables except category, item, and serving size (make sure to z score variables first!).

Next, perform PCA on all the variables except category, item, and serving size. Calculate and grab the first 2 PCs, and add them to your dataframe.

Normally (like in your project) when we have more than 2 variables, we have to make MULTIPLE plots. But, one other option is to plot the first 2 PCs, and then color by cluster. Even though we are losing *some* information, we're still roughly able to see how cohesive/dense/separate our clusters are! Try making a scatterplot using the first 2 PCs, then coloring by cluster. What can you tell about your clusters?

In [None]:
from sklearn.preprocessing import StandardScaler #Z-score variables

from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

from sklearn.metrics import silhouette_score
d = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/McMenu.csv")
d.head()

Unnamed: 0,Category,Item,Serving Size,Calories,Calories from Fat,Total Fat,Total Fat (% Daily Value),Saturated Fat,Saturated Fat (% Daily Value),Trans Fat,...,Carbohydrates,Carbohydrates (% Daily Value),Dietary Fiber,Dietary Fiber (% Daily Value),Sugars,Protein,Vitamin A (% Daily Value),Vitamin C (% Daily Value),Calcium (% Daily Value),Iron (% Daily Value)
0,Breakfast,Egg McMuffin,4.8 oz (136 g),300,120,13.0,20,5.0,25,0.0,...,31,10,4,17,3,17,10,0,25,15
1,Breakfast,Egg White Delight,4.8 oz (135 g),250,70,8.0,12,3.0,15,0.0,...,30,10,4,17,3,18,6,0,25,8
2,Breakfast,Sausage McMuffin,3.9 oz (111 g),370,200,23.0,35,8.0,42,0.0,...,29,10,4,17,2,14,8,0,25,10
3,Breakfast,Sausage McMuffin with Egg,5.7 oz (161 g),450,250,28.0,43,10.0,52,0.0,...,30,10,4,17,2,21,15,0,30,15
4,Breakfast,Sausage McMuffin with Egg Whites,5.7 oz (161 g),400,210,23.0,35,8.0,42,0.0,...,30,10,4,17,2,21,6,0,25,10


In [None]:
### YOUR CODE HERE ####
# grab columns we want to use
names = [n for n in d.columns if n not in ["Category", "Item", "Serving Size"]]

# z score data

# fit km model with k = 3


# grab cluster assignments


In [None]:
# fit PCA model
pca = ???

# grab 2 components and add them to d
components = pd.DataFrame(pca.transform(d[names]))
d[["pc1", "pc2"]] = components

In [None]:
# plot



### /YOUR CODE HERE ####