<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# PCA Lab: Speed Dating

_Authors: Kiefer Katovich (SF)_

---

Let's practice principal component analysis (PCA) using a data set from [Kaggle](https://www.kaggle.com/). 
- PCA is often used to simplify data, reduce noise, and find unmeasured latent variables, so let's take a look at an example of each and try to understand what's going on.

### Learning Objectives
- Explore how PCA relates to correlation.
- Use PCA to perform dimensionality reduction.
- Predict whether or not a speed dater likes reading based on the dater's other likes.

### Data Set
The data set we're using for this lab is a subset of this [much more detailed speed dating data set](https://www.kaggle.com/annavictoria/speed-dating-experiment). 
- In particular, this contains no information on the actual speed dating itself (i.e., successes with or opinions of other individuals).

It also contains no follow-up information where individuals are asked the same questions about themselves again. It only contains information about what an individual enjoys doing, their self-ratings on how desirable they are, and how they think others rate them based on desirability.

The columns present in the data are outlined below:

FieldName|Description
---------|-----------
    subject_id                   |   Unique individual identifier
    wave                         |   Meetup ID
    like_sports                  |   Enjoyment of participating in sports
    like_tvsports                |   Enjoyment of watching sports on TV
    like_exercise                |   Enjoyment of exercise
    like_food                    |   Enjoyment of food
    like_museums                 |   Enjoyment of museums
    like_art                     |   Enjoyment of art
    like_hiking                  |   Enjoyment of hiking
    like_gaming                  |   Enjoyment of playing games
    like_clubbing                |   Enjoyment of going clubbing/partying
    like_reading                 |   Enjoyment of reading
    like_tv                      |   Enjoyment of TV in general
    like_theater                 |   Enjoyment of the theater (plays, musicals, etc.)
    like_movies                  |   Enjoyment of movies
    like_concerts                |   Enjoyment of concerts
    like_music                   |   Enjoyment of music
    like_shopping                |   Enjoyment of shopping
    like_yoga                    |   Enjoyment of yoga
    subjective_attractiveness    |   How attractive they rate themselves
    subjective_sincerity         |   How sincere they rate themselves
    subjective_intelligence      |   How intelligent they rate themselves
    subjective_fun               |   How fun they rate themselves
    subjective_ambition          |   How ambitious they rate themselves
    objective_attractiveness     |   Perceived rating others would give them on how attractive they are
    objective_sincerity          |   Perceived rating others would give them on how sincere they are
    objective_intelligence       |   Perceived rating others would give them on how intelligent they are
    objective_fun                |   Perceived rating others would give them on how fun they are
    objective_ambition           |   Perceived rating others would give them on how ambitious they are
    
There are 551 subjects total

In [None]:
# load packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA

sns.set_style("whitegrid")

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

---

### 2) Load and clean the speed dating data

- First, remove columns with more than 200 missing values.
- Then, remove rows with missing values.
- Verify that no rows contain NaNs.

In [None]:
sd = pd.read_csv('../datasets/speed_dating.csv')

In [None]:
sd.columns

In [None]:
sd.head(3)

In [None]:
sd.info()

In [None]:
sd.isnull().sum()

In [None]:
sd.drop(['objective_attractiveness','objective_sincerity',
         'objective_intelligence','objective_fun','objective_ambition'],
        axis=1, inplace=True)

In [None]:
sd.dropna(inplace=True)

In [None]:
sd.info()

---

### 3) Example: Are the `subjective` columns correlated?

Here, we'll understand how the `subjective` columns are correlated.

- Find the z scores of each `subjective` column.
- Visualize correlation using PairGrid.
- Visualize correlation using a heat map.

#### 3.A) Find the z scores of each column. This allows the columns to more easily be directly compared.

In [None]:
subjective_cols = [col for col in sd.columns if col.startswith('subjective')]
print(subjective_cols)
subjective = sd[subjective_cols]
subjective = (subjective - subjective.mean()) / subjective.std() # transform to z-score

#### 3.B) Use a PairGrid to visualize correlation.

In [None]:
g = sns.PairGrid(subjective)
g = g.map_lower(sns.regplot)    # regression plots in lower triangle
g = g.map_upper(sns.kdeplot, cmap="Blues", shade=True, shade_lowest=False)  # KDE plots in upper triangle
g = g.map_diag(plt.hist)        # histograms along diagonal

plt.show();

#### 3.C) Use a heat map to visualize correlation.

In [None]:
subj_corr = subjective.corr()      # correlation DataFrame — very useful!

In [None]:
# Generate a mask for the upper triangle (taken from Seaborn example gallery)
mask = np.zeros_like(subj_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True         # triu: TRIangle upper

fig, ax = plt.subplots(figsize=(8,7))

# Plot the heat map with Seaborn
# Assign the Matplotlib axis the function returns. This will let us resize the labels
ax = sns.heatmap(subj_corr, mask=mask)
ax.set_ylim(subj_corr.shape[0],0)

# Resize the labels
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14, rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14, rotation=0)

# If you put plt.show() at the bottom, it prevents useless printouts from Matplotlib
plt.show()

In [None]:
# Understand how this visualization can be "seen" just by looking at the correlation scores

subj_corr

---

**Important: Did you ensure the results make sense intuitively?** If not, look at the results again. You should **always** interpret your results and ensure they make sense based on what you expected. If they don’t, investigate why — often your analysis or data are wrong.

> For example, the results show that believing you are attractive and fun are correlated. Would you expect that believing you are intellectual and fun to have a higher or lower correlation? What do the results say?

---


### 4) Visualize some preference columns.

Next, we’ll explore how some preference ratings are correlated. You saw an example — now try it on the `preference_cols` below.

- Find the z scores of each column in `preference_cols` ([example](https://stackoverflow.com/a/41713622/6293191)).
- Visualize correlation using PairGrid.
- Visualize correlation using a heat map.
- Do these results make sense intuitively? 

In [None]:
preference_cols = ['like_tvsports', 'like_sports', 'like_museums', 'like_theater', 'like_shopping']
sd_like = sd[preference_cols]

#### 4.A) Find the z scores of each column in `preference_cols`.

In [None]:
sd_like = (sd_like - sd_like.mean()) / sd_like.std()

#### 4.B) Visualize correlation using PairGrid.

In [None]:
g = sns.PairGrid(sd_like)
g = g.map_lower(sns.regplot)
g = g.map_upper(sns.kdeplot, cmap="Blues", shade=True, shade_lowest=False)
g = g.map_diag(plt.hist)

plt.show();

#### 4.C) Visualize correlation using a heat map.

In [None]:
pref_corr = sd_like.corr()

mask = np.zeros_like(pref_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(8,7))

ax = sns.heatmap(pref_corr, mask=mask)
ax.set_ylim(pref_corr.shape[0],0)

ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=14, rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=14, rotation=0)

plt.show();

---

### 5) Example: Fit PCA on the `subjective` ratings.

In [None]:
# instantiate PCA model (specify number of components)
subjective_pca = PCA(n_components=5)

# fit PCA model ('learn' the data)
subjective_pca.fit(subjective.values)

---

#### 5.A) Look at principal component weighting vectors (eigenvectors).

The principal components, or eigenvectors, can be thought of as weightings on the original variables to transform them into the new feature space.

In [None]:
subj_components = subjective_pca.components_

In [None]:
print(subjective_cols, '\n')
print('-------------------------------------\n')

for i, pc in enumerate(['PC1','PC2','PC3','PC4','PC5']):
    print(pc, 'weighting vector:', subj_components[i])
    print( '-------------------------------------\n')

---

#### 5.B) Look at the eigenvalues and the explained variance ratio.

The eigenvalues are ordered such that the first components have the largest eigenvalues. The values and their normalized equivalent in the explained variance ratio attribute tell you how much of the variance in the original data is encapsulated in the new component variables.

In [None]:
subj_exp_var_eigenvals = subjective_pca.explained_variance_
subj_exp_var_pct = subjective_pca.explained_variance_ratio_

print('eigenvalues:', subj_exp_var_eigenvals, '\n')
print('explained variance pct:', subj_exp_var_pct)

---

#### 5.C) Transform the subjective data into the principal component space.

The `transform()` function in the PCA will create your new component variable matrix.

In [None]:
# transform values using weights of PCA object
subj_to_pcs = subjective_pca.transform(subjective.values)

This transformed our five-dimensional data set into vectors along its five principal components (with zero loss).
- Using these, we can now reduce the dimensionality of our data while minimizing loss.
- For example, taking only the first three eigenvectors accounts for $0.431 + 0.178 + 0.147 = 75.6\%$ of the variance.

In [None]:
## This transforms our original five-dimensional data into three-dimensional data
# The first row is the first person's subjective.values transformed

subj_to_pcs[:,:3]

---

#### 5.D) PCA-transformed features are not correlated.

- Keep in mind that each column in the transformed data is no longer correlated.
- Compare this to the exploration above, where many columns were correlated.

In [None]:
sns.pairplot(pd.DataFrame(subj_to_pcs, columns=['PC1','PC2','PC3','PC4','PC5']), kind='reg');

---

### 6) Optional: How were the data transformed?

To demonstrate how the new principal component matrix is created from the original variable columns and the eigenvector weighting matrix, we'll create the first component (PC1) manually.

#### 6.A) Pull out the eigenvector for PC1.

In [None]:
confidence_weights = subj_components[0]

person1_original_ratings = subjective.iloc[0,:]
person1_pcas = subj_to_pcs[0,:]

#### 6.B) Create a DataFrame showing the original values for the subjective variables for `person1`.

In [None]:
person1_original_ratings = subjective.iloc[0,:]

how_to_make_pc1 = pd.DataFrame({'person1_original': person1_original_ratings.values},
                               index=subjective.columns)
how_to_make_pc1

#### 6.C) Add the eigenvector for PC1: the weights by which to multiply each original variable.

Recall that each component is a linear combination of the original variables, multiplied by a "weight" defined in the eigenvector of that component.

In [None]:
how_to_make_pc1['weights_to_make_pc1'] = confidence_weights
how_to_make_pc1

#### 6.D) Multiply the original variable values by the eigenvector values.

These are the "pieces" of PC1 that will be added together to create the new value for that person.

In [None]:
how_to_make_pc1['pieces_of_pc1_value'] = how_to_make_pc1.person1_original * how_to_make_pc1.weights_to_make_pc1
how_to_make_pc1

#### 6.E) Sum the original values multiplied by the eigenvector weights to get `person1`’s value for PC1.

In [None]:
print('sum of linear combinations of weights * original values for PC1:', np.sum(how_to_make_pc1.pieces_of_pc1_value))
print('person 1s pca variables:', person1_pcas)

---

### 7) Fit PCA on the preference data.

Now that you've seen how it's done, try it yourself!

- Find PCA eigenvalues and eigenvectors for the five `sd_like` columns.
- Transform the original `sd_like` columns into the principal component space.
- Verify that these columns are uncorrelated.

In [None]:
sd_like.columns

#### 7.A) Find PCA eigenvalues and eigenvectors for the five `sd_like` columns.

In [None]:
pref_pca = PCA(n_components=5)
pref_pca.fit(sd_like)

In [None]:
pref_comp = pref_pca.components_

In [None]:
print(pref_pca.explained_variance_ratio_)
print('-------------------------------------\n')

print(sd_like.columns.values)
print('-------------------------------------\n')

for i, pc in enumerate(['PC1','PC2','PC3','PC4','PC5']):
    print(pc, 'weighting vector:', pref_comp[i])
    print('-------------------------------------\n')

#### 7.B) Transform the original `sd_like` columns into the principal component space.

In [None]:
pref_pcs = pref_pca.transform(sd_like)

In [None]:
pref_pcs[0:3]

#### 7.C) Verify that these columns are uncorrelated.

In [None]:
sns.pairplot(pd.DataFrame(pref_pcs, columns=['PC1','PC2','PC3','PC4','PC5']), kind='reg')

---

### 8) Use PCA for dimensionality reduction.

Using linear regression, let's predict whether or not a user likes reading.

**The key question:** Can we get the same prediction accuracy using only the first three principal components as features versus using all five original values as features?

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

reading = sd['like_reading'].values

#### 8.A) Linear regression cross-validated on original variables (`sd_like.values`).

- What is the mean cross-validation score?
- Keep in mind that linear regression uses error for score, so zero is the ideal score.

In [None]:
linreg = LinearRegression()
original_scores = cross_val_score(linreg, sd_like.values, reading, cv=10)
print(sd_like.columns.values)
print(original_scores)
print(np.mean(original_scores))

In [None]:
linreg.fit(sd_like.values, reading)
for coef, var in zip(linreg.coef_, sd_like.columns):
    print(var, coef)

#### 8.B) Linear regression on the first principal component.

- What is the mean cross-validation score?

In [None]:
pref_pcs[:,0:1].shape

In [None]:
pca_linreg = LinearRegression()
pca_scores = cross_val_score(pca_linreg, pref_pcs[:,0:1], reading, cv=10)
print(pca_scores)
print(np.mean(pca_scores))

#### 8.C) Linear regression on first three principal components.

- What is the mean cross-validation score?

In [None]:
pca_linreg = LinearRegression()
pca_scores = cross_val_score(pca_linreg, pref_pcs[:,0:3], reading, cv=10)
print(pca_scores)
print(np.mean(pca_scores))

---

**Checkity-check yo'self**. The mean cross-validation score should be nearly the same for the first three principal components as it was on the original five-component data.