In [None]:
%matplotlib inline
low_memory=False
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns; sns.set()
from scipy import stats
import math
import os
import random
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA

## 9.1 Introduction & Motivation

We've seen quite a few regression, classification, and clustering methods. That's great! We can already train many different models and gain insights from them. However, now it's time to take a step back and look at the data we are feeding our models. Instead of just throwing everything we have at it, we are going to reduce our data by performing **Principal Component Analysis (PCA)**.

Why do we do this? When working with large datasets, training a model becomes more complex and computationally expensive. For example, think of the training process of a GPT model. The more data being used, the more time it will take to train and the more CPU power it will consume, requiring expensive machinery and driving up electricity bills significantly. However, we still want to retain most of the information contained in our data. That's where PCA comes into play. We reduce data by transforming it into a smaller set of core components while preserving the most important information.

**Key Benefits of PCA:**
- Reduces computational complexity
- Speeds up training time
- Reduces storage requirements
- Helps visualize high-dimensional data
- Can reduce noise in the data

## 9.2 Problem Setting

The faces dataset is a useful dataset for exploring the differences between models and the effects of dimensionality reduction techniques. This makes it a prime candidate for exploring PCA!

Similar to the digits dataset, it contains a collection of images and their corresponding labels. This time, however, the images are not handwritten digits but faces of famous American politicians. Let's explore the dataset and discover what we can achieve with PCA!

**Dataset Overview:**
- Contains facial images of politicians
- Each image has multiple pixel values (features)
- High-dimensional data perfect for demonstrating PCA
- Real-world application of dimensionality reduction

## 9.3 Model

First, let's have a look at the data.

**Hint:** Pay attention to the shape of the data. This will help you understand how many features (dimensions) we're working with and how PCA can reduce this complexity.

In [None]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

##### Question 1: By now you should be familiar with the digits dataset. Visualize the faces dataset in a similar way. Show only the first 5 faces. Can you display the correct name as labels?

**Hints:**
- Use `plt.subplot()` to create multiple plots in one figure
- The image data needs to be reshaped to display properly (check the image shape from above)
- Use `target_names[label]` to get the actual person's name
- Consider using a grayscale colormap for better visualization

In [None]:
# Your code here


##### Question 2: Train a classification method of your choice to get some predictions. Use cross-validation to get the best results. Remember to select the best parameters!

**Hints:**
- KNN is a good choice for image classification
- Try different values of k (number of neighbors) to find the optimal parameter
- Use train-test split to evaluate different parameter values
- Plot the accuracy vs. parameter values to visualize the best choice
- This baseline performance will be important for comparing with PCA results later

In [None]:
# Your code here


## 9.4 Model Evaluation

##### Question 3: It's time to reduce our data using PCA! Figure out the best amount of principal components and reduce the data.

**Hints:**
- Start by fitting PCA without specifying the number of components to see all eigenvalues
- Plot the explained variance ratio (eigenvalues) to visualize the "elbow"
- The elbow method helps identify where additional components provide diminishing returns
- Remember: each eigenvalue represents how much variance that component explains
- Look for the point where the slope significantly decreases

In [None]:
# Your code here


##### Question 4: Do you notice something special about the eigenvalues? What is the total sum of all eigenvalues? Play around with models with different amounts of components. How does this change? Is this expected? Elaborate based on the meaning of eigenvalue.

**Hints:**
- Calculate the sum of explained variance ratios for different numbers of components
- Think about what 100% variance means in the context of the original data
- Remember: eigenvalues represent the proportion of total variance explained by each component
- Consider what happens when you include ALL components vs. just a subset

In [None]:
# Your code here


##### Question 5: Use PCA and the best amount of components you found earlier to reduce your data. Retrain your model using k-fold and compare the accuracy. What do you notice?

**Steps to follow:**
1. Apply PCA with your chosen number of components to transform your data
2. Split the transformed data for training and testing
3. Find the optimal k value for KNN on the reduced data
4. Use k-fold cross-validation to get a reliable accuracy estimate
5. Compare this accuracy with your baseline from Question 2

In [None]:
# Your code here


##### Question 6: Your department just got granted some extra budget. You are able to use some more processing power, but still not enough to use the entire dataset. Your boss wants you to create a model that retains 90% of all variance. Create this model and calculate the accuracy as before. By how much did you reduce the size of your dataset?

**Hints:**
- You can specify the variance ratio directly in PCA: `PCA(0.9)`
- This will automatically determine how many components are needed to retain 90% of variance
- Compare the number of components used with the original features
- Calculate the percentage reduction: `(original_features - new_features) / original_features * 100`

In [None]:
# Your code here


## 9.5 Exercises

##### Question 1: See section 9.3
##### Question 2: See section 9.3
##### Question 3: See section 9.4
##### Question 4: See section 9.4
##### Question 5: See section 9.4
##### Question 6: See section 9.4

##### Question 7: PCA is used to reduce the number of variables by creating new variables that explain multiple previous ones. By definition, you should get somewhat usable results when reversing this process. Transform your projected data from Question 6 back into the original number of dimensions and compare the data by looking at the data points of the first face.

**Hints:**
- Use `pca.inverse_transform()` to reconstruct the original data
- Compare the original `faces.data[0]` with the reconstructed version
- The reconstructed data won't be identical but should be very similar
- This process demonstrates that PCA preserves the most important information

In [None]:
# Your code here


##### Question 8: Now that you have figured out the process of reversing PCA, visualize the reconstructed data. Compare the reconstructed faces with the original ones and behold the true power of PCA!

**Instructions:**
- Create side-by-side visualizations: original faces vs. reconstructed faces
- Use the same visualization code from Question 1
- Apply PCA with 90% variance, then use inverse_transform
- Compare how well the faces are preserved despite the massive dimensionality reduction

**What to look for:**
- Overall facial structure should be well preserved
- Key facial features should remain recognizable
- Some fine details might be slightly blurred (this is the 10% variance we discarded)

In [None]:
# Your code here for original faces


In [None]:
# Your code here for reconstructed faces


##### Question 9: We talked briefly about how PCA can be used to reduce noise. Assume the 'noisy' data as seen below. Plot the faces as you did before to visually see the noise. What's the best result you can achieve when using PCA to reduce the noise?

**Experiment Design:**
1. First, visualize the noisy data to see the effect of added noise
2. Apply PCA to the original clean data (not the noisy data)
3. Use inverse_transform to reconstruct the data
4. Compare: Original → Noisy → PCA-reconstructed

**Key Insight:** PCA learns patterns from clean data and can filter out noise when reconstructing because noise typically has low variance and gets captured in the discarded components.

**Try different variance thresholds:** Test 90%, 95%, 99% to see which gives the best noise reduction while preserving facial features.

In [None]:
np.random.seed(42)
noisy = np.random.normal(faces.data, 0.1)

In [None]:
# Your code here to visualize noisy data


In [None]:
# Your code here to apply PCA noise reduction
