# PCA: Principal Component Analysis

What is the dimensionality of data?
- y = x is 1-dimensional. We can argue it is 1D even it has small deviations (think of those as noise).
- But a cubic in PCA is 2D. (PCA only does shifts and rotations to create different coordinate systems. Probably does not include extra feature transformation)


PCA: If you're given data of any shape, PCA finds a new coordinate system obtained from the old one by translation and rotation only and 
- it moves the centre of the coordinate system to the centre of the data.
- it moves the x axis into the principal axis of variation relative to all other data points
- it moves further axes orthogonal to the directions of variation

So some data is 'PCA-ready', some is not (e.g. if it's a cubic). PCA can deal with vertical lines cause it's just vectors (vs regression uses functions).

Questions
- Is the data PCA-ready?
- Does the major axis dominate? (Once you have spread captured in major axis, there's not much left in the minor axis(axes).
    - e.g. circle -> no. both eigenvalues of same magnitude, haven't gained much by running PCA.

## Measurable vs Latent Features

Q: Given the features of a house, what is its price?

Measurable variables
- Square footage
- No. of rooms
- School ranking
- Neighbourhood safety

-> Probing **latent variables**
- Size
- Neighbourhood

### Preserving information: How best to condense our measurable features to k features (where there are e.g. 2 latent variables)? 

- Feature selection tools
    - Select k best (good if unknown no. of features)
    - Select percentile

Process:
- Have many features, but I hypothesise a smaller number of features actually drive the patterns.
- Try to make a **composite feature** (principal component) that more directly probes the underlying phenomenon.

Tool for dimensionality reduction, also a good independent unsupervised learning tool.

PC vs Regression:
- Regression: Predicting
- PC: Trying to find direction we can project our data onto to lose the least amount of info.

## How to determine the principal component

**Variance (stats)** : The spread of a data distribution (vs ML the willingness or flexibility of an alg to learn)

**Principal component** of a dataset is the direction that has the **largest variance** because projecting onto this direction **retains the maximum amount of info in the original data**.

(This is a compression algorithm)

### Maximal variance and informal loss
Information loss: perpendicular distance between point and line we're projecting the point onto.

Projection onto direction of maximal variances minimises distance from old (higher-dimensional) point to its new transformed value -> Minimises information loss

## PCA as a general algorithm for feature transformation
- So far, separating or grouping features by hand (square footage, no. of rooms -> size). But this is not scalable.

- Instead, put all features into PCA and ask PCA to pick first, second PCs. 
    - They'll likely be a mix of the intuitive latent variables, but it's a useful unsupervised learning technique.

Max number of PCAs allowed by sklearn: min of no. of features and no. of training points


## Working definition of PCA
- PCA is a systematised way to transform input features into principal components
- use principal components as new features
- PCs are directions in data that maximise variance (min info loss) when you project or compress down onto them
- The more variance of data along a PC, the hiher that PC is ranked.
- Each PC is linearly independent with every other PC, so there is no overlap.
- Max no. of PCs = min of  no. of input features and no. of training points.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)

# Print eigenvalues
print(pca.explained_variance_ratio_)
first_pc = pca.components_[0]
socend_pc = pca.componentns_[1]

## When to use PCA
- Figure out latent features driving the patterns in data
- Dimensionality reduction
    - Visualise high-dimensional data (scatterplot only have 2D available) -> Can visualise e.g. k means clustering
    - Reduce noise (Hope 1st and 2nd PCs capture info and other minor ones capture noise)
    - Preprocessing (reduce dim): Make other algs (regression, classification) work better b/c fewer inputs (e.g. Eigenfaces for facial identification -> feed into SVM)

### PCA for Facial Recognition
Good for PCA because
- Pictures of faces generally have high input dimensionality (many pixels)
- Faces have general patterns that could be captured in smaller number of dimensions (two eyes on top, moth/chin on bottom)

### Selecting a number of PCs
- Train on different number of PCs and choose optimal
- Be v careful about throwing out features before you do PCA. Sometimes you might do it because PCA is computationally expensive, but be careful when you do.

In [None]:
# Train-test split

# from 1850 features to 150
n_components = 150 

# Extracting the top 150 faces from >1200 faces
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

eigenfaces = pca.components_.reshape((n_components, h, w))
`
# Transform into PCA representation
# i.e. project input data on the eigenfaces orthonormal basis
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# 
clf = GridSearchCV(...)