# Lecture 16: Dimensionality Reduction

## Dimensionality Reduction

A mathematical process to reduce the number of random variables to consider

* Reduce the dimension of quantitative data to a more manageable set of variables
* Reduced set can then be input to reveal underlying pattens in the data and/or as inputs in a model (regression, classification, etc.)

## Dimensionality Reduction - Synergies

* How do you control redundant degress of freedom in a useful way?
    * **Synergies** - Coordinated movements that couple a system's degrees of freedom together to reduce control complexity
* **Importance** - Human body has massive redundancy for a given task, and given its compliance the entire body must be actuated to perform simple movements

## More Use Cases for Dimensionality Reduction

* Thousands of sensors used to monitor an industrial process
    * Reducing the data from these 1000s of sensors to a few features, we can then build an interpretable model
    * Goal: Predict process failure from sensors
* Understanding diet around the world
    * Amount of foods eaten among populations across the world
    * Goal: Identify diet similarity among populations
* Identify genetic diversity
    * determine ancestral origins based on genetic variation
    * Coal: Learn more about our genetic history

## As An Extension of EDA

* Gain insight into a set of data
* Understand how different variables relate to one another

Note: Dimensionality reduction can also be used for modeling and prediction (including PCA for example)

## Methods For Dimensionality Reduction

* Projecting high-D data into a lower D-space
* Methods
    * PCA - Prinicple Component Analysis
    * ICA - Independent Component Analysis
    * CCA - Canonical Correlation Analysis
    * Clustering
    * FA - Factor Analysis
    * ... and many others!

## Major Example Methods

* **PCA** - (Linear) Find projections of the data into lower dimensional space that captures most of the variations in the data
* **ICA** - (Linear) Separate mixed additive independent signals into separate sources
* **CCA** - (Linear) Looks for relationships between two multivariate data sets
* **Clustering** - (Nonlinear) Uses machine learning to extract features from the data

## The Big Picture

* Mutlivariate data usually occupies a lower dimensional subspace, or a slice that captures most of the features of the data
* The question is, how do we find that slice?
* Typically some sort of multidimensional rotation

## Principle Componenet Analysis (PCA)

Key Terms:

* **Principle Component (PC)** - A linear combination of the predictor variables
* **Loadings** - The weights that transform the predictors into components (aka weights)
* **Screeplot** - Variables of each component plotted

Goal: Combine multiple numeric predictor variables into a smaller set of variables. Each variable in the smaller set is a weighted linear combination of the original set.

This smaller set of variables - the ***principle components (PCs)*** - "explain" most of the variability of the full set of variables... but uses many fewer dimensions to do so.

The **weights (loadings)** used to form the PCs explain the relative contributions of the original variables to the new PCs.

### Simple PCA: Two Predictor Variables ($X_1$ and $X_2$)

For two variables $X_{1}$ and $X_{2}$ there are two principle components $Z_i$ with i = 1 or 2

$$Z_{i} = w_{x,1}X_{1} + w_{x,2}X_{2}$$

$w_{i,1}$ and $w_{i,2}$: weightings (*loadings*)

* Transform the original variables into principle components

$Z_{i}$: The first principle component (PC1)

* The linear combination that best explains the total variance

### Case Study: Stock Price Returns for Chevron (CVX) and ExxonMobil (XOM)

* PC1 and PC2 are the dotted lines on the plot
* Each principle component is orthogonal to the other one

**CLICKER QUESTION**

If you have a dataset of 500 observations and 10,000 variables, how many PCs will be calculated?

A) 2

B) 10

C) 500

**D) 10,000**

**CLICKER QUESTION**

In a dataset with 10,000 variables, which PC explains the most variance?

A) PC 1

B) PC 10,000

**C) Depends on the analysis**

D) All explain the same amount of variance

But... PCA shines when you're dealing with high-dimensional data. So we have to move beyond two predictors to many predictors.

Step 1: Combine all predictors in linear combination.

Step 2: Assign weights that optimize the collection of the covariation to the first PC ($Z_{1}$) (maximizes the % total variance explained).

Step 3: Repeat Step 2 to generate new predictor ($Z_{2}$) (second PC) with different weights. By definition ($Z_{1}$) and ($Z_{2}$) are uncorrelated. Continue until you have as many new variables (PCs) as original predictors.

Step 4: Retain as many components as needed to account for *most* of the variance.

## S&P 500 Data: 5648 Days (1993-2015) x 517 Stocks

In this example, we'll focus on the 16 top companies.

### Screeplot

The vernacular definition of "scree" is an accumulation of loose stones or rocky debris lying on a slope or at the base of a hill or cliff.

In a screeplot, "it is desirable to find a sharp reduction in the size of the eigenvalues (like a cliff), with the rest of the smaller eigenvalues constituting rubble. When the eigenvalues drop dramatically in size, an additional factor would add relatively little to the information already extracted."

### Loading of PCs 1-5

* PC1: Overall stock market trend
* PC2: Price change of energy stocks
* PC3: Movements of Apple and Costco
* PC4: Movements of Schlumberger to other stocks
* PC5: Financial companies

### How Many PCs to Select?

* Option 1: Visually through the screeplot
* Option 2: % Variance explained (i.e., 80% variance explained)
* Option 3: Inspect loadings for an intuitive interpretation
* Option 4: Cross-validation

## PCA: Key Ideas

1. PCs are linear combinations of the predictor variables (numeric data only)
2. Calculated to minimize correlation between components (minimizes redundancy)
3. A limited number of components will typically explain most of the variance in the outcome variable
4. Limited set of PCs can be used in place of original predictors (dimensionality reduction)

* Plotting the top principle components can reveal groupings within your data
* A screeplot can help identify how many PCs to consider; look for the elbow in the plot
* Loadings Plot: Project values on each PC to show how much weight they have on that PC

## Case Study: Diet in the UK

If we look back at the raw data from Northern Ireland, the population eats way more fresh potatoes and way fewer fresh fruits, cheese, and fish. This reflects real world geography...

**CLICKER QUESTION**

Which of the following likely explains the fact that North Ireland is so far from the other countries in the first principle component?

A) Amount of cereals consumed

B) Geography

C) Amount of liquids consumed

**D) Genetic differences**

## Case Study: Genetics and Geography

Novembre, John, et al. “Genes Mirror Geography within Europe.” Nature, vol. 456, no. 7218, Nov. 2008, pp. 98–101. www.nature.com, https://doi.org/10.1038/nature07331.

### SNP (Single Nucleotide Polymorphism)

* Reminder: Your DNA is made of up four bases: A, T, C, and G
* A SNP is a position in one's DNA that varies between individuals (appears in at least 1% of the population)
    * This results from normal human variation
    * Some contribute to disease, but many are just differences between humans
    * these are used by companies like 23andME and Ancestry.com

### The Data: 1,387 Europeans x 500,000 SNPs

* Step 1: Measure genotype at 500,000 positions (SNPs) along the genome in 1387 European individuals
* Step 2: Calculate PCs from 500,000 SNPs
* Step 3: Plot PC1 and PC2 (each point is an individual)
* Step 4: Compare to the map of Europe

* PCA on SNP data for European samples reflects geographic location of where samples came from
* PC1 is East-West, PC2 is North-South

**CLICKER QUESTION**

This analysis used 500,000 SNPs from 1,387 individuals. How many PCs would have been calculated?

A) 2

B) 10

C) 1,387

**D) 500,000**

**CLICKER QUESTION**

This analysis used 500,000 SNPs from 1,387 individuals. How many PCs explain geographic differences across Europe by genetic ancestry?

**A) 2**

B) 10

C) 1,387

D) 500,000

**CLICKER QUESTION**

Which of the following is not true?

A) PC1 explains geographic differences from North to South

B) PC2 explains geographic differences from East to West

**C) The French (FR) are not genetically related to the Scottish (SCT)**

D) The French (FR) are more closely related genetically to Germans (DE) than they are to the Fins (GL)

E) The Spanish (ES) and Portuguese (PT) are genetically similar

## Dimensionality Reduction with PCA: Pros and Cons

Pros:

* Helps compress data; reduced storage space
* Reduces computation time
* Helps remove redundant features (if any)
* Identifies outliers in the data

Cons:

* May lead to some amount of data loss
* Tends to find linear correlations between variables, which is sometimes understandable
* Fails in cases where mean and covariance are not enough to define datasets
* May not know how many principle components to keep
* Highly affected by outliers in the data