# 3. Dimensionality Reduction - PCA

It can be very challenging to generate hypotheses regarding either single neurons or the population when looking at high-dimensional population activity. Dimensionality reduction techniques can help by giving a low-dimensional summary of the high-dimensional population activity, and thus provide an efficient way to explore and visualise the data.

The goal of this exercise is to learn how to apply PCA to neural data and how to interpret the results. 
We will start by analyzing a relatively simple dataset. 

The dataset was collected by [Graf *et al*, 2011](http://www.nature.com/neuro/journal/v14/n2/full/nn.2733.html).

Details about the dataset:
- Neural activity recorded from 65 V1 neurons using multi-electrode arrays
- The subject was an anesthetized monkey. 
- Stimuli were drifing sinusoidal gratings of 0 and 90 degrees, randomly interleaved. 
- Each stimulus lasted 2560ms. The first 1280ms consisted of a grating, the second 1280 consisted of a blank screen.
- The dataset contains 100 stimulus repetitions.
- The neural activity is quantified by counting the number of spikes into 40 ms time bins. Each stimulus therefore has 64 time bins (2560/40).
- The dataset you will work with is a small subset of the original dataset.


If there is time left, we will try our hand at the  neuropixels dataset. This tutorial is inspired by exercises from Jonathan Pillow (see homework 1 of the course http://pillowlab.princeton.edu/teaching/statneuro2018/).

In [10]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.io import loadmat 
from sklearn.decomposition import PCA 
from numpy.linalg import eig

### 3.1 Visualize the data

The data consist of a (6400, 65) matrix of binned spike counts. Each column constains the spike counts of one neuron, each row contains the spike counts in one time bin.

**a.**
Plot the population response during the first
5 stimuli  (first 320 rows of X). Tip: see `plt.imshow()` to visualise the population response. The responses should show clear stimulus-locking. 


In [11]:
data = loadmat('v1data_Graf2011.mat')
X = data['Msp']
print('Dimensions of X:',X.shape)

# Your code goes here:



Dimensions of X: (6400, 65)


**b.** Plot the responses of neurons 8 and 32 (columns 8 and 32) over the first 5 stimuli. 

Question: What is the main difference in the response properties of neuron 8 and 32?

Answer: Their responses are anti-correlated.

### 3.2 Investigate the dimensionality of the data using PCA

Recall that PCA finds an ordered set of activity patterns (principal components) that explain most variance in the data. Mathematically, the principal components are the eigenvectors of the covariance matrix $X^T X/(n-1)$. The variance that they capture is measured by the corresponding eigenvalue. In practice, we don't have to work with eigenvectors but we can use the class `sklearn.decomposition.PCA`. Use the function `fit` and variable `pca.explained_variance_ratio_` to answer the following question. 

**a.**
Fit PCA to the spike count data. Next, visualize the dimensionality of the data by making two figures.
The first figure should show the fraction of variance explained. The second figure should show the cumulative sum of the fraction of variance explained. Note that both the x-axis should read 'PCs' for both.

In [12]:
from sklearn.decomposition import PCA
# create an PCA object. 
# Giving it no input we won't reduce the data yet
pca = PCA(n_components=None) 

# Your code goes here:


Question: How many components are needed to account for 50% of the variance in the data? And for 90%?

Answer:

**3.**
Each principal component (PC) is a vector of length equal to the number of neurons. A PC can therefore be interpreted as an activity pattern, where the $i$th component of a PCs is the deviation of this neuron from its mean rate (PCA explains variance, so average deviation from the mean).

Plot the first PC (The PCs are stored in the variable `pca.components_`). By definition, this is the single activity pattern that explains the most variance in the data.

Question:
 What do you notice about the sign of its elements? What does this tell you about the dominant activity pattern?

**4.** Plot the second PC. How do the values of neuron 8 and 32 (the neurons you previously looked at) compare?

**5.** Use the function `pca.transform` to transform the data. The result is again a (6400, 65) matrix. The first column contains the projection of the neural activity onto the first PC. This vector of length 6400 is the similarity of the population activity to the first PC, over time. Next, make a scatter plot of the first PC agains the second PC.

Question:
     Can you speculate on what is going on here?

**6.**
Plot the first 320 time bins of PC 1 and PC 2 over time to get a final answer of what the first PCs could represent.