# Homework 08 - PCA and Clustering (60 pts)

---
### EEG dataset

The following cell will load and parse an EEG dataset for you.

The dataset is located in the time series lecture folder. If you put it somewhere else on your computer, just change the path so that it points to where you have placed the sampleEEGdata.mat file.

In [2]:
# this cell loads a bunch of EEG data for you
# you don't have to edit it, just run it to get the data
import numpy as np
from scipy.io import loadmat

# data file is in the time series lecture folder
data = loadmat('lecture_25_time_series/sampleEEGdata.mat')

# grab relevant data, EEG units are microvolts (uV)
# each channel is an electrode, each trial is a separate EEG recording from that electrode
eeg_uV = data["EEG"][0,0]["data"]  # [channel, time, trial]
time_ms = data["EEG"][0,0]["times"][0]
samplefreq_Hz = float(data["EEG"][0,0]["srate"][0])
dt_ms = time_ms[1] - time_ms[0]
n_channels = eeg_uV.shape[0]
n_pts = eeg_uV.shape[1]
n_trials = eeg_uV.shape[2]

# put all EEG waveforms from all channels and all trials into one big matrix
# rows are waveforms, columns are time points
n_eegs = n_channels * n_trials
all_eegs_uV = np.zeros((n_eegs, n_pts))
avg_eegs_uV = np.zeros((n_channels, n_pts))  # average EEG per channel
channels = np.zeros((n_eegs,), dtype=int)
i = 0
for c in range(n_channels):
    avg_eegs_uV[c, :] = eeg_uV[c, :, :].mean(axis=1)
    for t in range(n_trials):
        all_eegs_uV[i, :] = eeg_uV[c, :, t]
        channels[i] = c
        i += 1

all_eegs_uV.shape, avg_eegs_uV.shape, eeg_uV.shape, time_ms.shape, samplefreq_Hz, dt_ms

((6336, 640), (64, 640), (64, 640, 99), (640,), 256.0, 3.90625)

In [3]:
import pickle

with open("avg_eegs_uV.dat", "rb") as f:
    time_ms, avg_eegs_uV = pickle.load(f)

---
### 1. (5 pts) Plot the average EEG recording from the first and last electrode channel. Use the **avg_eegs_uV** variable.

Each row of the variable **avg_eegs_uV** is the average EEG recording from one electrode. First and last electrode recordings are in the first and last rows of this variable.

---
### 2. (5 pts) Perform Principal Component Analysis on the average EEG recordings from all electrodes such that 95% of the variability in the data is explained. Report the # of principal components required.

Note that each length 640 time point EEG recording is considered as a point in a 640-dimensional space.

---
### 3. (5 pts) Transform the average EEGs into the lower dimensional PCA coordinates. Report the shape of the transformed EEG matrix and describe in your own words what the matrix represents.

---
### 4. (5 pts) Plot the fraction of the explained variance versus the number of principal components. How much of the variability in the data are explained by the first two principal components?

---
### 5. (5 pts) Make a scatter plot of the PCA transformed EEGs projected onto the first two principal components (i.e. the scatter plot axes should be the first two principal components). Describe in your own words what each point on the scatter plot represents.

---
### 6. (5 pts) Plot the first three principal components. Make sure to label the $x$-axis of the plot. Describe in your own words what the principal components represent.

---
### 7. (5 pts) Transform back from PCA coordinates to the original EEG coordinates. Report the shape of the inverse transformed matrix and describe in your own words what the matrix represents.

---
### 8. (pts) Plot the average EEG recording for the first electrode overlaid with its approximation from the PCA.

---
### 9. (10 pts) Use a Gaussian Mixture Model (GMM) to cluster the averge EEG recordings based on their PCA transformed representations. Repeat for 1 to 64 clusters and plot the Bayesian Information Criteria (BIC) versus number of clusters. Report the optimal number of clusters.

---
### 10. (5 pts) Use the optimal number of clusters above to recluster using a GMM. Make a scatter plot of the PCA transformed EEGs projected onto the first two principal components (i.e. the scatter plot axes should be the first two principal components). Color the points by their cluster index.

---
### 11. (5 pts) For each of the first two clusters, plot overlaid all of the PCA-approximated EEGs belonging to that cluster.

Note that EEGs within clusters share similar features, whereas EEGs from different clusters have different features.