# Week 8 - Dimensionality Reduction

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
In this lab, you'll get a chance to experiment with library implementations of the dimensionality reduction techniques discussed in this week's lectures.

We will be performing dimensionality reduction on gene-expression data from different tissues of fruit flies.
</div>

## Setup

### Installation notes

To run this notebook you will need to install several packages. These can be installed via pip.
```bash
pip install --upgrade pip
pip install pandas scikit-learn umap-learn matplotlib
```

> **Note** <br>
> Ensure pip is updated to 23.2.1 or later, otherwise you may experience errors. 

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, MDS
import umap.umap_ as umap  # You can ignore the warnings about Numba
import matplotlib.pyplot as plt

# Set default figure size to a larger size
plt.rcParams['figure.figsize'] = [8, 8]

## Data

We will be using microarray gene-expression data from [FlyAtlas](http://flyatlas.org/atlas.cgi), part of [NCBI's Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/). The dataset includes gene expression information from a variety of cell types.

If it is not already present in your repository, the following code will download the raw data. It may take a few moments to download.

```bash
# create data directory & cd into folder
mkdir -p data
cd data

# download via wget
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE7nnn/GSE7763/matrix/GSE7763_series_matrix.txt.gz -O flydata.txt.gz

# or download via curl
curl ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE7nnn/GSE7763/matrix/GSE7763_series_matrix.txt.gz -o flydata.txt.gz

# decompress 
gunzip flydata.txt.gz
```

The following cell loads 'flydata.txt' you downloaded from GEO into a `pandas` data frame, ignoring any header lines which start with '!'. <br>
It then uses these header lines to re-assign meaningful names to the dataframe columns.

In [2]:
# Import data to Pandas dataframe
expression = pd.read_csv("data/flydata.txt", sep="\t", comment="!", index_col=0)

# Add Column names
with open("data/flydata.txt", 'r') as fp:
    line = fp.readline()
    while line:
        if line.startswith("!Sample_title"):
            header = [x.strip('"') for x in line.split("\t")[1:]]
            expression.columns = header
            break
        line = fp.readline()

FileNotFoundError: [Errno 2] No such file or directory: 'data/flydata.txt'

### Inspect the data

Output the data frame and see what we have. It should have microarray probe ID's as row names (these can be mapped to gene names but we will skip that for today) and sample names as column names.

The data frame has 18952 rows (measurements) and 136 columns (samples) so it is certainly high dimensional.

In [None]:
expression.head()

These 136 columns represent 4 replicates each from 34 different tissue types.

### Get Cell Type Labels

The following code snippet uses regular expressions to remove the replicate name from each sample, so we can use these labels as categories for plotting later.

In [None]:
sample_categories = [
    re.match(r'(.+?)(( biological)? rep\d+)', c).group(1) 
    for c in expression.columns
]
sample_categories[:10]

### Transforming Expression Data

Before plotting the expression data, it's common practice to take the log of expression values. <br>
**Taking logs** of count data is very common practise in data science, especially biology. **Let's see why:**


In [None]:
# Plot original transcript counts
_ = plt.hist(expression.values.flatten(), bins=200)

In [None]:
# Plot log counts
_ = plt.hist(np.log10(expression).values.flatten(), bins=200)

There are a small number of very highly expressed genes in our dataset. 

Variation in these genes may have a disproportionately large effect on attempts to find structure in our data (highly expressed genes will have higher varances). 

By log transforming our data we can map expression values to an approximately normal distribution.  

### Plotting Expression Data

The following function will render a two-dimensional scatterplot which is coloured by the list of categories. We will use it for PCA, MDS, tSNE, and UMAP visualizations.

In [None]:
def plot_two_dimensions(data, categories):
    categories = pd.Series(categories)
    fig, ax = plt.subplots()
    for category in categories.unique():
        ax.scatter(data[categories==category, 0], 
                    data[categories==category, 1],
                    label=category)
    # Place the legend outside the plot, at x=1.05
    # (where the plot runs from 0 to 1)
    plt.legend(loc=(1.05,0))

### Dimensionality Reduction with PCA

Here is an intro video on PCA [StatQuest: Principal Component Analysis (PCA)](https://www.youtube.com/watch?v=FgakZw6K1QQ)

The following code performs PCA on the dataset. The `log_expression.values` extracts the values in the data frame as a matrix. The `.T` takes the transpose of the matrix (swaps rows and columns).

In [None]:
# Create log transformed expression values
log_expression = np.log(expression + 1)

In [None]:
pca = PCA()
expression_pca = pca.fit_transform(log_expression.values.T)
plot_two_dimensions(expression_pca, sample_categories)

This code prints out the variance explained by component.

In [None]:
print(pca.explained_variance_ratio_[:5])
len(pca.explained_variance_ratio_[:5])

This code makes a plot of the explained variance by component, like we saw on one of the lecture slides.

In [None]:
plt.plot([x + 1 for x in range(len(pca.explained_variance_ratio_))], pca.explained_variance_ratio_, 'o-')
plt.xlabel("principal component")
plt.ylabel("variance explained")

## Exercise 1: PCA
We've seen that the ammount of additional explained variance diminishes as we all more components.

How many components do you think are worth keeping for our analysis?

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Calculate the number of components required to explain at least 90% of the variance in our data.

Generally, we take as many components as necessary to cover 90% of the sample variance. 

We will just take the cumulative sum until we have over 90% explained variance, and report how many components we needed.
    
Hint: You can find the variance explained by each component with `pca.explained_variance_ratio_`
</div>

In [None]:
# Calculate the number of components required to explain at least 90% of the variance in our data.

# YOUR CODE HERE
raise NotImplementedError


## Note: The fit_transform() syntax 

In the remaining exercises we will perform dimensionality reducion using MDS, tSNE, and UMAP. 

Similar to PCA, the packages we will use all follow the <small>`.fit_transform(matrix)`</small> syntax. <br>
The general syntax can be seen below. 

<div style='font-size: 18px'>

```bash
reducer = PCA(n_components=2)  # PCA | MDS | TSNE | UMAP 
result = reducer.fit_transform(log_expression.values.T)
plot_two_dimensions(result, sample_categories)
```

</div>

Each reduction technique has a different set of parameters aside from <small>`n_components`</small>. <br>
For example, you can supply a value for the <small>`perplexity`</small> parameter to TSNE. 

## Exercise 2: Multidimensional Scaling (MDS)


<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Try creating an MDS plot using `MDS()`, which we imported from `sklearn.manifold`. All scikit-learn models use a consistent syntax, so the syntax is extremely similar to that for `PCA()`.

This should produce a near-identical plot to PCA as we are not changing the MDS distance metric from euclidean (default). 

Examine the documentation either online or just using `help(MDS)` in the notebook.

</div>

StatQuest intro to [MDS and PCoA](https://youtu.be/GEn-_dAyYME)

In [None]:
# YOUR CODE HERE
raise NotImplementedError


## Exercise 3: tSNE

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Try creating a tSNE plot using `TSNE()`, which we imported from `sklearn.manifold`. All scikit-learn models use a consistent syntax, so the syntax is extremely similar to that for `PCA()`.

Examine the documentation either online or just using `help(TSNE)` in the notebook.

`TSNE()` takes several parameters: the most important is `perplexity`. Lower values of perplexity try hard to preserve local structure at the cost of global structure, and vice versa. From the documentation, what is the default value of `perplexity`? What happens if you redo your plot with it set to a much lower or much higher value?
</div>

Helpful resources for learning tSNE:

[StatQuest: t-SNE, Clearly Explained](https://www.youtube.com/watch?v=NEaUSP4YerM)

[t-SNE interactive settings](https://distill.pub/2016/misread-tsne/)

[Datacamp: t-SNE Tutorial](https://www.datacamp.com/community/tutorials/introduction-t-sne)

In [None]:
# YOUR CODE HERE
raise NotImplementedError


## Exercise 4: UMAP

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Try creating a UMAP plot using `UMAP()`, which we imported from the `umap` library. `umap` is not part of scikit-learn, but it deliberately uses a similar syntax.

Examine the documentation either online or just using `help(UMAP)` in the notebook.

Look at the available parameters in the documentation. Try varying `n_neighbours` (which has a conceptual similarity to tSNE's `perplexity`) and `min_dist`.
</div>

[UMAP Uniform Manifold Approximation and Projection for Dimension Reduction](https://www.youtube.com/watch?v=nq6iPZVUxZU) video.

In [None]:
# YOUR CODE HERE
raise NotImplementedError
