# Week 9 - Dimensionality Reduction

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
In this lab, you'll get a chance to experiment with library implementations of the dimensionality reduction techniques discussed in this week's lectures.

We will be performing dimensionality reduction on gene-expression data from different tissues of fruit flies.
</div>

## Setup

### Installation notes

To run this notebook you will need to install several packages.

These can be installed via Conda:
```bash
conda install pandas scikit-learn
conda install -c conda-forge umap-learn altair vega
```    
or with pip:

```bash
pip install pandas scikit-learn umap-learn altair vega
```

In [None]:
import os
import requests
from IPython.core.display import HTML

In [None]:
# Load stylesheet
HTML(requests.get('https://raw.githubusercontent.com/melbournebioinformatics/COMP90014/main/data/2023/style/custom.css').text)

In [None]:
%matplotlib inline

In [None]:
import os.path
import pandas as pd
import numpy as np
import altair
import gzip
import re
from urllib.request import urlretrieve
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, MDS
from umap import UMAP # You can ignore the warnings about Numba

In [None]:
import matplotlib.pyplot as plt
# Set default figure size to a larger size
plt.rcParams['figure.figsize'] = [10, 10]

## Data

We will be using microarray gene-expression data from [FlyAtlas](http://flyatlas.org/atlas.cgi), part of [NCBI's Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/). The dataset includes gene expression information from a variety of cell types.


If it is not already present in your repository, the following code will download the raw data. It may take a few moments to download.

In [None]:
if not os.path.exists('flydata.txt.gz'):
    urlretrieve("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE7nnn/GSE7763/matrix/GSE7763_series_matrix.txt.gz",
                filename="flydata.txt.gz")

The following two cells will open the compressed file you downloaded from GEO into a `pandas` data frame, then reopen the file and parse out the sample title line in order to use this as the column names for the data frame.

In [None]:
# Import data to Pandas dataframe
with gzip.open("flydata.txt.gz") as handle:
    expression = pd.read_csv(handle, sep="\t", comment="!", index_col=0)

In [None]:
# Add Column names
with gzip.open("flydata.txt.gz") as handle:
    for line in handle:
        line = line.decode("utf-8")
        if line.startswith("!Sample_title"):
            header = [x.strip('"') for x in line.split("\t")[1:]]
            expression.columns = header
            break

### Inspect the data

Output the data frame and see what we have. It should have microarray probe ID's as row names (these can be mapped to gene names but we will skip that for today) and sample names as column names.

The data frame has 18952 rows (measurements) and 136 columns (samples) so it is certainly high dimensional.

In [None]:
expression.head()

These 136 columns represent 4 replicates each from 34 different tissue types.

### Get Cell Type Labels

The following code snippet removes the replicate name from each sample, so we can use these labels as categories for plotting later.

In [None]:
sample_categories = [re.match('(.+?)(( biological)? rep\d+)', c).group(1)
                     for c in expression.columns]

In [None]:
sample_categories

### Transforming Expression Data

Before plotting the expression data, it's common practice to take the log of expression values. <br>
**Taking logs** of count data is very common practise in data science, especially biology. **Let's see why:**


In [None]:
# Plot original transcript counts
_ = plt.hist(expression.values.flatten(), bins=200)

In [None]:
# Plot log counts
_ = plt.hist(np.log10(expression).values.flatten(), bins=200)

There are a small number of very highly expressed genes in our dataset. Variation in these genes may have a disproportionately large effect on attempts to find structure in our data (highly expressed genes will have higher varances). By log transforming our data we can map expression values to an approximately normal distribution.  

### Plotting Expression Data

The following function will render a two-dimensional scatterplot which is coloured by the list of categories. We will use it for PCA, MDS, tSNE, and UMAP visualizations.

The first function provided uses the Altair plotting library, which is interactive, allowing us to mouseover the points. To use this, you must install Altair as described in the first cell.

If you have trouble with Altair, you can use the second function below instead, which only requires matplotlib.

In [None]:
# Here is a plotting function that uses Altair
# You can interact with the plot by mousing over
# the data points
def plot_two_dimensions(data, categories, reps):
    df = pd.DataFrame(data)
    df.columns = ['Dim{}'.format(n) for n in range(1,data.shape[1]+1)]
    df['Category'] = categories
    df['Sample'] = reps
    chart = altair.Chart(df).mark_circle().\
                encode(x='Dim1',y='Dim2',color='Category',tooltip='Sample')
    return chart

In [None]:
# Here is a plotting function that uses just matplotlib

########################################################
### Use this instead if you have trouble with Altair ###
########################################################

def plot_two_dimensions_mpl(data, categories):
    categories = pd.Series(categories)
    fig,ax = plt.subplots()
    for category in categories.unique():
        ax.scatter(data[categories==category, 0], 
                    data[categories==category, 1],
                    label=category)
    # Place the legend outside the plot, at x=1.05
    # (where the plot runs from 0 to 1)
    plt.legend(loc=(1.05,0))

### Dimensionality Reduction with PCA

Here is an intro video on PCA [StatQuest: Principal Component Analysis (PCA)](https://www.youtube.com/watch?v=FgakZw6K1QQ)

The following code performs PCA on the dataset. The `log_expression.values` extracts the values in the data frame as a matrix. The `.T` takes the transpose of the matrix (swaps rows and columns).

In [None]:
# Create log transformed expression values
log_expression = np.log(expression + 1)

In [None]:
pca = PCA(n_components=4)
expression_pca = pca.fit_transform(log_expression.values.T)
plot_two_dimensions(expression_pca, sample_categories, expression.columns)

This code prints out the variance explained by component.

In [None]:
print(pca.explained_variance_ratio_)
len(pca.explained_variance_ratio_)

This code makes a plot of the explained variance by component, like we saw on one of the lecture slides.

In [None]:
plt.plot([x + 1 for x in range(len(pca.explained_variance_ratio_))], pca.explained_variance_ratio_, 'o-')
plt.xlabel("principal component")
plt.ylabel("variance explained")

Now re-run the PCA with a higher number of dimensions (try 100) and see how the plot of variance explained changes. 

## Exercise 1: PCA
We've seen that the ammount of additional explained variance diminishes as we all more components.

How many components do you think are worth keeping for our analysis?

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Calculate the number of components required to explain at least 90% of the variance in our data.

Generally, we take as many components as necessary to cover 90% of the sample variance. 

We will just take the cumulative sum until we have over 90% explained variance, and report how many components we needed.
    
Hint: You can find the variance explained by each component with `pca.explained_variance_ratio_`
</div>

In [None]:
# Calculate the number of components required to explain at least 90% of the variance in our data.

# YOUR CODE HERE
raise NotImplementedError

## Exercise 2: Multidimensional Scaling (MDS)

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Try creating an MDS plot using `MDS()`, which we imported from `sklearn.manifold`. All scikit-learn models use a consistent syntax, so the syntax is extremely similar to that for `PCA()`.

Examine the documentation either online or just using `help(MDS)` in the notebook.

</div>

StatQuest intro to [MDS and PCoA](https://youtu.be/GEn-_dAyYME)

In [None]:
# YOUR CODE HERE
raise NotImplementedError

## Exercise 3: tSNE

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Try creating a tSNE plot using `TSNE()`, which we imported from `sklearn.manifold`. All scikit-learn models use a consistent syntax, so the syntax is extremely similar to that for `PCA()`.

Examine the documentation either online or just using `help(TSNE)` in the notebook.

`TSNE()` takes several parameters: the most important is `perplexity`. Lower values of perplexity try hard to preserve local structure at the cost of global structure, and vice versa. From the documentation, what is the default value of `perplexity`? What happens if you redo your plot with it set to a much lower or much higher value?
</div>

Helpful resources for learning tSNE:

[StatQuest: t-SNE, Clearly Explained](https://www.youtube.com/watch?v=NEaUSP4YerM)

[t-SNE interactive settings](https://distill.pub/2016/misread-tsne/)

[Datacamp: t-SNE Tutorial](https://www.datacamp.com/community/tutorials/introduction-t-sne)

In [None]:
# YOUR CODE HERE
raise NotImplementedError

## Exercise 4: UMAP

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challenge:</b> Try creating a UMAP plot using `UMAP()`, which we imported from the `umap` library. `umap` is not part of scikit-learn, but it deliberately uses a similar syntax.

Examine the documentation either online or just using `help(UMAP)` in the notebook.

Look at the available parameters in the documentation. Try varying `n_neighbours` (which has a conceptual similarity to tSNE's `perplexity`) and `min_dist`.
</div>

[UMAP Uniform Manifold Approximation and Projection for Dimension Reduction](https://www.youtube.com/watch?v=nq6iPZVUxZU) video.

In [None]:
# YOUR CODE HERE
raise NotImplementedError