## CMM 262 Winter 2019 Midterm Single Cell Analysis Module

**To upload:**
Do this analysis in a jupyter notebook, open as HTML file, then save as PDF.


For this section of the midterm we are going to analyze a 10X dataset of 931 Brain Cells from an E18 Mouse using [scanpy](https://scanpy.readthedocs.io/en/latest/basic_usage.html)

The data that you will need to analyze is located in: `/oasis/tscc/scratch/biom200/cmm262/Module_4/midterm_data`

This directory contains `.mtx` and `.tsv` files.

### Importing necessary packages below:
Below are the possible python packages that you may want to use in your analysis. If you use the packages below, remember that packages are called by whatever name you give them. For example, if you use the import functions as they are written below, we have written that we will use "np" when we call numpy. 

In [None]:
import scanpy.api as sc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import mode
import seaborn as sns
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=170, color_map='viridis')  # low dpi (dots per inch) yields small inline figures
sc.logging.print_versions()

### The goal of this midterm is to first filter our data, determine which cells are most similar to each other (cluster the cells), and finally find some genes that are indicative of each cluster of cells.

### Loading data below:

**Remember, data is in the path below:** 
`/oasis/tscc/scratch/biom200/cmm262/Module_4/midterm_data`

In [None]:
adata = sc.read_10x_mtx('/oasis/tscc/scratch/biom200/cmm262/Module_4/midterm_data',  # the directory with the `.mtx` file
    var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
    cache=True) # write a cache file for faster subsequent reading

---

**Question 1 (2 pts):** Take a quick look at your data using violin plots. Plot both the number of counts per cell and genes per cell. 

*Note:* Use scanpy's violin plotting function

**Question 2 (3 pts):** Now filter your dataset so that you are only left with:   
    (A) cells that have at least 100 genes expressed in them, **AND**   
    (B) genes that are expressed in at least 5 cells **AND**   
    (C) filter out any genes that are expressed below a count of 15.  

**Question 3 (1 pt):** How many cells and genes do you have left in your dataset adata?


**Question 4 (3 pts):** 
  
A) Normalize your dataset for total counts per cell and (library-size correct) the data matrix to CPM or "counts per million", so that counts become comparable among cells. 

B) Then log normalize the counts.
    
B) Why is it important to normalize your data by total counts per cell?

**Question 5 (3 pts):**  Filter your dataset for only the variable genes. 

Scanpy uses normalized dispersion which is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.

A) Filter out genes with a minimum of 0.01 and a maximum mean of 5 and a minimum dispersion of 0.2 and show a plot of your data to visualize what genes you are filtering out and which you are keeping.

B) How many genes do you have left in your dataset?

C) Why do we only care about the genes that differ between the cells?

**Question 6 (1 pt):** Why do we filter our dataset for highly variable genes?

**Question 7 (2 pts):** Reduce the dimensionality of your dataset using PCA and produce a scatterplot showing the first two principle components colored with your favorite neuronal gene or any other gene that you didn't filter out.

**Question 8 (2 pts):**   
(A) Inspect the contribution of single PCs to the total variance in the data. Provide a graph.   
This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells.  
(B) How will you decide how many PCs to use in your neighbor hood graph?

**Question 9 (1 pt):** Compute the neighborhood graph with the number of PCs = 30 and the number of neighbors = 10

**Question 10 (2 pts):** Compute and graph a tSNE plot using 30 PCs and color with your favorite gene.

**Question 11 (2 pts):** Cluster your data using Louvain's clustering algorithm using the default parameters. 
    How many clusters are there in your dataset?

**Question 12 (3 pts):** Find marker genes in each of the clusters. Test each cluster against all of the other clusters combined. Show the top ten differentially expressed genes (by score - this is the default ranking) in each cluster in either a graph or table.