[![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nunososorio/SingleCellGenomics2024/blob/main/3_Wednesday_April10th/SessionIV_p3_students.ipynb)

<img src="https://github.com/nunososorio/SingleCellGenomics2024/blob/main/logo.png?raw=true" alt="AnnData" style="width:600px; height:auto;"/>

## Practical Session IV - Part 3


## 0. Setup the environment and load your data

In [None]:
# Install scanpy and loompy if you don't have them already or if you are running on colab
# In this notebook we will use the Louvain and Leiden clustering algorithms; you will need the corresponding packages
! pip install scanpy loompy louvain leidenalg > _

In [None]:
# Load the libraries we will use
import numpy as np
import pandas as pd
import scanpy as sc
import loompy
import matplotlib.pyplot as plt

In [None]:
# Adjust the output for the figures
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=100, facecolor='white')
plt.rcParams['figure.figsize'] = (6, 6)
plt.rcParams['font.size'] = 16
sc.logging.print_header()

In [None]:
# IF you DID NOT generate the normalize data, you can download it from the repository:
#!wget https://figshare.com/ndownloader/files/44904205 -O Data1_DimRed.h5ad

In [None]:
adata = sc.read_h5ad(#YOUR CODE HERE#) # enter loom/h5ad file name here
#adata = sc.read_loom(#YOUR CODE HERE#, var_names='var_names', obs_names='obs_names')


In [None]:
# Some scanpy versions might be asking for this
adata.uns['log1p']['base']=None

In [None]:
# If you used .loom format, you have to rerun the neighbors calculation
#sc.pp.neighbors(adata, n_neighbors=15, n_pcs=40) # specify the number of neighbors and number of PCs you wish to use

## 6. Clustering

Clustering the data helps to identify cells with similar gene expression properties that may belong to the same cell type or cell state. There are two popular clustering methods, both available in scanpy: Louvain and Leiden clustering.

### **Exercise 1**:

Run both the Louvain and Leiden clustering algorithms. Visualize both sets of clusters on your UMAP representation. Are the clusters different from each method? Visualize the clusters again, this time on the tSNE embedding instead of the UMAP embedding. Are there differences in which clusters are grouped together?

In [None]:
# your code here
sc.tl.louvain(#YOUR CODE HERE#, resolution=0.2)

In [None]:
# your code here
sc.tl.leiden(#YOUR CODE HERE#, resolution=0.2)

Next, you can visualize your UMAP and tSNE representations of the scRNA-seq and color by various metadata attributes (including Louvian or Leiden clusters) from the prior steps. For example:

In [None]:
sc.pl.umap(#YOUR CODE HERE#, use_raw=False, color=#YOUR CODE HERE#, wspace=0.3, ncols=2) # color by louvain and leiden

In [None]:
sc.pl.tsne(#YOUR CODE HERE#) # color by louvain and leiden

### **Exercise 2**:

How many cells do you have per cluster? Assess this using the value_counts() function from pandas.

Hint: remember that adata.obs is *just* a pandas data frame!

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html

In [None]:
# your code here
#YOUR CODE HERE#

### **Exercise 3**:

Visualize some of the other metadata on the UMAP or tSNE embedding, including the n_counts, n_genes, percent_mito, and phase metadata found in adata.obs. Do any clusters seem to have an obvious bias towards particular attributes?

This might be a sign that we want to optimize prior steps of the analysis, such as adjusting the number of principal components used in the neighborhood smoothing or regressing out particular variables. As with a pandas dataframe, you can also examine the frequency of various attributes using a command such as: adata.obs["phase"].value_counts().


In [None]:
# your code here
#YOUR CODE HERE#

### **Exercise 4**:

Let’s proceed with Louvain clustering and UMAP embeddings for the time being.
- Create a new metadata attribute for your current clusters, i.e. adata.obs["louvain_res1"] = adata.obs["louvain"].
- Repeat louvain clustering using different values for the resolution parameter: 0.5 and 1.5.
- Save the clusters in a new metadata column and visualize them on the UMAP representation.
- How does the number of clusters change with adjustments to the resolution parameter? Using the resolution=1 as a basis, do any clusters divide into two smaller clusters upon changing the resolution parameter? Do any clusters merge together?

In [None]:
adata.obs["louvain_res1"] = adata.obs["louvain"].copy()




In [None]:

sc.tl.louvain(#YOUR CODE HERE#) # you must complete
adata.obs["louvain_res0.5"] = #YOUR CODE HERE#

In [None]:
# repeat for louvain resolution1.5

In [None]:
# visualize the umap colored by the different resolution clustering. What changes?
# your code here
sc.pl.umap(#YOUR CODE HERE#)

### **Exercise 5**:

Let’s take a few steps back to understand the previous steps a little bit better! For example, the number of principal components used in computing the neighborhood graph will greatly impact the visualizations.

Rerun previous code using the following number of PCs and visualize the different UMAPs and number of clusters: 4 PCs, 8 PCs, 15 PCs, 30 PCs. What changes with the different number of PCs used?

Choose an “optimal” number of PCs by examining the contribution of each PC to the total variance with the command: sc.pl.pca_variance_ratio(adata, log=True).

In [None]:
# Repeat the clustering and UAMP projections using only 4 PCs
sc.pp.neighbors(#YOUR CODE HERE#)


In [None]:
sc.tl.louvain(#YOUR CODE HERE#)


In [None]:
sc.tl.umap(#YOUR CODE HERE#)


In [None]:
# Now visuazlie the results of the new clustering in the UMAP projection
sc.pl.umap(a#YOUR CODE HERE#)

In [None]:
# Apply for 8 PCs

In [None]:
# Apply for 15 PCs

In [None]:
# Apply for 30 PCs

## 7. Identifying marker genes and cell types

Let’s use a simple method implemented by scanpy to find marker genes by the Louvain cluster.

In [None]:
sc.tl.rank_genes_groups(#YOUR CODE HERE# , #YOUR CODE HERE# ) # read the function description to complete this function

In [None]:
sc.get.rank_genes_groups_df(#YOUR CODE HERE#)

In [None]:
marker_genes = pd.DataFrame(adata.uns["rank_genes_groups"]["names"])

In [None]:
marker_genes.head(10)

In [None]:
sc.pl.rank_genes_group_heatmap(adata,groupby= , n_genes=)

### **Exercise 6**:

Visualize marker genes on the UMAP or tSNE representation. Try to find 3-4 marker genes that are indeed specific to a particular cluster. Are there any clusters that do not seem to have unique marker genes?

Are there any clusters containing markers that are only specific to a portion of the cluster?

Marker genes should uniformly define cells "everywhere" in a cluster in UMAP space, otherwise the cluster might actually be two!

In [None]:
# your code here
sc.pl.umap(adata, color=#YOUR CODE HERE#)

### **Exercise 7**:

Let’s take a few steps back to understand all of the previous steps a little bit better!

The number of genes selected by the highly_variable_genes function can significantly impact your ability to cluster. Too few genes and you cannot discriminate between different cell types, too many genes and you capture lots of noisy clusters!

Try repeating the previous analysis with either 500 or 5000 highly variable genes, naming the AnnData object differently (i.e. adata_500genes) to avoid overwriting your previous results.

Transfer the metadata for the new cluster labels to the original AnnData object's metadata at adata.obs and compare on the UMAP. Are the clusters different?

In [None]:
# your code here


### **Exercise 8**:

Once you have settled on the parameters for the dimensionality reduction and clustering steps, it is time to begin annotating your clusters with cell types. This is normally a challenging step!

When you are not too familiar with the marker genes for a particular cluster, a good starting point is simply to Google a strong marker gene and understand its function. Other tools that might be useful include EnrichR and GSEAPy.
- https://maayanlab.cloud/Enrichr/
- https://gseapy.readthedocs.io/en/latest/gseapy_example.html#2.-Enrichr-Example

Fortunately in our case, this dataset comes from a publication with an extensive web browser that allows you to search for cell types by marker gene expression: http://mousebrain.org/adolescent/celltypes.html

This should help narrow down the search but might not be enough for distinguishing two very similar cell types or clusters.

Justify your cell type choices with marker genes from the literature!

### **Exercise 9**:

Create a new metadata attribute to annotate clusters with corresponding cell types. This can be done as shown below. Illustrate the final results on the UMAP or tSNE.

In [None]:
cluster2type_dict = {"0":"CellType1", "1": "CellType2", ... } # update for the number of clusters/cell types you have!

adata.obs["cell_type"] = np.array([cluster2type_dict[i] for i in adata.obs["louvain"]])

In [None]:
# your code here to visualize result on the UMAP or tSNE
sc.pl.umap(adata, color= )

### **Exercise 10**:

There are many excellent plotting functions to visualize marker genes for particular cell types in your data. Explore the documentation below and create some visualizations of your results (such as a heatmap, dot plot, or violin plot).

https://scanpy-tutorials.readthedocs.io/en/latest/plotting/core.html


In [None]:
sc.pl.heatmap(adata, ["Gene1", "Gene2", ..., "GeneN"], groupby='louvain',
              cmap='viridis', dendrogram=False)

In [None]:
sc.pl.dotplot(adata, ["Gene1", "Gene2", ..., "GeneN"], groupby='louvain',
              cmap='viridis', dendrogram=False)

In [None]:
sc.pl.violin(adata, ["Gene1", "Gene2", ..., "GeneN"], groupby='louvain',
              cmap='viridis', dendrogram=False)

## 8. Compare to the annotated results from the study

Fortunately, these data are from a completed study, so we have the annotations created by the authors for the various cell types! When you reach this step, let us know and we will provide you with the “solutions.” Load these into a new AnnData object, named ref_adata.

Once you have done this, visualize the cell types provided by the authors. Some good questions to think about investigating might be: Do the author's results overlap with the clusters and/or cell types you annotated? Did the authors overgeneralize or did you miss any clusters? How many of your cells were excluded by the authors?


In [None]:
# Download the reference data:
!wget https://figshare.com/ndownloader/files/34551920 -O ref_data.h5ad




### **Exercise 11**:

Compare your results with those from the published study. Some suggestions are below:

In [None]:
ref_adata = sc.read_h5ad() # load reference file (only provided to your group once the prior steps are completed)

In [None]:
ref_adata

In [None]:
# compare the number of cells in your AnnData object and the number of cells in the reference
# YOUR CODE HERE

In [None]:
# The cell-type labels in the reference data are stored in the 'Class' variable of the obs.
ref_adata.obs['Class'].head(3)

In [None]:
# Transfer them to your data by creating a dictionary of barcode : cell type for the reference
ref_types=ref_adata.obs['Class'].to_dict()
# Now you can add these labels to your data in a new metadata attribute "reference_cell_type"
adata.obs["reference_cell_type"]=ref_types

In [None]:
# Compare the reference annotations with your own
sc.pl.umap(adata, color=[#YOUR CODE HERE])

In [None]:
# do your clusters / cell types correspond directly to cell types from the authors?
# do you have multiple clusters that the author's annotated together as a single cell type?
# or, do you have one cluster that the author's actually annotated as two different cell types?

In [None]:
# what are the marker genes for the author's cell types? does this assist with annotation of your clusters?
# Compute the differential expression between the groups
sc.tl.rank_genes_groups(#YOUR CODE HERE)

In [None]:
# Store the results in a dataframe and analyze them
marker_genes = pd.DataFrame(adata_raw_norm.uns["rank_genes_groups"]["names"])

In [None]:
marker_genes.head()

In [None]:
# don't forget to save your final AnnData object