[![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nunososorio/bhs/edit/main/Data_structures/NB2_Intro_scanpy.ipynb)

# Notebook 2 - [AnnData](https://anndata.readthedocs.io/en/latest/) and [Scanpy](https://scanpy.readthedocs.io/en/stable/)
In this notebook we will explore the Annotated Data (AnnData) data structure and the [Scanpy](https://scanpy.readthedocs.io/en/stable/) library.

This data structure is meant to allow you to keep (biological) data together with annotations about its observations (patients/cells/tumors...) and its variables

# Setup the environment

The *basic* libraries (Numpy, Pandas, Matplotlib, Seaborn...) are already installed in Google Colab. To run this notebook you will need to install scanpy and anndata

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install scanpy anndata

In [None]:
# Import all the libraries we will use
import os
import numpy as np
from scipy.stats import ttest_ind
from scipy.stats import wilcoxon
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import anndata
import random
import scanpy as sc

In [None]:
# Some details for the plots
plt.rcParams.update({'font.size':18, 'figure.figsize':(8,8)})

# Load some data

We will use the simulated data we generated previously. If you didn't sotre you can just get from the course repository.



In [None]:
if not os.path.exists('My_data.csv'):
    !wget https://raw.githubusercontent.com/Leo-GG/bhs/main/Data_structures/My_data.csv

In [None]:
df=pd.read_csv('My_data.csv',index_col=0)

If you remember, this data was a table where we had stored:
1. One row per observation
2. One column per "gene", with a numerical value for each observation
3. One column with an attribute of of the observations (diseased -D- or healthy -H-)

In [None]:
# Have a look and see that it looks alright
df.head(5)

In [None]:
# Like before, we should have 100 observations and 81 columns/variables
df.shape

In [None]:
# We saw that this is very useful, but here we have labels mixed up with values

#df*df #This will give an error!

In [None]:
# We have to remember to separate labels from data!
df.iloc[:,:-1]*df.iloc[:,:-1]

# From DataFrame to AnnData

The anndata library supports the use of Annotated Data (AnnData) objects. In these objects
- The numerical data is kept in a main matrix (X)
- Annotations about the observations and the variables are kept in accesory tables, names "obs" and "var"
<br/><br/>
<br/><br/>

<img src="https://falexwolf.de/img/scanpy/anndata.svg" alt="AnnData" style="width:600px; height:auto;"/>





<br/><br/>
We will build an AnnData object using our table:
- The "gene" data will go on the X matrix
- The information we have about the Condition will go to the "obs" DataFrame
- The "gene" names will go to the "var" DataFrame






In [None]:
import anndata

# First we create the object using the table, keeping all the columns except the last one
adata=anndata.AnnData(X=df.iloc[:,:-1].copy())
adata


In [None]:
# Then we copy the Condition to the "obs" DataFrame
adata.obs=df[['Condition']].copy()
adata

In [None]:
# Let's read the warning and fix the issue...
adata.obs.index=adata.obs.index.astype(str)
adata

## Let's check each element of our AnnData object

In [None]:
adata.X

In [None]:
adata.obs.head(3)

In [None]:
# The "var" DataFrame will be empty for now, but we can check that its index is made of "gene" names
adata.var.head(3)

# OK, but why?

## 1. Efficient data storage
One very important feature of the AnnData object is that it separates numerical values from annotations

The numerical data stored in the X matrix can now be stored using memory-efficient formats, such as sparse matrices

This makes possible to store and access *very* large datasets

## 2. Annotations

Now we can add all the information we want about the observations/subjects/samples and about the features!

Categorical and strings do not get mixed up with the numerical values on the matrix

In [None]:
# Maybe we want to include the age of our subjects
adata.obs['Age']=[ random.randint(20, 100) for i in range(adata.n_obs)]

# Add their sex
adata.obs['Sex']=np.random.choice(['F','M'],adata.n_obs)

# And some other variable
adata.obs['Hospitalized']=np.random.choice(['Y','N'],adata.n_obs)

In [None]:
adata.obs.head(3)

In [None]:
# Same with the genes, we can add some annotations to them
# Describe what the variable represents
adata.var['Type']='Not-a-real-gene'

# Add a random clinical annotation
adata.var['Clinical_annotation']=np.random.choice(['Pathogenic','NA','Unknown','Developmental'],adata.n_vars)

In [None]:
adata.var.head(3)

In [None]:
# Now the AnnData object is storing all the information in a single object
adata


### We have now a rather complex object!
We have
- High-dimensional biological data
- Information about our observations
- Information about the variables

All in one place!

## 3. Data analysis!

The [scanpy](https://scanpy.readthedocs.io/en/stable/) library, designed for single-cell RNA-seq data analysis, offers *many* methods that can be used with data stored in AnnData objects

<br/><br/>

<img src="https://scanpy.readthedocs.io/en/stable/_static/Scanpy_Logo_BrightFG.svg" alt="AnnData" style="width:100px; height:auto;"/>

<br/><br/>


### Plotting
Scanpy has great plotting functions!
You can plot your data referring to any of the annotations stored in ".var" or in ".obs"

In [None]:
# For example, lets pick some genes to plot
adata.var.iloc[[3,8,22, 54, 12]]

In [None]:
# Now let's use scanpy to plot their values on each of the two condition groups

sc.pl.heatmap(adata=adata, var_names=['SOLB','EGFRE','MTE','PORT_NC','TACD'], groupby='Condition', swap_axes=True,figsize=[8,6])

In [None]:
# It would be the same if we want to plot by another label
sc.pl.heatmap(adata=adata, var_names=['SOLB','EGFRE','MTE','PORT_NC','TACD'], groupby='Hospitalized', swap_axes=True,figsize=[8,6])

### Statistical tests
Scanpy has functions to easily compare groups of observations


In [None]:
# For example rank_genes_groups() allows you make comparisons between groups of observations

sc.tl.rank_genes_groups(adata=adata, groupby='Condition', method='wilcoxon')

In [None]:
# The results are stored in the "unstructured" (uns) layer

adata

In [None]:
#adata.uns['rank_genes_groups']['pvals']

In [None]:
# We can use these results to plot the features that are more different on each group
sc.pl.rank_genes_groups(adata)

In [None]:
# We can plot the actual values of the top-ranking variables - i.e. the ones with most differences across groups

sc.pl.rank_genes_groups_heatmap(adata=adata,n_genes=20,groupby='Condition',swap_axes=True,figsize=[10,12])

In [None]:
# It also allows you to operate on your data, e.g. log-transforming it

sc.pp.log1p(adata)


### Feature selection
Scanpy allows you to **easily** do a simple feature selection:
the highly_variable_genes() function uses different methods to label variables as "highly variable" or not, based on their values across the dataset
<br/><br/>

In [None]:
# Let's label our "genes"

sc.pp.highly_variable_genes(adata=adata,min_disp=0)


In [None]:
adata.var

In [None]:
# You directly call a function to visualize the results of this function

sc.pl.highly_variable_genes(adata_or_result=adata)

### Dimensionality reduction

Scanpy was designed to work with single-cell sequencing data. In this type of data, each observation is described by thousands of variables.

A very useful/important step in the analysis of **highly-dimensional** data is dimensionality reduction. These are techniques that reflect the information across all variables in fewer dimensions.

For example, this allows us to make 2d or 3d visualizations of our 80-gene data


In [None]:
# You can apply PCA with a single function call
sc.pp.pca(data=adata, n_comps=20, use_highly_variable=True)

# Results are stored in the varm and obsm layers
adata

In [None]:
# Visualize the PCA loadings
sc.pl.pca_variance_ratio(adata=adata, log=True, n_pcs=20)

In [None]:
# Visualize the data projected on the PCs, color by variable or observation attribute
sc.pl.pca(adata=adata, color=['SOLB','EGFRE','Condition'])

#sc.pl.pca(adata=adata, color=['SOLB','EGFRE','Condition'], components=['4,5'])

# You are not bound to Scanpy!

Remember that the AnnData object is still "just" a collection of data entries: a central matrix connected to dataframes with annotations

You can read, write and transform this data using any library with tools specific to your problem


In [None]:
# For example, let's use sklearn to cluster the data and write the results back the anndata object

# First import a clustering tool from sklearn
from sklearn.cluster import KMeans
# Instantiate it
kmeans = KMeans(n_clusters=4, random_state=42)

# Then we apply it to our data
adata.obs['kmeans_cluster'] = kmeans.fit_predict(adata.X)


In [None]:
adata.obs

In [None]:
# Make sure the results are kept as categories and not numbers, so the plots will be easier to understand
adata.obs['kmeans_cluster']=adata.obs['kmeans_cluster'].astype('category')

In [None]:
# Now let's use again the scanpy function to visualize the results on the PCA space
sc.pl.pca(adata, color=['kmeans_cluster'])

### Cluster on the PCA space instead...if we have time

In [None]:
# Cluster the data again, but use the first 2 principal components only
adata.obsm['X_pca'].shape # The PCA data is stored here, it is n_obs x the number of PCs that we calculated

In [None]:
# The first 2 PCs are the first 2 columns
adata.obsm['X_pca'][:,0:2]

# We can use k-means again
adata.obs['kmeans_PCA'] = kmeans.fit_predict(adata.obsm['X_pca'][:,0:2])

In [None]:
# See the results
sc.pl.pca(adata, color=['kmeans_PCA'])

# Into the multi-ome: tables of tables

Current *-omics* techniques tend more and more to be run in parallel, so we can have information on different **modalities** for the same subjects/cells/tissue...

For example, you we can have RNA-sequencing data and proteomics data. In our example, we would have two data tables and one table with subject information/metadata.

<img src="https://github.com/Leo-GG/bhs/blob/main/Data_structures/Illustrations/premise3.png?raw=true" alt="AnnData" style="width:100px; height:auto;"/>

<br/><br/>


**Luckily** for us, there is a Python library designed for multiomics data: [Muon](https://muon.readthedocs.io/en/latest/)

This framework implements a Multiomics Data structure - **MuData**


<img src="https://github.com/scverse/muon/raw/master/docs/img/muon_header.png" alt="AnnData" style="width:100px; height:auto;"/>

<br/><br/>




In Muon we basically pile up multiple AnnData objects, one for each **modality**, and we keep sets of annotations that are common to the observations (subjects/cells..) in all modalities

<img src="https://github.com/Leo-GG/bhs/blob/main/Data_structures/Illustrations/muon_paper.png?raw=true" alt="AnnData" style="width:100px; height:auto;"/>
source: Bredikhin et al. "MUON: multimodal omics analysis framework", Genome Biology (2022)
<br/><br/>

<img src="https://github.com/Leo-GG/bhs/blob/main/Data_structures/Illustrations/muon_diagram.png?raw=true" alt="AnnData" style="width:100px; height:auto;"/>
<br/><br/>

The main advantage of the Muon framework is that allows the use of **multimodal integration** methods: [MOFA](https://biofam.github.io/MOFA2/)

<br/><br/>


<img src="https://biofam.github.io/MOFA2/images/mofa_overview.png" alt="AnnData" style="width:100px; height:auto;"/>
<br/><br/>

But that's a topic for another time...
<br/><br/>