# Single Cell Multi-Omics Integration

Many single-cell sequencing technologies are now available, but it is increasingly common to have different types of measurements performed on the same underlying system. The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a wide variety of biological conditions. However, it is challenging to analyze them together, particularly when datasets are assayed with different technologies because biological and technical differences are interspersed.

Here we will implement manifold alignment to do data integration on single-cell RNA sequence data (scRNAseq). In this case, there will be two scRNAseq data, and our purpose is to integrate them together. That will be great if you can also pair the cells in different data sets.

In [None]:
####### Install necessary packages ###########
# %pip install numpy scanpy pandas seaborn umap scikit-learn

In [None]:
import numpy as np
import scanpy as sc
import pandas as pd
import seaborn as sns
import umap
import sklearn

## Preprocessing

Following pipeline based on toturial https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html. In this section, most of cases you only need one line code. You can skip bonus questions if you don't have time.

In [None]:
## load rna_data.csv
rna_data = sc.read_csv("data set/rna_data.csv").transpose()
# each cell is one observation
print(rna_data.obs.index[0:5])
# each gene is one variable
print(rna_data.var.index[0:5])

Do Basic filtering. For each cell, it expresses at least 100 genes. For each gene, it expresses at least in 5 cells.

In [None]:
## Do Basic filtering.


Use the function pp.calculate_qc_metrics to compute the fraction of mitochondrial genes and additional measures.

In [None]:
## Compute the fraction of mitochondrial genes


Total-count normalize (library-size correct) the data matrix X to 10,000 reads per cell, so that counts become comparable among cells.

In [None]:
## Total-count normalize 

## Logarithmize the data.


In [None]:
## Identify highly-variable genes.

## Scale the data to unit variance.


### Principal component analysis

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.

In [None]:
## Running principal component analysis (PCA)


**Bouns questions:** write your own PCA function.

In [None]:
## This is your own PCA.


In [None]:
## make a scatter plot in the PCA coordinates.



### Computing the neighborhood graph

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix.

In [None]:
## compute the neighborhood graph


### Embedding the neighborhood graph 

We advertise embedding the graph in 2 dimensions using UMAP. 

In [None]:
## Run UMAP

## Plot your results.


### Clustering the neighborhood graph

Here we recommend the Leiden graph-clustering method (community detection based on optimizing modularity). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

In [None]:
## Do cluster

## Plot the clusters


**Bouns questions:** Try basic cluster method, for example, k-means and hierarchical cluster.

In [None]:
## k-means

## hierarchical cluster.


### Go through pipeline for another dataset

In [None]:
## load atac_data.csv
atac_data = sc.read_csv("data set/atac_data.csv").transpose()
# each cell is one observation
print(atac_data.obs.index[0:5])
# each gene is one variable
print(atac_data.var.index[0:5])

## Do Basic filtering.

## Compute the fraction of mitochondrial genes

## Total-count normalize 

## Logarithmize the data.

## Identify highly-variable genes.

## Scale the data to unit variance.

## Running principal component analysis (PCA)

## compute the neighborhood graph

## Run UMAP

## Plot your results.

## Do cluster

## Plot the clusters


### Directly assmblely

Combine two datasets directly and go through the above pipeline. In this case, you don't need to do cluster but when doing plots labeled by data set.

In [None]:
## load your data
sc_data = sc.read_csv("your own csv combined two csv file").transpose()

## Do Basic filtering.

## Compute the fraction of mitochondrial genes

## Total-count normalize 

## Logarithmize the data.

## Identify highly-variable genes.

## Scale the data to unit variance.

## Running principal component analysis (PCA)

## compute the neighborhood graph

## Run UMAP

## Plot your results. Labeled by data set.


## Supervised general manifold alignment 

In this case we know cells are paired one to one. Implentment https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/viewFile/4903/5535 method.

In [None]:
## Supervised general manifold alignment function.


## Manifold Alignment without Correspondence

Suppose we don't know paired cells. implement https://www.ijcai.org/Proceedings/09/Papers/214.pdf method.

In [None]:
## Manifold Alignment without Correspondence function.


## Generalized Unsupervised Manifold Alignment

implement https://papers.nips.cc/paper/5620-geeralized-unsupervised-manifold-alignment.pdf method.

In [None]:
## Generalized Unsupervised Manifold Alignment function.


## Bouns question: Unsupervised Topological Alignment for Single-Cell Multi-Omics Integration 

implement https://www.biorxiv.org/content/10.1101/2020.02.02.931394v2.full method.

In [None]:
## Unsupervised Topological Alignment for Single-Cell Multi-Omics Integration function.


## Check alignment method

Please implement above three methods on our data set “rna_data” and “atac_data”. After getting aligned dataset, going through the same pipeline as in section “Directly assembly”, check whether your method works not not.