# RNA velocity analysis using scVelo

* __Notebook version__: `v0.0.2`
* __Created by:__ `Imperial BRC Genomics Facility`
* __Maintained by:__ `Imperial BRC Genomics Facility`
* __Docker image:__ `imperialgenomicsfacility/scanpy-notebook-image:release-v0.0.4`
* __Github repository:__ [imperial-genomics-facility/scanpy-notebook-image](https://github.com/imperial-genomics-facility/scanpy-notebook-image)
* __Created on:__ {{ DATE_TAG }}
* __Contact us:__ [Imperial BRC Genomics Facility](https://www.imperial.ac.uk/medicine/research-and-impact/facilities/genomics-facility/contact-us/)
* __License:__ [Apache License 2.0](https://github.com/imperial-genomics-facility/scanpy-notebook-image/blob/master/LICENSE)
* __Project name:__ {{ PROJECT_IGF_ID }}
{% if SAMPLE_IGF_ID %}* __Sample name:__ {{ SAMPLE_IGF_ID }}{% endif %}

## Table of contents

* [Introduction](#Introduction)
* [Tools required](#Tools-required)
* [Loading required libraries](#Loading-required-libraries)
* [Input parameters](#Input-parameters)
* [Reading data from Cellranger output](#Reading-data-from-Cellranger-output)
  * [Reading output of Scanpy](#Reading-output-of-Scanpy)
  * [Reading output of Velocyto](#Reading-output-of-Velocyto)
* [Estimate RNA velocity](#Estimate-RNA-velocity)
  * [Dynamical Model](#Dynamical-Model)
* [Project the velocities](#Project-the-velocities)
* [Interprete the velocities](#Interprete-the-velocities)
* [Identify important genes](#Identify-important-genes)
* [Kinetic rate paramters](#Kinetic-rate-paramters)
* [Latent time](#Latent-time)
* [Top-likelihood genes](#Top-likelihood-genes)
* [Cluster-specific top-likelihood genes](#Cluster-specific-top-likelihood-genes)
* [Velocities in cycling progenitors](#Velocities-in-cycling-progenitors)
* [Speed and coherence](#Speed-and-coherence)
* [PAGA velocity graph](#PAGA-velocity-graph)
* [References](#References)
* [Acknowledgement](#Acknowledgement)


## Introduction
This notebook for running RNA velocity analysis (for a single sample) using [scVelo](https://scvelo.readthedocs.io/) package. Most of the codes and documentation used in this notebook has been copied from the following sources:

* [RNA Velocity Basics](https://scvelo.readthedocs.io/VelocityBasics/)
* [Dynamical Modeling](https://scvelo.readthedocs.io/DynamicalModeling/)

## Tools required
* [scVelo](https://scvelo.readthedocs.io/)

## Loading required libraries

We need  to load all the required libraries to environment before we can run any of the analysis steps. Also, we are checking the version information for most of the major packages used for analysis.

In [None]:
%matplotlib inline
import logging
import pandas as pd
import scvelo as scv
scv.logging.print_version()

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Input parameters

In [None]:
scanpy_h5ad = '{{ SCANPY_H5AD }}'
loom_file = '{{ VELOCYTO_LOOM }}'
threads = {{ CPU_THREADS }}
genome_build = '{{ GENOME_BUILD }}'

In [None]:
s_genes = {{ CUSTOM_S_GENES_LIST }}
g2m_genes = {{ CUSTOM_G2M_GENES_LIST }}

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Reading data from Cellranger output
### Reading output of Scanpy
We have already processed the count data using [Scanpy](https://scanpy.readthedocs.io/en/stable/). Now we are loading the h5ad file using scVelo.

In [None]:
adata = scv.read(scanpy_h5ad, cache=True)

### Reading output of Velocyto
We have already generated loom file using [Velocyto](http://velocyto.org/velocyto.py/). Now we are loading the loom file to scVelo.

In [None]:
ldata = scv.read(loom_file, cache=True)

In [None]:
if genome_build == 'TAIR10':
    ldata.var_names = ldata.var['Accession']

In [None]:
ldata.var_names_make_unique()

In [None]:
adata = scv.utils.merge(adata, ldata)

Displaying the proportions of spliced/unspliced counts

In [None]:
scv.pl.proportions(adata, groupby='leiden', dpi=150)

Further, we need the first and second order moments (means and uncentered variances) computed among nearest neighbors in PCA space, summarized in `scv.pp.moments`. First order is needed for deterministic velocity estimation, while stochastic estimation also requires second order moments.

In [None]:
scv.pp.moments(adata, n_neighbors=30, n_pcs=20, use_highly_variable=True)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Estimate RNA velocity

Velocities are vectors in gene expression space and represent the direction and speed of movement of the individual cells. The velocities are obtained by modeling transcriptional dynamics of splicing kinetics, either stochastically (default) or deterministically (by setting mode='`deterministic`'). For each gene, a steady-state-ratio of pre-mature (unspliced) and mature (spliced) mRNA counts is fitted, which constitutes a constant transcriptional state. Velocities are then obtained as residuals from this ratio. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

### Dynamical Model

We run the dynamical model to learn the full transcriptional dynamics of splicing kinetics.

It is solved in a likelihood-based expectation-maximization framework, by iteratively estimating the parameters of reaction rates and latent cell-specific variables, i.e. transcriptional state and cell-internal latent time. It thereby aims to learn the unspliced/spliced phase trajectory for each gene.

In [None]:
scv.tl.recover_dynamics(adata, n_jobs=threads)

In [None]:
scv.tl.velocity(adata, mode='dynamical')

The computed velocities are stored in `adata.layers` just like the count matrices.

The combination of velocities across genes can then be used to estimate the future state of an individual cell. In order to project the velocities into a lower-dimensional embedding, transition probabilities of cell-to-cell transitions are estimated. That is, for each velocity vector we find the likely cell transitions that are accordance with that direction. The transition probabilities are computed using cosine correlation between the potential cell-to-cell transitions and the velocity vector, and are stored in a matrix denoted as velocity graph. The resulting velocity graph has dimension $n_{obs}×n_{obs}$
and summarizes the possible cell state changes that are well explained through the velocity vectors (for runtime speedup it can also be computed on reduced PCA space by setting `approx=True`).

In [None]:
scv.tl.velocity_graph(adata)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Project the velocities

Finally, the velocities are projected onto any embedding, specified by basis, and visualized in one of these ways:

* on cellular level with `scv.pl.velocity_embedding`
* as gridlines with `scv.pl.velocity_embedding_grid`
* or as streamlines with `scv.pl.velocity_embedding_stream`

In [None]:
scv.pl.velocity_embedding(
    adata,
    basis='umap',
    color='leiden',
    arrow_size=2,
    arrow_length=2,
    legend_loc='center right',
    figsize=(9,7),
    dpi=150)

In [None]:
scv.pl.velocity_embedding_grid(
    adata,
    basis='umap',
    color='leiden',
    arrow_size=1,
    arrow_length=2,
    legend_loc='center right',
    figsize=(9,7),
    dpi=150)

In [None]:
scv.pl.velocity_embedding_stream(
    adata,
    basis='umap',
    color='leiden',
    linewidth=0.5,
    figsize=(9,7),
    dpi=150)

The velocity vector field displayed as streamlines yields fine-grained insights into the developmental processes. It accurately delineates the cycling population of ductal cells and endocrine progenitors. Further, it illuminates cell states of lineage commitment, cell-cycle exit, and endocrine cell differentiation.

The most fine-grained resolution of the velocity vector field we get at single-cell level, with each arrow showing the direction and speed of movement of an individual cell.


<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Interprete the velocities
 
We will examine the phase portraits of some marker genes, visualized with `scv.pl.velocity(adata, gene_names)` and `scv.pl.scatter(adata, gene_names)`

Gene activity is orchestrated by transcriptional regulation. 

Transcriptional induction for a particular gene results in an increase of (newly transcribed) precursor unspliced mRNAs while, conversely, repression or absence of transcription results in a decrease of unspliced mRNAs. Spliced mRNAs is produced from unspliced mRNA and follows the same trend with a time lag. Time is a hidden/latent variable. 

Thus, the dynamics needs to be inferred from what is actually measured: spliced and unspliced mRNAs as displayed in the phase portrait.


We are collecting the top marker gene for each cluster from the Scanpy output

In [None]:
top_marker_genes = \
    pd.DataFrame(
        adata.uns['rank_genes_groups']['names']).\
    head(1).\
    values.\
    tolist()[0]
pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(1)

Now plotting phase and velocity plot for top marker genes.

The phase plot shows spliced against unspliced expressions with steady-state fit. Further the embedding is shown colored by velocity and expression.


In [None]:
scv.pl.velocity(adata, top_marker_genes, ncols=1, figsize=(9,7), dpi=150)

The black line corresponds to the estimated 'steady-state' ratio, i.e. the ratio of unspliced to spliced mRNA abundance which is in a constant transcriptional state. RNA velocity for a particular gene is determined as the residual, i.e. how much an observation deviates from that steady-state line. Positive velocity indicates that a gene is up-regulated, which occurs for cells that show higher abundance of unspliced mRNA for that gene than expected in steady state. Conversely, negative velocity indicates that a gene is down-regulated.

In [None]:
scv.pl.scatter(
    adata,
    top_marker_genes,
    add_outline=True,
    color='leiden',
    ncols=2,
    dpi=150)

In [None]:
scv.pl.scatter(
    adata,
    top_marker_genes,
    add_outline=True,
    color='velocity',
    ncols=2,
    dpi=150)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Identify important genes

We need a systematic way to identify genes that may help explain the resulting vector field and inferred lineages. To do so, we can test which genes have cluster-specific differential velocity expression, being siginificantly higher/lower compared to the remaining population. The module `scv.tl.rank_velocity_genes` runs a differential velocity t-test and outpus a gene ranking for each cluster. Thresholds can be set (e.g. `min_corr`) to restrict the test on a selection of gene candidates.

In [None]:
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=.3)
df = \
    scv.DataFrame(
        adata.uns['rank_velocity_genes']['names'])
df.head()

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Kinetic rate paramters
The rates of RNA transcription, splicing and degradation are estimated without the need of any experimental data.

They can be useful to better understand the cell identity and phenotypic heterogeneity.

In [None]:
df = adata.var
df = df[(df['fit_likelihood'] > .1) & (df['velocity_genes'] == True)]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(
        df['fit_alpha'],
        xlabel='transcription rate',
        **kwargs)
    pl.hist(
        df['fit_beta'] * df['fit_scaling'],
        xlabel='splicing rate',
        xticks=[.1, .4, 1], **kwargs)
    pl.hist(
        df['fit_gamma'],
        xlabel='degradation rate',
        xticks=[.1, .4, 1], **kwargs)

scv.get_df(adata, 'fit*', dropna=True).head()

The estimated gene-specific parameters comprise rates of transription (`fit_alpha`), splicing (`fit_beta`), degradation (`fit_gamma`), switching time point (`fit_t_`), a scaling parameter to adjust for under-represented unspliced reads (`fit_scaling`), standard deviation of unspliced and spliced reads (`fit_std_u`, `fit_std_s`), the gene likelihood (`fit_likelihood`), inferred steady-state levels (`fit_steady_u`, `fit_steady_s`) with their corresponding p-values (`fit_pval_steady_u`, `fit_pval_steady_s`), the overall model variance (`fit_variance`), and a scaling factor to align the gene-wise latent times to a universal, gene-shared latent time (`fit_alignment_scaling`).

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Latent time

The dynamical model recovers the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and approximates the real time experienced by cells as they differentiate, based only on its transcriptional dynamics.

In [None]:
scv.tl.latent_time(adata)
scv.pl.scatter(
    adata,
    color='latent_time',
    color_map='gnuplot',
    size=80, dpi=150)

In [None]:
scv.tl.latent_time(adata)
scv.pl.scatter(
    adata,
    color='latent_time',
    color_map='gnuplot',
    size=80,
    dpi=150)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Top-likelihood genes

Driver genes display pronounced dynamic behavior and are systematically detected via their characterization by high likelihoods in the dynamic model.

In [None]:
top_genes = \
    adata.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(
    adata,
    basis=top_genes[:15],
    color='leiden',
    ncols=3,
    frameon=False,
    dpi=150)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Cluster-specific top-likelihood genes

Moreover, partial gene likelihoods can be computed for a each cluster of cells to enable cluster-specific identification of potential drivers.

In [None]:
scv.tl.rank_dynamical_genes(adata, groupby='leiden')
df = scv.DataFrame(adata.uns['rank_dynamical_genes']['names'])
df.head(5)

In [None]:
adata.obs['leiden'].drop_duplicates().sort_values().values.tolist()

In [None]:
for cluster in adata.obs['leiden'].drop_duplicates().sort_values().values.tolist():
    scv.pl.scatter(
        adata,
        df[cluster][:3],
        ylabel=cluster,
        color='leiden',
        frameon=False)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Velocities in cycling progenitors

The cell cycle detected by RNA velocity, is biologically affirmed by cell cycle scores (standardized scores of mean expression levels of phase marker genes).

Unless gene lists are provided for S and G2M phase, it calculates scores and assigns a cell cycle phase (G1, S, G2M) using the list of cell cycle genes defined in _Tirosh et al, 2015_ (https://doi.org/10.1126/science.aad0501).

In [None]:
score_cell_cycle_genes = False
if s_genes is not None and g2m_genes is not None and \
   isinstance(s_genes, list) and isinstance(g2m_genes, list) and \
   len(s_genes) > 0 and len(g2m_genes) > 0:
    print('Using custom cell cycle genes')
    scv.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes)
    score_cell_cycle_genes = True
else:
    if genome_build in ('HG38', 'MM10', 'MM39'):
        print('Using predefined cell cycle genes')
        scv.tl.score_genes_cell_cycle(adata, s_genes=None, g2m_genes=None)
        score_cell_cycle_genes = True
    else:
        logging.warning("Skipping step for cell cycle genes scoring")

In [None]:
if score_cell_cycle_genes:
    scv.pl.scatter(
        adata,
        color_gradients=['S_score', 'G2M_score'],
        smooth=True,
        perc=[5, 95],
        dpi=150)

The previous module also computed a spearmans correlation score, which we can use to rank/sort the phase marker genes to then display their phase portraits.

In [None]:
if score_cell_cycle_genes:
    s_genes, g2m_genes = \
        scv.utils.get_phase_marker_genes(adata)
    s_genes = \
        scv.get_df(
            adata[:, s_genes],
            'spearmans_score',
            sort_values=True).index
    g2m_genes = \
        scv.get_df(
            adata[:, g2m_genes],
            'spearmans_score',
            sort_values=True).index

    kwargs = \
        dict(
            frameon=False,
            ylabel='cell cycle genes',
            color='leiden',
            ncols=3,
            dpi=150)
    scv.pl.scatter(adata, list(s_genes[:5]) + list(g2m_genes[:5]), **kwargs)

In [None]:
if score_cell_cycle_genes:
    scv.pl.velocity(
        adata,
        list(s_genes[:5]) + list(g2m_genes[:5]),
        ncols=1,
        add_outline=True,
        dpi=150)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Speed and coherence

Two more useful stats:

* The speed or rate of differentiation is given by the length of the velocity vector.
* The coherence of the vector field (i.e., how a velocity vector correlates with its neighboring velocities) provides a measure of confidence.

In [None]:
scv.tl.velocity_confidence(adata)

In [None]:
scv.pl.scatter(adata, c='velocity_length', cmap='coolwarm', perc=[5, 95], figsize=(9,7), dpi=150)

In [None]:
scv.pl.scatter(adata, c='velocity_confidence', cmap='coolwarm', perc=[5, 95], figsize=(9,7), dpi=150)

These provide insights where cells differentiate at a slower/faster pace, and where the direction is un-/determined.

In [None]:
df = adata.obs.groupby('leiden')['velocity_length', 'velocity_confidence'].mean().T
df.style.background_gradient(cmap='coolwarm', axis=1)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## Velocity graph and pseudotime

We can visualize the velocity graph to portray all velocity-inferred cell-to-cell connections/transitions. It can be confined to high-probability transitions by setting a `threshold`. The graph, for instance, indicates two phases of Epsilon cell production, coming from early and late Pre-endocrine cells.

In [None]:
scv.pl.velocity_graph(adata, threshold=.1, color='leiden', figsize=(9,7), dpi=150)

Further, the graph can be used to draw descendents/anscestors coming from a specified cell. Here, a pre-endocrine cell is traced to its potential fate.

In [None]:
x, y = \
    scv.utils.get_cell_transitions(
        adata,
        basis='umap',
        starting_cell=70)
ax = \
    scv.pl.velocity_graph(
        adata,
        c='lightgrey',
        edge_width=.05,
        show=False,
        dpi=150)
ax = \
    scv.pl.scatter(
        adata,
        x=x,
        y=y,
        s=120,
        c='ascending',
        cmap='gnuplot',
        ax=ax,
        figsize=(9,7),
        dpi=150)

Finally, based on the velocity graph, a velocity pseudotime can be computed. After inferring a distribution over root cells from the graph, it measures the average number of steps it takes to reach a cell after walking along the graph starting from the root cells.

Contrarily to diffusion pseudotime, it implicitly infers the root cells and is based on the directed velocity graph instead of the similarity-based diffusion kernel.


In [None]:
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot', figsize=(9,7), dpi=150)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## PAGA velocity graph

[PAGA](https://doi.org/10.1186/s13059-019-1663-x) graph abstraction has benchmarked as top-performing method for trajectory inference. It provides a graph-like map of the data topology with weighted edges corresponding to the connectivity between two clusters. Here, PAGA is extended by velocity-inferred directionality.

In [None]:
scv.tl.paga(adata, groups='leiden')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')

In [None]:
scv.pl.paga(
    adata,
    basis='umap',
    size=50,
    alpha=.1,
    dpi=150,
    figsize=(9,7),
    min_edge_width=2,
    node_size_scale=1.2)

<div align="right"><a href="#Table-of-contents">Go to TOC</a></div>

## References
* [scVelo](https://scvelo.readthedocs.io/)
* [RNA Velocity Basics](https://scvelo.readthedocs.io/VelocityBasics/)
* [Dynamical Modeling](https://scvelo.readthedocs.io/DynamicalModeling/)


## Acknowledgement
The Imperial BRC Genomics Facility is supported by NIHR funding to the Imperial Biomedical Research Centre.