# Pancreatic Endocrinogenesis

## Overview

### Pancreas anatomy

<img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Blausen_0699_PancreasAnatomy2.png" width=500 />

### Physiology

- mixed or heterocrine gland, i.e. it has both an endocrine (1%) and a digestive exocrine (99%) function
- endocrine:
  -  insulin, glucagon, somatostatin, and pancreatic polypeptide
- exocrine (pancreatic juice):
  - bicarbonate (neutralizes acid entering the duodenum from the stomach)
  - digestive enzymes, which break down carbohydrates, proteins, and fats
 
### Cellular development

<img src="https://upload.wikimedia.org/wikipedia/commons/4/43/Panc.png"  width = 600/>

Pancreatic progenitor cells are precursor cells that differentiate into the functional pancreatic cells, including exocrine acinar cells, endocrine islet cells, and ductal cells.[17] These progenitor cells are characterised by the co-expression of the transcription factors PDX1 and NKX6-1.[17]

The cells of the exocrine pancreas differentiate through molecules that induce differentiation including follistatin, fibroblast growth factors, and activation of the Notch receptor system.[17] Development of the exocrine acini progresses through three successive stages. These are the predifferentiated, protodifferentiated, and differentiated stages, which correspond to undetectable, low, and high levels of digestive enzyme activity, respectively.[17]

Pancreatic progenitor cells differentiate into endocrine islet cells under the influence of neurogenin-3 and ISL1, but only in the absence of notch receptor signaling. Under the direction of a Pax gene, the endocrine precursor cells differentiate to form alpha and gamma cells. Under the direction of Pax-6, the endocrine precursor cells differentiate to form beta and delta cells.[17] The pancreatic islets form as the endocrine cells migrate from the duct system to form small clusters around capillaries.[9] This occurs around the third month of development,[11] and insulin and glucagon can be detected in the human fetal circulation by the fourth or fifth month of development.[17]



### From http://www.vivo.colostate.edu/hbooks/pathphys/endocrine/pancreas/anatomy.html

Pancreatic islets house three major cell types, each of which produces a different endocrine product:

- Alpha cells (A cells) secrete the hormone glucagon.
- Beta cells (B cells) produce insulin and are the most abundant of the islet cells.
- Delta cells (D cells) secrete the hormone somatostatin, which is also produced by a number of other endocrine cells in the body.
Interestingly, the different cell types within an islet are not randomly distributed - beta cells occupy the central portion of the islet and are surrounded by a "rind" of alpha and delta cells. Aside from the insulin, glucagon and somatostatin, a number of other "minor" hormones have been identified as products of pancreatic islets cells.

Islets are richly vascularized, allowing their secreted hormones ready access to the circulation. Although islets comprise only 1-2% of the mass of the pancreas, they receive about 10 to 15% of the pancreatic blood flow. Additionally, they are innervated by parasympathetic and sympathetic neurons, and nervous signals clearly modulate secretion of insulin and glucagon.

## Pancreatic progenitor cell (source https://en.wikipedia.org/wiki/Pancreatic_progenitor_cell)

- pancreatic progenitors
  - Pdx1 : earliest marker for pancreatic differentiation
  - Mnx1/Hlxb1 ?
- exocrine cells
  - acinar cells
    - amylase, lipase, peptidase
  - ductal cells
- endocrine cells
  - beta cells -> insulin
  - alpha cells -> glucagon
  - delta cells -> somatostatin
  - PP cells -> pancreatic polypetide
  
## [YouTube: Pancreas Clinical Anatomy and Physiology](https://youtu.be/9TSt9IuozMg)

## [Genetic programming of liver and pancreas progenitors: lessons for stem-cell differentiation](https://www.nature.com/articles/nrg2318)

## Data description

### Data source

We will reanalyze data from the paper [Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis](https://journals.biologists.com/dev/article/146/12/dev173849/19483/Comprehensive-single-cell-mRNA-profiling-reveals-a)
- Code: https://github.com/theislab/pancreatic-endocrinogenesis
- Data: [GSE132188](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132188)

There is another paper from the same authors [Generalizing RNA velocity to transient cell states through dynamical modeling](https://www.nature.com/articles/s41587-020-0591-3)
The paper states:
  > Endocrine cells are derived from endocrine progenitors located in the pancreatic epithelium, marked by transient expression of the transcription factor Ngn3. Endocrine commitment terminates in four major fates: glucagon-producing α-cells, insulin-producing β-cells, somatostatin-producing δ-cells and ghrelin-producing ∈-cells

###

## Prepare the environment

In [None]:
import numpy as np
import pandas as pd
import scanpy as sc
import scanpy.logging as logg
import scvelo as scv
import matplotlib.pyplot as plt
import seaborn as sns
import gseapy as gp
import plotly.graph_objects as go
scv.set_figure_params()
scv.settings.presenter_view = True  # set max width size for presenter view
scv.settings.set_figure_params('scvelo')  # for beautified visualization
plt.rcParams['figure.figsize']=(8,8) #rescale figures
sc.settings.verbosity = 3
# Matplotlib backwards compatibility hack
import matplotlib
matplotlib.cbook.iterable = np.iterable

### Loading data from various DBs

In [None]:
# PanglaoDB Gene Markers
panglao_gene_markers = pd.read_csv('../../external/Panglao/PanglaoDB_markers_27_Mar_2020.tsv', sep='\t')
panglao_gene_markers = panglao_gene_markers[(panglao_gene_markers.species == 'Mm') | (panglao_gene_markers.species == 'Mm Hs')]
panglao_gene_markers = panglao_gene_markers.rename(columns = {'official gene symbol': 'gene', 'cell type': 'cell_type', 'ubiquitousness index': 'ui'})
panglao_gene_markers = panglao_gene_markers.loc[:, ['gene', 'cell_type', 'organ', 'ui']]
print("Markers: ",  len(panglao_gene_markers["gene"].unique()))
print("Cell types: ", len(panglao_gene_markers["cell_type"].unique()))
print("Organs: ", len(panglao_gene_markers["organ"].unique()))
panglao_gene_markers.loc[:, 'gene'] = panglao_gene_markers.loc[:, 'gene'].str.lower().str.capitalize()
cell_type_groups = panglao_gene_markers.loc[:, ['gene', 'cell_type']].groupby('cell_type')
gene_markers = {}
for name, data in cell_type_groups:
    gene_markers[name] = data.gene.values.tolist()

In [None]:
# Cell Marker DB
cell_marker_gene_markers = pd.read_csv('../../external/CellMarker/Mouse_cell_markers.txt', sep='\t')
cell_marker_gene_markers = cell_marker_gene_markers[cell_marker_gene_markers.cellType == 'Normal cell']
cell_marker_gene_markers = cell_marker_gene_markers.rename(columns = {'geneSymbol' : 'gene', 'tissueType': 'organ', 'cellName': 'cell_type' })
cell_marker_gene_markers = cell_marker_gene_markers.loc[:, ['organ', 'cell_type', 'gene']]
#display(cell_marker_gene_markers)
cell_marker_gene_markers = cell_marker_gene_markers.set_index(['organ', 'cell_type']) \
    .apply(lambda x: x.str.split(',').explode().str.strip()) \
    .reset_index()
print("Markers: ",  len(cell_marker_gene_markers["gene"].unique()))
print("Cell types: ", len(cell_marker_gene_markers["cell_type"].unique()))
print("Organs: ", len(cell_marker_gene_markers["organ"].unique()))

In [None]:
common_markers = cell_marker_gene_markers.set_index(['organ', 'gene']) \
    .join(panglao_gene_markers.set_index(['organ', 'gene']), lsuffix='_cell_marker', rsuffix='_panglao', how = 'inner') \
    .reset_index()
common_markers.to_csv('CellTypes.csv')

In [None]:
# # Get transcription factors from TRRUST
# tf_mouse_interactions = pd.read_csv("../../external/trrustv2/trrust_rawdata.mouse.tsv", sep = "\t", header = None, names = ["TF1", "TF2", "Effect", "Reference"])
# tf_mouse = set(tf_mouse_interactions["TF1"]) #.union(set(tf_mouse_interactions["TF2"]))
# tf_mouse = list(tf_mouse)
# print(len(tf_mouse))

# Get transcription factors from TFDB3
tf_mouse_table = pd.read_csv("../../external/AnimalTFDB3/Mus_musculus_TF.txt", sep = "\t")
tf_mouse = tf_mouse_table.Symbol
print("Number of transcription factors: ", (len(tf_mouse)))

In [None]:
# TF activities From Saez-Rodriguez' Lab 'Dorothea'
tf_interactions_sr = pd.read_csv("../../external/Dorothea/Supplemental_Tables/GarciaAlonso_Supplemental_Tables/RegulonsNormal.csv")
tf_interactions_sr

## Helper functions

In [None]:
def gene_tf_stats(expr_data, tf_list):
    gene_max = expr_data.X.max(axis = 0).toarray().flatten()
    tf_ind = np.in1d(expr_data.var.index.values, tf_list)
    tf_max = gene_max[tf_ind]
    tf_names = expr_data.var.index.values[tf_ind]
    fig, axes = plt.subplots(1, 2, figsize = (15, 5))
    axes[0].hist(gene_max, bins = 50, log = True, range = (0, 500))
    axes[0].set_title("Max cell count distribution per gene")
    axes[1].hist(tf_max, bins = 50, log = True, range = (0, 100))
    axes[1].set_title("Max cell count distribution per TF")
    cnt_threshold = 0 
    print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
    cnt_threshold = 10
    print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
    cnt_threshold = 20
    print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
    print("TF: ", tf_names[tf_max >= 20])
    cnt_threshold = 40
    print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
    cnt_threshold = 100
    print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
    cnt_threshold = 200
    print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
    print("Genes: ", expr_data.var.index.values[gene_max > 200])

## Using processed expression matrices

In [None]:
# Read the raw gene matrices from  
# tar -xvzf GSM3852754_E14_5_counts.tar.gz; mv mm10 E14_5
expr_data = sc.read_10x_mtx("../../external/GSE132188_RAW/E15_5", cache = True)
expr_data

### QC

In [None]:
# Counts per cell
expr_data.obs['n_counts'] = expr_data.X.sum(axis = 1)
# Log-counts per cell
expr_data.obs['log_counts'] = np.log10(expr_data.obs['n_counts'])
# Genes per cell
expr_data.obs['n_genes'] = (expr_data.X > 0).sum(axis = 1)
# Mitochondrial gene counts per cell
mt_gene_mask = [gene.startswith('mt-') for gene in expr_data.var_names]
mt_gene_index = np.where(mt_gene_mask)[0]
expr_data.obs['mt_frac'] = expr_data.X[:,mt_gene_index].sum(axis = 1) / expr_data.X.sum(axis = 1)
# Show sample info
display(expr_data.obs)

In [None]:
# Plot quality metrics
fig, axes = plt.subplots(1, 3, figsize = (8, 10))
sc.pl.violin(expr_data, 'n_counts', cut = 0, ax = axes[0], show = False)
sc.pl.violin(expr_data, 'n_genes', cut = 0, ax = axes[1], show = False)
sc.pl.violin(expr_data, 'mt_frac', cut = 0, ax = axes[2], show = True)

#sc.pl.scatter(expr_data, 'n_counts', 'n_genes', color = 'mt_frac')
sc.pl.scatter(expr_data, 'n_counts', 'n_genes', color='mt_frac')
# sc.pl.scatter(expr_data, 'n_counts', 'mt_frac', ax = axes[1, 1], show = False)
# sc.pl.scatter(expr_data, 'n_genes', 'mt_frac', ax = axes[1, 2], show = False)

In [None]:
print('Total number of cells: {:d}'.format(expr_data.n_obs))
expr_data = expr_data[expr_data.obs['mt_frac'] < 0.2]
print('Number of cells after MT filter: {:d}'.format(expr_data.n_obs))

sc.pp.filter_cells(expr_data, min_genes = 1200)
print('Number of cells after gene filter: {:d}'.format(expr_data.n_obs))
#Filter genes:
print('Total number of genes: {:d}'.format(expr_data.n_vars))

# Min 20 cells - filters out 0 count genes
sc.pp.filter_genes(expr_data, min_cells=20)
print('Number of genes after cell filter: {:d}'.format(expr_data.n_vars))

### Normalization

In [None]:
# Perform log1p transformation (x -> log(1 + x)) and quantile normalization
gene_tf_stats(expr_data, tf_mouse)
sc.pp.normalize_total(expr_data, target_sum = 1e4)
sc.pp.log1p(expr_data)

### Highly variable genes

In [None]:
sc.pp.highly_variable_genes(expr_data, flavor = 'cell_ranger', n_top_genes = 5000)
print('\n','Number of highly variable genes: {:d}'.format(np.sum(expr_data.var['highly_variable'])))

In [None]:
expr_data.var

In [None]:
sc.pl.highly_variable_genes(expr_data)

In [None]:
sc.pl.highest_expr_genes(expr_data, n_top=20)

In [None]:
expr_data_hvg = expr_data.copy()
expr_data_hvg = expr_data_hvg[:, expr_data.var['highly_variable']]
expr_data_hvg

### Embedding

In [None]:
sc.pp.pca(expr_data_hvg)
print("...done PCA")
sc.pp.neighbors(expr_data_hvg)
print("...done finding neighbours")
# sc.tl.diffmap(expr_data_hvg)
# print("...done Diffmap")
sc.tl.umap(expr_data_hvg, n_components = 2)
print('...done UMAP')


In [None]:
#Transfer UMAP and other coordinates to full data objects
expr_data.obsm['X_umap'] = expr_data_hvg.obsm['X_umap']
# expr_data.obsm['X_tsne'] = expr_data_hvg.obsm['X_tsne']
#expr_data.obsm['X_diffmap'] = expr_data_hvg.obsm['X_diffmap']
#expr_data.uns['diffmap_evals'] = expr_data_hvg.uns['diffmap_evals']

In [None]:
sc.pl.umap(expr_data)

### Cell cycle

In [None]:
#Score cell cycle and visualize the effect:
cc_genes_file = "../../external/scanpy/Macosko_cell_cycle_genes.txt"
cc_genes = pd.read_table(cc_genes_file, delimiter='\t')
s_genes = cc_genes['S'].dropna()
g2m_genes = cc_genes['G2.M'].dropna()

s_genes_mm = [gene.lower().capitalize() for gene in s_genes]
g2m_genes_mm = [gene.lower().capitalize() for gene in g2m_genes]

s_genes_mm_ens = expr_data.var_names[np.in1d(expr_data.var_names, s_genes_mm)]
g2m_genes_mm_ens = expr_data.var_names[np.in1d(expr_data.var_names, g2m_genes_mm)]

sc.tl.score_genes_cell_cycle(expr_data, s_genes=s_genes_mm_ens, g2m_genes=g2m_genes_mm_ens)
sc.pl.umap(expr_data, color='phase', use_raw=False)

### Clustering

In [None]:
#expr_data_hvg.obs['Ngn3+'] = pd.Categorical(list(map(str,list(expr_data[:,'Neurog3'].X > 0))))
#expr_data_hvg.obs
# Perform clustering - using highly variable genes
sc.tl.leiden(expr_data_hvg, resolution = 1.0, key_added='leiden_r1')
sc.tl.leiden(expr_data_hvg, resolution = 0.5, key_added='leiden_r0.5')
sc.tl.paga(expr_data_hvg, groups = 'leiden_r1')

In [None]:
#Visualize the clustering and how this is reflected by different technical covariates
sc.pl.umap(expr_data_hvg, color=['leiden_r1', 'leiden_r0.5'], legend_loc = 'on data', palette=sc.pl.palettes.default_28) # default_102

In [None]:
# PDX1, Fox A2 (HNF3β), Fox B2 (HNF6), HB9, Isl1, Ptf1a (p48), neurogenin 3, Beta2/NeuroD1, Nkx2.2, PAX4, PAX6, and Nkx6.1
#expr_data.var[expr_data.var.index.str.match('Pdx1|Foxa2|Onecut1|Ptf1a|Mnx1|Isl1|Hnf1b')]
sc.pl.paga(expr_data_hvg, color = 'leiden_r0.5')

In [None]:
sc.tl.rank_genes_groups(expr_data_hvg, groupby='leiden_r1', key_added='rank_genes_clusters')
sc.pl.rank_genes_groups(expr_data_hvg, key='rank_genes_clusters', fontsize=12, ncols = 3)

In [None]:
 #gp.get_library_name(database = 'Mouse')

In [None]:
# with open('pancreas/gsea/gene_markers.gmt', 'w') as f:
#     for k, v in gene_markers.items():
#         f.write(f"{k} {' '.join(v)}\n")
#gene_list = expr_data_hvg.uns['rank_genes_clusters']['names']['6']
# gene_list = pd.DataFrame({'gene': gene_list})
# gene_list.to_csv("pancreas/gsea/6.txt", header = False, index = False)
#"pancreas/gsea/6.txt"
enr_res = {}
rank_gene_clusters = pd.DataFrame.from_records(expr_data_hvg.uns['rank_genes_clusters']['names'])

for cl in rank_gene_clusters.columns.values:
    print("Processing cluster ", cl)
    enr_res[cl] = gp.enrichr(
        gene_list = rank_gene_clusters[cl][:50].tolist(),
        gene_sets = gene_markers, #"pancreas/gsea/gene_markers.gmy", #'GeneSigDB',
        #organism = 'Mouse', # don't forget to set organism to the one you desired! e.g. Yeast
        description='pancreas_mouse',
        outdir='pancreas/gsea/' + cl,
        background = expr_data.var.shape[0],
        # no_plot=True,
        cutoff=0.5, # test dataset, use lower value from range(0,1)
        verbose = True
    )
    display(enr_res[cl].res2d.sort_values('Adjusted P-value').head(n = 5))

## Using velocity data

In [None]:
data_15_velo = scv.datasets.pancreas()
display(data_15_velo)

In [None]:
scv.pl.proportions(data_15_velo)

In [None]:
scv.pp.filter_genes(data_15_velo, min_shared_counts = 20)
scv.pp.normalize_per_cell(data_15_velo)
scv.pp.filter_genes_dispersion(data_15_velo, n_top_genes = 3000)
scv.pp.log1p(data_15_velo)
sc.pp.highly_variable_genes(data_15_velo, min_mean = 0.0125, max_mean = 3, min_disp = 0.5)

### Clustering

In [None]:
#sc.pl.umap(data_15_velo, color = 'clusters')

In [None]:
sc.pp.neighbors(data_15_velo, n_neighbors=10, n_pcs=40)
sc.tl.leiden(data_15_velo, resolution = 1.0, key_added = 'leiden_1')
sc.tl.leiden(data_15_velo, resolution = 0.5, key_added = 'leiden_0.5')

In [None]:
fig, axes = plt.subplots(1, 4, figsize = (15, 5))
sc.tl.paga(data_15_velo, groups = 'leiden_1')
sc.pl.paga(data_15_velo, ax = axes[0], plot = False, show = False)  # remove `plot=False` if you want to see the coarse-grained graph
sc.tl.umap(data_15_velo, init_pos='paga')
sc.pl.umap(data_15_velo, color = 'clusters', legend_loc='on data', ax = axes[0], show = False)
sc.pl.umap(data_15_velo, color = 'leiden_1', legend_loc='on data', ax = axes[1], show = False)
sc.pl.paga(data_15_velo, color = 'leiden_1', ax = axes[2], show = False)
sc.pl.umap(data_15_velo, color = 'leiden_0.5', legend_loc='on data', ax = axes[3], show = False)


In [None]:
def umap_3d(adata, key_added = 'X_umap_3d'):
    umap3d = sc.tl.umap(adata, n_components = 3, copy = True)
    adata.obsm[key_added] = umap3d.obsm['X_umap']
    logg.info(f"Added {key_added} UMAP coordinates (adata.obsm)")
    
umap_3d(data_15_velo)

In [None]:
#Show in 3D
def plot_clust_3d(expr_data):
    colors = matplotlib.rcParams["axes.prop_cycle"]
    groups = {'Ductal' : 0, 'Ngn3 low EP': 1, 'Ngn3 high EP': 2, 'Pre-endocrine': 3, 'Alpha': 4, 'Beta': 5, 'Delta': 5, 'Epsilon': 6}    
    expr_umap = pd.DataFrame(expr_data.obsm['X_umap_3d'], columns = ['x', 'y', 'z'])
    expr_umap['group'] = expr_data.obs['clusters'].values
    fig = go.Figure(layout = dict(width = 1200, height = 800))
    for g, ind in groups.items():
        series_data = expr_umap[expr_umap.group == g]
        fig.add_trace(go.Scatter3d(x = series_data.x, y = series_data.y, z = series_data.z, mode = 'markers', name = g,
                                  marker = dict(size = 4, color = ind)))
    fig.show()
plot_clust_3d(data_15_velo)
# Subclustering possible
# sc.tl.leiden(adata, restrict_to=('leiden_r0.5', ['Enterocyte mature']), resolution=0.25, key_added='louvain_r0.5_entero_mat_sub')


In [None]:
sc.tl.rank_genes_groups(data_15_velo, groupby='clusters', key_added='rank_genes_clusters')
sc.pl.rank_genes_groups(data_15_velo, key='rank_genes_clusters', fontsize=12, ncols = 3)

In [None]:
fig, axes = plt.subplots(1, 2, figsize = (15, 5))
sc.pl.rank_genes_groups_violin(data_15_velo, use_raw = True, key = 'rank_genes_clusters', groups = ['Ductal'],
                               gene_names = data_15_velo.uns['rank_genes_clusters']['names']['Ductal'][:10],
                               ax = axes[0], show = False)
sc.pl.rank_genes_groups_violin(data_15_velo, use_raw = True, key = 'rank_genes_clusters', groups = ['Beta'],
                               gene_names = data_15_velo.uns['rank_genes_clusters']['names']['Beta'][:10], 
                               ax = axes[1])

In [None]:
sc.pl.rank_genes_groups_dotplot(data_15_velo, n_genes = 4, key='rank_genes_clusters')

In [None]:
ax = plt.figure(figsize = (15, 5)).gca()
gene_mask = np.in1d(
    data_15_velo.uns['rank_genes_clusters']['names']['Ductal'],
    panglao_gene_markers.loc[(panglao_gene_markers.cell_type == 'Ductal cells') & (panglao_gene_markers.organ == 'Pancreas'), 'gene'].str.lower().str.capitalize().values
)
sc.pl.rank_genes_groups_violin(data_15_velo, use_raw = True, key = 'rank_genes_clusters', groups = ['Ductal'],
                               gene_names = data_15_velo.uns['rank_genes_clusters']['names']['Ductal'][gene_mask], ax = ax)
#display(panglao_gene_markers[(panglao_gene_markers.cell_type == 'Ductal cells') & (panglao_gene_markers.organ == 'Pancreas')])

In [None]:
ax = plt.figure(figsize = (15, 5)).gca()
gene_mask = np.in1d(
    data_15_velo.uns['rank_genes_clusters']['names']['Beta'],
    panglao_gene_markers.loc[(panglao_gene_markers.cell_type == 'Beta cells') & (panglao_gene_markers.organ == 'Pancreas'), 'gene'].str.lower().str.capitalize().values
)
sc.pl.rank_genes_groups_violin(data_15_velo, use_raw = True, key = 'rank_genes_clusters', groups = ['Beta'],
                               gene_names = data_15_velo.uns['rank_genes_clusters']['names']['Beta'][gene_mask], ax = ax)

In [None]:
marker_genes_dict = {'Ductal': ['Spp1', 'Dbi'], 'Ngn3 low EP' : ['Sparc', 'Mgst1', 'Sox9'], 'Ngn3 high EP' : ['Neurog3', 'Btbd17', 'Mdk'], 
                     'Pre-endocrine': ['Map1b', 'Fev'], 'Alpha':['Cpe', 'Tmem27'], 'Beta' : ['Pcsk2',  'Pdx1', 'Ins1', 'Ins2'], 'Delta': ['Rbp4', 'Pyy'], 'Epsilon': ['Ghrl', 'Isl1']}
ax = sc.pl.heatmap(data_15_velo, marker_genes_dict, groupby='clusters', cmap='viridis', dendrogram=True) #, standard_scale = 'obs'

In [None]:
ax = sc.pl.tracksplot(data_15_velo, marker_genes_dict, groupby='clusters', dendrogram=True)

In [None]:
#Visualize some markers
sc.pl.umap(data_15_velo, color = ['Ins1'])

### Compute velocity (fast)

In [None]:
scv.pp.moments(data_15_velo, n_pcs = 30, n_neighbors = 30)
scv.tl.velocity(data_15_velo)

In [None]:
scv.tl.velocity_graph(data_15_velo)

In [None]:
scv.pl.velocity_embedding_stream(data_15_velo, basis='umap')

In [None]:
scv.pl.velocity_embedding(data_15_velo, arrow_length = 10, arrow_size = 2, dpi = 150)

### Compute velocity(dynamic)

In [None]:
scv.pp.moments(data_15_velo, n_pcs = 30, n_neighbors = 30)
scv.tl.recover_dynamics(data_15_velo, n_jobs = 8)
scv.tl.velocity(data_15_velo, mode='dynamical')
scv.tl.velocity_graph(data_15_velo)

In [None]:
#data_15_velo.write('data/pancreas.h5ad', compression='gzip')
#data_15_velo = scv.read('data/pancreas.h5ad')

In [None]:
scv.pl.velocity_embedding_stream(data_15_velo, basis='umap')

In [None]:
scv.pl.velocity_embedding(data_15_velo, arrow_length = 10, arrow_size = 2, dpi = 200)

In [None]:
umap_3d(data_15_velo)

In [None]:
#print(data_15_velo.obsm_keys())
scv.tl.velocity_embedding(data_15_velo, basis = 'umap_3d')
display(data_15_velo.obsm['velocity_umap_3d'])


In [None]:
#Show in 3D
def plot_clust_3d(expr_data):
    colors = matplotlib.rcParams["axes.prop_cycle"]
    groups = {'Ductal' : 0, 'Ngn3 low EP': 1, 'Ngn3 high EP': 2, 'Pre-endocrine': 3, 'Alpha': 4, 'Beta': 5, 'Delta': 5, 'Epsilon': 6}    
    expr_umap = pd.DataFrame(expr_data.obsm['X_umap_3d'], columns = ['x', 'y', 'z'])
    expr_umap['group'] = expr_data.obs['clusters'].values
    expr_umap_vel = pd.DataFrame(expr_data.obsm['velocity_umap_3d'], columns = ['u', 'v', 'w'])
    expr_umap = pd.concat([expr_umap, expr_umap_vel], axis = 1)
#     display(expr_umap)
    fig = go.Figure(layout = dict(width = 1200, height = 800))
    for g, ind in groups.items():
        series_data = expr_umap[expr_umap.group == g]
        fig.add_trace(go.Scatter3d(x = series_data.x, y = series_data.y, z = series_data.z, mode = 'markers', name = g,
                                  marker = dict(size = 4, color = ind)))
        fig.add_trace(go.Cone(x = series_data.x, y = series_data.y, z = series_data.z, u = series_data.u, v = series_data.v, w = series_data.w,
                                   sizemode = "absolute", sizeref = 0.1))
    fig.show()
plot_clust_3d(data_15_velo)

In [None]:
scv.pl.velocity(data_15_velo, ['Cpe',  'Gnao1', 'Ins2', 'Adk'], ncols = 2)

In [None]:
scv.tl.score_genes_cell_cycle(data_15_velo)
scv.pl.scatter(data_15_velo, color_gradients=['S_score', 'G2M_score'], smooth=True, perc=[5, 95])

In [None]:
scv.tl.velocity_confidence(data_15_velo)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(data_15_velo, c=keys, cmap='coolwarm', perc=[5, 95])

In [None]:
x, y = scv.utils.get_cell_transitions(data_15_velo, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(data_15_velo, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(data_15_velo, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)

In [None]:
x, y = scv.utils.get_cell_transitions(data_15_velo, basis='umap', starting_cell = 70, n_steps = 200)
ax = scv.pl.velocity_graph(data_15_velo, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(data_15_velo, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)

In [None]:
scv.tl.latent_time(data_15_velo)
scv.pl.scatter(data_15_velo, color='latent_time', color_map='gnuplot', size=80)

## Imputation & Denoising

In [None]:
data_15_velo = scv.datasets.pancreas()
scv.pp.filter_genes(data_15_velo, min_shared_counts = 20)
data_15_velo.raw = data_15_velo


In [None]:
# fig, axes = plt.subplots(1, 2, figsize = (15, 5))
# axes[0].hist(gene_max, bins = 50, log = True, range = (0, 500))
# axes[0].set_title("Max cell count distribution per gene")
# axes[1].hist(tf_max, bins = 50, log = True, range = (0, 100))
# axes[1].set_title("Max cell count distribution per TF")
# cnt_threshold = 0 
# print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
# cnt_threshold = 10
# print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
# cnt_threshold = 20
# print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
# print("TF: ", tf_names[tf_max >= 20])
# cnt_threshold = 40
# print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
# cnt_threshold = 100
# print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
# cnt_threshold = 200
# print(f"Gene Count > {cnt_threshold}: {gene_max[gene_max > cnt_threshold].shape[0]}, TF count > {cnt_threshold}: {tf_max[tf_max > cnt_threshold].shape[0]}", )
# print("Genes: ", data_15_velo.var.index.values[gene_max > 200])

In [None]:
#sc.external.pp.dca(data_15_velo, optimizer = 'Adam', verbose = True)

In [None]:
# Keep only the transcription factors present in the filtered dataset
# tf_mouse = data_15_velo.var.index[np.in1d(data_15_velo.var.index, tf_mouse)].values
# print(len(tf_mouse))

In [None]:
# fig, axes = plt.subplots(25, 4, figsize = (15, 100))
# for i in range(100):
#     #tf1 = tf_mouse[np.random.randint(len(tf_mouse))]
#     tf1 = tf_mouse[i]
#     ax = axes.flatten()[i]
#     ax.plot(data_15_velo.raw[traj1, [tf1]].X.toarray().flatten(), 'r', label = 'Raw')
#     ax.plot(data_15_velo[traj1, tf1].X, 'g', label = 'Smoothed')
#     ax.set_title(tf1)
#     ax.legend()


In [None]:
from scipy.sparse import csr_matrix
data_tf = data_15_velo[:, tf_mouse].copy()
data_tf.X = csr_matrix(data_tf.X)
sc.tl.rank_genes_groups(data_tf, groupby='clusters', use_raw = False, key_added='rank_tf_clusters')

In [None]:
fig, axes = plt.subplots(4, 2, figsize = (15, 40))

groups = ['Ductal', 'Ngn3 low EP', 'Ngn3 high EP', 'Pre-endocrine', 'Alpha', 'Beta', 'Delta', 'Epsilon']
for i in range(8):
    ax = axes.flatten()[i]
    group = groups[i]
    sc.pl.rank_genes_groups_violin(data_tf, key = 'rank_tf_clusters', groups = group,
                                   gene_names = data_tf.uns['rank_tf_clusters']['names'][group][:20],
                                   ax = ax, show = False)

In [None]:
groups = ['Ductal', 'Ngn3 low EP', 'Ngn3 high EP', 'Pre-endocrine', 'Alpha', 'Beta', 'Delta', 'Epsilon']
top_tf = []
for g in groups: 
    top_tf = top_tf + data_tf.uns['rank_tf_clusters']['names'][g][:10].tolist()
sc.pl.heatmap(data_15_velo, var_names = top_tf, groupby = 'clusters', swap_axes = True, figsize = (10, 15), show_gene_labels=True, log = True)

In [None]:
#data_tf[data_tf.obs.clusters == 'Ngn3 low EP', 'Neurog3'].X.max()
data_tf.X.max()

In [None]:
top_genes = data_15_velo.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(data_15_velo, var_names = top_genes, sortby = 'latent_time', 
               col_color = 'clusters', n_convolve=100 , figsize = (7, 20))

In [None]:
data_15_velo

## Transcription factor analysis

In [None]:
# #Visualize the clustering and how this is reflected by different technical covariates
# sc.pl.umap(adata, color=['louvain_r1', 'louvain_r0.5'], palette=sc.pl.palettes.default_64)


In [None]:
# Get transcription factors from TRRUST
tf_mouse_interactions = pd.read_csv("../../external/trrustv2/trrust_rawdata.mouse.tsv", sep = "\t", header = None, names = ["TF1", "TF2", "Effect", "Reference"])
tf_mouse = set(tf_mouse_interactions["TF1"]) #.union(set(tf_mouse_interactions["TF2"]))
tf_mouse = list(tf_mouse)
tf_mouse.sort()
#print(tf_mouse)
expr_tf = data_15_velo[:, np.in1d(data_15_velo.var.index, tf_mouse)]
expr_tf.obs['latent_time'] = data_15_velo.obs['latent_time']
expr_tf

In [None]:
data_15_velo

In [None]:
# #sc.pl.umap(data_15_velo, color=['Neurog3', 'Sox9'])
# #sc.pl.umap(data_15_velo, color=['Hnf1b'])
# # sc.pl.umap(data_15_velo, color=['Mnx1', 'Isl1'])
# # sc.pl.umap(data_15_velo, color=['Cpa1', 'Neurog3', 'Ins1'])
# X = scv.utils.get_cell_transitions(data_15_velo, starting_cell = 70, n_steps = 200)
# print(X)
# tf_names = ['Neurog3', 'Cpa1', 'Nkx2.2']
# tf_indices = data_15_velo.var.index.get_indexer(tf_names)
# #data_15_velo.X
# tf_levels = data_15_velo.X[X, :][:, tf_indices].toarray()
# #print(tf_levels)
# #plt.plot()
# fig, axes = plt.subplots(1, 2, figsize = (15, 5))
# axes[0].plot(tf_levels)
# scv.pl.velocity_graph(data_15_velo, c='lightgrey', edge_width=.05, show=False, ax = axes[1])
# scv.pl.scatter(data_15_velo, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax = axes[1])
top_genes = expr_tf.var['fit_likelihood'].sort_values(ascending=False)
scv.pl.heatmap(data_15_velo, var_names = top_genes.index[:300], sortby = 'latent_time', 
               col_color = 'clusters', figsize = (7, 40), colorbar = True)
# tg = data_15_velo.var['fit_likelihood'].sort_values(ascending=False)
# tg = tg[~np.isnan(tg)]
# tg[:300]

In [None]:
def show_points_umap(ax, data, ids, color):
    cells_umap = data.obsm['X_umap'][ids, :]
    scv.pl.scatter(data, x = cells_umap[:, 0], y = cells_umap[:, 1], s = 120, ax = ax, color = color, show = False)
    
def follow_trajectories(expr_data, t_start, t_end):
    stp = (t_end - t_start) / 10
    start_sel = (expr_data.obs['latent_time'] >= t_start - stp/2) & (expr_data.obs['latent_time'] < t_start + stp/2)
    start_cells_data = expr_data[start_sel, :]
    start_cells_umap = start_cells_data.obsm['X_umap']
    start_cells = np.arange(expr_data.obs.shape[0])[start_sel]
    next_cells = np.zeros(start_cells.shape[0], dtype = "int32")
    print(start_cells)
    for i in range(len(start_cells)):
        next_cells [i] = scv.utils.get_cell_transitions(data_15_velo, starting_cell = start_cells[i], n_steps = 1)[1]
    print(next_cells)
    
    fig, axes = plt.subplots(1, 2, figsize = (15, 5))
#     axes[0].plot(tf_levels)
    #scv.pl.velocity_graph(data_15_velo, c = 'lightgrey', edge_width=.05, show=False, ax = axes[1])
    scv.pl.velocity_embedding(data_15_velo, arrow_length = 10, arrow_size = 2, ax = axes[1], show = False)
    show_points_umap(axes[1], data_15_velo, start_cells, 'red')
    show_points_umap(axes[1], data_15_velo, next_cells, 'black')
#     scv.pl.scatter(data_15_velo, x = start_cells_umap[:, 0], y = start_cells_umap[:, 1], s = 120, ax = axes[1], color = 'red', cmap='gnuplot')
#     scv.pl.scatter(data_15_velo, x = start_cells_umap[:, 0], y = start_cells_umap[:, 1], s = 120, ax = axes[1], color = 'green', cmap='gnuplot')
    
    
follow_trajectories(data_15_velo, t_start = 0.3, t_end = 0.7)

In [None]:
# Get transcription factors from 
top_genes.index.str.match('Onecut1')

In [None]:
def plot_distrib():
    rnd_gene = np.random.randint(expr_data.X.shape[1])
    data = expr_data.X[:, rnd_gene].toarray()
    data = data[data > 0]
    print(data)
    sns.histplot(data).set_title(expr_data.var.index[rnd_gene])
plot_distrib()

## Further steps

### Single cell RNA-seq denoising using a deep count autoencoder

- Paper: https://www.biorxiv.org/content/10.1101/300681v1
- Code: https://github.com/qiaochen/VeloRep
- Network architecture: https://github.com/qiaochen/VeloRep/blob/e7840bad413a3cc171c1039d29bd70db96274438/veloproj/util.py

<img src="https://www.biorxiv.org/content/biorxiv/early/2021/03/20/2021.03.19.436127/F1.large.jpg?width=800&height=600&carousel=1" width=700 />

- Encoder:
  - conventional encoder
  - multi-layer perceptrons with 1 hidden layer
  - graph convolutional network module (cohort aggregation, KNN)
- Decoder
  - Attentive combination module
  


### Representation learning of RNA velocity reveals robust cell transitions

- Paper: https://www.biorxiv.org/content/10.1101/2021.03.19.436127v1.full

In [None]:
import scanpy.external as sce
import scvelo as scv
data_15_velo = scv.datasets.pancreas()
display(data_15_velo)

In [None]:
# Must install dca via: pip install dca
sce.pp.dca(data_15_velo)