# Treg Cell Atlas

## Introduction
This tutorial shows how to visualize an atlas of multiple single-cell RNA samples, including the distribution of cell types within each sampl and some of the marker gene characteristics.

Cell atlases should describe the basic cell types found, how the are distrbuted in individual samples, and key defining marker genes for each of the cell types. The goal is generally to allow future researchers to better navigate the cell types and expressed genes in similar types of samples.

Though these plots are with benign prostate, the `.h5ad` should be easily swappable with any number of atlases from CZI cellxgene or other sources.

In this tutorial we will look at data from [Gleeson 2023, "Conserved transcriptional connectivity of regulatory T cells in the tumor microenvironment informs new combination cancer therapy strategies"](https://www.nature.com/articles/s41590-023-01504-2), which is [available from cellxgene](https://cellxgene.cziscience.com/collections/efd94500-1fdc-4e28-9e9f-a309d0154e21).

<img 
    src="./assets/01_Combined_Figure.png" 
    alt="Atlas Overview Figure"
    align="center" 
    style="border: 2px solid #ccc; border-radius: 8px; padding: 5px; width: 100%; box-shadow: 0px 4px 8px rgba(0,0,0,0.1);">
    
## Workflow Steps
1. Visualize UMAP features of cell clustering to determine if cell types are well-separated.
2. Visualize cell type distribution within each sample to detect sample-associated differences.
3. Visualize marker genes for each cell type to validate cell type assignments.

## Workflow Input Data
* pre-processed AnnData atlas, with `.obsm` for:
  * UMAP embeddings
  * cell typings for each cell
  * sample name for each cell
* Mapping of cell type --> list of marker genes. This could be from a subject matter expert, or automatically generated from a stats tool or an SCVerse tool.

# Setup and data download (~600MB)

In [None]:
import requests

import anndata as ad
import holoviews as hv
import scanpy as sc
import pooch

from hv_anndata import Dotmap

hv.extension("bokeh")

# download the data
anndata_file_path = pooch.retrieve(
    url="https://datasets.cellxgene.cziscience.com/32149a2b-b637-481b-8e04-b7c4c2dd68db.h5ad",
    known_hash="md5:be84940746cfb3e25fcb0432e55ddfde",
    fname="32149a2b-b637-481b-8e04-b7c4c2dd68db.h5ad",
    path="data-download"
)

In [None]:
%%time
adata = ad.read_h5ad(anndata_file_path)

## Figure A: UMAP of Cell Type and Sample Source

### Questions:
- How are different cell types distributed in our spatial embedding, and do they form distinct clusters?
- Are there any unexpected mixing patterns between cell types?

### Features:
- Switch the cell point coloring between assigned cell type and sample source of each cell.

### Inputs:
- UMAP coordinates for each cell
- Cell type annotations
- Sample source identifiers

### Expected Output/Evaluation:
Well-defined cell type clusters with minimal batch effects would show as distinct color regions in cell type view, but evenly distributed sample colors in sample view.

In [None]:
sc.pl.umap(adata, color="cell_type")

In [None]:
sc.pl.umap(adata, color="donor_id")

## Figure B: Cell Type Distribution Per Sample

### Questions:
- What is the relative abundance of each cell type across samples?
- Are there any notable sample-specific variations?

### Features:
- Linked selection with Figure A UMAP plot to provide context for where specific sample populations exist in UMAP feature space.
    - [TODO: Add xaction to click on a sample in the bar plot to highlight the corresponding UMAP points.]
- Linked selection with Figure C DotMap plot to provide context for marker gene specificity of selected sample.
- Toggle between absolute and percentage-based cell counts to facilitate comparison and avoid contextualizing differences in sample size.

### Inputs:
- Cell type annotations
- Sample identifiers
- Cell counts per type per sample

### Expected Output/Evaluation:
* While allowing for biological variation, look for highly inconsistent cell type proportions in samples.
* Mark outlier samples or cell types for more detailed investigation
  * For example: if immune cells in particular ore off, look for additional marker genes to be added to the set
* [TODO: Maybe add a metric (chi-square test?) to quantify the consistency of cell type proportions across samples and maybe sort the samples by this metric. There may be metrics in existing atlas papers, otherwise this could be a good quick pub]

In [None]:
cell_type_counts = adata.obs.groupby(['donor_id', 'cell_type'], observed=False).size().reset_index(name='count')

bars = hv.Bars(cell_type_counts, kdims=['donor_id', 'cell_type'], vdims=['count'])
bars.opts(
    hv.opts.Bars(
        stacked=True,              # Enable stacking
        color='cell_type',          # Color by celltype
        width=650,                 # Width of the plot
        height=400,                # Height of the plot
        tools=['hover'],           # Add hover tool
        xrotation=45,              # Rotate x-axis labels
        legend_position='right',   # Position of the legend
        title='Cell Count by Donor and Cell Type',
        ylabel='Count',
        xlabel='Donor',
        cmap="Category20",
    )
)
bars

In [None]:
cell_type_counts['donor_total'] = cell_type_counts.groupby('donor_id', observed=False)['count'].transform('sum')
cell_type_counts['proportion'] = cell_type_counts['count'] / cell_type_counts['donor_total'] * 100
bars = hv.Bars(cell_type_counts, kdims=['donor_id', 'cell_type'], vdims=['proportion'])
bars.opts(
    hv.opts.Bars(
        stacked=True,              # Enable stacking
        color='cell_type',          # Color by celltype
        width=650,                 # Width of the plot
        height=400,                # Height of the plot
        tools=['hover'],           # Add hover tool
        xrotation=45,              # Rotate x-axis labels
        legend_position='right',   # Position of the legend
        title='Cell Count by Donor and Cell Type',
        ylabel='Count',
        xlabel='Donor',
        cmap="Category20",
    )
)
bars

## Figure C: Marker Gene Distribution

### Questions:
- Do canonical markers adequately define our cell types?
- Are there any marker genes showing unexpected expression patterns?

### Features:
- Toggle heatmap to collapse fraction of cell in group.
- Toggle/Tab tracksplot to expand view to include expression level of every cell in assigned cell-type cluster.

### Inputs:
- Gene expression matrix
- Cell type annotations
- Marker gene list per cell type

### Expected Output/Evaluation:
Expect high expression of markers in their assigned cell types with minimal expression elsewhere.

In [None]:
signature_gene_symbols = {
    "ActivatedVEC": ["Bcl3", "Noct", "Relb", "Tnf", "Cerl2", "Cc40", "Irf5", "Csf1", "NiKb2", "Icosl", "Egr2", "Dll1", "Pim1", "Irf1", "Icam1", "Fgf2", "Tank", "I16", "Tgif1", "Ninj1", "Tnip1"],
    "Angiogenesis": ["LPI", "Cd36", "Miga2", "Tap1", "Wars1", "Cd74", "Lyбe", "Gbp6", "Ido1", "Ciita", "Oas2", "Vegfa", "Thod", "Slco2a1", "Jup", "Icam2", "Lima1", "Cldn5", "Pardog", "Cd47", "Fmol I", "Alas1", "Bmpr2", "Sptbnt", "Smad6", "Sema3c"],
    "Hypoxia": ["Klf6", "Nfil", "Bhlhe40", "Maff", "Serpine1", "Plaur", "Tnfaip3", "Icam1", "Nfkbia Junb", "Hbegf", "Rel", "Relb", "Fosl2", "Hmox1", "Timp3", "Irf8", "Batf3", "Nikbiz", "Pvr", "Ccr7", "Stat3"],
    "EndMT": ["Emp3", "Serpina3", "Psmg4", "Cd63", "Il1r1", "Lgmn", "Csrp2", "Len2", "Cfb", "Lgals4", "Npm3", "Traf4", "Kpnb1", "Timp1", "Gda", "Ch25", "Tgm2", "Prkca", "Csrp2", "Ngf", "Ammecr1"]
}

In [None]:
MYGENE_QUERY_URI = "https://mygene.info/v3/query?fields=ensembl.gene&dotfield=true&size=1&from=0&fetch_all=false&facet_size=10&entrezonly=false&ensemblonly=false"
MAX_GENES_PER_GROUP = 8

adata_gene_set = set(adata.var_names)
signature_ensembl = {}
for signature, symbols in signature_gene_symbols.items():
    query = {
      "q": symbols,
      "scopes": "symbol",
      "species": [
        "human"
      ],
      "fields": "ensemble.gene"
    }
    response = requests.post(MYGENE_QUERY_URI, json=query)

    ens_genes = []
    for gene in response.json():
        eg = gene.get("ensembl.gene", [])
        if isinstance(eg, str):
            eg = [eg]
        ens_genes += list(set(eg) & adata_gene_set)
    
    signature_ensembl[signature] = ens_genes[:MAX_GENES_PER_GROUP]

In [None]:
Dotmap(adata=adata, marker_genes=signature_ensembl, groupby="cell_type")