# Preliminary work
The session showcases how to infer activities of transcription factors from single-cell RNA-sequencing (scRNA-seq) data and spatial transcriptomics (ST) data using three methods:
- decoupleR (scRNA-seq based)
- pySCENIC (scRNA-seq based)
- STAN (ST-based)

Please follow this notebook after you have [set up the environment](https://github.com/osmanbeyoglulab/Tutorials-on-ISMB-2024?tab=readme-ov-file#environment-set-up).

In [1]:
import scanpy as sc
import warnings
warnings.filterwarnings("ignore")

## Downloading datasets
The following scRNA-seq data consists in 3k peripheral blood mononuclear cells (PBMCs) from a Healthy Donor and is freely available from [10x Genomics](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k). This command downloads 5.9 MB of data upon the first call and stores it in `data/pbmc3k_raw.h5ad`.

In [2]:
sc.datasets.pbmc3k()

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

The following scRNA-seq data are processed 3k PBMCs of the above using [the basic tutorial](https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html). This command downloads 24.7 MB of data upon the first call and stores it in `data/pbmc3k_processed.h5ad`.

In [3]:
sc.datasets.pbmc3k_processed()

AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

We will download a Visium spatial transcriptomics dataset of a human lymph node freely available from [10X Genomics](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Human_Lymph_Node). This command downloads 47.4 MB of dataset upon the first call and stores it in `data/V1_Human_Lymph_Node/`, including the high-resolution tissue image.

In [4]:
sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")

AnnData object with n_obs × n_vars = 4035 × 36601
    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'

## decoupleR (scRNA-seq based)
[decoupler](https://doi.org/10.1093/bioadv/vbac016) [1] is a package containing different statistical methods to extract biological activities from omics data within a unified framework, including pathway activity inference and transcription factor activity inference. We follow [the instruction](https://decoupler-py.readthedocs.io/en/latest/installation.html) to install decoupler.

[1] Badia-i-Mompel, Pau, et al. "decoupleR: ensemble of computational methods to infer biological activities from omics data." _Bioinformatics Advances_ 2.1 (2022): vbac016.

In [None]:
pip install decoupler

In [None]:
import decoupler
decoupler.__version__

## pySCENIC (scRNA-seq based)
[pySCENIC](https://www.nature.com/articles/s41596-020-0336-2) [2] is a package containing different statistical methods to extract biological activities from single-cell RNA-seq data within a unified framework, including gene regulatory network inference and transcription factor activity inference. We follow [the instruction](https://pyscenic.readthedocs.io/en/latest/installation.html) to install pySCENIC. pySCENIC depends on packages e.g. arboreto and ctxcore. 

**IMPORTANT:** To install and use pySCENIC, we recommend creating a new conda environment as instructed.

[2] Van de Sande, Bram, et al. "A scalable SCENIC workflow for single-cell gene regulatory network analysis." _Nature Protocols_ 15.7 (2020): 2247-2276.


In [None]:
pip install pyscenic

In [None]:
import pyscenic
pyscenic.__version__

### Downloading resources and databases

Download:
- the list of transcription factors of human;
- the motif annotations of human;
- the gene-based [cisTarget databases](https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/) generated using the 2022 motif collection.

**IMPORTANT:** The cisTarget database files are quite big (unit: GB). Alternatively, to avoid corrupt or incomplete downloads, files can be downloaded with [zsync_curl](https://resources.aertslab.org/cistarget/). It allows resuming already partially downloaded databases and only will download missing or redownload corrupted chunks.


In [None]:
!mkdir resources_pyscenic
!curl https://resources.aertslab.org/cistarget/tf_lists/allTFs_hg38.txt \
    -o resources_pyscenic/allTFs_hg38.txt
!curl https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl \
    -o resources_pyscenic/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
!curl https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather \
    -o resources_pyscenic/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
!curl https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc_v10_clust/gene_based/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather \
    -o resources_pyscenic/hg38_500bp_up_100bp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather

## STAN (ST-based)
[STAN](https://www.biorxiv.org/content/10.1101/2024.06.26.600782v1) [3] is a computational framework for inferring **spatially informed** transcription factor activity across cellular contexts. Specifically, STAN is a linear mixed-effects computational method that predicts spot-specific, spatially informed TF activities by integrating curated TF-target gene priors, mRNA expression, spatial coordinates, and morphological features from corresponding imaging data. 

[3] Zhang, Linan, et al. "STAN, a computational framework for inferring spatially informed transcription factor activity across cellular contexts." _bioRxiv_ (2024): 2024-06.

### Downloading supporting files

In [None]:
!mkdir resources_stan
!curl https://raw.githubusercontent.com/vitkl/cell2location_paper/1c645a0519f8f27ecef18468cf339d35d99f42e7/notebooks/selected_results/lymph_nodes_analysis/CoLocationModelNB4V2_34clusters_4039locations_10241genes_input_inferred_V4_batch1024_l2_0001_n_comb50_5_cps5_fpc3_alpha001/W_cell_density.csv \
    -o resources_stan/W_cell_density.csv
!curl https://raw.githubusercontent.com/vitkl/cell2location_paper/1c645a0519f8f27ecef18468cf339d35d99f42e7/notebooks/selected_results/lymph_nodes_analysis/CoLocationModelNB4V2_34clusters_4039locations_10241genes_input_inferred_V4_batch1024_l2_0001_n_comb50_5_cps5_fpc3_alpha001/manual_GC_annot.csv \
    -o resources_stan/manual_GC_annot.csv

In [6]:
import session_info
session_info.show()