## Stage 3: Regulatory inference

In this tutorial, we will show how to conduct regulatory inference using GLUE feature embeddings. We continue with the previous example of scRNA-seq and scATAC-seq data integration.

In this example, the GLUE-based regulatory inference is used to identify significant cis-regulatory regions (ATAC peaks) for each gene. We will also demonstrate how to build TF-target gene regulatory graph based on the GLUE-inferred cis-regulatory regions, using additional information about TF binding sites.

## inputs

In [5]:
import anndata as ad
import networkx as nx
import numpy as np
import pandas as pd
import scglue
import seaborn as sns
from IPython import display
from matplotlib import rcParams
from networkx.algorithms.bipartite import biadjacency_matrix
from networkx.drawing.nx_agraph import graphviz_layout
work_dir = '../../../output'

In [13]:
scglue.plot.set_publication_params()
rcParams['figure.figsize'] = (4, 4)

## Read intermediate results

First, read the intermediate results containing cell and feature embeddings from [stage 2](training.ipynb).

In [2]:
rna = ad.read_h5ad(f"{work_dir}/infer/scglue/rna-emb.h5ad")
atac = ad.read_h5ad(f"{work_dir}/infer/scglue/atac-emb.h5ad")
guidance = nx.read_graphml(f"{work_dir}/infer/scglue/guidance.graphml.gz")

We will be using genomic coordinates a lot in `BED` format. It is convenient to store the variable names as a "name" column.

In [3]:
rna.var["name"] = rna.var_names
atac.var["name"] = atac.var_names

Given that the GLUE model was trained on highly-variable features, regulatory inference will also be limited to these features. So, we extract the list of highly-variable features for future convenience.

In [4]:
genes = rna.var.index
peaks = atac.var.index

In [5]:
peaks.shape

(135358,)

In [6]:
aa

NameError: name 'aa' is not defined

## Cis-regulatory inference with GLUE feature embeddings

> (Estimated time: negligible)

We first concatenate the feature indices and embeddings of the two modalities.

In [15]:
features = pd.Index(np.concatenate([rna.var_names, atac.var_names]))
feature_embeddings = np.concatenate([rna.varm["X_glue"], atac.varm["X_glue"]])

We would also need to extract a "skeleton" graph on which to conduct regulatory inference. The "skeleton" serves to limit the search space of potential regulatory pairs, which helps reduce false positives caused by spurious correlations.

* The regulatory scores are defined as the cosine similarity between feature embeddings. As cosine similarities are symmetric, we only choose one direction between RNA genes and ATAC peaks to avoid repeated computation.
* Self-loops are also ignored as self-regulation is not meaningfully modeled with the current model.

In [16]:
skeleton = guidance.edge_subgraph(
    e for e, attr in dict(guidance.edges).items()
    if attr["type"] == "fwd"
).copy()

Regulatory inference can be conducted by using the [scglue.genomics.regulatory_inference](api/scglue.genomics.regulatory_inference.rst) function. The function takes feature indices and embeddings as input, together with the skeleton graph generated above.

The resulting object is also a graph, with additional edge attributes:

* `"score"`: Regulatory score between genomic features, defined as cosine similair between feature embeddings;
* `"pval"`: *P*-value of the regulatory scores, obtained by comparing with a NULL distribution from shuffled feature embeddings;
* `"qval"`: *Q*-value of the regulatory scores, obtained by FDR correction of the *P*-values.

In [17]:
reginf = scglue.genomics.regulatory_inference(
    features, feature_embeddings,
    skeleton=skeleton, random_state=0
)

regulatory_inference: 100%|██████████| 95021/95021 [00:00<00:00, 115014.96it/s]


Significant regulatory connections can be extracted based on edge attribute (Q-value < 0.05).

In [18]:
gene2peak = reginf.edge_subgraph(
    e for e, attr in dict(reginf.edges).items()
    if attr["qval"] < 0.05
)

## Visualize the inferred cis-regulatory regions

> (Estimated time: negligible)

The inferred cis-regulatory connections can be visualized using [pyGenomeTracks](https://pygenometracks.readthedocs.io/en/latest/). You can install it via:

```sh
conda install -c bioconda pygenometracks
```

Before making the plot, we need to prepare input files for the `pygenometracks` CLI.

Specifically, we save the ATAC peaks in BED format, and the inferred gene-peak connections in "links" format:

In [None]:
scglue.genomics.Bed(atac.var).write_bed(f"{work_dir}/infer/scglue/peaks.bed", ncols=3)
scglue.genomics.write_links(
    gene2peak,
    scglue.genomics.Bed(rna.var).strand_specific_start_site(),
    scglue.genomics.Bed(atac.var),
    f"{work_dir}/infer/scglue/gene2peak.links", keep_attrs=["score"]
)

In [None]:
from scglue.genomics import read_ctx_grn
from ast import literal_eval
from functools import reduce
import pandas as pd

df = pd.read_csv(
    f"{work_dir}/infer/scglue/gene2peak.links", sep='\t',  header=None, skiprows=0
)
df['gene'] = df.apply(lambda row:'-'.join(map(str, row[0:3])), axis=1)
df['peak'] = df.apply(lambda row:'-'.join(map(str, row[3:6])), axis=1)
df = df[['peak','gene']]

In [None]:
df

Unnamed: 0,peak,gene
0,chr1-778292-779204,chr1-778746-778747
1,chr1-822873-823635,chr1-825137-825138
2,chr1-825292-826033,chr1-825137-825138
3,chr1-827076-827959,chr1-825137-825138
4,chr1-837612-838149,chr1-825137-825138
...,...,...
94453,chrY-20558402-20559294,chrY-20575518-20575519
94454,chrY-20573112-20573918,chrY-20575518-20575519
94455,chrY-20575222-20576136,chrY-20575518-20575519
94456,chrY-20573112-20573918,chrY-20575775-20575776


Then prepare a track configuration file like below (see their [documentation](https://pygenometracks.readthedocs.io/en/latest/content/all_tracks.html) for more details):

In [None]:
# %%writefile tracks.ini

# [Score]
# file = gene2peak.links
# title = Score
# height = 2
# color = YlGnBu
# compact_arcs_level = 2
# use_middle = True
# file_type = links

# [ATAC]
# file = peaks.bed
# title = ATAC
# display = collapsed
# border_color = none
# labels = False
# file_type = bed

# [Genes]
# file = gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz
# title = Genes
# prefered_name = gene_name
# height = 4
# merge_transcripts = True
# labels = True
# max_labels = 100
# all_labels_inside = True
# style = UCSC
# file_type = gtf

# [x-axis]
# fontsize = 12

Finally, we can call `pygenometracks` CLI to visualizing the inferred cis-regulatory connections within a proper genomic range (e.g., an area surrounding the *Gad2* gene):

In [None]:
# loc = rna.var.loc["Gad2"]
# chrom = loc["chrom"]
# chromLen = loc["chromEnd"] - loc["chromStart"]
# chromStart = loc["chromStart"] - chromLen
# chromEnd = loc["chromEnd"] + chromLen
# !pyGenomeTracks --tracks tracks.ini \
#     --region {chrom}:{chromStart}-{chromEnd} \
#     --outFileName tracks.png 2> /dev/null
# display.Image("tracks.png")

Note that in the tutorials, the guidance graph was constructed using only genomic overlap (see [Stage 1](preprocessing.ipynb#Graph-construction)), so the inferred regulatory connections are limited to the proximal promoter and gene body regions.

In real-world analyses, it would be beneficial to extend the genomic range (e.g., 150kb around TSS with distance-decaying weight) or incorporate additional information like Hi-C and eQTL (see our [case study](https://github.com/gao-lab/GLUE/tree/master/experiments/RegInf/s01_preprocessing.py) for an example).

## Construct TF-gene regulatory network from inferred cis-regulatory regions

Next, we demonstrate how to further construct TF-gene regulatory graph by combining the GLUE-inferred regulatory regions and TF motif/ChIP-seq information. Specifically, the [SCENIC pipeline](https://doi.org/10.1038/nmeth.4463) pipeline is adopted with the following 3 steps:

1. Generate a coexpression-based draft network using `GRNBoost2`;
2. Generate gene-wise TF cis-regulatory ranking by combining cis-regulatory regions and TF motif/ChIP-seq data;
3. Prune the coexpression-based draft network using the above cis-regulatory ranking with `cisTarget`.

To install pyscenic, use the following commands:

```sh
conda install -c conda-forge pyarrow cytoolz
pip install pyscenic
```

For human and mouse, the TF motif/ChIP-seq data used in the second step can be downloaded from here:

JASPAR motif hits:

* [http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-hg19.bed.gz](http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-hg19.bed.gz)
* [http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-hg38.bed.gz](http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-hg38.bed.gz)
* [http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-mm9.bed.gz](http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-mm9.bed.gz)
* [http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-mm10.bed.gz](http://download.gao-lab.org/GLUE/cisreg/JASPAR2022-mm10.bed.gz)

ENCODE TF ChIP-seq:

* [http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-hg38.bed.gz](http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-hg38.bed.gz)
* [http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-hg19.bed.gz](http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-hg19.bed.gz)
* [http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-mm10.bed.gz](http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-mm10.bed.gz)
* [http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-mm9.bed.gz](http://download.gao-lab.org/GLUE/cisreg/ENCODE-TF-ChIP-mm9.bed.gz)


Also see [pySCENIC](https://pyscenic.readthedocs.io/en/latest/index.html) for the original pipeline.

### Draft a coexpression-based network

> (Estimated time: ~5 min)

First, generate a list of eligible TFs. We use TFs covered in both the scRNA-seq dataset and TF motif/ChIP-seq data.

As ENCODE ChIP-seq covers a very limited number of mouse TFs, we will be using JASPAR motif hits in this tutorial:

In [8]:
motif_bed = scglue.genomics.read_bed(f"{work_dir}/JASPAR2022-hg38.bed.gz")
motif_bed.head()

Unnamed: 0,chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd,itemRgb,blockCount,blockSizes,blockStarts
0,GL000008.2,38,48,SOX2,.,.,.,.,.,.,.,.
1,GL000008.2,327,344,ZNF684,.,.,.,.,.,.,.,.
2,GL000008.2,332,344,TEAD1,.,.,.,.,.,.,.,.
3,GL000008.2,332,344,TEAD2,.,.,.,.,.,.,.,.
4,GL000008.2,672,689,ZNF684,.,.,.,.,.,.,.,.


In [None]:
motif_bed.name.unique().shape

(634,)

In [9]:
tfs = pd.Index(motif_bed["name"]).intersection(rna.var_names)
tfs.size

441

Since pySCENIC CLI uses `loom` files as input, we need to save the scRNA-seq data as a `loom` file (with only highly-variable genes and TFs). We also need to save the list of TFs as a separate `txt` file.

In [None]:
rna[:, np.union1d(genes, tfs)].write_loom(f"{work_dir}/infer/scglue/rna.loom")
np.savetxt(f"{work_dir}/infer/scglue/tfs.txt", tfs, fmt="%s")

The loom file will lack these fields:
{'PCs', 'X_glue', 'X_pca', 'X_umap'}
Use write_obsm_varm=True to export multi-dimensional annotations


In [None]:
if True:
    import loompy

    # Open the Loom file
    with loompy.connect(f"{work_dir}/infer/scglue/rna.loom") as ds:
        print(ds.ra.keys())
        print(ds.ca.keys())
        gene_names = ds.ra['name']
        gene_names
        # expression_matrix = ds[:, :]  # Get the entire expression matrix

['artif_dupl', 'blockCount', 'blockSizes', 'blockStarts', 'chrom', 'chromEnd', 'chromStart', 'gene_id', 'gene_type', 'havana_gene', 'hgnc_id', 'highly_variable', 'highly_variable_rank', 'itemRgb', 'location', 'mean', 'means', 'name', 'score', 'std', 'strand', 'tag', 'thickEnd', 'thickStart', 'variances', 'variances_norm']
['balancing_weight', 'cell_type', 'obs_id']


In [None]:
aa # run the following from terminal in a seperate env 

NameError: name 'aa' is not defined

Now use the command `pyscenic grn` to build a coexpression-based draft network.

In [None]:
!pyscenic grn ../../../output/infer/scglue/rna.loom ../../../output/infer/scglue/tfs.txt \
    -o ../../../output/infer/scglue/draft_grn.csv --seed 0 --num_workers 20 \
    --cell_id_attribute obs_id --gene_attribute name

In [None]:
pd.read_csv(f"{work_dir}/output/infer/scglue/draft_grn.csv")

Unnamed: 0,TF,target,importance
0,EBF1,AFF3,3.091653e+01
1,MITF,CYBB,2.830844e+01
2,RORA,CD96,2.702984e+01
3,TCF7,ANK3,2.698522e+01
4,LEF1,CAMK4,2.541232e+01
...,...,...,...
1265951,NR2C2,VWC2,1.645437e-19
1265952,ELF2,SGIP1,1.584049e-19
1265953,ZBTB26,SYNPR,1.557584e-19
1265954,TCF7,CDH9,1.548399e-19


### Generate TF cis-regulatory ranking bridged by ATAC peaks

> (Estimated time: ~1 h)

We scan the genome with the [scglue.genomics.window_graph](api/scglue.genomics.window_graph.rst) function to connect ATAC peaks with TF motif hits based on genomic overlap. This will take some time (~1 hour).

In [10]:
peak_bed = scglue.genomics.Bed(atac.var.loc[peaks])
peak2tf = scglue.genomics.window_graph(peak_bed, motif_bed, 0, right_sorted=True)
peak2tf = peak2tf.edge_subgraph(e for e in peak2tf.edges if e[1] in tfs)

window_graph: 100%|██████████| 135358/135358 [47:30<00:00, 47.48it/s]  


In [None]:
adj_matrix = nx.to_pandas_edgelist(peak2tf)
adj_matrix.to_csv(f"{work_dir}/infer/scglue/adj_matrix.csv")

Given the GLUE-inferred gene-peak connections and motif-supported peak-TF connections, the ATAC peaks can serve as a bridge to help deduce gene-TF connections.

Specifically, we can use the function [scglue.genomics.cis_regulatory_ranking](api/scglue.genomics.cis_regulatory_ranking.rst) to combine gene-peak and peak-TF connections into a gene-TF cis-regulatory ranking. Given that each gene can connect to a varying number of ATAC peaks with different lengths, the combined gene-TF connections are not directly comparable. As such, the function compares the observed connections with randomly sampled ones (stratified by peak length) to evaluate their levels of enrichment, which are then used to rank genes for each TF.

In [19]:
gene2tf_rank_glue = scglue.genomics.cis_regulatory_ranking(
    gene2peak, peak2tf, genes, peaks, tfs,
    region_lens=atac.var.loc[peaks, "chromEnd"] - atac.var.loc[peaks, "chromStart"],
    random_state=0
)
gene2tf_rank_glue.iloc[:5, :5]

  rs.choice(region_bins_lut[region_bins[c_]], n_samples, replace=True)
cis_reg_ranking.sampling: 100%|██████████| 17069/17069 [00:03<00:00, 4318.28it/s]
cis_reg_ranking.mapping: 100%|██████████| 1000/1000 [02:09<00:00,  7.70it/s]


Unnamed: 0_level_0,ZNF684,TEAD1,TEAD2,KLF15,ZNF140
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1BG,9957.5,9738.0,9522.0,3523.0,10221.5
A1BG-AS1,9957.5,9738.0,9522.0,3061.5,10221.5
A2M,9957.5,611.0,254.0,15513.5,10221.5
A2M-AS1,9957.5,9738.0,9522.0,3751.0,10221.5
A2ML1,9957.5,9738.0,9522.0,15513.5,10221.5


In [23]:
gene2tf_rank_glue.to_csv(f"{work_dir}/infer/scglue/gene2tf_rank_glue.csv")

### Generate TF cis-regulatory ranking with proximal promoters (optional)

> (Estimated time: ~1 h)

One potential limitation of the above approach is that genes with no regulatory ATAC peaks identified would be left out. As a supplement, we can also use proximal promoter regions flanking the TSS to generate cis-regulatory ranking, as in the original pySCENIC pipeline.

To do that we scan the genome again with the [scglue.genomics.window_graph](api/scglue.genomics.window_graph.rst) function to connect TSS flanking regions (-500 to +500bp) with TF motif hits based on genomic overlap. Here the flanking regions will be named after the corresponding genes in the resulting graph.

In [22]:
flank_bed = scglue.genomics.Bed(rna.var.loc[genes]).strand_specific_start_site().expand(500, 500)
flank2tf = scglue.genomics.window_graph(flank_bed, motif_bed, 0, right_sorted=True)

window_graph: 100%|██████████| 17069/17069 [46:11<00:00,  6.16it/s]  


Similar to the previous section, we use the [scglue.genomics.cis_regulatory_ranking](api/scglue.genomics.cis_regulatory_ranking.rst) function to generate a supplementary cis-regulatory ranking, with the following differences:

* The gene-peak connection is replaced with a gene-flank connection, which is just a self-loop graph since each TSS flanking region have the same name as its corresponding gene;
* Since each gene has exactly one flanking region with the same length, it is unnecessary to evaluate TF enrichment with stratified random sampling, so we set `n_samples=0` to disable the sampling process.

In [24]:
gene2flank = nx.Graph([(g, g) for g in genes])
gene2tf_rank_supp = scglue.genomics.cis_regulatory_ranking(
    gene2flank, flank2tf, genes, genes, tfs,
    n_samples=0
)
gene2tf_rank_supp.iloc[:5, :5]

Unnamed: 0_level_0,ZNF684,TEAD1,TEAD2,KLF15,ZNF140
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A1BG,8962.5,8763.0,8755.5,7619.0,8866.0
A1BG-AS1,8962.5,8763.0,8755.5,5108.5,8866.0
A2M,8962.5,8763.0,8755.5,14787.5,8866.0
A2M-AS1,8962.5,8763.0,8755.5,5108.5,8866.0
A2ML1,8962.5,8763.0,8755.5,14787.5,8866.0


In [25]:
gene2tf_rank_supp.to_csv(f"{work_dir}/infer/scglue/gene2tf_rank_supp.csv")

### Prune coexpression network using cis-regulatory ranking

> (Estimated time: ~5 min)

For the final step, we will prune the coexpression-based draft network with these cis-regulatory rankings to preserve TF-gene connections with cis-regulatory evidence. To do that, we need to prepare the following files:

* One or more `feather` files containing the cis-regulatory ranking;
* A `tsv` annotation file mapping column names in the ranking files to TF names.

Here we have two separate rankings (ATAC-based and promoter-based) with identical column names. We need to differentiate them by appending their information source.

In [27]:
gene2tf_rank_glue.columns = gene2tf_rank_glue.columns + "_glue"
gene2tf_rank_supp.columns = gene2tf_rank_supp.columns + "_supp"

We can use the [scglue.genomics.write_scenic_feather](api/scglue.genomics.write_scenic_feather.rst) function to save the cis-regulatory rankings as `feather` files compatible with `pySCENIC`.

In [28]:
scglue.genomics.write_scenic_feather(gene2tf_rank_glue, f"{work_dir}/infer/scglue/glue.genes_vs_tracks.rankings.feather")
scglue.genomics.write_scenic_feather(gene2tf_rank_supp, f"{work_dir}/infer/scglue/supp.genes_vs_tracks.rankings.feather")

In [40]:
gene2tf_rank_glue

Unnamed: 0_level_0,ZNF684_glue,TEAD1_glue,TEAD2_glue,KLF15_glue,ZNF140_glue,CEBPA_glue,ZNF530_glue,NFATC2_glue,NFATC1_glue,ATF4_glue,...,ZFP57_glue,VENTX_glue,CAMTA2_glue,OVOL2_glue,CEBPE_glue,HOXA9_glue,ARF4_glue,GMEB1_glue,RUNX2_glue,MYBL2_glue
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,9957.5,9738.0,9522.0,3523.0,10221.5,9274.0,13759.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
A1BG-AS1,9957.5,9738.0,9522.0,3061.5,10221.5,9274.0,13759.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
A2M,9957.5,611.0,254.0,15513.5,10221.5,202.5,13759.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
A2M-AS1,9957.5,9738.0,9522.0,3751.0,10221.5,9274.0,13759.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
A2ML1,9957.5,9738.0,9522.0,15513.5,10221.5,9274.0,13759.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZXDC,9957.5,9738.0,9522.0,5235.0,10221.5,9274.0,8385.0,998.5,9766.5,9304.5,...,615.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
ZYG11A,9957.5,9738.0,9522.0,2214.5,10221.5,9274.0,13759.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
ZYG11B,9957.5,9738.0,9522.0,4538.0,10221.5,9274.0,13759.5,9790.5,9766.5,212.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0
ZYX,9957.5,9738.0,9522.0,108.5,10221.5,9274.0,5806.5,9790.5,9766.5,9304.5,...,9118.0,8592.0,8664.5,8624.0,8606.5,8601.0,8700.5,8709.5,8704.0,8629.0


Then use the following format for the annotation file:

In [35]:
pd.concat([
    pd.DataFrame({
        "#motif_id": tfs + "_glue",
        "gene_name": tfs
    }),
    pd.DataFrame({
        "#motif_id": tfs + "_supp",
        "gene_name": tfs
    })
])

Unnamed: 0,#motif_id,gene_name
0,ZNF684_glue,ZNF684
1,TEAD1_glue,TEAD1
2,TEAD2_glue,TEAD2
3,KLF15_glue,KLF15
4,ZNF140_glue,ZNF140
...,...,...
436,HOXA9_supp,HOXA9
437,ARF4_supp,ARF4
438,GMEB1_supp,GMEB1
439,RUNX2_supp,RUNX2


In [29]:
pd.concat([
    pd.DataFrame({
        "#motif_id": tfs + "_glue",
        "gene_name": tfs
    }),
    pd.DataFrame({
        "#motif_id": tfs + "_supp",
        "gene_name": tfs
    })
]).assign(
    motif_similarity_qvalue=0.0,
    orthologous_identity=1.0,
    description="placeholder"
).to_csv(f"{work_dir}/infer/scglue/ctx_annotation.tsv", sep="\t", index=False)

In [30]:
adj_matrix = pd.read_csv(f"{work_dir}/infer/scglue/adj_matrix.csv")

In [31]:
adj_matrix.source.unique().shape

(135308,)

In [32]:
pd.read_csv(f"{work_dir}/infer/scglue/ctx_annotation.tsv", sep='\t')

Unnamed: 0,#motif_id,gene_name,motif_similarity_qvalue,orthologous_identity,description
0,ZNF684_glue,ZNF684,0.0,1.0,placeholder
1,TEAD1_glue,TEAD1,0.0,1.0,placeholder
2,TEAD2_glue,TEAD2,0.0,1.0,placeholder
3,KLF15_glue,KLF15,0.0,1.0,placeholder
4,ZNF140_glue,ZNF140,0.0,1.0,placeholder
...,...,...,...,...,...
877,HOXA9_supp,HOXA9,0.0,1.0,placeholder
878,ARF4_supp,ARF4,0.0,1.0,placeholder
879,GMEB1_supp,GMEB1,0.0,1.0,placeholder
880,RUNX2_supp,RUNX2,0.0,1.0,placeholder


We are now ready to prune the coexpression network. This can be achieved using the pySCENIC command `pyscenic ctx` (here `rank_threshold` was scaled down according to the number of highly-variable genes):

In [None]:
!pyscenic ctx ../../../output/infer/scglue/draft_grn.csv \
    ../../../output/infer/scglue/glue.genes_vs_tracks.rankings.feather \
    ../../../output/infer/scglue/supp.genes_vs_tracks.rankings.feather \
    --annotations_fname ../../../output/infer/scglue/ctx_annotation.tsv \
    --expression_mtx_fname ../../../output/infer/scglue/rna.loom \
    --output ../../../output/infer/scglue/pruned_grn.csv \
    --rank_threshold 5000 --min_genes 1 \
    --num_workers 4 --no_pruning  \
    --cell_id_attribute obs_id --gene_attribute name 

[########################################] | 100% Completed | 35.40 s
[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m[0m