# Infer GRN from Baron-Human Dataset

### Necessary Imports

This will be using a package called [pyscenic](https://github.com/aertslab/pySCENIC) which uses the Arboreto algorithm for network inference

Dask is used for distributed computing.

When installing Pyscenic, as of Today (June 10, 2021), there is an issue with the dask. Check out the issue I filed [here](https://github.com/aertslab/pySCENIC/issues/295) The solution: 

#### I re-install pyscenic with version 0.11.1, and downgrade dask==2.30.0 and distributed==2.30.0, which works for me.

In [1]:
import os
import glob
import pickle
import pandas as pd
import numpy as np

from dask.diagnostics import ProgressBar

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

from ctxcore.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell

import seaborn as sns

import pickle



### Define some routes

Here, we will define some routes. They are constants we will use for simplicity.

The structure of the folders:
```
data/
├─ resources/
│  ├─ GSE60361_C1-3005-Expression.txt
│  ├─ metadata.txt
│  ├─ mm_mgi_tfs.txt
│  ├─ motifs-v9-nr.mgi-m0.001-o0.0.tbl
├─ databases/
│  ├─ mm9-500bp-upstream-10species.mc9nr.feather
│  ├─ mm9-500bp-upstream-7species.mc9nr.feather
│  ├─ mm9-tss-centered-10kb-10species.mc9nr.feather
│  ├─ mm9-tss-centered-10kb-7species.mc9nr.feather
│  ├─ mm9-tss-centered-5kb-10species.mc9nr.feather
│  ├─ mm9-tss-centered-5kb-7species.mc9nr.feather
├─ obj/
```

Download the databases and motif annotations [here](https://resources.aertslab.org/cistarget/)

Download the Expression data [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361)

Download the metadata [here](http://linnarssonlab.org/cortex/)

```obj/``` will be used to store some pickled data

In [2]:
DATA_FOLDER="data-baron-human/"
RESOURCES_FOLDER="data-baron-human/resources"
DATABASE_FOLDER = "data-baron-human/databases"
DATABASES_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.mc9nr.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'hs_hgnc_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "Filtered_Baron_HumanPancreas_data.csv")
REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p")
MOTIFS_FNAME = os.path.join(DATA_FOLDER, "motifs.csv")

## Load up expression matrix

In [3]:
ex_matrix = pd.read_csv(SC_EXP_FNAME, header=0, index_col=0)


In [4]:
ex_matrix.columns

Index(['A1BG', 'A1CF', 'A2M', 'A4GALT', 'AAAS', 'AACS', 'AACSP1', 'AADAC',
       'AADACL2', 'AADACP1',
       ...
       'ZWILCH', 'ZWINT', 'ZXDA', 'ZXDB', 'ZXDC', 'ZYG11B', 'ZYX', 'ZZEF1',
       'ZZZ3', 'pk'],
      dtype='object', length=17499)

In [5]:
tf_names = load_tf_names(MM_TFS_FNAME)

## Load up the ranking databases

In [6]:
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.splitext(os.path.basename(fname))[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs

[FeatherRankingDatabase(name="mm9-500bp-upstream-7species.mc9nr"),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species.mc9nr"),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species.mc9nr"),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species.mc9nr"),
 FeatherRankingDatabase(name="mm9-500bp-upstream-10species.mc9nr"),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species.mc9nr")]

## Now, the fun part. Let's infer a GRN

This may take a bit, so grab a snack or a cup of tea

In [7]:
adjacencies = grnboost2(ex_matrix, gene_names=ex_matrix.columns,tf_names=tf_names, verbose=True)

preparing dask client
parsing input
creating dask graph
6 partitions
computing dask graph
shutting down client and local cluster
finished


In [None]:
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

In [8]:
def save_obj(obj, name):
    with open('data/obj/' + name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)


def load_obj(name):
    with open('data/obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [9]:
save_obj(adjacencies, "adjacenciesBaronHuman")

In [None]:
save_obj(modules, "modulesBaronHuman")

In [10]:
adjacencies.head(40)

Unnamed: 0,TF,target,importance
470,IRX2,TTR,276.361904
114,CKMT1B,CKMT1A,261.817782
470,IRX2,CLU,252.919963
391,HMGA1,MFSD2B,224.693691
444,HSPA5,SDF2L1,221.625993
370,HHEX,SST,205.507621
855,RPS4X,RPL3,205.365172
265,FOXD2,PCDHB4,203.737695
444,HSPA5,MANF,181.967444
444,HSPA5,HERPUD1,181.643999


In [11]:
adjacencies.to_csv('data-baron-human/adjacenciesBaronHuman.tsv', sep='\t', header=True, index=False)