__Author:__ Bram Van de Sande
    
__Date:__ 14 JUN 2019

__Outline:__ Notebook generating list of Transcription Factors (TFs) for human and mouse. These lists can be used for the network inference step of SCENIC (step 1 - GENIE3/GRNBoost2).

__DATA ACQUISITION:__
1. Download motif annotations for _H. sapiens_ - HGNC symbols: `wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl`
2. Download motif annotations for _M. musculus_ - MGI symbols: `wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl`
3. Download list of curated human transcription factors from: Lambert SA et al. The Human Transcription Factors. Cell 2018 https://dx.doi.org/10.1016/j.cell.2018.01.029 

In [3]:
# Prepare tfs
import os
import pandas as pd
! wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl
! wget https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl


--2022-03-03 13:21:13--  https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.hgnc-m0.001-o0.0.tbl
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.65.132, 2a02:2c40:0:80::80:1284
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.65.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103568514 (99M)
Saving to: ‘motifs-v9-nr.hgnc-m0.001-o0.0.tbl.2’


2022-03-03 13:21:27 (6.91 MB/s) - ‘motifs-v9-nr.hgnc-m0.001-o0.0.tbl.2’ saved [103568514/103568514]

--2022-03-03 13:21:28--  https://resources.aertslab.org/cistarget/motif2tf/motifs-v9-nr.mgi-m0.001-o0.0.tbl
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.65.132, 2a02:2c40:0:80::80:1284
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.65.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112121859 (107M)
Saving to: ‘motifs-v9-nr.mgi-m0.001-o0.0.tbl.1’


2022-03-03 13:21:44 (6.83 MB/s) - ‘motifs-v9-nr.

In [4]:
BASEFOLDER_NAME = './resources/'

MOTIFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'motifs-v9-nr.hgnc-m0.001-o0.0.tbl')
MOTIFS_MGI_FNAME = os.path.join(BASEFOLDER_NAME, 'motifs-v9-nr.mgi-m0.001-o0.0.tbl')
CURATED_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'lambert2018.txt')

OUT_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'hs_hgnc_tfs.txt')
OUT_TFS_HGNC_FNAME = os.path.join(BASEFOLDER_NAME, 'hs_hgnc_curated_tfs.txt')
OUT_TFS_MGI_FNAME = os.path.join(BASEFOLDER_NAME, 'mm_mgi_tfs.txt')

__MUS MUSCULUS__

In [5]:
df_motifs_mgi = pd.read_csv(MOTIFS_MGI_FNAME, sep='\t')
mm_tfs = df_motifs_mgi.gene_name.unique()
with open(OUT_TFS_MGI_FNAME, 'wt') as f:
    f.write('\n'.join(mm_tfs) + '\n')
len(mm_tfs)

  exec(code_obj, self.user_global_ns, self.user_ns)


1721

__HOMO SAPIENS__

List of TFs based on motif collection.

In [6]:
df_motifs_hgnc = pd.read_csv(MOTIFS_HGNC_FNAME, sep='\t')
hs_tfs = df_motifs_hgnc.gene_name.unique()
with open(OUT_TFS_HGNC_FNAME, 'wt') as f:
    f.write('\n'.join(hs_tfs) + '\n')
len(hs_tfs)

1839

List of TFs based on Lambert SA et al.

In [9]:
with open(CURATED_TFS_HGNC_FNAME, 'rt') as f:
    hs_curated_tfs = list(map(lambda s: s.strip(), f.readlines()))
len(hs_curated_tfs)

815

List of human curated TFs for which a motif can be assigned based on our current version of the motif collection.

In [10]:
hs_curated_tfs_with_motif = list(set(hs_tfs).intersection(hs_curated_tfs))
len(hs_curated_tfs_with_motif)

17

In [11]:
with open(OUT_TFS_HGNC_FNAME, 'wt') as f:
    f.write('\n'.join(hs_curated_tfs_with_motif) + '\n')

# grab databases

In [21]:
! wget https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather ./databases/

--2021-11-23 11:37:34--  https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.65.132, 2a02:2c40:0:80::80:1284
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.65.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1327758248 (1.2G)
Saving to: ‘hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather’


2021-11-23 11:40:37 (6.92 MB/s) - ‘hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather’ saved [1327758248/1327758248]

--2021-11-23 11:40:37--  http://./databases/
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2021-11-23 11:40:37--
Total wall clock time: 3m 3s
Downloaded: 1 files, 1.2G in 3m 3s (6.92 MB/s)


In [22]:
! wget https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-tss-centered-10kb-7species.mc9nr.feather ./databases/

--2021-11-23 11:41:24--  https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-tss-centered-10kb-7species.mc9nr.feather
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.65.132, 2a02:2c40:0:80::80:1284
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.65.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1081248976 (1.0G)
Saving to: ‘mm9-tss-centered-10kb-7species.mc9nr.feather’


2021-11-23 11:43:52 (6.96 MB/s) - ‘mm9-tss-centered-10kb-7species.mc9nr.feather’ saved [1081248976/1081248976]

--2021-11-23 11:43:52--  http://./databases/
Resolving . (.)... failed: No address associated with hostname.
wget: unable to resolve host address ‘.’
FINISHED --2021-11-23 11:43:52--
Total wall clock time: 2m 28s
Downloaded: 1 files, 1.0G in 2m 28s (6.96 MB/s)
