## Table of Contents
- [Intro](#Intro)
- [0.x Setup](#0.x-Setup)
  - [0.1 Download tools](#0.1-Download-tools)
  - [0.2 Download data](#0.2-Download-data)
  - [0.3 Merge and index](#0.3-Merge-and-index)
  - [0.4 Create experimental data](#0.4-Create-experimental-data)
  - [0.5 Create control data](#0.5-Create-control-data)
  - [0.6 Download peaks BED file](#0.6-Download-peaks-BED-file)
- [1.x Run BPNet](#1.x-Run-BPNet)
  - [1.0 Download bpnet-lite](#1.0-Download-bpnet-lite)
  - [1.1 Download hg38](#1.1-Download-hg38)
  - [1.2 Create bpnet-lite fit JSON file](#1.2-Create-bpnet-lite-fit-JSON-file)
  - [1.3 Train BPNet](#1.3-Train-BPNet)
  - [1.4 Create bpnet-lite interpret JSON file](#1.4-Create-bpnet-lite-interpret-JSON-file)
  - [1.5 Retrieve one-hot encoding & corresponding attribution scores using DeepSHAP](#1.5-Retrieve-one-hot-encoding-&-corresponding-attribution-scores-using-DeepSHAP)
- [2.x Motif discovery with tfmodisco-lite](#2.x-Motif-discovery-with-tfmodisco-lite)
  - [2.1 Install tfmodisco-lite](#2.1-Install-tfmodisco-lite)
  - [2.2 tfmodisco-lite on DeepSHAP output](#2.2-Running-tfmodisco-lite-on-DeepSHAP-output)
- [3.x Displaying the motifs with tfmodisco-lite report](#3.x-Displaying-the-motifs-with-tfmodisco-lite-report)
  - [3.1 Run modisco report](#3.1-Run-modisco-report)
  - [3.2 Run modisco report with TOMTOM comparison](#3.2-Run-modisco-report-with-TOMTOM-comparison)

# Intro
In this notebook we will demonstrate how to use TF-MoDISco to discover motifs from a neural network that predicts drivers of transcription factor binding 
(we'll use `bpnet-lite`). The rough order of execution will be as follows:
0. Obtain tools and data to process (from the ENCODE project)
1. Train `bpnet-lite` on ChIP-seq data for a particular transcription factor.
2. Obtain the sequence one-hot encoding file & its corresponding attribution scores file (provided using `DeepSHAP`)
3. Run `tfmodisco-lite` on the one-hot encoding file & its corresponding attribution scores file to receive the motifs, represented as contribution scores, as an `.h5` file. As a side note, it contains the aggregate sequence data, as well as (aggregated) hypothetical contribution score data.

# 0.x Setup

First let's create a Conda environment to install packages to.

In [2]:
%%bash
conda create --name tfmodisco_example python='3.9.*'
conda activate tfmodisco_example

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done




  current version: 23.3.1
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0





## Package Plan ##

  environment location: /users/airanman/anaconda3/envs/tfmodisco_example

  added / updated specs:
    - python=3.9


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge 
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu 
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h7f98852_4 
  ca-certificates    conda-forge/linux-64::ca-certificates-2023.5.7-hbcca054_0 
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-h41732ed_0 
  libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5 
  libgcc-ng          conda-forge/linux-64::libgcc-ng-12.2.0-h65d4601_19 
  libgomp            conda-forge/linux-64::libgomp-12.2.0-h65d4601_19 
  libnsl             conda-forge/linux-64::libnsl-2.0.0-h7f98852_0 
  libsqlite          conda-forge/linux-64::libsqlite-3.42.0-h2797004_0 
  libuuid            conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 
  libzlib            conda

## 0.1 Download tools

In [3]:
!conda install -y -c bioconda bamtools bedtools samtools

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.3.1
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0



# All requested packages already installed.



In [33]:
%%sh
wget --quiet http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedGraphToBigWig -O bedGraphToBigWig
chmod a+x bedGraphToBigWig

## 0.2 Download data

In [32]:
%%sh
mkdir -p ENCSR000EGM/data
cd ENCSR000EGM/data
wget --quiet https://www.encodeproject.org/files/ENCFF198CVB/@@download/ENCFF198CVB.bam -O rep1.bam
wget --quiet https://www.encodeproject.org/files/ENCFF488CXC/@@download/ENCFF488CXC.bam -O rep2.bam
wget --quiet https://www.encodeproject.org/files/ENCFF023NGN/@@download/ENCFF023NGN.bam -O control.bam
wget --quiet https://www.encodeproject.org/files/GRCh38_EBV.chrom.sizes/@@download/GRCh38_EBV.chrom.sizes.tsv -O hg38.chrom.sizes

## 0.3 Merge and index

In [9]:
%%sh
cd ENCSR000EGM/data
samtools merge -f merged.bam rep1.bam rep2.bam
samtools index merged.bam

## 0.4 Create experimental data



In [None]:
%%sh
cd ENCSR000EGM/data
# Get coverage of 5’ positions of the plus strand
bedtools genomecov -5 -bg -strand + \
        -g hg38.chrom.sizes -ibam merged.bam \
        | sort -k1,1 -k2,2n > plus.bedGraph

# Get coverage of 5’ positions of the minus strand
bedtools genomecov -5 -bg -strand - \
        -g hg38.chrom.sizes -ibam merged.bam \
        | sort -k1,1 -k2,2n > minus.bedGraph

# Convert bedGraph files to bigWig files
../.././bedGraphToBigWig plus.bedGraph hg38.chrom.sizes plus.bw
../.././bedGraphToBigWig minus.bedGraph hg38.chrom.sizes minus.bw


*****
*****

*****
*****


## 0.5 Create control data

In [12]:
%%sh
cd ENCSR000EGM/data
# Get coverage of 5’ positions of the plus strand
bedtools genomecov -5 -bg -strand + \
        -g hg38.chrom.sizes -ibam control.bam \
        | sort -k1,1 -k2,2n > control_plus.bedGraph

# Get coverage of 5’ positions of the minus strand
bedtools genomecov -5 -bg -strand - \
        -g hg38.chrom.sizes -ibam control.bam \
         | sort -k1,1 -k2,2n > control_minus.bedGraph

# Convert bedGraph files to bigWig files
../.././bedGraphToBigWig control_plus.bedGraph hg38.chrom.sizes control_plus.bw
../.././bedGraphToBigWig control_minus.bedGraph hg38.chrom.sizes control_minus.bw


*****
*****


## 0.6 Download peaks BED file

In [13]:
%%sh
cd ENCSR000EGM/data
wget -q https://www.encodeproject.org/files/ENCFF396BZQ/@@download/ENCFF396BZQ.bed.gz -O peaks.bed.gz

# 1.x Run BPNet

## 1.0 Download bpnet-lite

In [14]:
!pip install -Uqq bpnet-lite

## 1.1 Download hg38

In [15]:
%%sh
wget -q https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz -O ENCSR000EGM/data/hg38.fa.gz
gunzip -f ENCSR000EGM/data/hg38.fa.gz

## 1.2 Create bpnet-lite `fit` JSON file
We'll set a low number of epochs for this example.

In [16]:
%%sh
mkdir -p ENCSR000EGM/bpnet
# Write to file
cat <<EOF > ENCSR000EGM/bpnet/bpnet_fit.json
{
   "n_filters": 64,
   "n_layers": 8,
   "profile_output_bias": true,
   "count_output_bias": true,
   "name": "example",
   "batch_size": 64,
   "in_window": 2114,
   "out_window": 1000,
   "max_jitter": 128,
   "reverse_complement": true,
   "max_epochs": 5,
   "validation_iter": 100,
   "lr": 0.001,
   "alpha": 1,
   "verbose": true,

   "min_counts": 0,
   "max_counts": 99999999,

   "training_chroms": ["chr2", "chr3", "chr4", "chr5", "chr6", "chr7", 
      "chr9", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", 
      "chr18", "chr19", "chr20", "chr21", "chr22", "chrX"],
   "validation_chroms": ["chr8", "chr10"],

   "sequences":"../data/hg38.fa",
   "loci":"../data/peaks.bed.gz",
   "signals":[
      "../data/plus.bw", 
      "../data/minus.bw"
   ],
   "controls":[
      "../data/control_plus.bw", 
      "../data/control_minus.bw"
   ],
   "random_state": 0
}
EOF


## 1.3 Train BPNet

In [17]:
%%sh
cd ENCSR000EGM/bpnet/
bpnet fit -p bpnet_fit.json

Loading Loci:   0%|          | 0/45969 [00:00<?, ?it/s]Loading Loci:   0%|          | 38/45969 [00:00<02:02, 373.90it/s]Loading Loci:   0%|          | 77/45969 [00:00<02:01, 378.18it/s]Loading Loci:   0%|          | 117/45969 [00:00<01:59, 383.93it/s]Loading Loci:   0%|          | 157/45969 [00:00<01:57, 389.01it/s]Loading Loci:   0%|          | 197/45969 [00:00<01:57, 390.79it/s]Loading Loci:   1%|          | 237/45969 [00:00<01:56, 391.40it/s]Loading Loci:   1%|          | 277/45969 [00:00<01:56, 392.37it/s]Loading Loci:   1%|          | 317/45969 [00:00<01:56, 392.29it/s]Loading Loci:   1%|          | 357/45969 [00:00<01:55, 393.22it/s]Loading Loci:   1%|          | 397/45969 [00:01<01:56, 392.42it/s]Loading Loci:   1%|          | 438/45969 [00:01<01:55, 395.51it/s]Loading Loci:   1%|          | 478/45969 [00:01<01:54, 396.27it/s]Loading Loci:   1%|          | 518/45969 [00:01<01:54, 396.95it/s]Loading Loci:   1%|          | 558/45969 [00:01<01:54, 396.32it/s]Loading

Loading Loci:   0%|          | 0/5148 [00:00<?, ?it/s]Loading Loci:   1%|          | 42/5148 [00:00<00:12, 414.48it/s]Loading Loci:   2%|▏         | 86/5148 [00:00<00:11, 428.54it/s]Loading Loci:   3%|▎         | 130/5148 [00:00<00:11, 430.43it/s]Loading Loci:   3%|▎         | 175/5148 [00:00<00:11, 434.23it/s]Loading Loci:   4%|▍         | 219/5148 [00:00<00:11, 435.00it/s]Loading Loci:   5%|▌         | 264/5148 [00:00<00:11, 437.34it/s]Loading Loci:   6%|▌         | 308/5148 [00:00<00:11, 436.29it/s]Loading Loci:   7%|▋         | 352/5148 [00:00<00:11, 435.69it/s]Loading Loci:   8%|▊         | 397/5148 [00:00<00:10, 437.11it/s]Loading Loci:   9%|▊         | 441/5148 [00:01<00:10, 437.37it/s]Loading Loci:   9%|▉         | 485/5148 [00:01<00:10, 438.09it/s]Loading Loci:  10%|█         | 529/5148 [00:01<00:10, 438.18it/s]Loading Loci:  11%|█         | 573/5148 [00:01<00:10, 438.13it/s]Loading Loci:  12%|█▏        | 617/5148 [00:01<00:10, 436.66it/s]Loading Loci:  13%|█▎ 

Training Set Size:  45969
Validation Set Size:  5148
Epoch	Iteration	Training Time	Validation Time	Training MNLL	Training Count MSE	Validation MNLL	Validation Profile Pearson	Validation Count Pearson	Validation Count MSE	Saved?
0	0	4.6008	4.0908	559.0572	27.3077	572.4651	0.006506731	-0.27866527	16.0063	True
0	100	10.8667	3.695	551.1813	4.1651	311.0575	0.07813424	0.10038188	1.6533	True
0	200	10.4749	3.6794	478.5148	0.5436	284.4794	0.23859434	0.13161778	1.1935	True
0	300	10.454	4.0222	417.9756	0.5297	259.9124	0.3404014	0.10733794	0.7302	True
0	400	10.8238	3.6792	413.8036	0.5095	253.5911	0.36618096	0.37802696	0.954	True
0	500	10.4698	3.6873	392.5204	0.4287	250.3475	0.37777773	0.44328076	0.8554	True
0	600	10.4805	3.6637	383.9519	0.4203	249.6323	0.3829707	0.46428233	0.8709	True
0	700	10.478	3.6763	410.6848	0.4174	247.821	0.3890268	0.49638587	0.9113	True
1	800	5.5563	3.672	425.1518	0.4097	247.0981	0.39271638	0.49701935	0.8789	True
1	900	10.4793	3.6649	459.5408	0.4332	246.524	0.39510086	0.510

## 1.4 Create bpnet-lite `interpret` JSON file

In [18]:
%%sh
cat <<EOF > ENCSR000EGM/bpnet/bpnet_interpret.json 
{
   "batch_size": 64,
   "in_window": 2114,
   "out_window": 1000,
   "verbose": true,
   "chroms": ["chr8", "chr10"],

   "sequences":"../data/hg38.fa",
   "loci":"../data/peaks.bed.gz",
   "model":"example.torch",
   "output":"count",

   "output": "profile",
   "ohe_filename": "ohe.npz",
   "attr_filename": "attr.npz",
   "n_shuffles":20,
   "random_state":0
}
EOF

## 1.5 Retrieve one-hot encoding & corresponding attribution scores using DeepSHAP
The following step provides a one-hot encoding, given as `ohe.npz`, and attribution scores as `attr.npz`.

In [19]:
%%sh
cd ENCSR000EGM/bpnet
bpnet interpret -p bpnet_interpret.json

Loading Loci: 100%|██████████| 5148/5148 [00:07<00:00, 658.88it/s]
100%|██████████| 5148/5148 [04:22<00:00, 19.59it/s]


The shape of these tensors are as follows:

In [21]:
import numpy as np

# Load the .npz file
attr = np.load('ENCSR000EGM/bpnet/attr.npz')
ohe = np.load('ENCSR000EGM/bpnet/ohe.npz')

for key in attr.files:
    print(f"The shape of the attribution scores is {attr[key].shape}")

for key in ohe.files:
    print(f"The shape of the one-hot encoding is {ohe[key].shape}")
    

The shape of the attribution scores is (5148, 4, 2114)
The shape of the one-hot encoding is (5148, 4, 2114)


These will be used as input to tfmodisco-lite

# 2.x Motif discovery with `tfmodisco-lite`

## 2.1 Install `tfmodisco-lite`

In [1]:
%%sh
pip install -Uqq modisco-lite

## 2.2 Running `tfmodisco-lite` on DeepSHAP output

In [23]:
%%sh
mkdir -p ENCSR000EGM/modisco
cd ENCSR000EGM/modisco
modisco motifs -s ../bpnet/ohe.npz -a ../bpnet/attr.npz -n 2000 -o modisco_results.h5

Now we have the results! We can interpret with the `report` subcommand

## 3.x Displaying the motifs with `tfmodisco-lite report`

## 3.1 Run `modisco report`

In [45]:
%%sh
cd ENCSR000EGM/modisco
modisco report -i modisco_results.h5 -o report/

which looks like:

In [46]:
from IPython.display import HTML
HTML('ENCSR000EGM/modisco/report/motifs.html')

pattern,num_seqlets,modisco_cwm_fwd,modisco_cwm_rev
pos_patterns.pattern_0,982,,
neg_patterns.pattern_0,82,,
neg_patterns.pattern_1,80,,


You can also compare the motifs with a database using [TOMTOM](https://meme-suite.org/meme/tools/tomtom)...

## 3.2 Run `modisco report` with TOMTOM comparison

First let's download the database file to compare our results against.

In [28]:
%%sh
wget --quiet \
https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt \
-O ENCSR000EGM/data/JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt

We'll also need the `tomtom` executable in our $PATH in order to run `modisco report` with TOMTOM comparison. For the purposes of this notebook, we'll be installing it through `conda`'s `meme` package.

In [34]:
%%bash
conda install --quiet -c bioconda meme

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /users/airanman/anaconda3/envs/bpnet-refactored.3.9

  added / updated specs:
    - meme


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-3.1.1              |       hd590300_1         2.5 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following packages will be UPDATED:

  openssl                                  3.1.1-hd590300_0 --> 3.1.1-hd590300_1 


Proceed ([y]/n)? 
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done


Now we can run it in `modisco report`

In [43]:
%%sh
cd ENCSR000EGM/modisco
modisco report -i modisco_results.h5 -o report/ -s ENCSR000EGM/modisco/report/ -m ../data/JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt

In [44]:
from IPython.display import HTML
HTML('ENCSR000EGM/modisco/report/motifs.html')

pattern,num_seqlets,modisco_cwm_fwd,modisco_cwm_rev,match0,qval0,match0_logo,match1,qval1,match1_logo,match2,qval2,match2_logo
pos_patterns.pattern_0,982,,,MA0139.1,1.33455e-14,,MA1929.1,1.52615e-13,,MA1930.1,6.47034e-11,
neg_patterns.pattern_0,82,,,MA1930.1,2.12841e-08,,MA1929.1,3.11419e-08,,MA0139.1,3.11419e-08,
neg_patterns.pattern_1,80,,,MA1102.2,2.35251e-06,,MA0139.1,5.52814e-06,,MA1929.1,2.25744e-05,


# 3.x `tfmodisco-lite` extras

## 3.1 Using `modisco meme` to generate MEME files

Using the `meme` subcommand, we can output various scores from the motifs into a MEME file.

In [3]:
!modisco meme -h

usage: modisco meme [-h] -i H5PY -t {PFM,CWM,hCWM,CWM-PFM,hCWM-PFM}
                    [-o OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i H5PY, --h5py H5PY  An HDF5 file containing the output from modiscolite.
  -t {PFM,CWM,hCWM,CWM-PFM,hCWM-PFM}, --datatype {PFM,CWM,hCWM,CWM-PFM,hCWM-PFM}
                        A case-sensitive string specifying the desired data of the output file.,
                        The options are as follows:
                        - 'PFM':      The position-frequency matrix.
                        - 'CWM':      The contribution-weight matrix.
                        - 'hCWM':     The hypothetical contribution-weight matrix; hypothetical
                                      contribution scores are the contributions of nucleotides not encoded
                                      by the one-hot encoding sequence. 
                        - 'CWM-PFM':  The softmax of the contribution-weight matrix.
 

Here we'll use it to create a MEME file with the hCWM-PFM scores:

In [14]:
%%sh
cd ENCSR000EGM/modisco
modisco meme -i modisco_results.h5 -t hCWM-PFM -o modisco_results.hCWM-PFM.meme

Which looks like the following

In [31]:
%%sh
cd ENCSR000EGM/modisco
head --lines=39 modisco_results.hCWM-PFM.meme

MEME version 5

ALPHABET= ACGT

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF pattern_0
letter-probability matrix: alength= 4 w= 30 nsites= 1
0.248826 0.251804 0.250520 0.248850
0.248163 0.251302 0.250202 0.250334
0.247805 0.251909 0.251251 0.249035
0.244951 0.255496 0.248984 0.250569
0.245002 0.253824 0.254852 0.246321
0.249467 0.246864 0.258725 0.244944
0.246277 0.265931 0.247717 0.240074
0.241209 0.277550 0.246266 0.234976
0.250526 0.246673 0.260665 0.242136
0.240527 0.265773 0.257526 0.236174
0.241154 0.259174 0.247957 0.251715
0.256594 0.250926 0.251941 0.240539
0.236031 0.249419 0.275758 0.238793
0.253191 0.242835 0.270393 0.233581
0.236852 0.251471 0.262782 0.248895
0.238141 0.245176 0.277373 0.239310
0.237160 0.246426 0.269815 0.246599
0.243103 0.271323 0.245214 0.240360
0.248603 0.244783 0.271999 0.234615
0.241764 0.259794 0.256034 0.242408
0.243188 0.256461 0.249168 0.251183
0.252519 0.253488 0.254545 0.239449
0.243827 0.251015 0.253445 0.251713
0.249577 0.