## This notebook demonstrates how to get importance scores for use with TF-MoDISco using GkmExplain

It relies on a gkm-SVM model pretrained on Nanog H1ESC ChIP-seq

In [1]:
from __future__ import print_function, division
%matplotlib inline

try:
    reload  # Python 2.7
except NameError:
    try:
        from importlib import reload  # Python 3.4+
    except ImportError:
        from imp import reload  # Python 3.0 - 3.3
        
#install lsgkm from the kundajelab repo
!rm -rf lsgkm
! git clone https://github.com/kundajelab/lsgkm
% cd lsgkm/src
! make
%cd ../..

Cloning into 'lsgkm'...
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 293 (delta 0), reused 1 (delta 0), pack-reused 289[K
Receiving objects: 100% (293/293), 489.52 KiB | 0 bytes/s, done.
Resolving deltas: 100% (196/196), done.
Checking connectivity... done.
/Users/avantishrikumar/Research/modisco/examples/H1ESC_Nanog_gkmsvm/lsgkm/src
c++ -Wall -Wconversion -O3 -fPIC -c libsvm.cpp
c++ -Wall -Wconversion -O3 -fPIC -c libsvm_gkm.c
c++ -Wall -Wconversion -O3 -fPIC gkmtrain.c libsvm.o libsvm_gkm.o -o gkmtrain -lm -lpthread
c++ -Wall -Wconversion -O3 -fPIC gkmpredict.c libsvm.o libsvm_gkm.o -o gkmpredict -lm -lpthread
c++ -Wall -Wconversion -O3 -fPIC gkmexplain.c libsvm.o libsvm_gkm.o -o gkmexplain -lm -lpthread
/Users/avantishrikumar/Research/modisco/examples/H1ESC_Nanog_gkmsvm


## Grab the data and pretrained model

In [2]:
#Download things and gunzip them
!wget https://raw.githubusercontent.com/AvantiShri/model_storage/2e603c/modisco/gkmexplain_scores/positives_test.fa.gz -O positives_test.fa.gz 
!gunzip positives_test.fa.gz
!wget https://raw.githubusercontent.com/AvantiShri/model_storage/2e603c/modisco/gkmexplain_scores/lsgkm_defaultsettings_t2.model.txt.gz -O lsgkm_defaultsettings_t2.model.txt.gz 
!gunzip lsgkm_defaultsettings_t2.model.txt.gz

--2019-02-11 15:05:27--  https://raw.githubusercontent.com/AvantiShri/model_storage/2e603c/modisco/gkmexplain_scores/positives_test.fa.gz
Resolving raw.githubusercontent.com... 151.101.188.133
Connecting to raw.githubusercontent.com|151.101.188.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75038 (73K) [application/octet-stream]
Saving to: 'positives_test.fa.gz'


2019-02-11 15:05:27 (5.50 MB/s) - 'positives_test.fa.gz' saved [75038/75038]

--2019-02-11 15:05:27--  https://raw.githubusercontent.com/AvantiShri/model_storage/2e603c/modisco/gkmexplain_scores/lsgkm_defaultsettings_t2.model.txt.gz
Resolving raw.githubusercontent.com... 151.101.188.133
Connecting to raw.githubusercontent.com|151.101.188.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655201 (640K) [application/octet-stream]
Saving to: 'lsgkm_defaultsettings_t2.model.txt.gz'


2019-02-11 15:05:27 (5.37 MB/s) - 'lsgkm_defaultsettings_t2.model.txt.gz' saved [655201/65

## Compute the gkmexplain importance scores on positives

In [3]:
#importance scores on the positives
# Note that it is not strictly necessary to compute these because they can be extracted from
# the hypothetical importance scores by doing an elementwise multiplication of the one-hot
# encoding with the hypothetical importance scores, as demonstrated in the TF-MoDISco notebook
!lsgkm/src/gkmexplain positives_test.fa lsgkm_defaultsettings_t2.model.txt gkmexplain_positives_impscores.txt
 
#hypothetical importance scores on the positives
!lsgkm/src/gkmexplain -m 1 positives_test.fa lsgkm_defaultsettings_t2.model.txt gkmexplain_positives_hypimpscores.txt

INFO 2019-02-11 15:05:28: Number of threads is set to 1
INFO 2019-02-11 15:05:28: load model lsgkm_defaultsettings_t2.model.txt
INFO 2019-02-11 15:05:28: reading... 1000/8873
INFO 2019-02-11 15:05:28: reading... 2000/8873
INFO 2019-02-11 15:05:28: reading... 3000/8873
INFO 2019-02-11 15:05:28: reading... 4000/8873
INFO 2019-02-11 15:05:29: reading... 5000/8873
INFO 2019-02-11 15:05:29: reading... 6000/8873
INFO 2019-02-11 15:05:29: reading... 7000/8873
INFO 2019-02-11 15:05:29: reading... 8000/8873
INFO 2019-02-11 15:05:29: write prediction result to gkmexplain_positives_impscores.txt
INFO 2019-02-11 15:06:07: 100 scored
INFO 2019-02-11 15:06:44: 200 scored
INFO 2019-02-11 15:07:22: 300 scored
INFO 2019-02-11 15:08:01: 400 scored
INFO 2019-02-11 15:08:38: 500 scored
INFO 2019-02-11 15:09:16: 600 scored
INFO 2019-02-11 15:09:55: 700 scored
INFO 2019-02-11 15:10:31: 800 scored
INFO 2019-02-11 15:11:09: 900 scored
INFO 2019-02-11 15:11:31: 960 scored
INFO 2019-02-11 15:11:31: Number of th

## Generate emprical null distribution of importance scores

For identifying regions of high importance, it is helpful (but not strictly necessary) to supply a null distribution of per-position scores to TF-MoDISco. We will do this by dinucleotide shuffling the sequences and scoring them.

In [6]:
#install the deeplift package to use the dinucleotide-shuffling code
!pip install deeplift
from deeplift.dinuc_shuffle import dinuc_shuffle

import numpy as np
import random
np.random.seed(1234)
random.seed(1234)

num_dinuc_shuffled_seqs = 500
#Generate the dinucleotide shuffled sequences and write to a file
fasta_seqs_no_N = [x.rstrip() for (i,x) in enumerate(open("positives_test.fa"))
                   if (i%2==1 and ('N' not in x))]
open("dnshuff_seqs.fa", 'w').write(
 "\n".join([">seq"+str(i)+"\n"+dinuc_shuffle(np.random.choice(fasta_seqs_no_N))
            for i in range(num_dinuc_shuffled_seqs)]))

#Score the dinucleotide shuffled sequences
!lsgkm/src/gkmexplain dnshuff_seqs.fa lsgkm_defaultsettings_t2.model.txt gkmexplain_dnshuff_impscores.txt

[33mYou are using pip version 18.0, however version 19.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
INFO 2019-02-11 15:28:35: Number of threads is set to 1
INFO 2019-02-11 15:28:35: load model lsgkm_defaultsettings_t2.model.txt
INFO 2019-02-11 15:28:35: reading... 1000/8873
INFO 2019-02-11 15:28:35: reading... 2000/8873
INFO 2019-02-11 15:28:35: reading... 3000/8873
INFO 2019-02-11 15:28:36: reading... 4000/8873
INFO 2019-02-11 15:28:36: reading... 5000/8873
INFO 2019-02-11 15:28:36: reading... 6000/8873
INFO 2019-02-11 15:28:36: reading... 7000/8873
INFO 2019-02-11 15:28:36: reading... 8000/8873
INFO 2019-02-11 15:28:37: write prediction result to gkmexplain_dnshuff_impscores.txt
INFO 2019-02-11 15:29:12: 100 scored
INFO 2019-02-11 15:29:48: 200 scored
INFO 2019-02-11 15:30:23: 300 scored
INFO 2019-02-11 15:30:59: 400 scored
INFO 2019-02-11 15:31:35: 500 scored
