<a href="https://colab.research.google.com/github/kundajelab/tfmodisco/blob/master/examples/H1ESC_Nanog_gkmsvm/Nanog_GkmExplain_Generate_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Download all the requisite data

In [1]:
!apt-get install bedtools
!git clone https://github.com/kundajelab/lsgkm.git lsgkm
%cd lsgkm/src
!make
%cd ../..

#Download ENCODE-processed peak files to get the foreground and background
! [[ -e conservative_peaks.bed.gz ]] || wget https://www.encodeproject.org/files/ENCFF148PBJ/@@download/ENCFF148PBJ.bed.gz -O conservative_peaks.bed.gz
! [[ -e optimal_peaks.bed.gz ]] || wget https://www.encodeproject.org/files/ENCFF379EPK/@@download/ENCFF379EPK.bed.gz -O optimal_peaks.bed.gz
# This DNAse dataset was obtained from ENCODE (accessions: ENCSR000EMU, ENCSR000EMU_ENCSR794OFW)
# and reprocessed using the Kundaje lab's ATAC/DNAse processing pipeline (https://github.com/kundajelab/atac_dnase_pipelines)
# by Daniel Kim.
! [[ -e bg_dnase.bed.gz ]] || wget https://raw.githubusercontent.com/AvantiShri/model_storage/8947701/gkmexplain/ENCSR000EMU_ENCSR794OFW.H1_Cells.UW_Stam.DNase-seq_rep1-pr.IDR0.1.narrowPeak.gz -O bg_dnase.bed.gz
  
#Get hg19 fasta by download 2bit and then converting to fa
! [[ -f hg19.2bit ]] || wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit -O hg19.2bit  
! [[ -f twoBitToFa ]] || wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa -O twoBitToFa
!chmod a+x twoBitToFa
! [[ -f hg19.genome.fa ]] || ./twoBitToFa hg19.2bit hg19.genome.fa

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  bedtools
0 upgraded, 1 newly installed, 0 to remove and 13 not upgraded.
Need to get 577 kB of archives.
After this operation, 2,040 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 bedtools amd64 2.26.0+dfsg-5 [577 kB]
Fetched 577 kB in 1s (488 kB/s)
Selecting previously unselected package bedtools.
(Reading database ... 149406 files and directories currently installed.)
Preparing to unpack .../bedtools_2.26.0+dfsg-5_amd64.deb ...
Unpacking bedtools (2.26.0+dfsg-5) ...
Setting up bedtools (2.26.0+dfsg-5) ...
Cloning into 'lsgkm'...
remote: Enumerating objects: 59, done.[K
remote: Counting objects: 100% (59/59), done.[K
remote: Compressing objects: 100% (44/44), done.[K
remote: Total 382 (delta 34), reused 32 (delta 15), pack-reused 323[K
Receiving objects: 100% (382/382), 585.86 KiB | 2

### Process the downloaded data to get the fasta sequences for pos and neg set

In [2]:
#positive set is 200bp around the consiervative IDR otpimal peaks
! zcat conservative_peaks.bed.gz | perl -lane 'print($F[0]."\t".($F[1]+$F[9]-100)."\t".($F[1]+$F[9]+100))' | gzip -c > positive_set_full.bed.gz
#negative set is 200bp around summit of H1 accessible peaks that don't overlap 1kb within any optimal or conservative peak
! zcat conservative_peaks.bed.gz optimal_peaks.bed.gz | perl -lane 'print($F[0]."\t".($F[1]+$F[9]-500)."\t".($F[1]+$F[9]+500))' | gzip -c > 1kb_around_optimal_or_conservative_peaks.bed.gz
! zcat bg_dnase.bed.gz | perl -lane 'print($F[0]."\t".($F[1]+$F[9]-100)."\t".($F[1]+$F[9]+100))' | gzip -c > prefiltering_neg_set.bed.gz
! bedtools intersect -a prefiltering_neg_set.bed.gz -b 1kb_around_optimal_or_conservative_peaks.bed.gz -v -wa | gzip -c > neg_set_full.bed.gz

In [3]:
#subsample the negative set to have approx. the same number of regions as the pos set
! zcat neg_set_full.bed.gz | perl -lane 'if ($.%20==1) {print $_}' | gzip -c > subsampled_neg_set.bed.gz

In [4]:
#use chr1 and 2 for the test set                                                
! zcat positive_set_full.bed.gz | egrep -w 'chr1|chr2' | gzip -c > positives_test_set.bed.gz
! zcat positive_set_full.bed.gz | egrep -w -v 'chr1|chr2' | gzip -c > positives_train_set.bed.gz
! zcat subsampled_neg_set.bed.gz | egrep -w 'chr1|chr2' | gzip -c > negatives_test_set.bed.gz
! zcat subsampled_neg_set.bed.gz | egrep -w -v 'chr1|chr2' | gzip -c > negatives_train_set.bed.gz

In [5]:
#Extract the underlying fasta regions
! bedtools getfasta -fi hg19.genome.fa -bed positives_train_set.bed.gz > positives_train.fa
! bedtools getfasta -fi hg19.genome.fa -bed positives_test_set.bed.gz > positives_test.fa
! bedtools getfasta -fi hg19.genome.fa -bed negatives_train_set.bed.gz > negatives_train.fa
! bedtools getfasta -fi hg19.genome.fa -bed negatives_test_set.bed.gz > negatives_test.fa

index file hg19.genome.fa.fai not found, generating...


### Train the model

In [6]:
#To save time, we can download the pre-trained model
! [[ -e lsgkm_defaultsettings_t2.model.txt.gz ]] || wget https://raw.githubusercontent.com/AvantiShri/model_storage/5dcfc2b/gkmexplain/lsgkm_defaultsettings_t2.model.txt.gz -O lsgkm_defaultsettings_t2.model.txt.gz
! zcat lsgkm_defaultsettings_t2.model.txt.gz > lsgkm_defaultsettings_t2.model.txt

#To train the model from scratch, run the lines below:
##Model is trained using kernel=2, which is the standard gkm kernel (no position weighting)
## I used the standard gkm kernel so that the method from Gandhi et al. (2014)
## would be applicable
! [[ -e lsgkm_defaultsettings_t2.model.txt ]] || lsgkm/src/gkmtrain -T 16 -t 2 positives_train.fa negatives_train.fa lsgkm_defaultsettings_t2

#Make predictions to assess performance
! [[ -e preds_test_positives.txt ]] || lsgkm/src/gkmpredict -T 16 positives_test.fa lsgkm_defaultsettings_t2.model.txt preds_test_positives.txt
! [[ -e preds_test_negatives.txt ]] || lsgkm/src/gkmpredict -T 16 negatives_test.fa lsgkm_defaultsettings_t2.model.txt preds_test_negatives.txt

from sklearn.metrics import roc_auc_score                                                                                                                    
pos_preds = [float(x.rstrip().split("\t")[1])                                   
             for x in open("preds_test_positives.txt")]                         
neg_preds = [float(x.rstrip().split("\t")[1])                                   
             for x in open("preds_test_negatives.txt")]                         
print(roc_auc_score(y_true=[1 for x in pos_preds]+[0 for x in neg_preds],       
                    y_score = pos_preds+neg_preds))

--2021-02-27 10:54:51--  https://raw.githubusercontent.com/AvantiShri/model_storage/5dcfc2b/gkmexplain/lsgkm_defaultsettings_t2.model.txt.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655201 (640K) [application/octet-stream]
Saving to: ‘lsgkm_defaultsettings_t2.model.txt.gz’


2021-02-27 10:54:51 (18.9 MB/s) - ‘lsgkm_defaultsettings_t2.model.txt.gz’ saved [655201/655201]

INFO 2021-02-27 10:54:51: Number of threads is set to 16
INFO 2021-02-27 10:54:51: load model lsgkm_defaultsettings_t2.model.txt
INFO 2021-02-27 10:54:52: reading... 1000/8873
INFO 2021-02-27 10:54:53: reading... 2000/8873
INFO 2021-02-27 10:54:53: reading... 3000/8873
INFO 2021-02-27 10:54:54: reading... 4000/8873
INFO 2021-02-27 10:54:54: reading... 5000/8873
INFO 2021-02-27 10:54:55

### Generate dinuc shuffled sequences for computing null distribution of importance scores

In [7]:
!pip install deeplift
from deeplift.dinuc_shuffle import dinuc_shuffle

import numpy as np
import random
np.random.seed(1234)
random.seed(1234)

num_dinuc_shuffled_seqs = 500
#Generate the dinucleotide shuffled sequences and write to a file
fasta_seqs_no_N = [x.rstrip() for (i,x) in enumerate(open("positives_test.fa"))
                   if (i%2==1 and ('N' not in x))]
open("dnshuff_seqs.fa", 'w').write(
 "\n".join([">seq"+str(i)+"\n"+dinuc_shuffle(
            str(np.random.choice(fasta_seqs_no_N)))
            for i in range(num_dinuc_shuffled_seqs)]))

#We can also download the pre-generated file
! [[ -e dnshuff_seqs.fa.gz ]] || wget https://raw.githubusercontent.com/AvantiShri/model_storage/aae0902/gkmexplain/dnshuff_seqs.fa.gz -O dnshuff_seqs.fa.gz
! zcat dnshuff_seqs.fa.gz > dnshuff_seqs.fa

Collecting deeplift
  Downloading https://files.pythonhosted.org/packages/d2/48/e8c4a331664c32682d6f7f55f1148f59224e32cbf4f22c90f3f961eb5a40/deeplift-0.6.13.0.tar.gz
Building wheels for collected packages: deeplift
  Building wheel for deeplift (setup.py) ... [?25l[?25hdone
  Created wheel for deeplift: filename=deeplift-0.6.13.0-cp37-none-any.whl size=36447 sha256=c56e49565a3d5f89ba24a9e5adb3f8fff57b4c834df85a1da36aeb0ec907eb0e
  Stored in directory: /root/.cache/pip/wheels/39/a2/1b/a2eac3afbfedc4fb40e213ec4f8d97d26598325f187ae0dc5a
Successfully built deeplift
Installing collected packages: deeplift
Successfully installed deeplift-0.6.13.0
--2021-02-27 10:56:32--  https://raw.githubusercontent.com/AvantiShri/model_storage/aae0902/gkmexplain/dnshuff_seqs.fa.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HT

### Compute hypothetical importance scores

In [8]:
#The actual importance scores can be derived from the hypothetical importance
# scores by doing an elementwise multiplication of the hypothetical importance
# with the one-hot encoded sequence.
! [[ -e gkmexplain_positives_hypimpscores.txt ]] || lsgkm/src/gkmexplain -m 1 positives_test.fa lsgkm_defaultsettings_t2.model.txt gkmexplain_positives_hypimpscores.txt
! [[ -e gkmexplain_dnshuff_hypimpscores.txt ]] || lsgkm/src/gkmexplain -m 1 dnshuff_seqs.fa lsgkm_defaultsettings_t2.model.txt gkmexplain_dnshuff_hypimpscores.txt

INFO 2021-02-27 10:56:33: Number of threads is set to 1
INFO 2021-02-27 10:56:33: load model lsgkm_defaultsettings_t2.model.txt
INFO 2021-02-27 10:56:33: reading... 1000/8873
INFO 2021-02-27 10:56:33: reading... 2000/8873
INFO 2021-02-27 10:56:33: reading... 3000/8873
INFO 2021-02-27 10:56:33: reading... 4000/8873
INFO 2021-02-27 10:56:34: reading... 5000/8873
INFO 2021-02-27 10:56:34: reading... 6000/8873
INFO 2021-02-27 10:56:34: reading... 7000/8873
INFO 2021-02-27 10:56:34: reading... 8000/8873
INFO 2021-02-27 10:56:34: write prediction result to gkmexplain_positives_hypimpscores.txt
INFO 2021-02-27 10:57:47: 100 scored
INFO 2021-02-27 10:58:58: 200 scored
INFO 2021-02-27 11:00:10: 300 scored
INFO 2021-02-27 11:01:21: 400 scored
INFO 2021-02-27 11:02:32: 500 scored
INFO 2021-02-27 11:03:41: 600 scored
INFO 2021-02-27 11:04:50: 700 scored
INFO 2021-02-27 11:05:59: 800 scored
INFO 2021-02-27 11:07:07: 900 scored
INFO 2021-02-27 11:07:48: 960 scored
INFO 2021-02-27 11:07:48: Number of