### Kipoi usage tutorial: Per-Variant Predictions

This tutorial shows how to apply the model to a vcf file, if you would like to predict the *individual* effect of all variants in a vcf file.

For this you need a vcf with all the variants of interest.

In [1]:
# Imports
import kipoi
from kipoi.pipeline import Pipeline
from kipoiseq.dataloaders import SingleVariantUTRDataLoader
import os
import numpy as np
import pandas as pd

### Load the Model

In [2]:
# Source model directly from directory
model = kipoi.get_model("../Framepool", source="dir")

Using downloaded and verified file: /data/nasif12/home_if12/karollus/projects/models/Framepool/downloaded/model_files/weights/d1e9656725e730d509a09d5371e51bd2
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.


### Optional: download example files and hg19 fasta

In [3]:
import urllib.request
import gzip
import shutil

In [None]:
# make ExampleFile directory if it does not exist
if not os.path.exists("ExampleFiles"):
    os.makedirs("ExampleFiles")

In [4]:
# Download vcf
urllib.request.urlretrieve("https://zenodo.org/record/3584238/files/patho.vcf.gz?download=1", 'ExampleFiles/patho.vcf.gz')
# Download vcf tabix
urllib.request.urlretrieve("https://zenodo.org/record/3584238/files/patho.vcf.gz.tbi?download=1", 'ExampleFiles/patho.vcf.gz.tbi')
# Download GTF
urllib.request.urlretrieve("https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ensGene.gtf.gz", 'ExampleFiles/hg19.gtf.gz')
# Download id mapping file
urllib.request.urlretrieve("https://zenodo.org/record/3584238/files/hg19_idmap.tsv?download=1", 'ExampleFiles/hg19_idmap.tsv')

('ExampleFiles/hg19_idmap.tsv', <http.client.HTTPMessage at 0x2b3564049b70>)

In [18]:
# Download gzipped hg19 fasta (warning: 900mb)
urllib.request.urlretrieve("https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz", 'ExampleFiles/hg19.fa.gz')
# unzip
with gzip.open('ExampleFiles/hg19.fa.gz', 'rb') as f_in:
    with open('ExampleFiles/hg19.fa', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

### Provide the Parameters

This Dataloader requires the following input files:
1. GTF which provides 5'UTR as a feature type (e.g. 5UTR in gencode, five_prime_utr in Ensembl)
2. fasta file that provides the reference genome
3. bgzip compressed (single sample) vcf that provides the variants
   
The vcf must be sorted (by position) and a tabix index must be present (must lie in the same directory and have the same name + .tbi)

All files must agree on whether there is a chr prefix (e.g. is it chr1 or just 1).

**NB: Some common issues**
1. Files do not agree with respect to the ordering or the chr prefix. This creates all kinds of problems.
2. The vcf includes non-DNA characters, e.g. * or /. The model cannot currently account for this

In [5]:
# Path of the vcf file
vcf_path = "ExampleFiles/patho.vcf.gz"

# Path of the fasta file
fasta_path = "ExampleFiles/hg19.fa"

# Path of the gtf file
gtf_path = "ExampleFiles/hg19.gtf.gz"
# How are the 5'UTR called in your gtf
feature_type = "5UTR"

# output file path
output_file_path = "patho.tsv"

### Filter your gtf (if necessary)

If your vcf is small, or only covers small parts of the human genome, it is advisable to first subset your gtf to the corresponding regions to speed up the dataloader.

Note: The dataloader was designed to handle very large vcf files (eg Gnomad VCF) that do not fit into memory easily. Thus, for very large VCF, filtering the gtf will not be beneficial.

In [6]:
import pyranges as pr
from cyvcf2 import VCF

In [7]:
# Import gtf with pyranges
gr = pr.read_gtf(gtf_path)

In [8]:
id_set = set()
for var in VCF(vcf_path):
    chrom = var.CHROM
    pos = var.POS
    id_set = set(gr[chrom, pos-1:pos].df.gene_id) | id_set # vcf is 1-based, pyranges is not

In [9]:
gr_subset = gr[gr.gene_id.isin(id_set)]

In [10]:
gtf_path = "ExampleFiles/reduced.gtf"
gr_subset.to_gtf(gtf_path)

### Run Prediction

In [12]:
pipeline = Pipeline(model, SingleVariantUTRDataLoader)

In [13]:
pipeline.predict_to_file(output_file_path, {"gtf_file":gtf_path, 
                               "fasta_file":fasta_path,
                               "vcf_file":vcf_path,
                                "feature_type":feature_type},
                              batch_size=64);

1it [00:01,  1.10s/it]


### Load results

In [14]:
# Load data as dataframe
df = pd.read_csv(output_file_path, sep="\t")
df = df.rename(index=str, columns={"metadata/variant/chr":"chr",
          "metadata/exon_positions":"exon_positions",
          "metadata/transcript_id":"transcript_id",
          "metadata/variant/str":"variants",
          "metadata/variant/ref":"ref",
          "metadata/variant/alt":"alt",
          "metadata/variant/chrom":"chrom",
          "metadata/variant/id":"id",  
          "metadata/variant/pos":"pos", 
          "preds/mrl_fold_change":"mrl_fold_change",
          "preds/shift_1":"shift_1",
          "preds/shift_2":"shift_2"}
)

### Optional: id map to get a clearer output

In [15]:
# id map path
# Provide an id mapping file to get a richer output:
id_map_path = "ExampleFiles/hg19_idmap.tsv"

# Id map
df_map = pd.read_csv(id_map_path, sep="\t")
df = df.merge(df_map, on="transcript_id")

In [16]:
df

Unnamed: 0,transcript_id,alt,chrom,id,pos,ref,variants,mrl_fold_change,shift_1,shift_2,gene_id,gene_name
0,ENST00000370321,A,chr1,rs376208311,93297626,C,chr1:93297626:C:A,-0.799757,-0.670084,0.025305,ENSG00000122406,RPL5
1,ENST00000470843,A,chr1,rs376208311,93297626,C,chr1:93297626:C:A,-0.107161,-0.189486,-0.054052,ENSG00000122406,RPL5
2,ENST00000367021,A,chr1,CR022509,209975361,T,chr1:209975361:T:A,-1.068,-0.861996,0.055964,ENSG00000117595,IRF6
3,ENST00000456314,A,chr1,CR022509,209975361,T,chr1:209975361:T:A,-0.675417,-0.53175,-0.228016,ENSG00000117595,IRF6
4,ENST00000335295,T,chr11,rs34704828,5248280,C,chr11:5248280:C:T,-0.828528,0.00658,-0.919977,ENSG00000244734,HBB
5,ENST00000392711,A,chr17,CR023224,66508599,G,chr17:66508599:G:A,-1.02586,0.015113,-1.124092,ENSG00000108946,PRKAR1A
6,ENST00000585427,A,chr17,CR023224,66508599,G,chr17:66508599:G:A,-1.088306,-0.048745,-1.154857,ENSG00000108946,PRKAR1A
7,ENST00000585608,A,chr17,CR023224,66508599,G,chr17:66508599:G:A,-1.141302,-0.975095,0.025098,ENSG00000108946,PRKAR1A
8,ENST00000589228,A,chr17,CR023224,66508599,G,chr17:66508599:G:A,-1.129289,-0.975691,0.034666,ENSG00000108946,PRKAR1A
9,ENST00000536854,A,chr17,CR023224,66508599,G,chr17:66508599:G:A,-0.142158,-0.277853,-0.063634,ENSG00000108946,PRKAR1A


### Filter results
1. Choose strongest result per gene (requires id mapping)
2. Reduce to absolute effect > 0.5

In [17]:
df["abs_mrl_fc"] = np.abs(df["mrl_fold_change"])
idx = df.groupby(['gene_name'])['abs_mrl_fc'].transform(max) == df['abs_mrl_fc']
df_max = df[idx]
df_max[np.abs(df_max.mrl_fold_change) > 0.5]

Unnamed: 0,transcript_id,alt,chrom,id,pos,ref,variants,mrl_fold_change,shift_1,shift_2,gene_id,gene_name,abs_mrl_fc
0,ENST00000370321,A,chr1,rs376208311,93297626,C,chr1:93297626:C:A,-0.799757,-0.670084,0.025305,ENSG00000122406,RPL5,0.799757
2,ENST00000367021,A,chr1,CR022509,209975361,T,chr1:209975361:T:A,-1.068,-0.861996,0.055964,ENSG00000117595,IRF6,1.068
4,ENST00000335295,T,chr11,rs34704828,5248280,C,chr11:5248280:C:T,-0.828528,0.00658,-0.919977,ENSG00000244734,HBB,0.828528
10,ENST00000392710,A,chr17,CR023224,66508599,G,chr17:66508599:G:A,-1.244144,-1.060489,-0.135118,ENSG00000108946,PRKAR1A,1.244144
11,ENST00000258439,A,chr2,rs121908813,96931137,G,chr2:96931137:G:A,-1.285575,-1.012779,0.028387,ENSG00000135956,TMEM127,1.285575
13,ENST00000264193,T,chr3,rs867711777,98312358,C,chr3:98312358:C:T,-1.006342,-0.834495,0.085337,ENSG00000080819,CPOX,1.006342
16,ENST00000510027,A,chr5,rs367798627,147211193,G,chr5:147211193:G:A,-0.818441,-0.669264,-0.730059,ENSG00000164266,SPINK1,0.818441
17,ENST00000242261,T,chr7,30040876_1,19157207,G,chr7:19157207:G:T,-0.790121,-0.396197,-0.773752,ENSG00000122691,TWIST1,0.790121
20,ENST00000498124,A,chr9,rs1800586,21974860,C,chr9:21974860:C:A,-1.020021,-0.02678,-1.134101,ENSG00000147889,CDKN2A,1.020021
