# Variant effect prediction
Variant effect prediction offers a simple way to predict effects of SNVs using any model that uses DNA sequence as an input. Many different scoring methods can be chosen, but the principle relies on in-silico mutagenesis. The default input is a VCF and the default output again is a TSV annotated with predictions of variant effects. 

This iPython notebook goes through the basic programmatic steps that are needed to preform variant effect prediction.

## Variant centered effect prediction
Models that uses `kipoiseq.dataloaders.SeqIntervalDl` as a default dataloader can make use of variant-centered effect prediction. This procedure starts out from the query VCF and generates genomic regions of the length of the model input, centered on the individual variant in the VCF which are then mutated according to the alleles in the VCF. The model batch prediction function is then triggered for all mutated sequence sets and finally the scoring method is applied.

The selected scoring methods compare model predicitons for sequences carrying the reference or alternative allele. Those scoring methods can be `Diff` for simple subtraction of prediction, `Logit` for substraction of logit-transformed model predictions, or `DeepSEA_effect` which is a combination of `Diff` and `Logit`, which was published in the Troyanskaya et al. (2015) publication.

This ipython notebook assumes that it is executed in an environment in which kipoi-veff2 is installed. For more information check https://github.com/kipoi/kipoi-veff2#install-the-conda-environment

In [9]:
from kipoi_veff2 import variant_centered

vcf_file = "example_data/clinvar_donor_acceptor_chr22.vcf"
fasta_file = "example_data/hg19_chr22.fa"
output_file = "output.tsv"
model_name = "DeepSEA/variantEffects"

model_group = model_name.split("/")[0]
model_group_config_dict = (
    variant_centered.VARIANT_CENTERED_MODEL_GROUP_CONFIGS.get(
        model_group, {}
    )
)

model_config = variant_centered.get_model_config(
    model_name, **model_group_config_dict
)

variant_centered.score_variants(
    model_config=model_config,
    vcf_file=vcf_file,
    fasta_file=fasta_file,
    output_file=output_file,
)


Using downloaded and verified file: /Users/b260/.kipoi/models/DeepSEA/variantEffects/downloaded/model_files/weights/35956ab9c28960b5a3693f470fe980c1


Lets have a look at the output annotated tsv:

In [12]:
import pandas as pd

output_dataframe = pd.read_csv("output.tsv", sep='\t')
print(output_dataframe.iloc[: 5, : 10])

  #CHROM       POS   ID REF ALT  DeepSEA/variantEffects/8988T_DNase_None/diff  \
0  chr22  41320486    4   G   T                                     -0.001468   
1  chr22  31009031    9   T   G                                     -0.038191   
2  chr22  43024150   15   C   G                                      0.013784   
3  chr22  43027392   16   A   G                                     -0.060475   
4  chr22  37469571  122   C   T                                     -0.015216   

   DeepSEA/variantEffects/AoSMC_DNase_None/diff  \
0                                      0.001205   
1                                     -0.019323   
2                                      0.001041   
3                                     -0.186859   
4                                      0.012377   

   DeepSEA/variantEffects/Chorion_DNase_None/diff  \
0                                       -0.001497   
1                                       -0.009417   
2                                        0.0072

## Interval based effect prediction

It is also possible to extend this to models that already perform variant effect prediction using the same api. An example is - models under group MMSplice. We make use of the kipoi dataloader directly as it already takes a vcf, a fasta and a gtf file. After that, we extract variant information directly from the metadata of the dataloader output to annotate the output predictions with.
This ipython notebook assumes that it is executed in an environment in which kipoi-veff2 is installed. For more information check https://github.com/kipoi/kipoi-veff2#install-the-conda-environment

In [3]:
from kipoi_veff2 import interval_based

vcf_file = "example_data/test.vcf"
fasta_file = "example_data/test.fa"
gtf_file = "example_data/test.gtf"
output_file = "output_MMSplice_mtsplice.tsv"
model_name = "MMSplice/mtsplice"

model_config = interval_based.INTERVAL_BASED_MODEL_CONFIGS[model_name]
interval_based.score_variants(
        model_config=model_config,
        vcf_file=vcf_file,
        fasta_file=fasta_file,
        gtf_file=gtf_file,
        output_file=output_file,
    )


Using TensorFlow backend.


Lets have a look at the annotated tsv

In [4]:
import pandas as pd

output_dataframe = pd.read_csv("output_MMSplice_mtsplice.tsv", sep='\t')
print(output_dataframe.iloc[: 5, : 10])

   #CHROM       POS                                       ID        REF ALT  \
0      17  41197805  17:41197805:ACATCTGCC>A:ENSE00001814242  ACATCTGCC   A   
1      17  41197805  17:41197805:ACATCTGCC>A:ENSE00001312675  ACATCTGCC   A   
2      17  41197805  17:41197805:ACATCTGCC>A:ENSE00001831829  ACATCTGCC   A   
3      17  41197805  17:41197805:ACATCTGCC>A:ENSE00002914501  ACATCTGCC   A   
4      17  41197805  17:41197805:ACATCTGCC>A:ENSE00001937547  ACATCTGCC   A   

   MMSplice/mtsplice/Retina_Eye  MMSplice/mtsplice/RPE_Choroid_Sclera_Eye  \
0                      0.245053                                 -0.105601   
1                      0.244987                                 -0.106446   
2                      0.269683                                 -0.102571   
3                      0.398744                                 -0.087164   
4                      0.106025                                 -0.007285   

   MMSplice/mtsplice/Subcutaneous_Adipose  \
0                

## Variant effect prediction at scale

A typical usecase of variant effect prediction pipeline is to predict across many models and many vcf-s or vcf/fasta pairs. To make use of high performance clusters and score variants at scale we use Snakemake. Below is an example Snakemake for running kipoi-veff2 across 24 vcf/fasta pairs from 1000genome project. VCF files are available here - http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/. We assume snakemake is run from a conda environment where kipoi-veff2 is installed. In this specific case, the total number of submitted jobs will be 24*(600+102) = 16848 - 600 models under pwm_HOCOMOCO model group and 102 under DeepBind/Homo_sapiens/RBP for each of 24 vcf/fasta pairs.

In [None]:
from pathlib import Path
import glob

data_dir = "/data/1000genomevcfs/"

def get_vcf_fasta_pair():
    vcf_files = [f'{p.stem.replace(".vcf", "")}' for p in Path(data_dir).iterdir() if p.is_file() and all(ext in p.suffixes for ext in ['.vcf', '.gz'])]
    vcf_fasta_pair = {vf: f'chr{vf.split(".")[1][3:]}_standardized.fa' for vf in vcf_files}
    return vcf_fasta_pair

def get_args(params):
    """Function returning appropriate parameters with the flag
    for the corresponding model
    """
    if "MMSplice" in params.model:
        return "-g input/test.gtf"
    else:
        return "-s diff"

groups = [ "pwm_HOCOMOCO", "DeepBind/Homo_sapiens/RBP"]

def get_list_of_models():
    """Function returning list of models
    that belongs to a list of model groups"""
    from kipoi import list_models
    all_models = list_models().model
    group_to_models = {group : sorted(list(all_models[all_models.str.contains(group)])) for group in groups}
    return group_to_models

group_to_models = get_list_of_models()
vcf_fasta_pair = get_vcf_fasta_pair()

rule all:
    input: 
        expand("merged__{group}__{id}.tsv", group=groups, id=vcf_fasta_pair.keys())

rule run_vep:
    input: 
        vcf = data_dir+"{id}.vcf.gz",
        fasta = lambda wildcards: data_dir+vcf_fasta_pair[wildcards.id], 
    output: 
        "output__{model}__{id}.tsv"
    params: 
        model_args = get_args
    shell: 
        "kipoi_veff2_predict {input.vcf} {input.fasta} {params.model_args} {output} -m {wildcards.model}"

rule merge_per_group:
    input:
        lambda wildcards: expand("output__{model}__{{id}}.tsv", model=group_to_models[wildcards.group])
    output:
        "merged__{group}__{id}.tsv"
    resources:
        mem_mb=28000
    shell:
        "kipoi_veff2_merge {input} {output}"

I highly recommend making a profile for your cluster type using https://github.com/Snakemake-Profiles/. Run the above Snakefile with 
```snakemake --profile <name_of_your_profile>.```

In [None]:
For more information and examples please check https://github.com/kipoi/kipoi-veff2