# Variant effect prediction
Variant effect prediction offers a simple way to predict effects of SNVs using any model that uses DNA sequence as an input. Many different scoring methods can be chosen, but the principle relies on in-silico mutagenesis. The default input is a VCF and the default output again is a VCF annotated with predictions of variant effects. 

For details please take a look at the documentation in Postprocessing/Variant effect prediction. This iPython notebook goes through the basic programmatic steps that are needed to preform variant effect prediction. First a variant-centered approach will be taken and secondly overlap-based variant effect prediction will be presented. For details in how this is done programmatically, please refer to the documentation.

## Variant centered effect prediction
Models that accept input `.bed` files can make use of variant-centered effect prediction. This procedure starts out from the query VCF and generates genomic regions of the length of the model input, centered on the individual variant in the VCF. The model dataloader is then used to produce the model input samples for those regions, which are then mutated accoirding to the alleles in the VCF.

The selected scoring methods compare model predicitons for sequences carrying the reference or alternative allele. Those scoring methods can be `Diff` for simple subtraction of prediction, `Logit` for substraction of logit-transformed model predictions, or `DeepSEA_effect` which is a combination of `Diff` and `Logit`, which was published in the Troyanskaya et al. (2015) publication.

Let's start out by loading the DeepSEA model and dataloader factory:

In [1]:
import kipoi
model_name = "DeepSEA/variantEffects"
kipoi.pipeline.install_model_requirements(model_name)
# get the model
model = kipoi.get_model(model_name)
# get the dataloader factory
Dataloader = kipoi.get_dataloader_factory(model_name)

Conda dependencies to be installed:
['h5py', 'pytorch::pytorch-cpu>=0.2.0']
Fetching package metadata .................
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /nfs/research1/stegle/users/rkreuzhu/conda-envs/kipoi:
#
h5py                      2.7.1            py35ha1a889d_0  
pytorch-cpu               0.3.1                py35_cpu_2    pytorch
pip dependencies to be installed:
[]


Next we will have to define the varaints we want to look at, let's look at a sample VCF in chromosome 22:

In [2]:
!head -n 40 example_data/clinvar_donor_acceptor_chr22.vcf

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##contig=<ID=chrMT,length=16569>

Now we will define path variable for vcf input and output paths and instantiate a VcfWriter, which will write out the annotated VCF:

In [3]:
from kipoi.postprocessing import VcfWriter
# The input vcf path
vcf_path = "example_data/clinvar_donor_acceptor_chr22.vcf"
# The output vcf path, based on the input file name    
out_vcf_fpath = vcf_path[:-4] + "%s.vcf"%model_name.replace("/", "_")
# The writer object that will output the annotated VCF
writer = VcfWriter(model, vcf_path, out_vcf_fpath)

Then we need to instantiate an object that can generate variant-centered regions (`SnvCenteredRg` objects). This class needs information on the model input sequence length which is extracted automatically within `ModelInfoExtractor` objects:

In [4]:
# Information extraction from dataloader and model
model_info = kipoi.postprocessing.ModelInfoExtractor(model, Dataloader)
# vcf_to_region will generate a variant-centered regions when presented a VCF record.
vcf_to_region = kipoi.postprocessing.SnvCenteredRg(model_info)

Now we can define the required dataloader arguments, omitting the `intervals_file` as this will be replaced by the automatically generated bed file:

In [5]:
dataloader_arguments = {"fasta_file": "example_data/hg19_chr22.fa"}

This is the moment to run the variant effect prediction:

In [6]:
import kipoi.postprocessing.snv_predict as sp
from kipoi.postprocessing.variant_effects import Diff, DeepSEA_effect
sp.predict_snvs(model,
                Dataloader,
                vcf_path,
                batch_size = 32,
                dataloader_args=dataloader_arguments,
                vcf_to_region=vcf_to_region,
                evaluation_function_kwargs={'diff_types': {'diff': Diff("mean"), 'deepsea_scr': DeepSEA_effect("mean")}},
                sync_pred_writer=writer)
writer.close()

  0%|          | 0/14 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
  7%|▋         | 1/14 [00:02<00:38,  2.96s/it]
0it [00:00, ?it/s][A
 14%|█▍        | 2/14 [00:05<00:35,  2.97s/it]
0it [00:00, ?it/s][A
 21%|██▏       | 3/14 [00:08<00:32,  2.97s/it]
0it [00:00, ?it/s][A
 29%|██▊       | 4/14 [00:11<00:29,  2.97s/it]
0it [00:00, ?it/s][A
 36%|███▌      | 5/14 [00:14<00:26,  2.98s/it]
0it [00:00, ?it/s][A
 43%|████▎     | 6/14 [00:17<00:23,  2.98s/it]
0it [00:00, ?it/s][A
 50%|█████     | 7/14 [00:20<00:20,  2.99s/it]
0it [00:00, ?it/s][A
 57%|█████▋    | 8/14 [00:23<00:17,  3.00s/it]
0it [00:00, ?it/s][A
 64%|██████▍   | 9/14 [00:26<00:14,  3.00s/it]
0it [00:00, ?it/s][A
 71%|███████▏  | 10/14 [00:30<00:12,  3.00s/it]
0it [00:00, ?it/s][A
 79%|███████▊  | 11/14 [00:33<00:09,  3.01s/it]
0it [00:00, ?it/s][A
 86%|████████▌ | 12/14 [00:36<00:06,  3.01s/it]
0it [00:00, ?it/s][A
 93%|█████████▎| 13/14 [00:39<00:03,  3.02s/it]
0it [00:00, ?it/s][A
100%|██████████| 14/14 [00:39<00:00,  

In the example above we have used the variant scoring method `Diff` and `DeepSEA_effect` from `kipoi.postprocessing.variant_effects`. As mentioned above variant scoring methods calculate the difference between predicitons for reference and alternative, but there is another dimension to this: Models that have the `use_rc: true` flag set in their model.yaml file (DeepSEA/variantEffects does) will not only be queried with the reference and alternative carrying input sequences, but also with the reverse complement of the the sequences. In order to know of to combine predictions for forward and reverse sequences there is a initialisation flag (here set to: `"mean"`) for the variant scoring methods. `"mean"` in this case means that after calculating the effect (e.g.: Difference) the average over the difference between the prediction for the forward and for the reverse sequence should be returned. Setting `"mean"` complies with what was used in the Troyanskaya et al. publication.

Now let's look at the output:

In [7]:
# A slightly convoluted way of printing out the first 40 lines and up to 80 characters per line maximum
with open("example_data/clinvar_donor_acceptor_chr22DeepSEA_variantEffects.vcf") as ifh:
    for i,l in enumerate(ifh):
        long_line = ""
        if len(l)>80:
            long_line = "..."
        print(l[:80].rstrip() +long_line)
        if i >=40:
            break

##fileformat=VCFv4.0
##INFO=<ID=KPVEP_DeepSEA:0.94_DIFF,Number=.,Type=String,Description="DIFF SNV ef...
##INFO=<ID=KPVEP_DeepSEA:0.94_DEEPSEA_SCR,Number=.,Type=String,Description="DEEP...
##INFO=<ID=KPVEP_DeepSEA:0.94_rID,Number=.,Type=String,Description="Range or reg...
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,le

This shows that variants have been annotated with variant effect scores - the name tag indicates with model was used, wich version of it and it displays the scoring function label (`DIFF`) which is derived from the scoring function label defined in the `evaluation_function_kwargs` dictionary (`'diff'`).

If you want to access the predictions programmatically you can either load the VCF or keep all the results in memory and receive them as a dictionary of pandas dataframes:


In [8]:
effects = sp.predict_snvs(model,
            Dataloader,
            vcf_path,
            batch_size = 32,
            dataloader_args=dataloader_arguments,
            vcf_to_region=vcf_to_region,
            evaluation_function_kwargs={'diff_types': {'diff': Diff("mean"), 'deepsea_scr': DeepSEA_effect("mean")}},
            return_predictions=True)

100%|██████████| 14/14 [00:38<00:00,  2.77s/it]


For every key in the `evaluation_function_kwargs` dictionary there is a key in `effects` and (the equivalent of an additional INFO tag in the VCF). Now let's take a look at the results:

In [9]:
for k in effects:
    print(k)
    print(effects[k].head())
    print("-"*80)

diff
                        8988T_DNase_None  AoSMC_DNase_None  \
chr22:41320486:G:['T']         -0.002850         -0.000094   
chr22:31009031:T:['G']         -0.027333         -0.008740   
chr22:43024150:C:['G']          0.010773          0.000702   
chr22:43027392:A:['G']         -0.121747         -0.247321   
chr22:37469571:C:['T']         -0.006546          0.000784   

                        Chorion_DNase_None  CLL_DNase_None  \
chr22:41320486:G:['T']           -0.001533       -0.000353   
chr22:31009031:T:['G']           -0.003499       -0.008143   
chr22:43024150:C:['G']            0.004689       -0.000609   
chr22:43027392:A:['G']           -0.167689       -0.010695   
chr22:37469571:C:['T']           -0.000383       -0.000924   

                        Fibrobl_DNase_None  FibroP_DNase_None  \
chr22:41320486:G:['T']           -0.000856          -0.000132   
chr22:31009031:T:['G']           -0.017288          -0.018895   
chr22:43024150:C:['G']            0.002075           0

We see that for `diff` and `deepsea_scr` there is a dataframe with variant identifiers as rows and model output labels as columns. The DeepSEA model predicts 919 tasks simultaneously hence there are 919 columns in the dataframe.

## Overlap based prediction
Models that cannot predict on every region of the genome might not accept a `.bed` file as dataloader input. An example of such a model is a splicing model. Those models only work in certain regions of the genome. Here varaint effect prediction can be executed based on overlaps between the regions generated by the dataloader and the variants defined in the VCF. The procedure is similar only that now we need to make sure that the VCF is tabixed so that a regional lookup can be performed efficiently, this can be done by using the `ensure_tabixed` function, the rest remains the same as before:

In [10]:
import kipoi
from kipoi.postprocessing import VcfWriter
from kipoi.postprocessing import ensure_tabixed_vcf
# Use a splicing model
model_name = "HAL"
# install dependencies
kipoi.pipeline.install_model_requirements(model_name)
# get the model
model = kipoi.get_model(model_name)
# get the dataloader factory
Dataloader = kipoi.get_dataloader_factory(model_name)
# The input vcf path
vcf_path = "example_data/clinvar_donor_acceptor_chr22.vcf"

# Make sure that the vcf is bgzipped and tabixed, if not then generate the compressed vcf in the same place
vcf_path_tbx = ensure_tabixed_vcf(vcf_path)

# The output vcf path, based on the input file name    
out_vcf_fpath = vcf_path[:-4] + "%s.vcf"%model_name.replace("/", "_")
# The writer object that will output the annotated VCF
writer = VcfWriter(model, vcf_path, out_vcf_fpath)

Conda dependencies to be installed:
['numpy']
Fetching package metadata ...............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /nfs/research1/stegle/users/rkreuzhu/conda-envs/kipoi:
#
numpy                     1.14.1           py35h3dfced4_1  
pip dependencies to be installed:
[]


This time we don't need an object that generates regions, hence we can directly define the dataloader arguments and run the prediction:

In [11]:
import kipoi.postprocessing.snv_predict as sp
from kipoi.postprocessing.variant_effects import Diff
dataloader_arguments = {"gtf_file":"example_data/Homo_sapiens.GRCh37.75.filtered_chr22.gtf",
                               "fasta_file": "example_data/hg19_chr22.fa"}

effects = sp.predict_snvs(model,
                        Dataloader,
                        vcf_path_tbx,
                        batch_size = 32,
                        dataloader_args=dataloader_arguments,
                        evaluation_function_kwargs={'diff_types': {'diff': Diff("mean")}},
                        sync_pred_writer=writer,
                        return_predictions=True)

  0%|          | 0/709 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
7it [00:00, 2135.90it/s][A
0it [00:00, ?it/s][A
3it [00:00, 1313.32it/s][A
0it [00:00, ?it/s][A
  4%|▎         | 26/709 [00:00<00:03, 221.61it/s]
0it [00:00, ?it/s][A
4it [00:00, 1596.92it/s][A
0it [00:00, ?it/s][A
3it [00:00, 1293.34it/s][A
0it [00:00, ?it/s][A
1it [00:00, 533.36it/s][A
0it [00:00, ?it/s][A
  6%|▌         | 44/709 [00:00<00:03, 190.23it/s]
0it [00:00, ?it/s][A
2it [00:00, 942.01it/s][A
0it [00:00, ?it/s][A
1it [00:00, 555.39it/s][A
0it [00:00, ?it/s][A
2it [00:00, 942.12it/s][A
0it [00:00, ?it/s][A
9it [00:00, 2459.84it/s][A
0it [00:00, ?it/s][A
3it [00:00, 1304.88it/s][A
0it [00:00, ?it/s][A
  8%|▊         | 54/709 [00:00<00:04, 145.32it/s]
0it [00:00, ?it/s][A
3it [00:00, 1320.90it/s][A
0it [00:00, ?it/s][A
5it [00:00, 1820.29it/s][A
0it [00:00, ?it/s][A
9it [00:00, 2392.04it/s][A
0it [00:00, ?it/s][A
9it [00:00, 2447.56it/s][A
0it [00:00, ?it/s][A
  9%|▉         | 63/709 [0

Let's have a look at the VCF:

In [12]:
# A slightly convoluted way of printing out the first 40 lines and up to 80 characters per line maximum
with open("example_data/clinvar_donor_acceptor_chr22HAL.vcf") as ifh:
    for i,l in enumerate(ifh):
        long_line = ""
        if len(l)>80:
            long_line = "..."
        print(l[:80].rstrip() +long_line)
        if i >=40:
            break

##fileformat=VCFv4.0
##INFO=<ID=KPVEP_Model_from_Rose:0.1_DIFF,Number=.,Type=String,Description="DIFF...
##INFO=<ID=KPVEP_Model_from_Rose:0.1_rID,Number=.,Type=String,Description="Range...
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=4812989

And the prediction output this time is less helpful because it's the ids that the dataloader created which are displayed as index. In general it is advisable to use the output VCF for more detailed information on which variant was overlapped with which region fo produce a prediction.

In [13]:
for k in effects:
    print(k)
    print(effects[k].head())
    print("-"*80)

diff
            0
290  0.105865
293  0.000000
299  0.000000
304  0.105865
307  0.000000
--------------------------------------------------------------------------------


## Batch prediction
Since the syntax basically doesn't change for different kinds of models a simple for-loop can be written to do what we just did on many models:

In [14]:
import kipoi
# Run effect predicton
models_df = kipoi.list_models()
models_substr = ["HAL", "MaxEntScan", "labranchor", "rbp"]
models_df_subsets = {ms: models_df.loc[models_df["model"].str.contains(ms)] for ms in models_substr}

Let's make sure that all the dependencies are installed:

In [15]:
for ms in models_substr:
    model_name = models_df_subsets[ms]["model"].iloc[0]
    kipoi.pipeline.install_model_requirements(model_name)

Conda dependencies to be installed:
['numpy']
Fetching package metadata ...............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /nfs/research1/stegle/users/rkreuzhu/conda-envs/kipoi:
#
numpy                     1.14.1           py35h3dfced4_1  
pip dependencies to be installed:
[]
Conda dependencies to be installed:
['bioconda::maxentpy']
Fetching package metadata ...............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /nfs/research1/stegle/users/rkreuzhu/conda-envs/kipoi:
#
maxentpy                  0.0.1                    py35_0    bioconda
pip dependencies to be installed:
[]
Conda dependencies to be installed:
[]
pip dependencies to be installed:
['tensorflow>=1.0.0', 'keras>=2.0.4']
Conda dependencies to be installed:
[]
pip dependencies to be installed:
['concise>=0.6.5', 'tensorflow==1.4.1', 'keras>=2.0.4']


Now we are good to go:

In [16]:
# Run variant effect prediction using a basic Diff

import kipoi
from kipoi.postprocessing import ensure_tabixed_vcf
import kipoi.postprocessing.snv_predict as sp
from kipoi.postprocessing import VcfWriter
from kipoi.postprocessing.variant_effects import Diff



splicing_dl_args = {"gtf_file":"example_data/Homo_sapiens.GRCh37.75.filtered_chr22.gtf",
                               "fasta_file": "example_data/hg19_chr22.fa"}
dataloader_args_dict = {"HAL": splicing_dl_args,
                        "labranchor": splicing_dl_args,
                        "MaxEntScan":splicing_dl_args,
                        "rbp": {"fasta_file": "example_data/hg19_chr22.fa",
                               "gtf_file":"example_data/Homo_sapiens.GRCh37.75_chr22.gtf"}
                       }

for ms in models_substr:
    model_name = models_df_subsets[ms]["model"].iloc[0]
    #kipoi.pipeline.install_model_requirements(model_name)
    model = kipoi.get_model(model_name)
    vcf_path = "example_data/clinvar_donor_acceptor_chr22.vcf"
    vcf_path_tbx = ensure_tabixed_vcf(vcf_path)
    
    out_vcf_fpath = vcf_path[:-4] + "%s.vcf"%model_name.replace("/", "_")

    writer = VcfWriter(model, vcf_path, out_vcf_fpath)
    
    print(model_name)
    
    Dataloader = kipoi.get_dataloader_factory(model_name)
    dataloader_arguments = dataloader_args_dict[ms]
    model_info = kipoi.postprocessing.ModelInfoExtractor(model, Dataloader)
    vcf_to_region = None
    if ms == "rbp":
        vcf_to_region = kipoi.postprocessing.SnvCenteredRg(model_info)
    sp.predict_snvs(model,
                    Dataloader,
                    vcf_path_tbx,
                    batch_size = 32,
                    dataloader_args=dataloader_arguments,
                    vcf_to_region=vcf_to_region,
                    evaluation_function_kwargs={'diff_types': {'diff': Diff("mean")}},
                    sync_pred_writer=writer)
    writer.close()

HAL


  0%|          | 0/709 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
7it [00:00, 2104.67it/s][A
0it [00:00, ?it/s][A
3it [00:00, 1283.84it/s][A
0it [00:00, ?it/s][A
  4%|▎         | 26/709 [00:00<00:03, 214.86it/s]
0it [00:00, ?it/s][A
4it [00:00, 1520.09it/s][A
0it [00:00, ?it/s][A
3it [00:00, 1272.16it/s][A
0it [00:00, ?it/s][A
1it [00:00, 538.56it/s][A
0it [00:00, ?it/s][A
  6%|▌         | 44/709 [00:00<00:03, 185.67it/s]
0it [00:00, ?it/s][A
2it [00:00, 955.64it/s][A
0it [00:00, ?it/s][A
1it [00:00, 539.46it/s][A
0it [00:00, ?it/s][A
2it [00:00, 937.59it/s][A
0it [00:00, ?it/s][A
9it [00:00, 2428.20it/s][A
0it [00:00, ?it/s][A
3it [00:00, 1249.42it/s][A
0it [00:00, ?it/s][A
  8%|▊         | 54/709 [00:00<00:04, 143.21it/s]
0it [00:00, ?it/s][A
3it [00:00, 1252.03it/s][A
0it [00:00, ?it/s][A
5it [00:00, 1777.85it/s][A
0it [00:00, ?it/s][A
9it [00:00, 2417.00it/s][A
0it [00:00, ?it/s][A
9it [00:00, 2392.95it/s][A
0it [00:00, ?it/s][A
  9%|▉         | 63/709 [0

MaxEntScan/3prime


  0%|          | 0/709 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
4it [00:00, 1447.19it/s][A
0it [00:00, ?it/s][A
6it [00:00, 1960.11it/s][A
0it [00:00, ?it/s][A
  5%|▌         | 38/709 [00:00<00:01, 366.60it/s]
0it [00:00, ?it/s][A
2it [00:00, 971.13it/s][A
0it [00:00, ?it/s][A
1it [00:00, 527.32it/s][A
0it [00:00, ?it/s][A
1it [00:00, 543.23it/s][A
0it [00:00, ?it/s][A
1it [00:00, 546.77it/s][A
0it [00:00, ?it/s][A
  7%|▋         | 51/709 [00:00<00:02, 250.34it/s]
0it [00:00, ?it/s][A
1it [00:00, 540.29it/s][A
0it [00:00, ?it/s][A
2it [00:00, 934.04it/s][A
0it [00:00, ?it/s][A
2it [00:00, 915.79it/s][A
0it [00:00, ?it/s][A
1it [00:00, 558.12it/s][A
0it [00:00, ?it/s][A
  9%|▉         | 64/709 [00:00<00:03, 207.88it/s]
0it [00:00, ?it/s][A
1it [00:00, 553.70it/s][A
0it [00:00, ?it/s][A
 17%|█▋        | 118/709 [00:00<00:02, 282.20it/s]
0it [00:00, ?it/s][A
2it [00:00, 970.23it/s][A
0it [00:00, ?it/s][A
 24%|██▎       | 167/709 [00:00<00:01, 321.96it/s]
0it [00:

labranchor


  0%|          | 0/709 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
  4%|▎         | 26/709 [00:00<00:08, 81.36it/s]
0it [00:00, ?it/s][A
6it [00:00, 1353.29it/s][A
0it [00:00, ?it/s][A
  5%|▌         | 38/709 [00:00<00:07, 87.59it/s]
0it [00:00, ?it/s][A
2it [00:00, 637.58it/s][A
0it [00:00, ?it/s][A
1it [00:00, 342.95it/s][A
0it [00:00, ?it/s][A
1it [00:00, 284.96it/s][A
0it [00:00, ?it/s][A
1it [00:00, 344.08it/s][A
0it [00:00, ?it/s][A
  6%|▌         | 43/709 [00:00<00:10, 64.09it/s]
0it [00:00, ?it/s][A
 17%|█▋        | 118/709 [00:00<00:04, 147.05it/s]
0it [00:00, ?it/s][A
2it [00:00, 604.28it/s][A
0it [00:00, ?it/s][A
 19%|█▉        | 138/709 [00:00<00:03, 147.75it/s]
0it [00:00, ?it/s][A
 26%|██▌       | 183/709 [00:01<00:02, 176.95it/s]
0it [00:00, ?it/s][A
1it [00:00, 347.84it/s][A
0it [00:00, ?it/s][A
 30%|██▉       | 210/709 [00:01<00:02, 180.57it/s]
0it [00:00, ?it/s][A
2it [00:00, 863.03it/s][A
0it [00:00, ?it/s][A
2it [00:00, 613.47it/s][A
0it [00:00, ?

rbp_eclip/AARS


2018-02-23 17:30:09,027 [INFO] Extracted GTF attributes: ['gene_id', 'gene_name', 'gene_source', 'gene_biotype', 'transcript_id', 'transcript_name', 'transcript_source', 'exon_number', 'exon_id', 'tag', 'protein_id', 'ccds_id']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:2018-02-23 17:30:09,346:genomelake] Running landmark extractors..
2018-02-23 17:30:09,346 [INFO] Running landmark extractors..
INFO:2018-02-23 17:30:09,515:genomelake] Done!
2018-02-23 17:30:09,515 [INFO] Done!
  0%|          | 0/14 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
  7%|▋         | 1/14 [00:00<00:06,  1.99it/s]
0it [00:00, ?it/s][A
 14%|█▍        | 2/14 [00:00<00:04,  2.47it/s]
0it [00:00, ?it/s][A
 21%|██▏       | 3/14 [00:01<00:04,  2.68it/s]
0it [00:00, ?it/s][A
 29%|██▊       | 4/14 [00:01<00:03,  2.78i

let's validate that things have worked:

In [17]:
! wc -l example_data/clinvar_donor_acceptor_chr22*.vcf

    450 example_data/clinvar_donor_acceptor_chr22DeepSEA_variantEffects.vcf
   2035 example_data/clinvar_donor_acceptor_chr22HAL.vcf
    794 example_data/clinvar_donor_acceptor_chr22labranchor.vcf
   1176 example_data/clinvar_donor_acceptor_chr22MaxEntScan_3prime.vcf
    449 example_data/clinvar_donor_acceptor_chr22rbp_eclip_AARS.vcf
    447 example_data/clinvar_donor_acceptor_chr22.vcf
   5351 total
