# Kipoi python API

## Quick start

There are three basic building blocks in kipoi:

- **Source** - provides Models and DataLoaders.
- **Model** - makes the prediction given the numpy arrays. 
- **Dataloader** - loads the data from raw files and transforms them into a form that is directly consumable by the Model

![img](../docs/theme_dir/img/kipoi-workflow.png)

## List of main commands


- `kipoi.list_sources()`
- `kipoi.get_source()`


- `kipoi.list_models()`
- `kipoi.list_dataloaders()`


- `kipoi.get_model()`
- `kipoi.get_dataloader_factory()`



### Source

Available sources are specified in the config file located at: `~/.kipoi/config.yaml`. Here is an example config file:

```yaml
model_sources:
    kipoi: # default
        type: git-lfs # git repository with large file storage (git-lfs)
        remote_url: git@github.com:kipoi/models.git # git remote
        local_path: ~/.kipoi/models/ # local storage path
    gl:
        type: git-lfs  # custom model
        remote_url: https://i12g-gagneurweb.informatik.tu-muenchen.de/gitlab/gagneurlab/model-zoo.git
        local_path: /s/project/model-zoo
```

There are three different model sources possible: 

- **`git-lfs`** - git repository with source files tracked normally by git and all the binary files like model weights (located in `files*` directories) are tracked by [git-lfs](https://git-lfs.github.com). 
  - Requires `git-lfs` to be installed.
- **`git`** - all the files including weights (not recommended)
- **`local`** - local directory containing models defined in subdirectories

For **`git-lfs`** source type, larger files tracked by `git-lfs` will be downloaded into the specified directory `local_path` only after the model has been requested (when invoking `kipoi.get_model()`).

#### Note

A particular model/dataloader is defined by its source (say `kipoi` or `my_git_models`) and the relative path of the desired model directory from the model source root (say `rbp/`).

A directory is considered a model if it contains a `model.yaml` file.

In [1]:
import kipoi

In [54]:
import warnings
warnings.filterwarnings('ignore')

import logging
logging.disable(1000)

In [4]:
kipoi.list_sources()

Unnamed: 0,source,type,location,local_size,n_models,n_dataloaders
0,kipoi,git-lfs,/home/avsec/.kipoi/mo...,"8,6G",779,779


In [5]:
s = kipoi.get_source("kipoi")

In [6]:
s

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/home/avsec/.kipoi/models/')

In [7]:
kipoi.list_models()

Unnamed: 0,source,model,version,authors,doc,type,inputs,targets,license,cite_as,trained_on,training_procedure,tags
0,kipoi,DeepSEAKeras,0.1,[Author(name='Lara Ur...,This CNN is based on ...,keras,seq,TFBS_DHS_probs,MIT,https://doi.org/10.10...,ENCODE and Roadmap Ep...,https://www.nature.co...,[]
1,kipoi,extended_coda,0.1,[Author(name='Johnny ...,Single bp resolution ...,keras,[H3K27AC_subsampled],[H3K27ac],MIT,https://doi.org/10.10...,Described in https://...,Described in https://...,[]
2,kipoi,DeepCpG_DNA/Hou2016_m...,1.0.4,[Author(name='Christo...,This is the extractio...,keras,[dna],"[cpg/mESC1, cpg/mESC2...",MIT,https://doi.org/10.11...,scBS-seq and scRRBS-s...,Described in https://...,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
780,kipoi,CpGenie/SK_N_SH_ENCSR...,0.1,[Author(name='Haoyang...,Abstract: DNA methyla...,keras,seq,methylation_prob,Apache License v2,https://doi.org/10.10...,RRBS (restricted repr...,RMSprop,[]
781,kipoi,CpGenie/SK_N_SH_ENCSR...,0.1,[Author(name='Haoyang...,Abstract: DNA methyla...,keras,seq,methylation_prob,Apache License v2,https://doi.org/10.10...,RRBS (restricted repr...,RMSprop,[]
782,kipoi,CpGenie/HEK293_ENCSR0...,0.1,[Author(name='Haoyang...,Abstract: DNA methyla...,keras,seq,methylation_prob,Apache License v2,https://doi.org/10.10...,RRBS (restricted repr...,RMSprop,[]


## Model

Let's choose to use the `rbp_eclip/UPF1` model from kipoi

In [55]:
# Note. Install all the dependencies for that model using
# kipoi env install 
model = kipoi.get_model("rbp_eclip/UPF1")

### Available fields:

#### Model

- type
- args
- info
  - authors
  - name
  - version
  - tags
  - doc
- schema
  - inputs
  - targets
- default_dataloader - loaded dataloader class


- predict_on_batch()
- source
- source_dir
- pipeline
  - predict()
  - predict_example()
  - predict_generator()
  
#### Dataloader

- type
- defined_as
- args
- info (same as for the model)
- output_schema
  - inputs
  - targets
  - metadata


- source
- source_dir
- example_kwargs
- init_example()
- batch_iter()
- batch_train_iter()
- batch_predict_iter()
- load_all()

In [9]:
model

<kipoi.model.KerasModel at 0x7f60ef774dd8>

In [10]:
model.type

'keras'

### Info

In [11]:
model.info

ModelInfo(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='\'RBP binding model from Avsec et al: "Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks". \'\n', name=None, version='0.1', license='MIT', tags=[], cite_as='https://doi.org/10.1093/bioinformatics/btx727', trained_on='RBP occupancy peaks measured by eCLIP-seq (Van Nostrand et al., 2016 - https://doi.org/10.1038/nmeth.3810), https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017\n', training_procedure='Single task training with ADAM')

In [12]:
model.info.version

'0.1'

### Schema

In [13]:
dict(model.schema.inputs)

{'dist_exon_intron': ArraySchema(shape=(1, 10), doc='Distance the nearest exon_intron (splice donor) site transformed with B-splines', name='dist_exon_intron', special_type=None, associated_metadata=[], column_labels=None),
 'dist_gene_end': ArraySchema(shape=(1, 10), doc='Distance the nearest gene end transformed with B-splines', name='dist_gene_end', special_type=None, associated_metadata=[], column_labels=None),
 'dist_gene_start': ArraySchema(shape=(1, 10), doc='Distance the nearest gene start transformed with B-splines', name='dist_gene_start', special_type=None, associated_metadata=[], column_labels=None),
 'dist_intron_exon': ArraySchema(shape=(1, 10), doc='Distance the nearest intron_exon (splice acceptor) site transformed with B-splines', name='dist_intron_exon', special_type=None, associated_metadata=[], column_labels=None),
 'dist_polya': ArraySchema(shape=(1, 10), doc='Distance the nearest Poly-A site transformed with B-splines', name='dist_polya', special_type=None, associ

In [14]:
model.schema.targets

ArraySchema(shape=(1,), doc='Predicted binding strength', name=None, special_type=None, associated_metadata=[], column_labels=None)

### Default dataloader

Model already has the default dataloder present. To use it, specify

In [15]:
model.source_dir

'/home/avsec/.kipoi/models/rbp_eclip/UPF1'

In [16]:
model.default_dataloader

dataloader.SeqDistDataset

In [17]:
model.default_dataloader.info

Info(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='RBP binding prediction for UPF1 protein', name=None, version='0.1', license='MIT', tags=[])

### Predict_on_batch

In [18]:
model.predict_on_batch

<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x7f60ef774dd8>>

### Pipeline

Pipeline object will take the dataloader arguments and run the whole pipeline:

```
dataloader arguments --Dataloader-->  numpy arrays --Model--> prediction
```

In [19]:
#model.pipeline.predict

In [20]:
#model.pipeline.predict_generator

### Others

In [21]:
# Model source
model.source

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/home/avsec/.kipoi/models/')

In [22]:
# model location directory
model.source_dir

'/home/avsec/.kipoi/models/rbp_eclip/UPF1'

## DataLoader

In [23]:
DataLoader = kipoi.get_dataloader_factory("rbp_eclip/UPF1")

In [24]:
?DataLoader

[0;31mInit signature:[0m [0mDataLoader[0m[0;34m([0m[0mintervals_file[0m[0;34m,[0m [0mfasta_file[0m[0;34m,[0m [0mgtf_file[0m[0;34m,[0m [0mtarget_file[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0muse_linecache[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Args:
    intervals_file: file path; tsv file
        Assumes bed-like `chrom start end id score strand` format.
    fasta_file: file path; Genome sequence
    gtf_file: file path; Genome annotation GTF file.
    preproc_transformer: file path; tranformer used for pre-processing.
    target_file: file path; path to the targets
    batch_size: int
[0;31mType:[0m           type


## Run dataloader on some examples

In [25]:
# each dataloader already provides the example files
DataLoader.example_kwargs

{'fasta_file': 'example_files/hg38_chr22.fa',
 'gtf_file': 'example_files/gencode.v24.annotation_chr22.gtf',
 'intervals_file': 'example_files/intervals.bed',
 'target_file': 'example_files/targets.tsv'}

In [26]:
import os

In [27]:
# cd into the source directory 
os.chdir(DataLoader.source_dir)

In [28]:
!tree

.
├── custom_keras_objects.py -> ../template/custom_keras_objects.py
├── dataloader_files
│   └── position_transformer.pkl
├── dataloader.py -> ../template/dataloader.py
├── dataloader.yaml -> ../template/dataloader.yaml
├── example_files -> ../template/example_files
├── model_files
│   └── model.h5
├── model.yaml -> ../template/model.yaml
└── __pycache__
    ├── custom_keras_objects.cpython-36.pyc
    └── dataloader.cpython-36.pyc

4 directories, 8 files


In [56]:
dl = DataLoader(**DataLoader.example_kwargs)
# could be also done with DataLoader.init_example()

In [30]:
# This particular dataloader is of type Dataset
# i.e. it implements the __getitem__ method:
dl[0].keys()

dict_keys(['inputs', 'targets', 'metadata'])

In [31]:
dl[0]["inputs"].keys()

dict_keys(['seq', 'dist_tss', 'dist_polya', 'dist_exon_intron', 'dist_intron_exon', 'dist_start_codon', 'dist_stop_codon', 'dist_gene_start', 'dist_gene_end'])

In [32]:
dl[0]["inputs"]["seq"][:5]

array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]], dtype=float32)

In [33]:
len(dl)

14

### Get the whole dataset

In [34]:
whole_data = dl.load_all()

100%|██████████| 1/1 [00:00<00:00,  7.59it/s]


In [35]:
whole_data.keys()

dict_keys(['inputs', 'targets', 'metadata'])

In [36]:
whole_data["inputs"]["seq"].shape

(14, 101, 4)

### Get the iterator to run predictions

In [37]:
it = dl.batch_iter(batch_size=1, shuffle=False, num_workers=0, drop_last=False)

In [38]:
next(it)["inputs"]["seq"].shape

(1, 101, 4)

In [39]:
model.predict_on_batch(next(it)["inputs"])

array([[0.0005]], dtype=float32)

### Train the Keras model

Keras model is stored under the `.model` attribute.

In [40]:
model.model.compile("adam", "binary_crossentropy")

In [41]:
train_it = dl.batch_train_iter(batch_size=2)

In [42]:
# model.model.summary()

In [43]:
model.model.fit_generator(train_it, steps_per_epoch=3, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x7f6067500240>

## Pipeline: `raw files -[dataloader]-> numpy arrays -[model]-> prediction`

In [57]:
example_kwargs = model.default_dataloader.example_kwargs

In [58]:
model.pipeline.predict(example_kwargs)

array([[0.1351],
       [0.0005],
       [0.0005],
       [0.1351],
       [0.1351],
       [0.1351],
       [0.0005],
       [0.1351],
       [0.1351],
       [0.1351],
       [0.1351],
       [0.1351],
       [0.1351],
       [0.1351]], dtype=float32)

In [53]:
next(model.pipeline.predict_generator(example_kwargs, batch_size=2))

array([[0.3588],
       [0.0004]], dtype=float32)