# `kipoi` python-sdk

### Quick start

There are three basic building blocks in kipoi:
- **Source** - provides Models and DataLoaders.
- **Model** - makes the prediction given the numpy arrays. 
- **DataLoader** - loads the data from raw files and transforms them into a form that is directly consumable by the Model

## List of main commands

- what if we had the following API:
  - `kipoi.list_sources()`
  - `kipoi.get_source()`
  - `kipoi.list_models()`
    - [ ] update the columns
      - add flags:
        - downloaded - if the model has already been downloaded
        - size - model size
  - `kipoi.get_model()`
  - `kipoi.get_dataloader_factory()`
- **TODO** `kipoi.list_dataloaders()`


### Source

Available sources are specified in the config file located at: `~/.kipoi/config.yaml`. Here is an example config file:

```yaml
model_sources:
    kipoi: # default
        type: git-lfs # git repository with large file storage (git-lfs)
        remote_url: git@github.com:kipoi/models.git # git remote
        local_path: ~/.kipoi/models/ # local storage path
    my_git_models: # custom model
        type: git
        remote_url: git@github.com:myself/models.git
        local_path: ~/.kipoi/my_git_models/
    my_local_models: # custom model
        type: local
        local_path: /mnt/local_models/
```

There are three different model sources possible: 
- **`git-lfs`** - git repository with source files tracked normally by git and all the binary files like model weights (located in `files*` directories) are tracked by [git-lfs](https://git-lfs.github.com). 
  - Requires `git-lfs` to be installed.
- **`git`** - all the files including weights (not recommended)
- **`local`** - local directory containing models defined in subdirectories

For **`git-lfs`** source type, larger files tracked by `git-lfs` will be downloaded into the specified directory `local_path` only after the model has been requested (when invoking `kipoi.get_model()`).

#### Note

A particular model/dataloader is defined by its source (say `kipoi` or `my_git_models`) and the relative path of the desired model directory from the model source root (say `rbp/`).

A directory is considered a model if it contains a `model.yaml` file.

In [141]:
import kipoi

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
reload(kipoi)

<module 'kipoi' from '/opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/__init__.py'>

In [4]:
# TODO - update
kipoi.list_sources()

INFO:2017-10-27 15:54:35,855:kipoi] Update /data/ouga/home/ag_gagneur/avsec/.kipoi/models/
INFO:2017-10-27 15:54:35,855:kipoi] Update /data/ouga/home/ag_gagneur/avsec/.kipoi/models/
INFO:2017-10-27 15:54:38,640:kipoi] Update /s/project/model-zoo/
INFO:2017-10-27 15:54:38,640:kipoi] Update /s/project/model-zoo/
INFO:2017-10-27 15:54:39,665:kipoi] model rbp loaded
INFO:2017-10-27 15:54:39,665:kipoi] model rbp loaded


Unnamed: 0,source,type,location,local_size,n_models,n_dataloaders
0,kipoi,git-lfs,/data/ouga/home/ag_ga...,205M,3,3
1,gl,git,/s/project/model-zoo/,112M,1,1


In [6]:
s = kipoi.get_source("kipoi")

In [7]:
s

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/data/ouga/home/ag_gagneur/avsec/.kipoi/models/')

In [8]:
kipoi.list_models()

INFO:2017-10-27 15:54:48,614:kipoi] Update /data/ouga/home/ag_gagneur/avsec/.kipoi/models/
INFO:2017-10-27 15:54:48,614:kipoi] Update /data/ouga/home/ag_gagneur/avsec/.kipoi/models/


Unnamed: 0,source,model,name,version,author,descr,type,inputs,targets,tags
0,kipoi,extended_coda,extended CODA,0.1,Johnny Israeli,Single bp resolution ...,keras,[H3K27ac_subsampled],[H3K27ac],[]
1,kipoi,rbp,rbp_eclip,0.1,Ziga Avsec,RBP binding prediction,keras,"[seq, dist_polya_st]",[binding_site],[]
2,kipoi,HAL,HAL,0.1,"Jun Cheng, Ziga Avsec",Model from Rosenberg ...,custom,[seq],[psi],[]


## Model

Let's choose to use the extended_coda model

In [9]:
model = kipoi.get_model("rbp")

INFO:2017-10-27 15:55:12,357:kipoi] git-lfs pull -I rbp/**
INFO:2017-10-27 15:55:12,357:kipoi] git-lfs pull -I rbp/**
INFO:2017-10-27 15:55:12,404:kipoi] model rbp loaded
INFO:2017-10-27 15:55:12,404:kipoi] model rbp loaded
INFO:2017-10-27 15:55:12,412:kipoi] git-lfs pull -I rbp/./**
INFO:2017-10-27 15:55:12,412:kipoi] git-lfs pull -I rbp/./**
INFO:2017-10-27 15:55:12,457:kipoi] model rbp/. loaded
INFO:2017-10-27 15:55:12,457:kipoi] model rbp/. loaded
Using TensorFlow backend.
INFO:2017-10-27 15:55:25,952:kipoi] successfully loaded model architecture from <_io.TextIOWrapper name='model_files/model.json' mode='r' encoding='UTF-8'>
INFO:2017-10-27 15:55:25,952:kipoi] successfully loaded model architecture from <_io.TextIOWrapper name='model_files/model.json' mode='r' encoding='UTF-8'>
2017-10-27 15:55:25,952 [INFO] successfully loaded model architecture from <_io.TextIOWrapper name='model_files/model.json' mode='r' encoding='UTF-8'>
INFO:2017-10-27 15:55:25,973:kipoi] successfully loaded

### Available fields:

From the model.yaml file:
- type
- args
- info
  - author
  - name
  - version
  - tags
  - descr
- schema
  - inputs
  - targets
- default_dataloader - loaded dataloader class

Others:
- predict_on_batch()
- source
- source_dir
- pipeline
  - predict()
  - predict_generator()

In [18]:
model

<kipoi.model.KerasModel at 0x7fb732643b70>

In [19]:
model.type

'keras'

### Info

In [20]:
model.info

Info(author='Ziga Avsec', name='rbp_eclip', version='0.1', descr='RBP binding prediction')

In [29]:
model.info.version

'0.1'

### Schema

In [24]:
model.schema.inputs

OrderedDict([('seq', ArraySchema(shape=('(', '4', ',', ' ', '1', '0', '1', ')'), descr='One-hot encoded RNA sequence', name='seq', special_type=<ArraySpecialType.DNASeq: 'DNASeq'>)), ('dist_polya_st', ArraySchema(shape=('(', 'N', 'o', 'n', 'e', ',', ' ', '1', ',', ' ', '1', '0', ')'), descr='Distance to poly-a site transformed with B-splines', name='dist_polya_st', special_type=None))])

In [25]:
model.schema.targets

OrderedDict([('binding_site', ArraySchema(shape=('(', '1', ',', ' ', ')'), descr='Predicted binding strength', name='binding_site', special_type=None))])

### Default dataloader

Model already has the default dataloder present. To use it, specify

In [42]:
model.source_dir

'/data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp'

In [27]:
model.default_dataloader

dataloader.SeqDistDataset

In [28]:
model.default_dataloader.info

Info(author='Ziga Avsec', name='rbp_eclip', version='0.1', descr='RBP binding prediction')

### Predict_on_batch

In [32]:
model.predict_on_batch

<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x7fb732643b70>>

### Pipeline

Pipeline object will take the dataloader arguments and run the whole pipeline:

```
dataloader arguments --Dataloader-->  numpy arrays --Model--> prediction
```

In [38]:
?model.pipeline.predict

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mpipeline[0m[0;34m.[0m[0mpredict[0m[0;34m([0m[0mdataloader_kwargs[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m32[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
# Arguments
    preproc_kwargs: Keyword arguments passed to the pre-processor

:return: Predict the whole array
[0;31mFile:[0m      /opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/pipeline.py
[0;31mType:[0m      method


In [39]:
?model.pipeline.predict_generator

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mpipeline[0m[0;34m.[0m[0mpredict_generator[0m[0;34m([0m[0mdataloader_kwargs[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m32[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Prediction generator

# Arguments
    preproc_kwargs: Keyword arguments passed to the pre-processor

# Yields
    model batch prediction
[0;31mFile:[0m      /opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/pipeline.py
[0;31mType:[0m      method


### Others

In [44]:
# Model source
model.source

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/data/ouga/home/ag_gagneur/avsec/.kipoi/models/')

In [46]:
# model location directory
model.source_dir

'/data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp'

## DataLoader

In [12]:
DataLoader = kipoi.get_dataloader_factory("rbp")

INFO:2017-10-27 15:55:39,598:kipoi] git-lfs pull -I rbp/**
INFO:2017-10-27 15:55:39,598:kipoi] git-lfs pull -I rbp/**
2017-10-27 15:55:39,598 [INFO] git-lfs pull -I rbp/**
INFO:2017-10-27 15:55:39,683:kipoi] model rbp loaded
INFO:2017-10-27 15:55:39,683:kipoi] model rbp loaded
2017-10-27 15:55:39,683 [INFO] model rbp loaded


In [13]:
?DataLoader

[0;31mInit signature:[0m [0mDataLoader[0m[0;34m([0m[0mintervals_file[0m[0;34m,[0m [0mfasta_file[0m[0;34m,[0m [0mgtf_file[0m[0;34m,[0m [0mpreproc_transformer[0m[0;34m,[0m [0mtarget_file[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Args:
    intervals_file: file path; tsv file
        Assumes bed-like `chrom start end id score strand` format.
    fasta_file: file path; Genome sequence
    gtf_file: file path; Genome annotation GTF file pickled using pandas.
    preproc_transformer: file path; tranformer used for pre-processing.
    target_file: file path; path to the targets
    batch_size: int
[0;31mType:[0m           type


## Run dataloader on some examples

In [14]:
cd $model.source_dir/test_files

/data/nasif12/home_if12/avsec/.kipoi/models/rbp/test_files


In [15]:
# example arguments
import yaml
with open("test.json", "r") as f:
    test_kwargs=yaml.load(f)

In [16]:
test_kwargs

{'fasta_file': 'hg38_chr22.fa',
 'gtf_file': 'gencode_v25_chr22.gtf.pkl.gz',
 'intervals_file': 'intervals.tsv',
 'preproc_transformer': '../dataloader_files/encodeSplines.pkl',
 'target_file': 'targets.tsv'}

In [17]:
dl = DataLoader(**test_kwargs)

INFO:2017-10-27 15:55:47,104:genomelake] Running landmark extractors..
2017-10-27 15:55:47,104 [INFO] Running landmark extractors..
INFO:2017-10-27 15:55:47,111:genomelake] Done!
2017-10-27 15:55:47,111 [INFO] Done!


In [18]:
dl[0].keys()

dict_keys(['inputs', 'targets', 'metadata'])

In [19]:
dl[0]["inputs"].keys()

dict_keys(['seq', 'dist_polya_st'])

In [20]:
dl[0]["inputs"]["seq"][:5]

array([[ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25]], dtype=float32)

In [21]:
len(dl)

14

### Get the whole dataset

In [22]:
whole_data = dl.load_all()

  0%|          | 0/1 [00:00<?, ?it/s]100%|██████████| 1/1 [00:00<00:00, 25.87it/s]


In [23]:
whole_data.keys()

dict_keys(['inputs', 'targets', 'metadata'])

In [24]:
whole_data["inputs"]["seq"].shape

(14, 101, 4)

### Get the iterator to run predictions

In [118]:
it = dl.batch_iter(batch_size=1, shuffle=False, num_workers=0, drop_last=False)

In [119]:
next(it)["inputs"]["seq"].shape

(1, 101, 4)

In [120]:
model.predict_on_batch(next(it)["inputs"])

array([[ 0.2263]], dtype=float32)

### Train the Keras model

Keras model is stored under the `.model` attribute.

In [121]:
model.model.compile("adam", "binary_crossentropy")

In [122]:
train_it = map(lambda x: (x["inputs"], x["targets"]["binding_site"]), it)

In [123]:
model.model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
seq (InputLayer)                 (None, 101, 4)        0                                            
____________________________________________________________________________________________________
conv1 (Conv1D)                   (None, 93, 10)        370         seq[0][0]                        
____________________________________________________________________________________________________
average_pooling1d_6 (AveragePool (None, 23, 10)        0           conv1[0][0]                      
____________________________________________________________________________________________________
dist_polya_st (InputLayer)       (None, 1, 10)         0                                            
___________________________________________________________________________________________

In [124]:
model.model.fit_generator(train_it, steps_per_epoch=3, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x7f9990817240>

## Pipeline: `raw files -> dataloader -> model -> prediction`

In [133]:
model.pipeline.predict(test_kwargs)

INFO:2017-10-27 17:20:58,817:kipoi] Initialized data generator. Running batches...
INFO:2017-10-27 17:20:58,817:kipoi] Initialized data generator. Running batches...
2017-10-27 17:20:58,817 [INFO] Initialized data generator. Running batches...
INFO:2017-10-27 17:20:58,914:genomelake] Running landmark extractors..
2017-10-27 17:20:58,914 [INFO] Running landmark extractors..
INFO:2017-10-27 17:20:58,924:genomelake] Done!
2017-10-27 17:20:58,924 [INFO] Done!


array([[ 0.2557],
       [ 0.2294],
       [ 0.2294],
       [ 0.2557],
       [ 0.2557],
       [ 0.2557],
       [ 0.2294],
       [ 0.2557],
       [ 0.2557],
       [ 0.2557],
       [ 0.2557],
       [ 0.2557],
       [ 0.2557],
       [ 0.2557]], dtype=float32)

In [134]:
next(model.pipeline.predict_generator(test_kwargs, batch_size=2))

INFO:2017-10-27 17:21:05,826:kipoi] Initialized data generator. Running batches...
INFO:2017-10-27 17:21:05,826:kipoi] Initialized data generator. Running batches...
2017-10-27 17:21:05,826 [INFO] Initialized data generator. Running batches...
INFO:2017-10-27 17:21:05,937:genomelake] Running landmark extractors..
2017-10-27 17:21:05,937 [INFO] Running landmark extractors..
INFO:2017-10-27 17:21:05,946:genomelake] Done!
2017-10-27 17:21:05,946 [INFO] Done!


array([[ 0.2557],
       [ 0.2294]], dtype=float32)