# Kipoi python-sdk

### Quick start

There are three basic building blocks in kipoi:
- **Source** - provides Models and DataLoaders.
- **Model** - makes the prediction given the numpy arrays. 
- **Dataloader** - loads the data from raw files and transforms them into a form that is directly consumable by the Model

![img](../docs/img/kipoi-workflow.png)

## List of main commands


- `kipoi.list_sources()`
- `kipoi.get_source()`


- `kipoi.list_models()`
- `kipoi.list_dataloaders()`


- `kipoi.get_model()`
- `kipoi.get_dataloader_factory()`



### Source

Available sources are specified in the config file located at: `~/.kipoi/config.yaml`. Here is an example config file:

```yaml
model_sources:
    kipoi: # default
        type: git-lfs # git repository with large file storage (git-lfs)
        remote_url: git@github.com:kipoi/models.git # git remote
        local_path: ~/.kipoi/models/ # local storage path
    gl:
        type: git-lfs  # custom model
        remote_url: https://i12g-gagneurweb.informatik.tu-muenchen.de/gitlab/gagneurlab/model-zoo.git
        local_path: /s/project/model-zoo
```

There are three different model sources possible: 
- **`git-lfs`** - git repository with source files tracked normally by git and all the binary files like model weights (located in `files*` directories) are tracked by [git-lfs](https://git-lfs.github.com). 
  - Requires `git-lfs` to be installed.
- **`git`** - all the files including weights (not recommended)
- **`local`** - local directory containing models defined in subdirectories

For **`git-lfs`** source type, larger files tracked by `git-lfs` will be downloaded into the specified directory `local_path` only after the model has been requested (when invoking `kipoi.get_model()`).

#### Note

A particular model/dataloader is defined by its source (say `kipoi` or `my_git_models`) and the relative path of the desired model directory from the model source root (say `rbp/`).

A directory is considered a model if it contains a `model.yaml` file.

In [2]:
import kipoi

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
kipoi.list_sources()

Unnamed: 0,source,type,location,local_size,n_models,n_dataloaders
0,kipoi,git-lfs,/data/ouga/home/ag_ga...,1.5G,4,4
1,gl,git,/s/project/model-zoo/,112M,1,1
2,dir,local,./,188K,0,0


In [6]:
s = kipoi.get_source("kipoi")

In [7]:
s

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/data/ouga/home/ag_gagneur/avsec/.kipoi/models/')

In [8]:
kipoi.list_models()

Unnamed: 0,source,model,name,version,author,descr,type,inputs,targets,tags
0,kipoi,extended_coda,extended CODA,0.1,Johnny Israeli,Single bp resolution ...,keras,[H3K27AC_subsampled],[H3K27ac],[]
1,kipoi,DeepSEA,DeepSEA,0.1,Lara Urban,This CNN is based on ...,keras,seq,epigen_mod,[]
2,kipoi,rbp,rbp_eclip,0.1,Ziga Avsec,RBP binding prediction,keras,"[seq, dist_polya_st]",,[]
3,kipoi,HAL,HAL,0.1,"Jun Cheng, Ziga Avsec",Model from Rosenberg ...,custom,[seq],[psi],[]


## Model

Let's choose to use the extended_coda model

In [9]:
model = kipoi.get_model("rbp")

Using TensorFlow backend.
2017-11-13 17:23:23,976 [INFO] successfully loaded the dataloader from /data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp/dataloader.py::SeqDistDataset
2017-11-13 17:23:24,030 [INFO] successfully loaded model architecture from <_io.TextIOWrapper name='model_files/model.json' mode='r' encoding='UTF-8'>
2017-11-13 17:23:24,054 [INFO] successfully loaded model weights from model_files/weights.h5
2017-11-13 17:23:24,055 [INFO] dataloader.output_schema is compatible with model.schema


### Available fields:

#### Model

- type
- args
- info
  - author
  - name
  - version
  - tags
  - descr
- schema
  - inputs
  - targets
- default_dataloader - loaded dataloader class


- predict_on_batch()
- source
- source_dir
- pipeline
  - predict()
  - predict_example()
  - predict_generator()
  
#### Dataloader

- type
- defined_as
- args
- info (same as for the model)
- output_schema
  - inputs
  - targets
  - metadata


- source
- source_dir
- example_kwargs
- init_example()
- batch_iter()
- batch_train_iter()
- batch_predict_iter()
- load_all()

In [10]:
model

<kipoi.model.KerasModel at 0x7f3cafd8f710>

In [11]:
model.type

'keras'

### Info

In [12]:
model.info

Info(author='Ziga Avsec', name='rbp_eclip', version='0.1', descr='RBP binding prediction', tags=[])

In [13]:
model.info.version

'0.1'

### Schema

In [14]:
model.schema.inputs

OrderedDict([('seq',
              ArraySchema(shape=(101, 4), descr='One-hot encoded RNA sequence', name='seq', special_type=<ArraySpecialType.DNASeq: 'DNASeq'>, associated_metadata=[])),
             ('dist_polya_st',
              ArraySchema(shape=(1, 10), descr='Distance to poly-a site transformed with B-splines', name='dist_polya_st', special_type=None, associated_metadata=[]))])

In [15]:
model.schema.targets

ArraySchema(shape=(1,), descr='Predicted binding strength', name=None, special_type=None, associated_metadata=[])

### Default dataloader

Model already has the default dataloder present. To use it, specify

In [16]:
model.source_dir

'/data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp'

In [17]:
model.default_dataloader

dataloader.SeqDistDataset

In [18]:
model.default_dataloader.info

Info(author='Ziga Avsec', name='rbp_eclip', version='0.1', descr='RBP binding prediction', tags=[])

### Predict_on_batch

In [19]:
model.predict_on_batch

<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x7f3cafd8f710>>

### Pipeline

Pipeline object will take the dataloader arguments and run the whole pipeline:

```
dataloader arguments --Dataloader-->  numpy arrays --Model--> prediction
```

In [20]:
?model.pipeline.predict

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mpipeline[0m[0;34m.[0m[0mpredict[0m[0;34m([0m[0mdataloader_kwargs[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m32[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
# Arguments
    preproc_kwargs: Keyword arguments passed to the pre-processor

:return: Predict the whole array
[0;31mFile:[0m      /data/nasif12/home_if12/avsec/projects-work/kipoi/kipoi/pipeline.py
[0;31mType:[0m      method


In [21]:
?model.pipeline.predict_generator

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mpipeline[0m[0;34m.[0m[0mpredict_generator[0m[0;34m([0m[0mdataloader_kwargs[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m32[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Prediction generator

# Arguments
    preproc_kwargs: Keyword arguments passed to the pre-processor

# Yields
    model batch prediction
[0;31mFile:[0m      /data/nasif12/home_if12/avsec/projects-work/kipoi/kipoi/pipeline.py
[0;31mType:[0m      method


### Others

In [22]:
# Model source
model.source

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/data/ouga/home/ag_gagneur/avsec/.kipoi/models/')

In [23]:
# model location directory
model.source_dir

'/data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp'

## DataLoader

In [24]:
DataLoader = kipoi.get_dataloader_factory("rbp")

2017-11-13 17:23:54,533 [INFO] git-lfs pull -I rbp/**
2017-11-13 17:23:54,616 [INFO] dataloader rbp loaded
2017-11-13 17:23:54,627 [INFO] successfully loaded the dataloader from /data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp/dataloader.py::SeqDistDataset


In [25]:
?DataLoader

[0;31mInit signature:[0m [0mDataLoader[0m[0;34m([0m[0mintervals_file[0m[0;34m,[0m [0mfasta_file[0m[0;34m,[0m [0mgtf_file[0m[0;34m,[0m [0mpreproc_transformer[0m[0;34m,[0m [0mtarget_file[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Args:
    intervals_file: file path; tsv file
        Assumes bed-like `chrom start end id score strand` format.
    fasta_file: file path; Genome sequence
    gtf_file: file path; Genome annotation GTF file pickled using pandas.
    preproc_transformer: file path; tranformer used for pre-processing.
    target_file: file path; path to the targets
    batch_size: int
[0;31mType:[0m           type


## Run dataloader on some examples

In [85]:
# each dataloader already provides the example files
DataLoader.example_kwargs

{'fasta_file': 'example_files/hg38_chr22.fa',
 'gtf_file': 'example_files/gencode_v25_chr22.gtf.pkl.gz',
 'intervals_file': 'example_files/intervals.tsv',
 'preproc_transformer': 'dataloader_files/encodeSplines.pkl',
 'target_file': 'example_files/targets.tsv'}

In [51]:
import os

In [52]:
# cd into the source directory 
os.chdir(DataLoader.source_dir)

In [96]:
!tree

.
├── custom_keras_objects.py
├── dataloader_files
│   └── encodeSplines.pkl
├── dataloader.py
├── dataloader.pyc
├── dataloader.yaml
├── example_files
│   ├── gencode_v25_chr22.gtf.pkl.gz
│   ├── hg38_chr22.fa
│   ├── hg38_chr22.fa.fai
│   ├── intervals.tsv
│   ├── predictions.h5
│   ├── predictions.tsv
│   └── targets.tsv
├── model_files
│   ├── model.json
│   └── weights.h5
├── model.yaml
├── readme.md
└── train_model.ipynb

3 directories, 17 files


In [54]:
dl = DataLoader(**DataLoader.example_kwargs)
# could be also done with DataLoader.init_example()

INFO:2017-11-13 17:28:08,231:genomelake] Running landmark extractors..
2017-11-13 17:28:08,231 [INFO] Running landmark extractors..
INFO:2017-11-13 17:28:08,239:genomelake] Done!
2017-11-13 17:28:08,239 [INFO] Done!


In [100]:
# This particular dataloader is of type Dataset
# i.e. it implements the __getitem__ method:
dl[0].keys()

dict_keys(['targets', 'metadata', 'inputs'])

In [56]:
dl[0]["inputs"].keys()

dict_keys(['dist_polya_st', 'seq'])

In [57]:
dl[0]["inputs"]["seq"][:5]

array([[ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25]], dtype=float32)

In [58]:
len(dl)

14

### Get the whole dataset

In [60]:
whole_data = dl.load_all()

  0%|          | 0/1 [00:00<?, ?it/s]100%|██████████| 1/1 [00:00<00:00, 20.56it/s]


In [61]:
whole_data.keys()

dict_keys(['targets', 'metadata', 'inputs'])

In [62]:
whole_data["inputs"]["seq"].shape

(14, 101, 4)

### Get the iterator to run predictions

In [63]:
it = dl.batch_iter(batch_size=1, shuffle=False, num_workers=0, drop_last=False)

In [64]:
next(it)["inputs"]["seq"].shape

(1, 101, 4)

In [65]:
model.predict_on_batch(next(it)["inputs"])

array([[ 0.2135]], dtype=float32)

### Train the Keras model

Keras model is stored under the `.model` attribute.

In [66]:
model.model.compile("adam", "binary_crossentropy")

In [75]:
train_it = dl.batch_train_iter(batch_size=2)

In [76]:
model.model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
seq (InputLayer)                (None, 101, 4)       0                                            
__________________________________________________________________________________________________
conv1 (Conv1D)                  (None, 93, 10)       370         seq[0][0]                        
__________________________________________________________________________________________________
average_pooling1d_6 (AveragePoo (None, 23, 10)       0           conv1[0][0]                      
__________________________________________________________________________________________________
dist_polya_st (InputLayer)      (None, 1, 10)        0                                            
__________________________________________________________________________________________________
flatten_6 

In [77]:
model.model.fit_generator(train_it, steps_per_epoch=3, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x7f3c563bb080>

## Pipeline: `raw files -[dataloader]-> numpy arrays -[model]-> prediction`

In [82]:
example_kwargs = model.default_dataloader.example_kwargs

In [83]:
model.pipeline.predict(example_kwargs)

2017-11-13 17:31:08,109 [INFO] Initialized data generator. Running batches...
INFO:2017-11-13 17:31:08,219:genomelake] Running landmark extractors..
2017-11-13 17:31:08,219 [INFO] Running landmark extractors..
INFO:2017-11-13 17:31:08,225:genomelake] Done!
2017-11-13 17:31:08,225 [INFO] Done!


array([[ 0.2645],
       [ 0.2181],
       [ 0.2181],
       [ 0.2645],
       [ 0.2645],
       [ 0.2645],
       [ 0.2181],
       [ 0.2645],
       [ 0.2645],
       [ 0.2645],
       [ 0.2645],
       [ 0.2645],
       [ 0.2645],
       [ 0.2645]], dtype=float32)

In [84]:
next(model.pipeline.predict_generator(example_kwargs, batch_size=2))

2017-11-13 17:31:08,868 [INFO] Initialized data generator. Running batches...
INFO:2017-11-13 17:31:08,997:genomelake] Running landmark extractors..
2017-11-13 17:31:08,997 [INFO] Running landmark extractors..
INFO:2017-11-13 17:31:09,006:genomelake] Done!
2017-11-13 17:31:09,006 [INFO] Done!


array([[ 0.2645],
       [ 0.2181]], dtype=float32)