# Kipoi python-sdk

### Quick start

There are three basic building blocks in kipoi:
- **Source** - provides Models and DataLoaders.
- **Model** - makes the prediction given the numpy arrays. 
- **Dataloader** - loads the data from raw files and transforms them into a form that is directly consumable by the Model

![img](../docs/img/kipoi-workflow.png)

## List of main commands


- `kipoi.list_sources()`
- `kipoi.get_source()`


- `kipoi.list_models()`
- `kipoi.list_dataloaders()`


- `kipoi.get_model()`
- `kipoi.get_dataloader_factory()`



### Source

Available sources are specified in the config file located at: `~/.kipoi/config.yaml`. Here is an example config file:

```yaml
model_sources:
    kipoi: # default
        type: git-lfs # git repository with large file storage (git-lfs)
        remote_url: git@github.com:kipoi/models.git # git remote
        local_path: ~/.kipoi/models/ # local storage path
    gl:
        type: git-lfs  # custom model
        remote_url: https://i12g-gagneurweb.informatik.tu-muenchen.de/gitlab/gagneurlab/model-zoo.git
        local_path: /s/project/model-zoo
```

There are three different model sources possible: 
- **`git-lfs`** - git repository with source files tracked normally by git and all the binary files like model weights (located in `files*` directories) are tracked by [git-lfs](https://git-lfs.github.com). 
  - Requires `git-lfs` to be installed.
- **`git`** - all the files including weights (not recommended)
- **`local`** - local directory containing models defined in subdirectories

For **`git-lfs`** source type, larger files tracked by `git-lfs` will be downloaded into the specified directory `local_path` only after the model has been requested (when invoking `kipoi.get_model()`).

#### Note

A particular model/dataloader is defined by its source (say `kipoi` or `my_git_models`) and the relative path of the desired model directory from the model source root (say `rbp/`).

A directory is considered a model if it contains a `model.yaml` file.

In [1]:
import kipoi

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
kipoi.list_sources()

Unnamed: 0,source,type,location,local_size,n_models,n_dataloaders
0,kipoi,git-lfs,/data/ouga/home/ag_ga...,1.7G,116,116
1,gl,git,/s/project/model-zoo/,112M,1,1
2,dir,local,./,232K,0,0


In [5]:
s = kipoi.get_source("kipoi")

In [6]:
s

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/data/ouga/home/ag_gagneur/avsec/.kipoi/models/')

In [7]:
kipoi.list_models()

Unnamed: 0,source,model,version,authors,doc,type,inputs,targets,tags
0,kipoi,extended_coda,0.1,[Author(name='Johnny ...,Single bp resolution ...,keras,[H3K27AC_subsampled],[H3K27ac],[]
1,kipoi,DeepSEA,0.1,[Author(name='Lara Ur...,This CNN is based on ...,keras,seq,epigen_mod,[]
2,kipoi,rbp,0.1,[Author(name='Ziga Av...,RBP binding prediction,keras,"[seq, dist_polya_st]",,[]
...,...,...,...,...,...,...,...,...,...
113,kipoi,rbp_eclip/HNRNPK,0.1,[Author(name='Ziga Av...,RBP binding prediction,keras,"[seq, dist_tss, dist_...",,[]
114,kipoi,rbp_eclip/FXR2,0.1,[Author(name='Ziga Av...,RBP binding prediction,keras,"[seq, dist_tss, dist_...",,[]
115,kipoi,rbp_eclip/GRSF1,0.1,[Author(name='Ziga Av...,RBP binding prediction,keras,"[seq, dist_tss, dist_...",,[]


## Model

Let's choose to use the rbp model from kipoi

In [9]:
model = kipoi.get_model("rbp")

2017-11-29 17:21:50,172 [INFO] git-lfs pull -I rbp/**
2017-11-29 17:21:50,542 [INFO] model rbp loaded
2017-11-29 17:21:50,552 [INFO] git-lfs pull -I rbp/./**
2017-11-29 17:21:50,888 [INFO] dataloader rbp/. loaded
2017-11-29 17:21:50,916 [INFO] successfully loaded the dataloader from /data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp/dataloader.py::SeqDistDataset
2017-11-29 17:21:50,960 [INFO] successfully loaded model architecture from <_io.TextIOWrapper name='model_files/model.json' mode='r' encoding='UTF-8'>
2017-11-29 17:21:50,982 [INFO] successfully loaded model weights from model_files/weights.h5
2017-11-29 17:21:50,983 [INFO] dataloader.output_schema is compatible with model.schema


### Available fields:

#### Model

- type
- args
- info
  - authors
  - name
  - version
  - tags
  - doc
- schema
  - inputs
  - targets
- default_dataloader - loaded dataloader class


- predict_on_batch()
- source
- source_dir
- pipeline
  - predict()
  - predict_example()
  - predict_generator()
  
#### Dataloader

- type
- defined_as
- args
- info (same as for the model)
- output_schema
  - inputs
  - targets
  - metadata


- source
- source_dir
- example_kwargs
- init_example()
- batch_iter()
- batch_train_iter()
- batch_predict_iter()
- load_all()

In [10]:
model

<kipoi.model.KerasModel at 0x7ff6ab516978>

In [11]:
model.type

'keras'

### Info

In [12]:
model.info

Info(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='RBP binding prediction', name=None, version='0.1', tags=[])

In [13]:
model.info.version

'0.1'

### Schema

In [14]:
model.schema.inputs

OrderedDict([('seq',
              ArraySchema(shape=(101, 4), doc='One-hot encoded RNA sequence', name='seq', special_type=<ArraySpecialType.DNASeq: 'DNASeq'>, associated_metadata=[], column_labels=None)),
             ('dist_polya_st',
              ArraySchema(shape=(1, 10), doc='Distance to poly-a site transformed with B-splines', name='dist_polya_st', special_type=None, associated_metadata=[], column_labels=None))])

In [15]:
model.schema.targets

ArraySchema(shape=(1,), doc='Predicted binding strength', name=None, special_type=None, associated_metadata=[], column_labels=None)

### Default dataloader

Model already has the default dataloder present. To use it, specify

In [16]:
model.source_dir

'/data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp'

In [17]:
model.default_dataloader

dataloader.SeqDistDataset

In [18]:
model.default_dataloader.info

Info(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='RBP binding prediction', name=None, version='0.1', tags=[])

### Predict_on_batch

In [19]:
model.predict_on_batch

<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x7ff6ab516978>>

### Pipeline

Pipeline object will take the dataloader arguments and run the whole pipeline:

```
dataloader arguments --Dataloader-->  numpy arrays --Model--> prediction
```

In [20]:
?model.pipeline.predict

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mpipeline[0m[0;34m.[0m[0mpredict[0m[0;34m([0m[0mdataloader_kwargs[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m32[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
# Arguments
    preproc_kwargs: Keyword arguments passed to the pre-processor

:return: Predict the whole array
[0;31mFile:[0m      /data/nasif12/home_if12/avsec/projects-work/kipoi/kipoi/pipeline.py
[0;31mType:[0m      method


In [21]:
?model.pipeline.predict_generator

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mpipeline[0m[0;34m.[0m[0mpredict_generator[0m[0;34m([0m[0mdataloader_kwargs[0m[0;34m,[0m [0mbatch_size[0m[0;34m=[0m[0;36m32[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Prediction generator

# Arguments
    preproc_kwargs: Keyword arguments passed to the pre-processor

# Yields
    model batch prediction
[0;31mFile:[0m      /data/nasif12/home_if12/avsec/projects-work/kipoi/kipoi/pipeline.py
[0;31mType:[0m      method


### Others

In [22]:
# Model source
model.source

GitLFSSource(remote_url='git@github.com:kipoi/models.git', local_path='/data/ouga/home/ag_gagneur/avsec/.kipoi/models/')

In [23]:
# model location directory
model.source_dir

'/data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp'

## DataLoader

In [24]:
DataLoader = kipoi.get_dataloader_factory("rbp")

2017-11-29 17:22:00,979 [INFO] git-lfs pull -I rbp/**
2017-11-29 17:22:01,299 [INFO] dataloader rbp loaded
2017-11-29 17:22:01,322 [INFO] successfully loaded the dataloader from /data/ouga/home/ag_gagneur/avsec/.kipoi/models/rbp/dataloader.py::SeqDistDataset


In [25]:
?DataLoader

[0;31mInit signature:[0m [0mDataLoader[0m[0;34m([0m[0mintervals_file[0m[0;34m,[0m [0mfasta_file[0m[0;34m,[0m [0mgtf_file[0m[0;34m,[0m [0mpreproc_transformer[0m[0;34m,[0m [0mtarget_file[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Args:
    intervals_file: file path; tsv file
        Assumes bed-like `chrom start end id score strand` format.
    fasta_file: file path; Genome sequence
    gtf_file: file path; Genome annotation GTF file pickled using pandas.
    preproc_transformer: file path; tranformer used for pre-processing.
    target_file: file path; path to the targets
    batch_size: int
[0;31mType:[0m           type


## Run dataloader on some examples

In [26]:
# each dataloader already provides the example files
DataLoader.example_kwargs

{'fasta_file': 'example_files/hg38_chr22.fa',
 'gtf_file': 'example_files/gencode_v25_chr22.gtf.pkl.gz',
 'intervals_file': 'example_files/intervals.bed',
 'preproc_transformer': 'dataloader_files/encodeSplines.pkl',
 'target_file': 'example_files/targets.tsv'}

In [27]:
import os

In [28]:
# cd into the source directory 
os.chdir(DataLoader.source_dir)

In [29]:
!tree

.
├── custom_keras_objects.py
├── dataloader_files
│   └── encodeSplines.pkl
├── dataloader.py
├── dataloader.yaml
├── example_files
│   ├── gencode_v25_chr22.gtf.pkl.gz
│   ├── hg38_chr22.fa
│   ├── hg38_chr22.fa.fai
│   ├── intervals.tsv
│   ├── predictions.h5
│   ├── predictions.tsv
│   └── targets.tsv
├── model_files
│   ├── model.json
│   └── weights.h5
├── model.yaml
├── __pycache__
│   └── dataloader.cpython-35.pyc
├── readme.md
└── train_model.ipynb

4 directories, 17 files


In [32]:
dl = DataLoader(**DataLoader.example_kwargs)
# could be also done with DataLoader.init_example()

INFO:2017-11-29 17:22:51,562:genomelake] Running landmark extractors..
2017-11-29 17:22:51,562 [INFO] Running landmark extractors..
INFO:2017-11-29 17:22:51,569:genomelake] Done!
2017-11-29 17:22:51,569 [INFO] Done!


In [33]:
# This particular dataloader is of type Dataset
# i.e. it implements the __getitem__ method:
dl[0].keys()

dict_keys(['inputs', 'metadata', 'targets'])

In [34]:
dl[0]["inputs"].keys()

dict_keys(['seq', 'dist_polya_st'])

In [35]:
dl[0]["inputs"]["seq"][:5]

array([[ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25],
       [ 0.25,  0.25,  0.25,  0.25]], dtype=float32)

In [36]:
len(dl)

14

### Get the whole dataset

In [37]:
whole_data = dl.load_all()

  0%|          | 0/1 [00:00<?, ?it/s]100%|██████████| 1/1 [00:00<00:00, 26.73it/s]


In [38]:
whole_data.keys()

dict_keys(['inputs', 'metadata', 'targets'])

In [39]:
whole_data["inputs"]["seq"].shape

(14, 101, 4)

### Get the iterator to run predictions

In [40]:
it = dl.batch_iter(batch_size=1, shuffle=False, num_workers=0, drop_last=False)

In [41]:
next(it)["inputs"]["seq"].shape

(1, 101, 4)

In [42]:
model.predict_on_batch(next(it)["inputs"])

array([[ 0.2135]], dtype=float32)

### Train the Keras model

Keras model is stored under the `.model` attribute.

In [43]:
model.model.compile("adam", "binary_crossentropy")

In [44]:
train_it = dl.batch_train_iter(batch_size=2)

In [45]:
model.model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
seq (InputLayer)                (None, 101, 4)       0                                            
__________________________________________________________________________________________________
conv1 (Conv1D)                  (None, 93, 10)       370         seq[0][0]                        
__________________________________________________________________________________________________
average_pooling1d_6 (AveragePoo (None, 23, 10)       0           conv1[0][0]                      
__________________________________________________________________________________________________
dist_polya_st (InputLayer)      (None, 1, 10)        0                                            
__________________________________________________________________________________________________
flatten_6 

In [46]:
model.model.fit_generator(train_it, steps_per_epoch=3, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x7ff6ab453ba8>

## Pipeline: `raw files -[dataloader]-> numpy arrays -[model]-> prediction`

In [47]:
example_kwargs = model.default_dataloader.example_kwargs

In [48]:
model.pipeline.predict(example_kwargs)

2017-11-29 17:23:03,508 [INFO] Initialized data generator. Running batches...
INFO:2017-11-29 17:23:03,631:genomelake] Running landmark extractors..
2017-11-29 17:23:03,631 [INFO] Running landmark extractors..
INFO:2017-11-29 17:23:03,637:genomelake] Done!
2017-11-29 17:23:03,637 [INFO] Done!


array([[ 0.2627],
       [ 0.2159],
       [ 0.2159],
       [ 0.2627],
       [ 0.2627],
       [ 0.2627],
       [ 0.2159],
       [ 0.2627],
       [ 0.2627],
       [ 0.2627],
       [ 0.2627],
       [ 0.2627],
       [ 0.2627],
       [ 0.2627]], dtype=float32)

In [49]:
next(model.pipeline.predict_generator(example_kwargs, batch_size=2))

2017-11-29 17:23:04,495 [INFO] Initialized data generator. Running batches...
INFO:2017-11-29 17:23:04,621:genomelake] Running landmark extractors..
2017-11-29 17:23:04,621 [INFO] Running landmark extractors..
INFO:2017-11-29 17:23:04,628:genomelake] Done!
2017-11-29 17:23:04,628 [INFO] Done!


array([[ 0.2627],
       [ 0.2159]], dtype=float32)