# Contributing a model to the Kipoi model repository

This notebook will show you how to contribute a model to the [Kipoi model repository](https://github.com/kipoi/models). 

## Kipoi basics

Contributing a model to Kipoi means writing a sub-folder with all the required files to the [Kipoi model repository](https://github.com/kipoi/models) via pull request.

Two main components of the model repository are **model** and **dataloader**.

#### Model

Model takes as input numpy arrays and outputs numpy arrays. In practice, a model needs to implement the `predict_on_batch(x)` method, where `x` is dictionary/list of numpy arrays. The model contributor needs to provide one of the following:

- Serialized Keras model
- Serialized Sklearn model
- Custom model inheriting from `keras.model.BaseModel`.
  - all the required files, i.e. weights need to be loaded in the `__init__`

#### Dataloader

Dataloader takes raw file paths or other parameters as argument and outputs modelling-ready numpy arrays. The dataloading can be done through a generator---batch-by-batch, sample-by-sample---or by just returning the whole dataset. The goal is to work really with raw files (say fasta, bed, vcf, etc in bioinformatics), as this allows to make model predictions on new datasets without going through the burden of running custom pre-processing scripts.

### Folder layout

Here is an example folder structure of a Kipoi model:

```
├── dataloader.py     # implements the dataloader
├── dataloader.yaml   # describes the dataloader
├── dataloader_files/      #/ files required by the dataloader
│   ├── x_transfomer.pkl
│   └── y_transfomer.pkl
├── model.yaml        # describes the model
├── model_files/           #/ files required by the model
│   ├── model.json
│   └── weights.h5
└── test_files/            #/ small test files
    ├── features.csv
    ├── targets.csv
    └── test.json
```    

Two most important files are `model.yaml` and `dataloader.yaml`. They provide a complete description about the model, the dataloader and the files they depend on.

## Contributing a simple Iris-classifier

Details about the individual files will be revealed throught the tutorial bellow. A simple Keras model will be trained to predict the Iris plant class from the well-known [Iris](archive.ics.uci.edu/ml/datasets/Iris) dataset.



### Outline

1. Train the model
2. Generate `dataloader_files/`
3. Generate `model_files/`
4. Generate `test_files/`
5. Write `model.yaml`
6. Write `dataloader.yaml`
7. Write `dataloader.py`
8. Test with the model with `$ kipoi test .`

### 1. Train the model

#### Load and pre-process the data

In [10]:
import pandas as pd
import os
from sklearn.preprocessing import LabelBinarizer, StandardScaler

from sklearn import datasets
iris = datasets.load_iris()

In [11]:
# view more info about the dataset
# print(iris["DESCR"])

In [12]:
# Data pre-processing
y_transformer = LabelBinarizer().fit(iris["target"])
x_transformer = StandardScaler().fit(iris["data"])

In [13]:
x = x_transformer.transform(iris["data"])
y = y_transformer.transform(iris["target"])

In [14]:
x[:3]

array([[-0.9007,  1.0321, -1.3413, -1.313 ],
       [-1.143 , -0.125 , -1.3413, -1.313 ],
       [-1.3854,  0.3378, -1.3981, -1.313 ]])

In [15]:
y[:3]

array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

#### Train an example model

Let's train a simple linear-regression model using Keras.

In [17]:
from keras.models import Sequential
import keras.layers as kl

model = Sequential([kl.Dense(units=3, input_shape=(4, ))])
model.compile("adam", "categorical_crossentropy")

model.fit(x, y, verbose=0)

<keras.callbacks.History at 0x7f5c6a6b7a20>

### 2. Generate `dataloader_files/`

Now that we have everything we need, let's start writing the files to model's directory (here `model_template/`). 

In reality, you would need to 
1. Fork the [kipoi/models repository](https://github.com/kipoi/models)
2. Clone your repository fork, ignoring all the git-lfs files
    - `$ git lfs clone git@github.com:<your_username>/models.git '-I /'`
3. Create a new folder `<mynewmodel>` containing all the model files in the repostiory root
    - put all the non-code files (serialized models, test data) into a `*files/` directory, where `*` can be anything. These will namely be tracked by `git-lfs` instead of `git`.
      - Examples: `model_files/`, `dataloader_files/`
4. Test your repository locally:
    - `$ kipoi test <mynewmodel_folder>`
5. Commit, push to your forked remote and submit a pull request to [github.com/kipoi/models](https://github.com/kipoi/models)

Dataloader can use some trained transformer (here the `LabelBinarizer` and `StandardScaler` transformers form sklearn). These should be written to `dataloader_files/`.

In [18]:
cd ../examples/iris_model_template

[Errno 2] No such file or directory: '../examples/iris_model_template'
/data/nasif12/home_if12/avsec/projects-work/kipoi/examples/iris_model_template


In [20]:
os.makedirs("dataloader_files", exist_ok=True)

In [21]:
import pickle

In [22]:
with open("dataloader_files/y_transformer.pkl", "wb") as f:
    pickle.dump(y_transformer, f)

with open("dataloader_files/x_transformer.pkl", "wb") as f:
    pickle.dump(x_transformer, f)    

In [23]:
ls dataloader_files

x_transformer.pkl  y_transformer.pkl


### 3. Generate `model_files/`

The serialized model weights and architecture go to `model_files/`.

In [24]:
os.makedirs("model_files", exist_ok=True)

In [25]:
# Architecture
with open("model_files/model.json", "w") as f:
    f.write(model.to_json())

In [26]:
# Weights
model.save_weights("model_files/weights.h5")

### 4. Generate `test_files/`

`test_files/` should contain a small subset of the raw files the dataloader will read.

#### Numpy arrays -> pd.DataFrame

In [27]:
iris.keys()

dict_keys(['data', 'DESCR', 'target', 'target_names', 'feature_names'])

In [28]:
X = pd.DataFrame(iris["data"][:20], columns=iris["feature_names"])

In [29]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [30]:
y = pd.DataFrame({"class": iris["target"][:20]})

In [31]:
y.head()

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0


#### Save test files

In [32]:
os.makedirs("test_files", exist_ok=True)

In [33]:
X.to_csv("test_files/features.csv", index=False)

In [34]:
y.to_csv("test_files/targets.csv", index=False)

In [35]:
!head -n 2 test_files/targets.csv

class
0


In [36]:
!head -n 2 test_files/features.csv

sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
5.1,3.5,1.4,0.2


### 5. Write `model.yaml`

The `model.yaml` for this model should look like this:

```yaml
type: keras  # use `kipoi.model.KerasModel`
args:  # arguments of `kipoi.model.KerasModel`
    arch: model_files/model.json
    weights: model_files/weights.h5
default_dataloader: . # path to the dataloader directory. Here it's defined in the same directory
info: # General information about the model
    author: Your Name
    name: NameOfThisModel
    version: 0.1
    descr: Model predicting the Iris species
dependencies:
    conda: # install via conda
      - python=3.5
      - h5py
      # - soumith::pytorch  # specify packages from other channels via <channel>::<package>      
    pip:   # install via pip
      - keras>=2.0.4
      - tensorflow>=1.0
schema:  # Model schema
    inputs:
        features:
            shape: (4,)  # array shape of a single sample (omitting the batch dimension)
            descr: "Features in cm: sepal length, sepal width, petal length, petal width."
    targets:
        plant_class:
            shape: (3,)
            descr: "One-hot encoded array of classes: setosa, versicolor, virginica."
```

All file paths are relative relative to `model.yaml`.

### 6. Write `dataloader.yaml`

```yaml
type: Dataset
defined_as: dataloader.py::MyDataset  # We need to implement MyDataset class inheriting from kipoi.data.Dataset in dataloader.py
args:
    features_file:
        # descr: > allows multi-line fields
        descr: >
          Csv file of the Iris Plants Database from
          http://archive.ics.uci.edu/ml/datasets/Iris features.
        type: str
    targets_file:
        descr: >
          Csv file of the Iris Plants Database targets.
          Not required for making the prediction.
        type: str
        optional: True  # if not present, the `targets` field will not be present in the dataloader output
info:
    author: Your Name
    name: NameOfThisModel
    version: 0.1
    descr: Model predicting the Iris species
dependencies:
    conda:
      - python=3.5
      - pandas
      - numpy
      - sklearn
output_schema:
    inputs:
        features:
            shape: (4,)
            descr: Features in cm: sepal length, sepal width, petal length, petal width.
    targets:
        plant_class:
            shape: (3, )
            descr: One-hot encoded array of classes: setosa, versicolor, virginica.
    metadata:  # field providing additional information to the samples (not directly required by the model)
        example_row_number:
            shape: (1,)
            descr: Just an example metadata column
```

### 7. Write `dataloader.py`

Finally, let's implement MyDataset. We need to implement two methods: `__len__` and `__getitem__`. 

`__getitem__` will return one item of the dataset. In our case, this is a dictionary with `output_schema` described in `dataloader.yaml`.

For more information about writing such dataloaders, see the [Data Loading and Processing Tutorial from pytorch](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

In [37]:
from kipoi.data import Dataset
import pandas as pd
import numpy as np

def read_pickle(f):
    with open(f, "rb") as f:
        return pickle.load(f)

class MyDataset(Dataset):

    def __init__(self, features_file, targets_file=None):
        self.features_file = features_file
        self.targets_file = targets_file
        
        self.y_transformer = read_pickle("dataloader_files/y_transformer.pkl")
        self.x_transformer = read_pickle("dataloader_files/x_transformer.pkl")

        self.features = pd.read_csv(features_file)
        if targets_file is not None:
            self.targets = pd.read_csv(targets_file)
            assert len(self.targets) == len(self.features)

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        x_features = np.ravel(self.x_transformer.transform(self.features.iloc[idx].values[np.newaxis]))
        if self.targets_file is None:
            y_class = {}
        else:
            y_class = np.ravel(self.y_transformer.transform(self.targets.iloc[idx].values[np.newaxis]))
        return {
            "inputs": {
                "features": x_features
            },
            "targets": {
                "plant_class": y_class
            },
            "metadata": {
                "example_row_number": np.array([idx])
            }
        }

#### Example usage of the dataset

In [38]:
ds = MyDataset("test_files/features.csv", "test_files/targets.csv")

In [39]:
# call __getitem__
ds[5]

{'inputs': {'features': array([-0.5372,  1.9577, -1.1707, -1.05  ])},
 'metadata': {'example_row_number': array([5])},
 'targets': {'plant_class': array([1, 0, 0])}}

Since MyDatset inherits from `kipoi.data.Dataset`, it has some additional nice feature. See [python-sdk.ipynb](python-sdk.ipynb) for more information.

In [40]:
# batch-iterator
it = ds.batch_iter(batch_size=3, shuffle=False, num_workers=2)
next(it)

{'inputs': {'features': array([[-0.9007,  1.0321, -1.3413, -1.313 ],
         [-1.143 , -0.125 , -1.3413, -1.313 ],
         [-1.3854,  0.3378, -1.3981, -1.313 ]])},
 'metadata': {'example_row_number': array([[0],
         [1],
         [2]])},
 'targets': {'plant_class': array([[1, 0, 0],
         [1, 0, 0],
         [1, 0, 0]])}}

In [41]:
# ds.load_all()  # load the whole dataset into memory

### 8. Test with the model with `$ kipoi test .`

Before we contribute the model to the repository, let's run the test:

In [42]:
!kipoi test .

Traceback (most recent call last):
  File "/opt/modules/i12g/anaconda/3-4.1.1/bin/kipoi", line 11, in <module>
    load_entry_point('kipoi==0.0.1', 'console_scripts', 'kipoi')()
  File "/opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/__main__.py", line 56, in main
    command_fn(args.command, sys.argv[2:])
  File "/opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/pipeline.py", line 147, in cli_test
    mh = kipoi.get_model(args.model, args.source)
  File "/opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/model.py", line 67, in get_model
    dl_source)
  File "/opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/data.py", line 144, in get_dataloader_factory
    dl = DataLoaderDescription.load(yaml_path)
  File "/opt/modules/i12g/anaconda/3-4.1.1/lib/python3.5/site-packages/kipoi/components.py", line 49, in load
    return cls.from_config(parsed_dict)
  File "/opt/modules/i12g/anaconda/3-4.1.1/lib/pyt

## Accessing the model through kipoi 

In [43]:
import kipoi

In [44]:
reload(kipoi)

<module 'kipoi' from '/opt/modules/i12g/anaconda/3-4.1.1/envs/dev-kipoi-py35/lib/python3.5/site-packages/kipoi/__init__.py'>

In [46]:
m = kipoi.get_model(".", source="dir")  # See also python-sdk.ipynb

TypeError: __init__() got an unexpected keyword argument 'output_schema'

## Recap

Congrats! You made it through the tutorial! Feel free to use this model for your model template.