# Contributing a model to the Kipoi model repository

This notebook will show you how to contribute a model to the [Kipoi model repository](https://github.com/kipoi/models). 

## Kipoi basics

Contributing a model to Kipoi means writing a sub-folder with all the required files to the [Kipoi model repository](https://github.com/kipoi/models) via pull request.

Two main components of the model repository are **model** and **dataloader**.

#### Model

Model takes as input numpy arrays and outputs numpy arrays. In practice, a model needs to implement the `predict_on_batch(x)` method, where `x` is dictionary/list of numpy arrays. The model contributor needs to provide one of the following:

- Serialized Keras model
- Serialized Sklearn model
- Custom model inheriting from `keras.model.BaseModel`.
  - all the required files, i.e. weights need to be loaded in the `__init__`

#### Dataloader

Dataloader takes raw file paths or other parameters as argument and outputs modelling-ready numpy arrays. The dataloading can be done through a generator---batch-by-batch, sample-by-sample---or by just returning the whole dataset. The goal is to work really with raw files (say fasta, bed, vcf, etc in bioinformatics), as this allows to make model predictions on new datasets without going through the burden of running custom pre-processing scripts.

### Folder layout

Here is an example folder structure of a Kipoi model:

```
├── dataloader.py     # implements the dataloader
├── dataloader.yaml   # describes the dataloader
├── dataloader_files/      #/ files required by the dataloader
│   ├── x_transfomer.pkl
│   └── y_transfomer.pkl
├── model.yaml        # describes the model
├── model_files/           #/ files required by the model
│   ├── model.json
│   └── weights.h5
└── test_files/            #/ small test files
    ├── features.csv
    ├── targets.csv
    └── test.json
```    

Two most important files are `model.yaml` and `dataloader.yaml`. They provide a complete description about the model, the dataloader and the files they depend on.

## Contributing a simple Iris-classifier

Details about the individual files will be revealed throught the tutorial bellow. A simple Keras model will be trained to predict the Iris plant class from the well-known [Iris](archive.ics.uci.edu/ml/datasets/Iris) dataset.



### Outline

1. Train the model
2. Generate `dataloader_files/`
3. Generate `model_files/`
4. Generate `test_files/`
5. Write `model.yaml`
6. Write `dataloader.yaml`
7. Write `dataloader.py`

### 1. Train the model

#### Load and pre-process the data

In [167]:
import pandas as pd
import os
from sklearn.preprocessing import LabelBinarizer, StandardScaler

from sklearn import datasets
iris = datasets.load_iris()

In [161]:
# view more info about the dataset
# print(iris["DESCR"])

In [172]:
# Data pre-processing
y_transformer = LabelBinarizer().fit(iris["target"])
x_transformer = StandardScaler().fit(iris["data"])

In [171]:
x = x_transformer.transform(iris["data"])
y = y_transformer.transform(iris["target"])

In [173]:
x[:3]

array([[-0.9007,  1.0321, -1.3413, -1.313 ],
       [-1.143 , -0.125 , -1.3413, -1.313 ],
       [-1.3854,  0.3378, -1.3981, -1.313 ]])

In [174]:
y[:3]

array([[1, 0, 0],
       [1, 0, 0],
       [1, 0, 0]])

#### Train an example model

Let's train a simple linear-regression model using Keras.

In [175]:
from keras.models import Sequential
import keras.layers as kl

model = Sequential([kl.Dense(units=3, input_shape=(4, ))])
model.compile("adam", "categorical_crossentropy")

model.fit(X.values, y, verbose=0)

<keras.callbacks.History at 0x7fa71dab32b0>

### 2. Generate `dataloader_files/`

Now that we have everything we need, let's start writing the files to model's directory (here `model_template/`). 

In reality, you would need to 
1. Fork the [kipoi/models repository](https://github.com/kipoi/models)
2. Clone your repository fork, ignoring all the git-lfs files
    - `$ git lfs clone git@github.com:<your_username>/models.git '-I /'`
3. Create a new folder `<mynewmodel>` containing all the model files in the repostiory root
    - put all the non-code files (serialized models, test data) into a `*files/` directory, where `*` can be anything. These will namely be tracked by `git-lfs` instead of `git`.
      - Examples: `model_files/`, `dataloader_files/`
4. Test your repository locally:
    - `$ kipoi test <mynewmodel_folder>`
5. Commit, push to your forked remote and submit a pull request to [github.com/kipoi/models](https://github.com/kipoi/models)

Dataloader can use some trained transformer (here the `LabelBinarizer` and `StandardScaler` transformers form sklearn). These should be written to `dataloader_files/`.

In [207]:
OUTPUT_PATH = "../model_template"

In [208]:
os.makedirs(OUTPUT_PATH + "/dataloader_files", exist_ok=True)

In [209]:
import pickle

In [210]:
with open(os.path.join(OUTPUT_PATH, "dataloader_files/y_transfomer.pkl"), "wb") as f:
    pickle.dump(y_transformer, f)

with open(os.path.join(OUTPUT_PATH, "dataloader_files/x_transfomer.pkl"), "wb") as f:
    pickle.dump(x_transformer, f)    

In [211]:
ls {OUTPUT_PATH}/dataloader_files

x_transfomer.pkl  y_transfomer.pkl


### 3. Generate `model_files/`

The serialized model weights and architecture go to `model_files/`.

In [218]:
model_output_path="../model_template/model_files"

In [219]:
os.makedirs(model_output_path, exist_ok=True)

In [220]:
# Architecture
with open(os.path.join(model_output_path, "model.json"), "w") as f:
    f.write(model.to_json())

In [221]:
# Weights
model.save_weights(os.path.join(model_output_path, "weights.json"))

### 4. Generate `test_files/`

`test_files/` contain a small subset of the files that the dataloader will read.

### Numpy arrays -> pd.DataFrame

In [5]:
iris.keys()

dict_keys(['data', 'feature_names', 'target_names', 'target', 'DESCR'])

In [78]:
X = pd.DataFrame(iris["data"], columns=iris["feature_names"])

In [80]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [120]:
y = pd.DataFrame({"class": iris["target"]})

In [121]:
y.head()

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0


### Save test files

In [122]:
output_path="../model_template/test_files"

In [123]:
os.makedirs(output_path, exist_ok=True)

In [124]:
X.to_csv(os.path.join(output_path, "features.csv"), index=False)

In [125]:
y.to_csv(os.path.join(output_path, "targets.csv"), index=False)

In [126]:
!head -n 2 {os.path.join(output_path, "targets.csv")}

class
0


In [127]:
!head -n 2 {os.path.join(output_path, "features.csv")}

sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
5.1,3.5,1.4,0.2
