## Getting started

In [2]:
# install kipoiseq
!pip install kipoiseq

[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
from kipoiseq.dataloaders import IntervalSeqDl

### Get the example files

SeqDaset comes with some example files that get downloaded

In [6]:
kwargs = IntervalSeqDl.example_kwargs
kwargs

Downloading data from https://raw.githubusercontent.com/kipoi/kipoiseq/master/tests/data/example_intervals.bed
Downloading data from https://raw.githubusercontent.com/kipoi/kipoiseq/master/tests/data/sample.5kb.fa


{'fasta_file': '/home/avsec/workspace/kipoi/kipoiseq/notebooks/downloaded/example_files/fasta_file',
 'intervals_file': '/home/avsec/workspace/kipoi/kipoiseq/notebooks/downloaded/example_files/intervals_file'}

In [13]:
!cat {kwargs['fasta_file']}

>chr1
ACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGTAACGT

In [14]:
!cat {kwargs['intervals_file']}

chr1	2	1000	1
chr1	2	5000	1
chr1	2	1002	1
chr1	602	604	1


Extra columns (all ones) are binary labels for the interval

### Setup the dataset

In [15]:
# setup the dataset
dl = IntervalSeqDl(**kwargs)

In [16]:
len(dl)

4

In [19]:
dl[0]

{'inputs': array([[0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        ...,
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0.

Sequence is one-hot encoded. You can see that the first interval matches the one provided in the intervals file. You can also see that the extra columns got parsed as targets.

In [22]:
len(dl[0]['inputs'])

998

Since the intervals are all of variable length, we have to resize them.

In [23]:
dl = IntervalSeqDl(auto_resize_len=10, **kwargs)

In [24]:
len(dl[0]['inputs'])

10

You can load the whole dataset into memory

In [25]:
data = dl.load_all()

100%|██████████| 1/1 [00:00<00:00, 431.34it/s]


In [27]:
import pandas as pd

In [27]:
pd.DataFrame(data['metadata']['ranges'])

Unnamed: 0,chr,end,id,start,strand
0,chr1,506,0,496,*
1,chr1,2506,1,2496,*
2,chr1,507,2,497,*
3,chr1,608,3,598,*


Or you can load it batch by batch

### Training a Keras model

In [28]:
# setup a simple model
import keras.layers as kl
from keras.models import Sequential

In [40]:
model = Sequential([kl.Conv1D(3, 2, activation='relu', input_shape=(10,4)), 
                    kl.GlobalMaxPool1D(), 
                    kl.Dense(1)])

In [41]:
model.compile('adam', 'binary_crossentropy', ['acc'])

In [42]:
batch_size = 2

In [43]:
# setup an iterator
iterator = dl.batch_train_iter(batch_size=batch_size, num_workers=4)   # use 4 workers in paralellel to load the data

In [44]:
model.fit_generator(iterator, steps_per_epoch=len(dl)//batch_size, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd96f865668>

#### What does the iterator return?

In [49]:
x,y = next(iterator)

In [52]:
x  # one-hot encoded DNA sequence

array([[[0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]],

       [[0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 0., 1.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]]])

In [53]:
x.shape

(2, 10, 4)

In [56]:
y  # target labels

array([[1.],
       [1.]])

In [57]:
y.shape

(2, 1)

### Making predictions and writing them iteratively to an hdf5 file

Let's say you have a very large dataset and you want to save the predictions batch-by-batch into an hdf5 file together with the `metadata`. Here is how you can do that

In [59]:
from kipoi.writers import HDF5BatchWriter

In [60]:
writer = HDF5BatchWriter("/tmp/preds.h5")

In [62]:
for batch in dl.batch_iter(batch_size=batch_size, num_workers=4):
    preds = model.predict_on_batch(batch['inputs'])
    to_write = {"preds": preds, "metadata": batch['metadata']}
    writer.batch_write(to_write)

In [63]:
writer.close()

Let's have a look at what we wrote

In [64]:
from kipoi.readers import HDF5Reader

In [66]:
reader = HDF5Reader('/tmp/preds.h5')
reader.open()

In [68]:
# list all the arrays
reader.ls()

[('/metadata/ranges/chr', <HDF5 dataset "chr": shape (4,), type "|O">),
 ('/metadata/ranges/end', <HDF5 dataset "end": shape (4,), type "<i8">),
 ('/metadata/ranges/id', <HDF5 dataset "id": shape (4,), type "|O">),
 ('/metadata/ranges/start', <HDF5 dataset "start": shape (4,), type "<i8">),
 ('/metadata/ranges/strand', <HDF5 dataset "strand": shape (4,), type "|O">),
 ('/preds', <HDF5 dataset "preds": shape (4, 1), type "<f4">)]

In [69]:
data = reader.load_all()

In [70]:
data['preds']

array([[0.3115],
       [0.3115],
       [0.3115],
       [0.3115]], dtype=float32)

In [73]:
# handle to the h5py objec
reader.f

<HDF5 file "preds.h5" (mode r)>

In [74]:
reader.f['preds'][:2]

array([[0.3115],
       [0.3115]], dtype=float32)

### Final remarks

- See the available arguments of `IntervalSeqDl`: http://kipoi.org/kipoiseq/dataloaders/sequence/#seqdataset
- Both, the `intervals_file` and the `fasta_file` may be gzipped.
- You may have multiple additional columns in the `intervals_file` to train a multi-task model.
- If you are training on large datasets, find the appropriate encoding the the labels (say `bool` for binary-only labels).