## Getting started

In [1]:
# install kipoiseq
# !pip install kipoiseq

In [2]:
from kipoiseq.dataloaders import SeqIntervalDl

### Get the example files

SeqDaset comes with some example files that get downloaded

In [39]:
kwargs = SeqIntervalDl.example_kwargs
kwargs

{'fasta_file': '/home/avsec/workspace/kipoi/kipoiseq/notebooks/downloaded/example_files/fasta_file',
 'intervals_file': '/home/avsec/workspace/kipoi/kipoiseq/notebooks/downloaded/example_files/intervals_file'}

In [40]:
!cat {kwargs['fasta_file']} | head -c 100

>chr22
AAGTTCCGGGATACATGTGCTGAACATGCAGGTTTGTTACATAGGTATACATGTGGTATACATGTTACATAGGTATACATGTTACATAGTTATcat: write error: Broken pipe


In [41]:
!cat {kwargs['intervals_file']}

chr22	136018	136069
chr22	136351	136402
chr22	137749	137800
chr22	134503	134554
chr22	139010	139061
chr22	139157	139208
chr22	134638	134689
chr22	138908	138959
chr22	139449	139500
chr22	139450	139501


Extra columns (all ones) are binary labels for the interval

### Setup the dataset

In [42]:
# setup the dataset
dl = SeqIntervalDl(**kwargs)

In [43]:
len(dl)

10

In [44]:
dl[0]

{'inputs': array([[0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 0., 1.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.],
        [0., 0., 0., 1.],
        [0., 0., 0., 1.],
        ...,
        [0., 0., 0., 1.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [1., 0., 0., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.],
        [0., 0., 0., 1.],
        [0., 0., 0., 1.],
        [0., 0.

Sequence is one-hot encoded. You can see that the first interval matches the one provided in the intervals file. You can also see that the extra columns got parsed as targets.

In [9]:
len(dl[0]['inputs'])

51

Since the intervals are all of variable length, we have to resize them.

In [10]:
dl = SeqIntervalDl(auto_resize_len=10, **kwargs)

In [11]:
len(dl[0]['inputs'])

10

You can load the whole dataset into memory

In [12]:
data = dl.load_all()

100%|██████████| 1/1 [00:00<00:00, 239.06it/s]


In [13]:
import pandas as pd

In [37]:
pd.DataFrame(data['metadata']['ranges'])

Unnamed: 0,chr,end,id,start,strand
0,chr22,136048,0,136038,*
1,chr22,136381,1,136371,*
2,chr22,137779,2,137769,*
...,...,...,...,...,...
7,chr22,138938,7,138928,*
8,chr22,139479,8,139469,*
9,chr22,139480,9,139470,*


Or you can load it batch by batch

### Training a Keras model

In [30]:
# setup a simple model
import keras.layers as kl
from keras.models import Sequential

In [31]:
model = Sequential([kl.Conv1D(3, 2, activation='relu', input_shape=(10,4)), 
                    kl.GlobalMaxPool1D(), 
                    kl.Dense(1)])

In [32]:
model.compile('adam', 'binary_crossentropy', ['acc'])

In [33]:
batch_size = 2

In [34]:
# setup an iterator
iterator = dl.batch_train_iter(batch_size=batch_size, num_workers=4)   # use 4 workers in paralellel to load the data

In [35]:
model.fit_generator(iterator, steps_per_epoch=len(dl)//batch_size, epochs=10)

Epoch 1/10


ValueError: Error when checking input: expected conv1d_2_input to have shape (10, 4) but got array with shape (51, 4)

#### What does the iterator return?

In [None]:
x,y = next(iterator)

In [None]:
x  # one-hot encoded DNA sequence

In [None]:
x.shape

In [None]:
y  # target labels

In [None]:
y.shape

### Making predictions and writing them iteratively to an hdf5 file

Let's say you have a very large dataset and you want to save the predictions batch-by-batch into an hdf5 file together with the `metadata`. Here is how you can do that

In [None]:
from kipoi.writers import HDF5BatchWriter

In [None]:
writer = HDF5BatchWriter("/tmp/preds.h5")

In [None]:
for batch in dl.batch_iter(batch_size=batch_size, num_workers=4):
    preds = model.predict_on_batch(batch['inputs'])
    to_write = {"preds": preds, "metadata": batch['metadata']}
    writer.batch_write(to_write)

In [None]:
writer.close()

Let's have a look at what we wrote

In [None]:
from kipoi.readers import HDF5Reader

In [None]:
reader = HDF5Reader('/tmp/preds.h5')
reader.open()

In [None]:
# list all the arrays
reader.ls()

In [None]:
data = reader.load_all()

In [None]:
data['preds']

In [None]:
# handle to the h5py objec
reader.f

In [None]:
reader.f['preds'][:2]

### Final remarks

- See the available arguments of `SeqIntervalDl`: http://kipoi.org/kipoiseq/dataloaders/sequence/#seqdataset
- Both, the `intervals_file` and the `fasta_file` may be gzipped.
- You may have multiple additional columns in the `intervals_file` to train a multi-task model.
- If you are training on large datasets, find the appropriate encoding the the labels (say `bool` for binary-only labels).