# `seqp`: how to use sharded storage

## Introduction

For this example, we will be using DNA data to illustrate sharded storage with `seqp`.

We will:

1. Download a DNA data file in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format).
2. Parse the file with [biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11).
3. Store the sequences from the file in multiple HDF5 shards, annotating each DNA sequence with its identifier.
4. Read the DNA data back.

## Download the data and take a look at it

In [1]:
!wget -q 'ftp://ftp.ebi.ac.uk/pub/databases/ena/coding/release/con/fasta/rel_con_hum_r138.cds.fasta.gz'
!rm -f rel_con_hum_r138.cds.fasta
!gunzip 'rel_con_hum_r138.cds.fasta.gz'

In [2]:
from Bio import SeqIO

file_name = 'rel_con_hum_r138.cds.fasta'

for k, seq_record in enumerate(SeqIO.parse(file_name, "fasta")):
    print("ID: {}".format(seq_record.id))
    print("   - Sequence: {}...{}".format(seq_record.seq[:20], seq_record.seq[-10:]))
    print("   - Lenght: {}".format(len(seq_record)))
    if k > 3:
        break

ID: ENA|EAL24309|EAL24309.1
   - Sequence: atgaagcatgtgttgaacct...aagcatgtga
   - Lenght: 192
ID: ENA|EAL24310|EAL24310.1
   - Sequence: atggaggggccactcactcc...gctgtactga
   - Lenght: 1800
ID: ENA|EAL24311|EAL24311.1
   - Sequence: atggaggggccactcactcc...gctgtactga
   - Lenght: 1692
ID: ENA|EAL24312|EAL24312.1
   - Sequence: atggacccaaggacatccag...gacctcctga
   - Lenght: 276
ID: ENA|EAL24313|EAL24313.1
   - Sequence: atggccaggcatggctgtct...agacctgtga
   - Lenght: 2082


## Read the DNA data and store it in HDF5 with `seqp` as we go

We want to use `Hdf5RecordWriter` to write DNA sequences to files. We also want to write to multiple HDF5 files, each one containing up to a maximum amount of records, so we make use of a `ShardedWriter` decorator.

Once we have our writer, we iterate over the FASTA file sequences and store them in the writer. Each nucleotide is saved as a byte-sized integer number obtained by subtracting the ASCII index of 'a' to the nucleotide letter.

Once all the sequences are written to files, we write a piece of metadata with a dictionary from the protein name and the index within the files.

In [3]:
from Bio import SeqIO
import json
import numpy as np
from seqp.hdf5 import Hdf5RecordWriter
from seqp.record import ShardedWriter
from tqdm import tqdm

def nucleotide2num(letter: str) -> int:
    """ Converts a nucleoide letter to an integer"""
    return ord(letter.lower()) - ord('a')

protein2idx = dict()
output_file_template = "dna_example_{:02d}.hdf5"

with ShardedWriter(Hdf5RecordWriter,
                   output_file_template,
                   max_records_per_shard=5000) as writer:

    for idx, seq_record in enumerate(tqdm(SeqIO.parse(file_name, "fasta"))):
        _, _, protein = seq_record.id.split('|')
        protein2idx[protein] = idx
        sequence = [nucleotide2num(letter) for letter in seq_record.seq]
        writer.write(idx, np.array(sequence, dtype=np.uint8))

    writer.add_metadata({'protein_idx': json.dumps(protein2idx)})


65183it [02:19, 466.99it/s]


## Read the HDF5 records back

We open the HDF5 files with a `Hdf5RecordReader`. First, we read back the dictionary with the indexes of each protein sequence, and then we retrieve the sequences associated with some specific target proteins.

In [4]:
from glob import glob
import json
from seqp.hdf5 import Hdf5RecordReader

target_proteins = ['EAL24309.1', 'EAL24312.1']

def num2nucleotide(num: int) -> str:
    """ Converts an integer to a nucleoide letter"""
    return chr(num + ord('a'))

with Hdf5RecordReader(glob('dna_example_*.hdf5')) as reader:
    loaded_protein2idx = json.loads(reader.metadata('protein_idx'))
    indexes = set(reader.indexes())
    for protein in target_proteins:
        sequence = reader.retrieve(protein2idx[protein])
        sequence = "".join([num2nucleotide(n) for n in sequence.tolist()])
        print("{} : {}...{}".format(protein, sequence[:20], sequence[-10:]))

EAL24309.1 : atgaagcatgtgttgaacct...aagcatgtga
EAL24312.1 : atggacccaaggacatccag...gacctcctga
