# Use of fields associated to sequence records

In is most simple form of use, `seqp` stores data sequences. However, sometimes it is useful to be able to add extra pieces of data associated to the sequence.

An example might be: in a Neural Machine Translation scenario, we might want to store extra information about each a sentence along with its token IDs, like its dependency parse, the words POS tags, etc.

In this notebook we will illustrate such a setup: apart from tokenizing the sentence storing its token IDs, we will use [spacy](https://spacy.io/) to get the sentence dependency parse and we will store it along with the token IDs.

This notebook overlaps a bit with the [basic read/write notebook](https://github.com/noe/seqp/blob/master/examples/basic_read_write.ipynb) and might be helpful to review it first if you are not familiar with `seqp`.

The idea in this notebook is to:
1. Retrieve a text file from the internet.
2. Extract a word-level vocabulary from the text.
3. Segment the text into sentences.
4. For each sentence:
    - encode it as token IDs
    - extract its dependency parse.
    - store everything with `seqp`

## Text file download

First, lets' download a text file to play with. It will be the Universal Declaration of Human Rights (UDHR).

In [10]:
!wget -q http://research.ics.aalto.fi/cog/data/udhr/txt/eng.txt

## Vocabulary extraction

We read all the file contents...

In [11]:
import re
from seqp.vocab import Vocabulary, VocabularyCollector

file_name = 'eng.txt'

with open(file_name) as f:
    lines = [line.strip() for line in f]

...then we segment each line into sentences with spacy...

In [12]:
import spacy
from itertools import chain

nlp = spacy.load('en')

sents_in_text = sum((list(nlp(line).sents) for line in lines), [])

...and now we extract the vocabulary from the sentences

In [13]:
tokens_in_text = [str(t) for sent in sents_in_text for t in sent]

collector = VocabularyCollector()
for token in tokens_in_text:
    collector.add_symbol(token)

vocab = collector.consolidate(max_num_symbols=5000)

## Store records with fields with `seqp`

We now will for each sentence, encode it in token IDs and store them with `seqp` along with the dependencies (i.e. the index of the head of each token in the sentence):

In [14]:
import numpy as np
from seqp.hdf5 import Hdf5RecordWriter

SEQ_FIELD = 'seq'
DEPS_FIELD = 'deps'
FIELDS = [SEQ_FIELD, DEPS_FIELD]

output_file = 'udhr_eng.hdf5'

with Hdf5RecordWriter(output_file, FIELDS, SEQ_FIELD) as writer:

    # save vocabulary along with the records
    writer.add_metadata({'vocab': vocab.to_json()})

    for sent in sents_in_text:
        tokens = [str(w) for w in sent]
        token_ids = vocab.encode(tokens, add_eos=False, use_unk=True)
        head_indexes = [w.head.i for w in sent]
        record = {SEQ_FIELD: np.array(token_ids),
                  DEPS_FIELD: np.array(head_indexes)}
        writer.write(record)

## Read records back

Now we will back a few records from the file we just wrote, to ensure everything works properly.

In [15]:
from seqp.hdf5 import Hdf5RecordReader

MAX_LINES_TO_PRINT = 3

with Hdf5RecordReader(output_file) as reader:

    loaded_vocab = Vocabulary.from_json(reader.metadata('vocab'))

    for idx, length in reader.indexes_and_lengths():
        if idx >= MAX_LINES_TO_PRINT:
            break
        record = reader.retrieve(idx)
        tokens = loaded_vocab.decode(record[SEQ_FIELD])
        print("Sentence: " + " ".join(tokens))
        deps = record[DEPS_FIELD]
        print("Deps: " + str(deps) + '\n')
        

Sentence: Universal Declaration of Human Rights
Deps: [1 1 1 4 2]

Sentence: Preamble
Deps: [0]

