# Introduction

This notebook illustrates the use of [`seqp`](https://github.com/noe/seqp) to write text to binary form as token IDs, and to later read the data back.

First, let's download some text data to encode. Universal Declaration of Human Rights (UDHR) will be our choice:

In [1]:
!wget http://research.ics.aalto.fi/cog/data/udhr/txt/eng.txt

--2019-03-03 19:45:55--  http://research.ics.aalto.fi/cog/data/udhr/txt/eng.txt
Resolving research.ics.aalto.fi... 130.233.195.27
Connecting to research.ics.aalto.fi|130.233.195.27|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10746 (10K) [text/plain]
Saving to: ‘eng.txt’


2019-03-03 19:45:55 (1.89 MB/s) - ‘eng.txt’ saved [10746/10746]



Let's have a look at the first lines:

In [2]:
file_name = 'eng.txt'
max_lines_to_print = 2

with open(file_name) as f:
    for line, idx in zip(f, range(max_lines_to_print)):
        print(line.strip())

Universal Declaration of Human Rights
Preamble Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, Whereas it is essential to promote the development of friendly relations between nations, Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in the equ

# Vocabulary extraction

Now, we will first extract a vocabulary that allows us to have a mapping from word to integer number.

Normally, this would involve a proper tokenization step, but since that is not the focus of this notebook, we will simply split on white space and punctuation by means of a regular expression.

In [3]:
import re
from seqp.vocab import Vocabulary, VocabularyCollector

collector = VocabularyCollector()

with open(file_name) as f:
    for line in f:
        line = line.strip().lower()
        # tokenize words (taken from https://stackoverflow.com/a/8930959/674487)
        tokens = re.findall(r"\w+|[^\w\s]", line, re.UNICODE)
        for token in tokens:
            collector.add_symbol(token)

vocab = collector.consolidate(max_num_symbols=5000)

# Write text as HDF5 records

We will iterate again the lines in the file and, with the vocabulary extracted above, we will turn each line into a list of integer numbers and store them in an HDF5 file.

Each record stored in the file has an associated unique key (an integer number). Although you can use whatever number as key, a sensible choice for line-oriented storage is to use the line number, like we do here:

In [4]:
import numpy as np
from seqp.hdf5 import Hdf5RecordWriter

output_file = 'udhr_eng.hdf5'

with Hdf5RecordWriter(output_file) as writer, open(file_name) as f:

    # save vocabulary along with the records
    writer.add_metadata({'vocab': vocab.to_json()})

    for idx, line in enumerate(f):
        line = line.strip().lower()
        tokens = re.findall(r"\w+|[^\w\s]", line, re.UNICODE)
        token_ids = vocab.encode(tokens, add_eos=False, use_unk=True)
        writer.write(idx, np.array(token_ids))

# Read back HDF5 records

First, we extract the vocabulary back from the HDF5 file metadata, and then we iterate through the first HDF5 records and print them.

`RecordReader` offers an iterator to tuples of sequence index and length. With the sequence index, you can `retrieve` the sequence itself.

With the vocabulary, we convert back the token IDs to tokens and print them to show that the text is properly preserved.

In [5]:
from seqp.hdf5 import Hdf5RecordReader

with Hdf5RecordReader(output_file) as reader:

    vocab = Vocabulary.from_json(reader.metadata('vocab'))

    for idx, length in reader.indexes_and_lengths():
        if idx >= max_lines_to_print:
            break
        token_ids = reader.retrieve(idx).tolist()
        tokens = vocab.decode(token_ids)
        print(" ".join(tokens))

universal declaration of human rights
preamble whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom , justice and peace in the world , whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind , and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people , whereas it is essential , if man is not to be compelled to have recourse , as a last resort , to rebellion against tyranny and oppression , that human rights should be protected by the rule of law , whereas it is essential to promote the development of friendly relations between nations , whereas the peoples of the united nations have in the charter reaffirmed their faith in fundamental human rights , in the dignity and worth of the human person and