# Converting Raw Text into Sequence Data

Typical preprocessing steps for dealing with text data would involve, 
1. Load text as strings into memory
2. Split the strings into tokens (e.g. words or characters)
3. Build a dictionary, associating each token with a numerical index
4. Convert the text into a sequence of numerical indices

In [2]:
import collections
import random
import re
import torch
from d2l import torch as d2l

## Reading The Dataset

We will be working with H.G. Wells' "The Time Machine".

In [4]:
class TimeMachine(d2l.DataModule):

    def _download(self):
        fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
            '090b5e7e70c295757f55df93cb0a180b9691891a')
        with open(fname) as f:
            return f.read()

In [5]:
data = TimeMachine()
raw_text = data._download()

Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...


In [6]:
raw_text[:69]

'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Traveller (f'

In [8]:
@d2l.add_to_class(TimeMachine)
def _preprocess(self, text):
    # Ignore punctuation and capitalisation for simplicity
    return re.sub('[^A-Za-z]+', ' ', text).lower()    

In [9]:
text = data._preprocess(raw_text)
text[:60]

'the time machine by h g wells i the time traveller for so it'

## Tokenisation

Tokens are the individual "steps" of a sequence. Exactly what those tokens are is a design choice. For example, we could represent a sentence as a sequence of words, with a small number of elements in the sequence, but an enormous vocabulary, or we could instead represetn that same sequence as a series of letters. This would result in a much longer sequence but require a smaller vocabulary of only, say, 256 tokens (if using ASCII). 

In [11]:
# Tokenise to a list of characters

@d2l.add_to_class(TimeMachine)
def _tokenize(self, text):
    return list(text)

In [12]:
tokens = data._tokenize(text)

In [14]:
",".join(tokens[:30])

't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '

## Vocabulary

These tokens are still strings, we must connect these to numerical indices to input to the model, this is called a "_vocabulary_".  First we compile a complete list of the tokens seen in the training _corpus_, and assign a unique numerical index to each. Often, uncommon/rare tokens are dropped from the vocabulary, and are represednted by "`<unk>`" at training/execution time.


In [15]:
class Vocab:
    def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):

        # Flatten?
        if tokens and isinstance(tokens[0], list):
            tokens = [token for line in tokens for token in line]

        # Count frequencies
        counter = collections.Counter(tokens)
        self.token_freqs = sorted(counter.items(), key=lambda x: x[1], reverse=True)

        # list unique tokens
        self.idx_to_token = list(
            sorted(
                set(
                    ['<unk>'] 
                    + reserved_tokens
                    + [token for token, freq in self.token_freqs if freq >= min_freq] 
                   )
            )                                
        )

        self.token_to_idx = {
            token: idx for idx, token in enumerate(self.idx_to_token)
        }

    def __len__(self):
        return len(self.idx_to_token())

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)  # Return unk if specified key does not exist
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if hasattr(indices, "__len__") and len(indices) > 1:
            return [self.idx_to_token(idx) for idx in indices]
        return self.idx_to_token[indices]

    # The unknown token
    @property
    def unk(self):
        return self.token_to_idx['<unk>']

In [16]:
vocab = Vocab(tokens=tokens)
indices = vocab[tokens[:10]]

print(f"Indices: {indices}")
print(f"Words: {vocab.indices_to_tok