# Converting Raw Text into Sequence Data

Typical preprocessing steps for dealing with text data would involve, 
1. Load text as strings into memory
2. Split the strings into tokens (e.g. words or characters)
3. Build a dictionary, associating each token with a numerical index
4. Convert the text into a sequence of numerical indices

In [2]:
import collections
import random
import re
import torch
from d2l import torch as d2l

## Reading The Dataset

We will be working with H.G. Wells' "The Time Machine".

In [4]:
class TimeMachine(d2l.DataModule):

    def _download(self):
        fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
            '090b5e7e70c295757f55df93cb0a180b9691891a')
        with open(fname) as f:
            return f.read()

In [5]:
data = TimeMachine()
raw_text = data._download()

Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...


In [6]:
raw_text[:69]

'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Traveller (f'

In [8]:
@d2l.add_to_class(TimeMachine)
def _preprocess(self, text):
    # Ignore punctuation and capitalisation for simplicity
    return re.sub('[^A-Za-z]+', ' ', text).lower()    

In [9]:
text = data._preprocess(raw_text)
text[:60]

'the time machine by h g wells i the time traveller for so it'