# 06 - Text data with pytorch

The objective of this notebook is get an introduction to text data and using pretrained word embeddings while getting familiar with saving / loading pytorch objects.

## Contents:

1. Reading `txt` files, tokenizing texts and building a vocabulary
2. Creating a "context/target" dataset
3. Using a pretrained embedding in a model
4. Predict next word's class

In [1]:
import torch
from torch import nn, optim
import torch.nn.functional as F
from datetime import datetime
from torch.utils.data import DataLoader, TensorDataset

import numpy as np
import torchtext
import os
import re
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from utils import train, set_device, compute_accuracy

seed = 265
torch.manual_seed(seed)

<torch._C.Generator at 0x7f88b1ff5210>

## 1. Reading `txt` files, tokenizing texts and building a vocabulary

There is nothing to do in this cell, apart from running it. **You might want to read it carefully though, as it is an important step of the pipeline in pytorch when dealing with text data. It is quite unlikely that you need to drastically change this part of the code in your future tasks**

In [2]:
# tokenizer will split a long text into a list of english words
TOKENIZER_EN = get_tokenizer('basic_english')
# Where we will store / load all our models, datasets, vocabulary, etc.
PATH_GENERATED = './generated/'
# Minimum number of occurence of a word in the text to add it to the vocabulary
MIN_FREQ = 100

def read_files(datapath='./data_train/'):
    """
    Return a list of strings, one for each line in each .txt files in 'datapath'
    """
    # Find all txt files in directory 
    files = os.listdir(datapath)
    files = [datapath + f for f in files if f.endswith(".txt")]
    
    # Stores each line of each book in a list
    lines = []
    for f_name in files:
        with open(f_name) as f:
            lines += f.readlines()
    return lines

def tokenize(lines, tokenizer=TOKENIZER_EN):
    """
    Tokenize the list of lines
    """
    list_text = []
    for line in lines:
        list_text += tokenizer(line)
    return list_text

def yield_tokens(lines, tokenizer=TOKENIZER_EN):
    """
    Yield tokens, ignoring names and digits to build vocabulary
    """
    # Match any word containing digit
    no_digits = '\w*[0-9]+\w*'
    # Match word containing a uppercase 
    no_names = '\w*[A-Z]+\w*'
    # Match any sequence containing more than one space
    no_spaces = '\s+'
    
    for line in lines:
        line = re.sub(no_digits, ' ', line)
        line = re.sub(no_names, ' ', line)
        line = re.sub(no_spaces, ' ', line)
        yield tokenizer(line)

def count_freqs(words, vocab):
    """
    Count occurrences of each word in vocabulary in the data
    
    Useful to get some insight on the data and to compute loss weights
    """
    freqs = torch.zeros(len(vocab), dtype=torch.int)
    for w in words:
        freqs[vocab[w]] += 1
    return freqs

def create_vocabulary(lines, min_freq=MIN_FREQ):
    """
    Create a vocabulary (list of known tokens) from a list of strings
    """
    # vocab contains the vocabulary found in the data, associating an index to each word
    vocab = build_vocab_from_iterator(yield_tokens(lines), min_freq=min_freq, specials=["<unk>"])
    # Since we removed all words with an uppercase when building the vocabulary, we skipped the word "I"
    vocab.append_token("i")
    # Value of default index. This index will be returned when OOV (Out Of Vocabulary) token is queried.
    vocab.set_default_index(vocab["<unk>"])
    return vocab

Now that the functions are defined we can use them.

In real life, the size of the datasets can be huge and just building the vocabulary can take some time. To avoid having to recompute it each time you need your vocabulary, it is wiser to save it. You can even save the tokenized version of the texts.

We also encourage you to take some time analysing your vocabulary, the frequences of each word, the size of the vocabulary compared to the number of words seen in the dataset, and much much more than what we can see in this notebook.

In [3]:
# ----------------------- Tokenize texts -------------------------------
# Load tokenized versions of texts if you have already generated it
# Otherwise, create it and save it
if os.path.isfile(PATH_GENERATED + "words_train.pt"):
    words_train = torch.load(PATH_GENERATED + "words_train.pt")
    words_val = torch.load(PATH_GENERATED + "words_val.pt")
    words_test = torch.load(PATH_GENERATED + "words_test.pt")
else:
    # Get lists of strings, one for each line in each .txt files in 'datapath' 
    lines_books_train = read_files('./data_train/')
    lines_books_val = read_files('./data_val/')
    lines_books_test = read_files('./data_test/')

    # List of words contained in the dataset
    words_train = tokenize(lines_books_train)
    words_val = tokenize(lines_books_val)
    words_test = tokenize(lines_books_test)
    
    torch.save(words_train , PATH_GENERATED + "words_train.pt")
    torch.save(words_val , PATH_GENERATED + "words_val.pt")
    torch.save(words_test , PATH_GENERATED + "words_test.pt")



# ----------------------- Create vocabulary ----------------------------
VOCAB_FNAME = "vocabulary.pt"
# Load vocabulary if you have already generated it
# Otherwise, create it and save it
if os.path.isfile(PATH_GENERATED + VOCAB_FNAME):
    vocab = torch.load(PATH_GENERATED + VOCAB_FNAME)
else:
    # Create vocabulary based on the words in the training dataset
    vocab = create_vocabulary(lines_books_train, min_freq=MIN_FREQ)
    torch.save(vocab, PATH_GENERATED + VOCAB_FNAME)
    


# ------------------------ Quick analysis ------------------------------
VOCAB_SIZE = len(vocab)
print("Total number of words in the training dataset:     ", len(words_train))
print("Total number of words in the validation dataset:   ", len(words_val))
print("Total number of words in the test dataset:         ", len(words_test))
print("Number of distinct words in the training dataset:  ", len(set(words_train)))
print("Number of distinct words kept (vocabulary size):   ", VOCAB_SIZE)

freqs = count_freqs(words_train, vocab)
print("occurences:\n", [(f.item(), w) for (f, w)  in zip(freqs, vocab.lookup_tokens(range(VOCAB_SIZE)))])

Total number of words in the training dataset:      347870
Total number of words in the validation dataset:    49526
Total number of words in the test dataset:          124152
Number of distinct words in the training dataset:   11161
Number of distinct words kept (vocabulary size):    324
occurences:
 [(88495, '<unk>'), (23451, ','), (20640, 'the'), (14589, 'and'), (13686, '.'), (9482, 'to'), (7226, 'of'), (5560, 'in'), (4799, 'was'), (5196, 'a'), (5274, 'he'), (4490, 'his'), (3729, 'that'), (6413, 'king'), (3262, 'with'), (2941, 'had'), (3110, 's'), (2661, 'him'), (2705, 'it'), (2619, 'they'), (1994, 'for'), (1969, 'all'), (1937, 'as'), (1763, 'men'), (1786, 'on'), (1620, 'were'), (1925, 'but'), (1569, 'who'), (1583, 'from'), (1433, 'be'), (1501, 'at'), (1323, 'not'), (1519, 'this'), (1313, 'them'), (1334, 'people'), (1318, 'their'), (1253, 'which'), (1225, 'came'), (1483, 'there'), (1204, 'is'), (1412, 'so'), (1116, 'went'), (1087, 'great'), (1583, 'when'), (931, 'out'), (918, 'said'

## 2. Creating a "context/target" dataset

Creating a context / pair dataset is a key part of any machine learning task involving text, and it has to be adapted to each task. 

In this notebook, we define a ``(contexts, targets)`` toydataset such that for each ``(c, t)`` context/target pair in the dataset, 
- ``t = 0`` if the next word after the sequence ``c`` is the ``<unk>`` token
- ``t = 1`` if the next word after the sequence ``c`` is a punction symbol ``[',', '.', '(', ')', '?', '!']``
- ``t = 2`` if the next word after the sequence ``c`` is an actual word and present in our vocabulary

For more realistic tasks, the function that creates the dataset can be much more complex. A non exaustive list of adaptation:

- having the context *around* the target and not just before
- ignore some context-target pairs when then target is not interesting
- ignore some context-target pairs when there are too many `<unk>` token in the context
- ignore some context-target pairs when then target is already too present in the dataset
- etc.

**Make sure you understand this function as you will have to implement a variant of it whenever you have task with text in pytorch.**

In [4]:
# ------------------------ Define targets ------------------------------
def compute_label(w):
    """
    helper function to define MAP_TARGET
    
    - 0 = 'unknown word'
    - 1 = 'punctuation' (i.e. the '<unk>' token)
    - 2 = 'is an actual word'
    """
    if w in ['<unk>']:
        return 0
    elif w in [',', '.', '(', ')', '?', '!']:
        return 1
    else:
        return 2

# true labels for this task:
MAP_TARGET = {
    vocab[w]:compute_label(w) for w in vocab.lookup_tokens(range(VOCAB_SIZE))
}

# context size for this task 
CONTEXT_SIZE = 3


# ---------------- Define context / target pairs -----------------------
def create_dataset(
    text, vocab, 
    context_size=CONTEXT_SIZE, map_target=MAP_TARGET
):
    """
    Create a pytorch dataset of context / target pairs from a text
    """
    
    n_text = len(text)
    n_vocab = len(vocab)
    
    # Change labels if only a few target are kept, otherwise, each word is
    # associated with its index in the vocabulary
    if map_target is None:
        map_target = {i:i for i in range(n_vocab)}
    
    # Transform the text as a list of integers.
    txt = [vocab[w] for w in text]

    # Start constructing the context / target pairs...
    contexts = []
    targets = []
    for i in range(n_text - context_size):
        
        # Word used to define target
        t = txt[i + context_size]
        
        # Context before the target
        c = txt[i:i + context_size]
        
        targets.append(map_target[t])
        contexts.append(torch.tensor(c))
            
    # contexts of shape (N_dataset, context_size)
    # targets of shape  (N_dataset)
    contexts = torch.stack(contexts)
    targets = torch.tensor(targets)
    # Create a pytorch dataset out of these context / target pairs
    return TensorDataset(contexts, targets)

Once again, now that the functions are defined we can use them.

And here again, it is wiser to save your datasets as they can also be quite costly to create. 

In [5]:
def load_dataset(words, vocab, fname):
    """
    Load dataset if already generated, otherwise, create it and save it
    """
    # If already generated
    if os.path.isfile(PATH_GENERATED + fname):
        dataset = torch.load(PATH_GENERATED + fname)
    else:
        # Create context / target dataset based on the list of strings
        dataset = create_dataset(words, vocab)
        torch.save(dataset, PATH_GENERATED + fname)
    return dataset

data_train = load_dataset(words_train, vocab, "data_train.pt")
data_val = load_dataset(words_val, vocab, "data_val.pt")
data_test = load_dataset(words_test, vocab, "data_test.pt")

## 3. Using a pretrained embedding in a model

You don't have to understand what a word embedding is for now, only to understand how to use a pretrained word embedding inside a neural network, and how to "freeze" it so that it doesn't get updated while the rest of the network is being trained.

In [6]:
class MyMLP(nn.Module):
    
    def __init__(self, embedding=None, context_size=CONTEXT_SIZE):
        super().__init__()
        
        (vocab_size, embedding_dim) = embedding.weight.shape
        # Instantiate an embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Load the pretrained weights
        self.embedding.load_state_dict(embedding.state_dict())
        # Freeze the layer
        for p in self.embedding.parameters():
            p.requires_grad = False
            
        # Regular MLP
        self.fc1 = nn.Linear(embedding_dim*context_size, 128)
        self.fc2 = nn.Linear(128, 3)

    def forward(self, x):
        # x is of shape (N, context_size) but contains integers which can
        # be seen as equivalent to (N, context_size, vocab_size) since one hot
        # encoding is used under the hood
        out = self.embedding(x)
        # out is now of shape (N, context_size, embedding_dim)
        
        out = F.relu(self.fc1(torch.flatten(out, 1)))
        # out is now of shape (N, context_size*embedding_dim)
        
        out = self.fc2(out)
        return out

## 4. Predict next word's class

We now train our dummy model for our dummy task and we don't forget to save it! For more information on saving pytorch objects, see the [documentation](https://pytorch.org/tutorials/beginner/basics/saveloadrun_tutorial.html) 

**Note that this is a dummy task, with many arbitrary choices such as the performance measure and without any proper model selection / evaluation nor analysis of the results. In a normal situation, much more would be expected there.**

In [7]:
torch.manual_seed(seed)
device = set_device()

# Load the pretrained embedding 
if os.path.isfile("embedding.pt"):
    embedding = torch.load("embedding.pt").to(device=device)
else:
    raise ValueError("Embedding not found at the given location")

MODEL_FNAME = "model.pt"

batch_size=512
train_loader = DataLoader(data_train, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(data_val, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(data_test, batch_size=batch_size, shuffle=True)

model = MyMLP(embedding)

if os.path.isfile(PATH_GENERATED + MODEL_FNAME):
    # Load the trained model
    model = torch.load(PATH_GENERATED + MODEL_FNAME)
    model.to(device)
else:
    # Or train the model...
    model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.CrossEntropyLoss()
    n_epochs=30

    train(n_epochs, optimizer, model, loss_fn, train_loader)
    # ... and save it
    torch.save(model.to(device="cpu"), PATH_GENERATED + MODEL_FNAME)

acc_train = compute_accuracy(model, train_loader)
acc_val = compute_accuracy(model, val_loader)
print("Training Accuracy:     %.4f" %acc_train)
print("Validation Accuracy:   %.4f" %acc_val)

On device cpu.
On device cpu.
09:02:21.082854  |  Epoch 1  |  Training loss 0.81438
09:02:32.414066  |  Epoch 5  |  Training loss 0.77512
09:02:46.950664  |  Epoch 10  |  Training loss 0.76392
09:03:03.048266  |  Epoch 15  |  Training loss 0.75964
09:03:19.843292  |  Epoch 20  |  Training loss 0.75685
09:03:36.346039  |  Epoch 25  |  Training loss 0.75430
09:03:52.637025  |  Epoch 30  |  Training loss 0.75276
On device cpu.
On device cpu.
Training Accuracy:     0.6534
Validation Accuracy:   0.6160
