In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-21
# GitHub: https://github.com/jaaack-wang 

## Quick start

With wrapped up functions that we will gradually learn throughout this tutorial, preprocessing the text data into one that is ready for model training can be as simple as following. Does it really work? Let's explore!

In [2]:
from utils import load_dataset, gather_text
from pytorch_utils import * 

train_set = load_dataset('train.tsv')

text = gather_text(train_set)
V = TextVectorizer()
V.build_vocab(text)

batchify_fn = get_batchify_fn(V, include_seq_len=False)
train_loader = create_dataloader(train_set, batchify_fn)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/w9/d_nplhzj4qx35xxlgljgdtjh0000gn/T/jieba.cache
Loading model cost 0.649 seconds.
Prefix dict has been built successfully.


Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


In [3]:
for example in train_loader:
    print(example)
    break

(tensor([[1524,  107,  385,  ...,    0,    0,    0],
        [ 141,  191,   68,  ...,    0,    0,    0],
        [1034,    8,   25,  ...,    0,    0,    0],
        ...,
        [  76,  391,  109,  ...,    0,    0,    0],
        [ 293,   13, 1054,  ...,    0,    0,    0],
        [1247,   41,   46,  ...,    0,    0,    0]]), tensor([0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
        1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1,
        0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0]))


## Overview

In this tutorial, we will use functions from `pytorch` to help us preprocess and numericalize datasets. As these functions are native to `pytorch`, so there is an advantage in training models constructed by using `pytorch`, especially when the datasets are large. You will also need [`torchtext`](https://github.com/pytorch/text), a nlp package designed by the `pytorch` team, to get everything going. To download it, simply run in command `pip3 install torchtext`.

If you need more intuition about the ins and outs of this process, please refer to `2 - preprocess_data.ipynb` in the same folder.

Below are the structure of this tutoiral:

- [Load dataset](#1)
- [Create vocab_to_idx mapping dictionary](#2)
- [Text encoder](#3)
- [Creating dataloader](#5)
    - [Transform the dataset into  Dataset class using MapDataset](#5-1)
    - [Building a batchify method](#5-3)
    - [Now the dataloader](#5-4)
- [A quick test](#6)
- [Wrapped up functions](#7)
    - [TextVectorizer](#7-1)
    - [Get batchify_fn](#7-4)
    - [Create dataloader](#7-5)
- [More thorough tests](#8)
    - [Initializations](#8-1)
    - [Test One: CNN](#8-2)
    - [Test Two: RNN](#8-3)

<a name="1"></a>
## Load dataset

As usual, let's first use the `load_dataset` function compiled in the last two tutorials to load the datasets.

In [4]:
from utils import load_dataset

train_set = load_dataset('train.tsv')

# check. should be 3000 (recall `1 - get_data.ipynb`)
len(train_set)

4000

<a name="2"></a>

## Create `vocab_to_idx` mapping dictionary

The purpose of creating a `vocab_to_idx` mapping dictionary is for later encoding or numeralizing text data for model training. In the `2.1 - wrapped_up_data_preprocessor` tutorial, we have learnt how to use `TextVectorizer` to conveniently do this job. 

In this tutorial, we will use `torchtext.vocab.build_vocab_from_iterator` to do a simlar job. For the tokenizer, we will still use `jieba` to tokenize Chinese. Alternatively, you can just use the `.split` function to tokenize English or use the `tokenize` function from `utils.py`. 

`torchtext.vocab.build_vocab_from_iterator` is indeed a long path to import and the essential functions that come with it can also be found in the `TextVectorizer` with more intuitive names, but we will deal with it. 

In [5]:
import jieba
from torchtext.vocab import build_vocab_from_iterator

In [6]:
# First, we need a tokenize func
tokenize = jieba.lcut

# Then we need a list of tokenized texts
from utils import gather_text
text = gather_text(train_set) # ---> gather text from the train_set
tokens = list(map(tokenize, text)) # ---> a list of tokenized texts ([[w1, w2...], [w1, w2...]...])

# build the vocabulary which will give us the mapping dictionaries for encoding
# the order of the inputs for the "specials" matters. The first item will be indexed as 0, the second 1, and so on..
V = build_vocab_from_iterator(tokens, specials=['[PAD]', '[UNK]'])
V.set_default_index(V['[UNK]']) # ---> This must be set to represent all unseen tokens that may occur

In [7]:
# check

tmp = "{:20}{}"
print("The first 10 examples from the V.get_itos() LIST\n")

for idx, tk in enumerate(V.get_itos()[:10]): # "itos" --> index to str, a list of str is output
    print(tmp.format(tk, str(idx)))
    
    
print("\n\nThe first 10 examples from the V.get_stoi() DICTIONARY\n")

for tk, idx in list(V.get_stoi().items())[:10]: # "stoi" --> str to idx, a dictionary (str:idx) is output
    print(tmp.format(str(idx), tk))

The first 10 examples from the V.get_itos() LIST

[PAD]               0
[UNK]               1
，                   2
的                   3
。                   4
了                   5
                    6
是                   7
我                   8
,                   9


The first 10 examples from the V.get_stoi() DICTIONARY

22092               Ｋ
22091               Ｂ
22090               Ａ
22089               ＠
22086               龙应台
22085               龙城
22084               龙之梦
22082               龌龊
22079               鼻炎
22078               鼻涕


In [8]:
# To look up indice for (a) token(s), use V.lookup_indices

me_idx = V.lookup_indices(['今天']) # the input must be in a list!
print("Index for me:", me_idx[0]) # the output is also a list!

unk_idx = V.lookup_indices(['This_Word_Does_Not_Exist'])
print("Index for \033[1mThis_Word_Does_Not_Exist\033[0m:", unk_idx[0])

Index for me: 640
Index for [1mThis_Word_Does_Not_Exist[0m: 1


<a name="3"></a>
## Text encoder

With this `V.lookup_indices` method, we do not have to write a for loop ourselves.

In [9]:
def text_encoder(text, 
                 tokenize=tokenize, 
                 idx_lookup=V.lookup_indices):
    
    tokens = tokenize(text)
    return idx_lookup(tokens)

In [10]:
# check

print("Original text:", text[0])
print("Encoded text:", text_encoder(text[0]))

Original text: 选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般
Encoded text: [189, 9545, 1205, 3, 429, 37, 99, 2, 17, 18740, 16391, 348, 917, 1944, 2, 606, 2482, 27, 21892, 27, 2185, 27, 1209, 27, 16748, 5476, 4, 13, 291, 76, 2, 44, 21, 260, 1071, 4, 6, 2870, 14, 284, 3, 4818, 2, 1102, 427, 2, 96, 399, 497, 7, 59, 4, 6, 533, 3, 108, 7, 2920, 3, 2, 21, 260, 472, 4, 6, 43, 320, 2, 76]


<a name="5"></a>
## Creating dataloader

Now comes with the most important points! **I figure that detailed explanations may not help you to understand what will be shown below, because you may need to practice again and again, and compare with what we have done previously to build a solid intuition.** Let's simply take a dataloder as a black box. All you need to know is what needs to go in and what will come out. Here are some of the points you need to know:

- A dataloader is something iterable and will work more efficiently with the models constructed by a deep learning framework, especially when trained on GPUs because they can load data asynchronously.


- For a dataloader, it usually comes with parameters like these (besides a dataset that must be passed): a `sampler` method that create samples (batches) from the given dataset and return indices relating to examples from the dataset; a `collate_fn` method that further preprocess the batched examples. Alternatively, instead of passing a `sampler` to the dataloader, we can specify a value to the `batch_size` directly. We will do the later here. More about dataloader, please refer to [here](https://pytorch.org/docs/stable/data.html).


- For the dataset, its type needs to be what is called `Dataset` (map-style dataset) or `IterableDataset` (iterable-style dataset) in order to make everything work. 


Enough words. Let's just see what this can be done.

<a name="5-1"></a>
### Transform the dataset into  `Dataset` class 

A key property of the `Dataset`, It is iterable both by a for loop and by a slicing index (just like a list!) 

Here, we will use `torchtext.data.functional.to_map_style_dataset` to do the transformation. I know the import path suck! 

In [11]:
from torchtext.data.functional import to_map_style_dataset

train = to_map_style_dataset(train_set)

print("Type of train", type(train))
print("Is train's type a Dataset?", isinstance(train, torch.utils.data.Dataset))

Type of train <class 'torchtext.data.functional.to_map_style_dataset.<locals>._MapStyleDataset'>
Is train's type a Dataset? True


<a name="5-3"></a>
### Building a `batchify` method 

The purpose of the `batchify` method is to provide a set of methods to further preprocess the bacthed dataset in a way that make possible model training. More concretely, as the batched dataset is still raw text plus label in our case, we will need to do the following:

- first, we need to encode or numericalize the text data using the `text_encoder` we build above; 
- second, we need to make sure that the text ids (numericalized text) within a batch and of same kind (text_a versus text_b) are of same length/dimension (aligned with the max length in a batch or a `max_seq_len`);
- then, for every bacthed element (e.g., text_a, text_b, label), we want them to be separated. 
- finally, for those RNN models, we will need to ensure that the outputs also include the "text_seq_len" info. 

This built `batchify` method will be passed to the `collate_fn` argument in the dataloader. 

In [12]:
def batchify_fn(batch, 
                text_encoder=text_encoder, 
                pad_idx=0,
                max_seq_len=None, 
                include_seq_len=False, 
                dtype=torch.int64):
    
    # ----- pad func for a list -----
    def _pad(lst, max_len):
        dif = max_len - len(lst)
        if dif > 0:
            return lst + [pad_idx] * dif
        if dif < 0:
            return lst[:max_len]
        return lst
    
    # ----- pad func for a bacth of lists -----
    def pad(data):
        if max_seq_len:
            max_len = max_seq_len
        else:
            max_len = max(len(d) for d in data)
        
        for i, d in enumerate(data):
            data[i] = _pad(d, max_len)
        
        return data
    
    # ----- turn a list of int into tensor -----
    def to_tensor(x):
        return torch.tensor(x, dtype=dtype)
    
    
    # ----- start batchifying -----
    out_t, out_l = [], []
    
    for (text, label) in batch:
        out_t.append(text_encoder(text)) 
        out_l.append(label)
    
    if include_seq_len:
    
    # if include_seq_len and max_seq_len, longer text will be trimmed
    # hence, their text_seq_len is also reduced to the max_seq_len
        if max_seq_len:
            t_len = to_tensor([len(t) if len(t) < max_seq_len 
                               else max_seq_len for t in out_t])
        else:
            t_len = to_tensor([len(t) for t in out_t])
        
    # torch.cat put a list of tensor together
    out_t = to_tensor(pad(out_t))
    out_l = to_tensor(out_l)
    
    if include_seq_len:
        return out_t, t_len, out_l
    
    return out_t, out_l

In [13]:
# check. Note the following has not been batched.

t, t_l, l = batchify_fn(train_set, include_seq_len=True)
print("Shape of text_ids preprocessed:", t.shape)
print("Shape of text_len preprocessed:", t_l.shape)
print("Shape of labels preprocessed:", l.shape)

Shape of text_ids preprocessed: torch.Size([4000, 928])
Shape of text_len preprocessed: torch.Size([4000])
Shape of labels preprocessed: torch.Size([4000])


<a name="5-4"></a>
### Now the dataloader

We will call the `torch.utils.data.DataLoader` and then set the "batch_size" as well as whether to shuffle the dataset passed to the dataloader. The output cannot be retrieved by index or is not subscriptable. 

**Please note that**, the `batchify_fn` passed to the `DataLoader` can only take one batch of text data as input. However, although our `batchify_fn` can take multiple values, except the `batch` parameter, other parameters all have a default values when not given. Neverthelss, in this senario, if we want to set `include_seq_len=True` or change other parameters, we will need to change the default values directly every time we run the `DataLoader`. We will introduce a `get_batchify_fn` method later to eliminate this problem.

In [14]:
from torch.utils.data import DataLoader


dataloader = DataLoader(
    train,
    batch_size=64,
    shuffle = True, 
    collate_fn=batchify_fn)

In [15]:
# check

for d in dataloader:
    print(d)
    break

(tensor([[   13,   305,    27,  ...,     0,     0,     0],
        [  401,    24,    35,  ...,     0,     0,     0],
        [21366,    60,   950,  ...,   780,    78,    61],
        ...,
        [    8,   153,  1803,  ...,     0,     0,     0],
        [  229,   113,   148,  ...,     0,     0,     0],
        [  750,     7,    43,  ...,     0,     0,     0]]), tensor([1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1,
        1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]))


<a name="6"></a>
## A quick test

It works!

In [16]:
from pytorch_utils import PyTorchUtils
import torch.optim as optim
import torch.nn as nn
from pytorch_models.BoW import BoW


model = BoW(len(V), 2)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=False)
%time PT.train(dataloader, epochs=2)

Epoch 1/2 {'Train loss': '0.92213', 'Train accu': '28.70'}

Epoch 2/2 {'Train loss': '0.63757', 'Train accu': '49.53'}

CPU times: user 7.05 s, sys: 298 ms, total: 7.35 s
Wall time: 5.34 s


<a name="7"></a>
## Wrapped up functions

Before heading to the next section, you can test the following functions/class methods up and see if you can utilize them to do a quick start yourself!

<a name="7-1"></a>
### TextVectorizer

The following wrapped up class method remsembles the one that we built in the `2.1 - wrapped_up_data_preprocessor.ipynb`, but here we are using as many functions as from `torchtext.vocab.build_vocab_from_iterator`. If you are interested, you can look a look back at the `TextVectorizer` inside the `utils.py` and see if you can create some additional functions (such as save the results into json file for later re-loading). 

In [17]:
import jieba
from torchtext.vocab import build_vocab_from_iterator
from collections import defaultdict
from collections.abc import Iterable


class TextVectorizer:
     
    def __init__(self, tokenizer=None):
        self.tokenize = tokenizer if tokenizer else jieba.lcut
        self.vocab_to_idx = {}
        self.idx_to_vocab = {}
        self._V = None
    
    def build_vocab(self, text):
        tokens = list(map(self.tokenize, text))
        
        self._V = build_vocab_from_iterator(tokens, specials=['[PAD]', '[UNK]'])
        for idx, tk in enumerate(self._V.get_itos()):
            self.vocab_to_idx[tk] = idx
            self.idx_to_vocab[idx] = tk
        
        self.vocab_to_idx = defaultdict(lambda: self.vocab_to_idx['[UNK]'], 
                                        self.vocab_to_idx)
        
        print('Two vocabulary dictionaries have been built!\n' \
             + 'Please call \033[1mX.vocab_to_idx | X.idx_to_vocab\033[0m to find out more' \
             + ' where [X] stands for the name you used for this TextVectorizer class.')
        
    def text_encoder(self, text):
        if isinstance(text, list):
            return [self(t) for t in text]
        
        tks = self.tokenize(text)
        out = [self.vocab_to_idx[tk] for tk in tks]
        return out
            
    def text_decoder(self, text_ids, sep=" "):
        if all(isinstance(ids, Iterable) for ids in text_ids):
            return [self.text_decoder(ids, sep) for ids in text_ids]
            
        out = []
        for text_id in text_ids:
            out.append(self.idx_to_vocab[text_id])
            
        return f'{sep}'.join(out)
    
    def __call__(self, text):
        if self.vocab_to_idx:
            return self.text_encoder(text)
        raise ValueError("No vocab is built!")

<a name="7-4"></a>
### Get batchify_fn

We will not change the `batchify_fn` we already built except a slight change of name, but we will customize a method to return batchify_fn for us on top of that for this series of tutorial!

In [18]:
import torch


def _batchify_fn(batch, 
                 text_encoder, 
                 pad_idx=0,
                 max_seq_len=None, 
                 include_seq_len=False, 
                 dtype=torch.int64):
    
    # ----- pad func for a list -----
    def _pad(lst, max_len):
        dif = max_len - len(lst)
        if dif > 0:
            return lst + [pad_idx] * dif
        if dif < 0:
            return lst[:max_len]
        return lst
    
    # ----- pad func for a bacth of lists -----
    def pad(data):
        if max_seq_len:
            max_len = max_seq_len
        else:
            max_len = max(len(d) for d in data)
        
        for i, d in enumerate(data):
            data[i] = _pad(d, max_len)
        
        return data
    
    # ----- turn a list of int into tensor -----
    def to_tensor(x):
        return torch.tensor(x, dtype=dtype)
    
    
    # ----- start batchifying -----
    out_t, out_l = [], []
    
    for (text, label) in batch:
        out_t.append(text_encoder(text)) 
        out_l.append(label)
    
    if include_seq_len:
    
    # if include_seq_len and max_seq_len, longer text will be trimmed
    # hence, their text_seq_len is also reduced to the max_seq_len
        if max_seq_len:
            t_len = to_tensor([len(t) if len(t) < max_seq_len 
                               else max_seq_len for t in out_t])
        else:
            t_len = to_tensor([len(t) for t in out_t])
        
    # torch.cat put a list of tensor together
    out_t = to_tensor(pad(out_t))
    out_l = to_tensor(out_l)
    
    if include_seq_len:
        return out_t, t_len, out_l
    
    return out_t, out_l


def get_batchify_fn(text_encoder, 
                    pad_idx=0,
                    max_seq_len=None, 
                    include_seq_len=False, 
                    dtype=torch.int64):
    
    return lambda ex: _batchify_fn(ex, text_encoder, 
                                   pad_idx, max_seq_len, 
                                   include_seq_len, dtype)

<a name="7-5"></a>
### Create dataloader 

In [19]:
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import Dataset, DataLoader


def create_dataloader(dataset, 
                      batchify_fn, 
                      batch_size=64, 
                      shuffle=True):
    
    
    if not isinstance(dataset, Dataset):
        dataset = to_map_style_dataset(dataset)
        
    
    dataloder = DataLoader(dataset, 
                           batch_size=batch_size, 
                           shuffle=shuffle,
                           collate_fn=batchify_fn)
    
    return dataloder

<a name="8"></a>
## More thorough tests 

This time, we will include the dev_set for validation and the test_set for evaluation!

<a name="8-1"></a>
### Initializations 

In [20]:
from utils import load_dataset, gather_text

train_set, dev_set, test_set = load_dataset(['train.tsv', 'dev.tsv', 'test.tsv'])

text = gather_text(train_set)
V = TextVectorizer()
V.build_vocab(text)

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


<a name="8-2"></a>
### Test One: CNN

In [21]:
batchify_fn = get_batchify_fn(V, include_seq_len=False)
train_loader = create_dataloader(train_set, batchify_fn)
dev_loader = create_dataloader(dev_set, batchify_fn, shuffle=False)
test_loader = create_dataloader(test_set, batchify_fn, shuffle=False)

In [22]:
from pytorch_models.CNN import CNN

model = CNN(len(V.vocab_to_idx), 2)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=False)
%time PT.train(train_loader, dev_loader, epochs=5)

Epoch 1/5 {'Train loss': '0.68719', 'Train accu': '45.78'}
Validation... {'Dev loss': '0.65926', 'Dev accu': '55.62'}

Epoch 2/5 {'Train loss': '0.58850', 'Train accu': '70.54'}
Validation... {'Dev loss': '0.53728', 'Dev accu': '70.78'}

Epoch 3/5 {'Train loss': '0.45382', 'Train accu': '79.09'}
Validation... {'Dev loss': '0.47383', 'Dev accu': '75.25'}

Epoch 4/5 {'Train loss': '0.32277', 'Train accu': '88.94'}
Validation... {'Dev loss': '0.40815', 'Dev accu': '82.29'}

Epoch 5/5 {'Train loss': '0.19674', 'Train accu': '94.87'}
Validation... {'Dev loss': '0.36995', 'Dev accu': '83.99'}

CPU times: user 1min 22s, sys: 12 s, total: 1min 34s
Wall time: 1min 17s


In [23]:
PT.evaluate(test_loader)

{'Test loss': '0.33586', 'Test accu': '85.72'}

<a name="8-3"></a>
### Test Two: RNN

In [24]:
batchify_fn = get_batchify_fn(V, include_seq_len=True)
train_loader = create_dataloader(train_set, batchify_fn)
dev_loader = create_dataloader(dev_set, batchify_fn, shuffle=False)
test_loader = create_dataloader(test_set, batchify_fn, shuffle=False)

In [25]:
from pytorch_models.S_RNN import SimpleRNN

model = SimpleRNN(len(V.vocab_to_idx), 2)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=True)
%time PT.train(train_loader, dev_loader, epochs=5)

Epoch 1/5 {'Train loss': '0.68665', 'Train accu': '44.05'}
Validation... {'Dev loss': '0.67081', 'Dev accu': '53.87'}

Epoch 2/5 {'Train loss': '0.65342', 'Train accu': '59.90'}
Validation... {'Dev loss': '0.64783', 'Dev accu': '61.24'}

Epoch 3/5 {'Train loss': '0.60613', 'Train accu': '66.99'}
Validation... {'Dev loss': '0.58531', 'Dev accu': '69.98'}

Epoch 4/5 {'Train loss': '0.55404', 'Train accu': '71.38'}
Validation... {'Dev loss': '0.57183', 'Dev accu': '71.03'}

Epoch 5/5 {'Train loss': '0.49778', 'Train accu': '75.72'}
Validation... {'Dev loss': '0.60941', 'Dev accu': '67.79'}

CPU times: user 2min 2s, sys: 20.6 s, total: 2min 23s
Wall time: 1min 15s


In [26]:
PT.evaluate(test_loader)

{'Test loss': '0.61738', 'Test accu': '66.42'}