In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-20, modified on 2022-01-22
# GitHub: https://github.com/jaaack-wang 

## Purposes of data preprocessing

For natural lannguage processing, the ultimate purpose of data preprocessing is to convert text data into numerical data, so that mathematical operations can be run on the dataset inputted. This typically involves two general steps:

- **Text tokenization**. Please note that, what I mean by tokenization here is not to simply tokenize a sequence of text into words. The results of text tokenization can be anything depending on what works best for you: whitespace-separated tokens, words, stems (subwords or words), lemma, or even characters (Yes, I do mean characters! Such as "a, b, c, d..." for English). Before tokenization, there may be a need of normalizing or standarizing the text, again depending on your needs. Text normalization may include: making all characters lower case, americanizing the spelling, removing extra spaces or stopwords etc.


- **Text representation or numericalization**: This is simply to encode the tokenized text data into numerical data where every token (remember that a token can be anything, see above) is mapped to something numerical, which can either be a number or an array of numbers. For machine learning purposes, we usually convert a token into an array of numbers and there are two common approaches to it: one-hot encoding and word embedding. 
    <br><br>
    - **One-hot encoding**. As suggested by the name, encodes every token into a sparse array that has as many elements as your vocabulary (unique token) size, where only the element whose index equal to the index of the token in your vocabulary will be 1 whereas the rest will be 0. For example, we have a vocabulary of 10,000 tokens. Say we have a token which happens to be "token" is the 4th word in the vocabulary. By one-hot encoding convention, the token "token" will be encoded as $(0, 0, 0, 1, 0, 0...0)$ where there are another 9996 zeros after 1.
    <br><br>
    - **Word embedding**. Word embedding is just a conventional name. In practice, we can embed any token the same way we embed words. The basic idea of word embedding is to convert words into a set of much denser arrays where every word (note that here word means token in the same sense as above) is mapped to a much shorter array. To make the array shorter, we need to employ continous floats instead of discrete integers so that a small difference in the floats can mean something. More about word embedding can be seen in the `README.md` file where I explain the general architecture of deep learning models used for text classification. 

## What to do in the tutorial 

In this tutorial, we will do both text tokenization and text numericalization to the corpus we have compiled (see `1 - get_data.ipynb`). In addition, we will also learn to build mini batches on the top of the numericalized text data, which is a common practice in deep learning applications. 

Although most deep learning frameworks will provide handy functions for you to do text data processing that works more efficiently when using the frameworks to train models, it is important to understand the whole process so you will not get confused with what those provided functions do for you. Therefore, we will do everything from scratch here.  

Specifically, we will 

- [first load the train set](#1)
- [tokenize the text in the train set](#2)
- [create dictionaries to map tokens to indices and vice versa](#3)
- [numericalize the text data based on the mapping dictionaries](#4)
- [and finally build mini batches for the numericalized text data](#5)

Let's do it!

<a name='1'></a>
## Load the train set

In [2]:
def load_dataset(fpath, num_row_to_skip=1):
    data = open(fpath)
    for _ in range(num_row_to_skip):
        next(data)
    for line in data:
        line = line.split('\t')
        yield line[1].rstrip(), int(line[0])

In [3]:
train_set = list(load_dataset('train.tsv'))

# check
len(train_set), train_set[:3]

(4000,
 [('选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般',
   1),
  ('15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错', 1),
  ('房间太小。其他的都一般。。。。。。。。。', 0)])

<a name='2'></a>
## Tokenize the text in the train set

Unlike English, Chinese words are run together, which makes tokenizing them more difficult. Here we will use a popular Chinese text segmentation tool [`jieba`](https://github.com/fxsjy/jieba) to tokenize Chinese.

In [4]:
import jieba

# define tokenize method
tokenize = jieba.lcut

In [5]:
# let's see how it works

text, _ = train_set[0]
print("Original: ", text)
print("Tokenized: ", tokenize(text))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/w9/d_nplhzj4qx35xxlgljgdtjh0000gn/T/jieba.cache


Original:  选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般


Loading model cost 0.740 seconds.
Prefix dict has been built successfully.


Tokenized:  ['选择', '珠江', '花园', '的', '原因', '就是', '方便', '，', '有', '电动', '扶梯', '直接', '到达', '海边', '，', '周围', '餐馆', '、', '食廊', '、', '商场', '、', '超市', '、', '摊位', '一应俱全', '。', '酒店', '装修', '一般', '，', '但', '还', '算', '整洁', '。', ' ', '泳池', '在', '大堂', '的', '屋顶', '，', '因此', '很小', '，', '不过', '女儿', '倒', '是', '喜欢', '。', ' ', '包', '的', '早餐', '是', '西式', '的', '，', '还', '算', '丰富', '。', ' ', '服务', '吗', '，', '一般']


<a name="3"></a>
## Create dictionaries to map tokens to indices and vice versa

This includes a dictionary that maps tokens to indices (encoding) and a dictionary that maps indices to tokens (decoding). Creating such dictionaries is not very hard. All we need is just a vocabulary list (unique tokens) and then create a dictionary that assign every token with an unique number. Then we can reverse the vocabulary dictionary and get another dictionary where number is the key and token is the value. 

There are two ways to get a vobulary list: either create one internally from the text in the train set (but not dev set nor test set because we want to make them untouched), or obtain one externally. If you obtain a vocabulary list externaly, do make sure that the vocabulary list is compatible with the way how you tokenize text. Otherwise, it not very meaningful that the tokens obtained by your tokenizer can barely find unique indices in the vocabulary list. (For example, the vocabulary list is based on words, but your tokenizer tokenizes texts into characters...)

Additionally, unless the vocabulary is definite or can be exclusively listed (characters-based), we will need a special token to map all unseen tokens that may occur from times to times. Usually, we denote such token as `[UNK]` or `<UNK>`. 

Furthermore, as the length varies from text to text, whereas deep learning models like to text squences of same length because eventually mathematical (specifically matrix) operations are particular about the dimensions of the data. For example, if two matrices have different dimensions, we cannot do element-wise addition or substraction (in programming world, however, there are certain cases where [`broadcasting`](https://numpy.org/doc/stable/user/basics.broadcasting.html) is allowed). More importanly is the matrix multiplication where two matrices must have compatible (not identical) dimensions such that multiplication can carry out. Nevertheless, a way to get around this is to pad all texts to a length so that every text (entirely or within a batch) is represented by an array of equal length. Typically, we use `[PAD]` or `<PAD>` to denote padded areas. 

### All tokens in the train set

To create a vocabulary dictionary, we need a list of unique tokens. Let's gather all tokens in the train set first

In [6]:
tokens = []

for (text, _) in train_set:
    tokens += tokenize(text)

# check
len(tokens), tokens[:5]

(274665, ['选择', '珠江', '花园', '的', '原因'])

### Unique tokens

Then we need to gather all the unique tokens. This can be done either randomly or according to the frequnecy of the toknes' occurences (typically in a descending order). It is hard to tell which way is better (probably not so different), but when there is an overwhelming number of tokens in the vocabulary, it is easier to leave out low-frequency tokens out if we have already arranged the tokens in the vocabulary based on their occurences. A very large vocabulary will require additional computational resources, which sometimes we may want to avoid. 

However, in this tutorial, we will keep all the tokens in the vocabulary as there are only 22095 unique tokens from the train set.

In [7]:
# ramdonly 
unique_tokens = list(set(tokens))
print("Number of unique tokens:", len(unique_tokens))

Number of unique tokens: 22095


In [8]:
# based on occurences
from collections import Counter

unique_tks_counted = Counter(tokens).most_common()
print("Number of unique tokens:", len(unique_tks_counted))

print("\nThe 10 most frequent tokens in the train set and their occurrences:\n")
tmp = "{:10}{}"
for tk, count in unique_tks_counted[:10]:
    print(tmp.format(tk, count))

Number of unique tokens: 22095

The 10 most frequent tokens in the train set and their occurrences:

，         22012
的         15152
。         8613
了         5654
          4092
是         3959
我         3445
,         3416
很         2536
！         2247


### Finalize the vocabulary list

You can add the two special tokens `[PAD]` and `[UNK]` here or later, depending on your preferences. In practice, it is typical that the `[PAD]` token is assigned with 0 and the `[UNK]` token is assigned with 1. Hence, tehy are placed as the first two items in the vocabulary list. 

Similarly, it is up to you to decide which way we want to order the rest tokens, either randomly or based on their occurences. 

In [9]:
vocab = ['[PAD]', '[UNK]'] + [tk for (tk, _) in unique_tks_counted]

# check
len(vocab), vocab[:10]

(22097, ['[PAD]', '[UNK]', '，', '的', '。', '了', ' ', '是', '我', ','])

### `vocab_to_idx` and `idx_to_vocab` dictionaries

In [10]:
vocab_to_idx = {tk: idx for idx, tk in enumerate(vocab)}
idx_to_vocab = dict(map(reversed, vocab_to_idx.items()))

# check
tmp = "{}\t{}"
print("The first 10 items in the vocab_to_idx dicionary:\n")
for tk, idx in list(vocab_to_idx.items())[:10]:
    print(tmp.format(tk, idx))
    
print("\n\nThe first 10 items in the vocab_to_idx dicionary:\n")
for idx, tk in list(idx_to_vocab.items())[:10]:
    print(tmp.format(str(idx), tk))

The first 10 items in the vocab_to_idx dicionary:

[PAD]	0
[UNK]	1
，	2
的	3
。	4
了	5
 	6
是	7
我	8
,	9


The first 10 items in the vocab_to_idx dicionary:

0	[PAD]
1	[UNK]
2	，
3	的
4	。
5	了
6	 
7	是
8	我
9	,


### A note

One thing to note here is that, as we want to map all the unseen vocabulary to the special token `[UNK]`, the `vocab_to_idx` dictionary needs to see all out-of-dictionary tokens as `[UNK]` and return the associated index (i.e., 1). We can do so by either by utilizing the `get` method that comes with the dictionary or the `defaultdict`. However, if you prefer the former, you need to create a lookup function for that accordingly. 

In this tutorial, we will use the `defaultdict` instead.

In [11]:
# A problem
print("Index for 字:", vocab_to_idx['字']) 
print("Index for nonexistent:", vocab_to_idx['nonexistent'])

Index for 字: 491


KeyError: 'nonexistent'

In [12]:
# get method and a look up function

vocab_to_idx_lookup = lambda key: vocab_to_idx.get(key, 1)

print("Index for 字:", vocab_to_idx_lookup('字')) 
print("Index for nonexistent:", vocab_to_idx_lookup('nonexistent'))

Index for 字: 491
Index for nonexistent: 1


In [13]:
# defaultdict
from collections import defaultdict

# "lambda: 1" lets the defaultdict return 1 when a key is not in the dict
vocab_to_idx = defaultdict(lambda: 1, vocab_to_idx)

print("Index for 字:", vocab_to_idx['字']) 
print("Index for nonexistent:", vocab_to_idx['nonexistent'])

Index for 字: 491
Index for nonexistent: 1


<a name="4"></a>
## Numericalize the text 

To numericalize the text is to encode it into a list of numbers. We will use the `vocab_to_idx` dictionary to do so. We will also create a decoder so that there is way we can decode the encoded text. We will use the `idx_to_vocab` dictionary to do this. 

As I mentioned earlier, the vocabulary list must be compatible with how our tokenizer tokenizes text. Only in this way can we use the `vocab_to_idx` and `idx_to_vocab` dictionaries to encode and decode text. 

In [14]:
def text_encoder(text, 
                 tokenize=tokenize, 
                 vocab_to_idx=vocab_to_idx):
    
    tokens = tokenize(text)
    out = []
    for tk in tokens:
        out.append(vocab_to_idx[tk])
    return out


def text_decoder(text_ids, 
                 tokenize=tokenize, 
                 idx_to_vocab=idx_to_vocab, 
                 sep="",
                 out_str=True):
    
    out = []
    for text_id in text_ids:
        out.append(idx_to_vocab[text_id])
    
    if out_str:
        return f"{sep}".join(out)

    return out

In [15]:
# Am example

text = "这只是一个简单的例子，看你懂不懂我"
encoded_text = text_encoder(text)
decoded = text_decoder(encoded_text)

# note that all the punctuations are removed by the tokenizer. 
# "Encoding, decoding, and Interestiiiiing" are unseen tokens
print("Original text:", text)
print("\nEncoded:", encoded_text)
print("\nDecoded back:", decoded)

Original text: 这只是一个简单的例子，看你懂不懂我

Encoded: [24, 210, 34, 289, 3, 1622, 2, 28, 67, 1, 8]

Decoded back: 这只是一个简单的例子，看你[UNK]我


### Encoding the train set

In [16]:
def encode_dataset(dataset):
    
    for text, label in dataset:
        text = text_encoder(text)
        
        yield [text, label]

In [17]:
train_set_encoded = list(encode_dataset(train_set))

assert len(train_set) == len(train_set_encoded)

tmp = "{}\t{}"
print("Three encoded train set examples:\n")
print(tmp.format("Text", "Label\n"))

for example in train_set_encoded[:3]:
    print(tmp.format(*example))

Three encoded train set examples:

Text	Label

[188, 6965, 1177, 3, 428, 37, 99, 2, 17, 10681, 10682, 343, 913, 1875, 2, 604, 2299, 27, 10683, 27, 2133, 27, 1178, 27, 10684, 5392, 4, 13, 290, 76, 2, 44, 21, 259, 1048, 4, 6, 2685, 14, 284, 3, 4400, 2, 1089, 422, 2, 96, 397, 497, 7, 59, 4, 6, 531, 3, 108, 7, 2686, 3, 2, 21, 259, 471, 4, 6, 43, 319, 2, 76]	1
[3285, 959, 414, 3, 175, 304, 2134, 2, 305, 239, 1876, 639, 5, 2, 672, 59, 2488, 3733, 2, 6966, 2488, 2300, 99, 2, 605, 12, 10, 2687, 2, 344, 12, 383, 26]	1
[22, 504, 4, 163, 3, 16, 76, 4, 4, 4, 4, 4, 4, 4, 4, 4]	0


### Decoding the shown examples back

In [18]:
print("Three decoded train set examples:\n")
print(tmp.format("Text", "Label\n"))

for example in train_set_encoded[:3]:
    t, l = example
    decoded = text_decoder(t)
    
    print(tmp.format(decoded, l))

Three decoded train set examples:

Text	Label

选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般	1
15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错	1
房间太小。其他的都一般。。。。。。。。。	0


<a name="5"></a>
## Build mini batches 

When building mini batches (more than one example for a batch) for the numericalized data, we need to make sure that the arrays in the batches have identical dimensions so that we can do matrix operations later on during the model training phase. Sometimes, for some tasks or for some models, we may need to ensure that every text is represented with a specified shape. For the neural network models represented in this repository, however, we only need to ensure the dimension for text_a or text_b is same within a given batch. Alternatively, we can set a `max_seq_len` that make every text represented by an arrary of `max_seq_len` items. A longer text with more tokens than the `max_seq_len` will be shortened accordingly. 

### How to build batches 

First, let's see how we can create batches. Suppose we have a list of 50 items. If we want to create 10 batches, then each batch should have 5 items. Typically, we do it in a reverse way. That is, we specify how many items we want to have within each batch and then get the corresponding number of bacthes. Say we want to have 5 items each batch for the list, then we will need to create 10 batches. 

Logically, if we set a batch size that cannot divide the number of itmes in a list, then we need to make judgment call as to whether to drop the reminder. 

In [19]:
def batch_creater(lst, 
                  batch_size, 
                  reminder_threshold=0):
    
    assert batch_size > 1, "batch_size must be greater than 1"
    
    lst_size = len(lst)
    reminder = lst_size % batch_size
    # if the threshold is greater than the reminder, 
    # we leave the reminder out. 
    if reminder_threshold > reminder:
        lst = lst[:-reminder]
        lst_size, reminder = len(lst), 0
    
    batch_num = lst_size // batch_size
    # if there is a reminder, we need to add 1 to the batch_num
    # so that the reminder can be included in the for loop
    end_idx = batch_num + 1 if reminder else batch_num
    
    out = []
    for i in range(0, end_idx):
        out.append(lst[i * batch_size : (i+1) * batch_size])
    
    # if the reminder is 1, that means the last batch in the batch 
    # is one dimension less than the previous batch, which is listed. 
    if reminder == 1:
        out[-1] = list(out[-1])
    
    return out

In [20]:
lst = list(range(50))
print("Creating batches. List size: 50. Batch size: 10...\n")

batches = batch_creater(lst, 10)
print(f"{len(batches)} batches created!\n")
print(batches)

Creating batches. List size: 50. Batch size: 10...

5 batches created!

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]


In [21]:
lst = list(range(50))
print("Creating batches. List size: 50. Batch size: 10...\n")

batches = batch_creater(lst, 7)
print(f"{len(batches)} batches created!\n")
print(batches)

# see the last batch with one item is also listed!

Creating batches. List size: 50. Batch size: 10...

8 batches created!

[[0, 1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13], [14, 15, 16, 17, 18, 19, 20], [21, 22, 23, 24, 25, 26, 27], [28, 29, 30, 31, 32, 33, 34], [35, 36, 37, 38, 39, 40, 41], [42, 43, 44, 45, 46, 47, 48], [49]]


### How to pad 

The key to pad is to make sure that every array in a list is of same length, so we need to decide: (1) when to pad; (2) what to pad with. For (1), we can pad when there are arrays in the list shorter than the longest one; that is, to make every array in the list is as long as the longest one. Or, we can set a `max_seq_len` and make every array to have the `max_seq_len` length no matter what. As for (2), we typically pad with 0.

In [22]:
def pad(lst, pad_idx=0, max_seq_len=None):
    # here let's assume that every item in the lst is a list
    
    if max_seq_len:
        max_len = max_seq_len
    else:
        max_len = max(len(l) for l in lst)
    
    lst_copy = lst.copy()
    for idx, l in enumerate(lst_copy):
        dif = max_len - len(l)
        
        # if there an item is shorter
        if dif > 0:
            lst_copy[idx] = lst_copy[idx] + [pad_idx] * dif
        elif dif < 0:
            lst_copy[idx] = lst_copy[idx][:max_len]
    
    return lst_copy

In [23]:
lst = [
    [1, 2, 3, 4, 5], 
    [1, 2, 3],
    [1, 2, 3, 4, 5, 6, 7],
    [1, 2, 3, 4]
]

pad(lst)

# looks perfect!

[[1, 2, 3, 4, 5, 0, 0],
 [1, 2, 3, 0, 0, 0, 0],
 [1, 2, 3, 4, 5, 6, 7],
 [1, 2, 3, 4, 0, 0, 0]]

In [24]:
pad(lst, max_seq_len=5)

# again, looks perfect!

[[1, 2, 3, 4, 5], [1, 2, 3, 0, 0], [1, 2, 3, 4, 5], [1, 2, 3, 4, 0]]

### Let's build batches for our dataset!

With everything put together, let's build batches for our dataset! 

Please note that: 

- in deep learning, we usually need to use `tensor` to represent a list of numbers, which makes training models easier and more efficiently especially on (parallel) GPUs. Here, we will use numpy array instead for illustration (numpy array is also trainable and convertible to tensor so no worries!).


- we may also want to suffle the inputted dataset because the order of the dataset makes a difference. A good practice is to train models with shuffled train sets for multiple times and report the average performance scores. (To make things reproducible, a seed is also needed so that the shuffled train sets can be reproduced.)


- whether to separate the labels from the texts does not matter as long as you know how to deal with both situations during the latter modelling training (as most deep learning frameworks are highly composable). So whatever works easiler for you. Here we will not separate them apart. 


- for different models, the required inputs can differ in order for the models to train, so **you need to build the `build_batches` function to suit your own needs**. In this repository, you will use Recurrent Neural Networks and its variants to train the text classifier, which all require the text length as inputs. The `build_batches` function is specific to this context and can actually be re-arranged into whatever way that works. 

In [25]:
import numpy as np


def build_batches(dataset, 
                  batch_size, 
                  shuffle=True, 
                  pad_idx=0, 
                  max_seq_len=None, 
                  dtype="int64", 
                  reminder_threshold=0, 
                  include_seq_len=False):
    
    # ------------- building bacthes first -------------
    
    if shuffle:
        np.random.shuffle(dataset)
        
    batches = batch_creater(dataset, batch_size, reminder_threshold)

    
    # ------------- start padding -------------
    # we can reuse the pad func above but the following is more efficient
    
    def _pad(lst, max_len):
        dif = max_len - len(lst)
        if dif > 0:
            return lst + [pad_idx] * dif
        if dif < 0:
            return lst[:max_len]
        return lst
        
    
    def pad(batch):
        if max_seq_len:
            max_len = max_seq_len
        else:
            max_len = max(len(b[0]) for b in batch)
            
        text, label = [], []
        
        if include_seq_len:
            if max_seq_len:
                text_len = [len(bt[0]) if len(bt[0]) < max_seq_len 
                              else max_seq_len for bt in batch]
            else:
                text_len = [len(bt[0]) for bt in batch]
                
            text_len = np.asarray(text_len, dtype=dtype)            
        
        for idx, bt in enumerate(batch):
            
            # ----- for text -----
            text.append(_pad(bt[0], max_len))

            # ----- for the label -----
            label.append(bt[-1])
        
        text = np.asarray(text, dtype=dtype)
        label = np.asarray(label, dtype=dtype)
        
        if include_seq_len:
            return [text, text_len, label]
        
        return [text, label]
    
    out = []
    for batch in batches:
        out.append(pad(batch))
    
    return out

In [26]:
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=True)

print(f"Number of {len(train_set_batched)} batches created!")

Number of 63 batches created!


In [27]:
# check

for idx, batch in enumerate(train_set_batched[:5]):
    text, seq_len, label = batch
    print(f"{'-' * 10} # {idx+1} batch {'-' * 10}")
    
    print("Shape of text batch:", text.shape)
    print("Shape of seq_len batch:", seq_len.shape)
    print("Shape of label batch:", label.shape)
    print()

---------- # 1 batch ----------
Shape of text batch: (64, 224)
Shape of seq_len batch: (64,)
Shape of label batch: (64,)

---------- # 2 batch ----------
Shape of text batch: (64, 706)
Shape of seq_len batch: (64,)
Shape of label batch: (64,)

---------- # 3 batch ----------
Shape of text batch: (64, 827)
Shape of seq_len batch: (64,)
Shape of label batch: (64,)

---------- # 4 batch ----------
Shape of text batch: (64, 309)
Shape of seq_len batch: (64,)
Shape of label batch: (64,)

---------- # 5 batch ----------
Shape of text batch: (64, 221)
Shape of seq_len batch: (64,)
Shape of label batch: (64,)

