In [None]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-15
# GitHub: https://github.com/jaaack-wang 

## Quick start

We used a lengthy tutorial to illustrate how to convert text dataset for training text matching models. However, if there are handy wrapped up functions, as available in many deep learning frameworks, this process can be easily done with a few lines of code. The following is an illustration, although in real projects, you many need additional lines of code to suit your specific needs. 

The following is a quick start. A more elaborated explanation is given afterwards.

In [1]:
from utils import *

# ---- load dataset ----
train_set = load_dataset('train.txt')

# ---- numericalize the train set ----
V = TextVectorizer(tokenize) 
text = gather_text(train_set) # for collecting texts from train set
V.build_vocab(text) # for building mapping vocab_to_idx dictionary and text_encoder
train_set_encoded = list(encode_dataset(train_set, encoder=V.text_encoder)) # encodoing train set

# ---- build mini batches for the train set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=True)
print(f"Number of {len(train_set_batched)} batches created!")

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.
Number of 47 batches created!


## Intro to the wrapped functions

As we will not rewrite or copy and paste (a) same functions over and over again to just load and preprocess data throughout the tutorials, some wrapped functions are therefore provided. These wrapped functions can be seen in the `utils.py` file in the same folder.

These wrapped functions mostly come from the last tutorial (see: `2 - preprocess_data.ipynb`) with some revisions to make some functons more reusable. Let's do what we do in the last tutorial and this tutorial will introduce these wrapped up functions along the way.

## Load dataset

The wrapped up `load_dataset` allows:

- loading a dataset (for tutorials in this repository) given its filepath; 
- (extended) loading multiples datasets given their filepathes.

In [2]:
from utils import load_dataset

train_set, dev_set, test_set = load_dataset(['train.txt', 'dev.txt', 'test.txt'])

# check. should be 3000, 1000, 1000 (recall `1 - get_data.ipynb`)
len(train_set), len(dev_set), len(test_set)

(3000, 1000, 1000)

## Numericalize text 

Recall that we need to encode the text data into something numerical so that we can train models on them. As elaborated in the last tutorial, to do so, we need to have a tokenizer and a related dictionary where we can map a token to an unique index. If we want to deocde the encoded text, we will also need to have a reversed dictionary that maps an index to a token. 

Re-doing all these can be tedious, so this tutorials introduces a highly wrapped up class function `TextVectorizer` to make everything easy. The `TextVectorizer` class can do the following: 

- building the `vocab_to_idx` and `idx_to_vocab` mapping dictionaries quickly given text and tokenizer. 
- encoding and decoding any given text(s) using the tokenizer used to build the dictionaries; 
- save and load the built `vocab_to_idx` and `idx_to_vocab` dictionaries for reuse; 

If you have a list of tokens ready that can be gotten from the tokenizer you pass to the `TextVectorizer` class, then you can also quickly build the dictionaries from that list of tokens. 

`TextVectorizer` is also callable, which means that after initializing it, you can directly encode text(s) by calling one(s) to it. 

### Initialization 

To initialize the `TextVectorizer`, you must pass a tokenizer function/method to it, which takes str as input and returns a list of tokens. You can also pass a text preprocessor function/method to it to preprocess a given text before it is being tokenized by the tokenizer. Of course, you can build a tokenizer that incorprates the preprocessor and only pass the tokenizer. They are a same thing. 

We will use the same tokenizer we used in the last tutorial, which is saved in the `utils.py` file.

In [3]:
from utils import tokenize, TextVectorizer

V = TextVectorizer(tokenize)

### Building `vocab_to_idx` and `idx_to_vocab` dictionaries

To build these two dictionaries, all you need is just a (list) of text(s) and then pass it/them to the `TextVectorizer.build_vocab` function. If you have a list of tokens, you can also do so by passing the tokens to `TextVectorizer.build_vocab_from_list_tks`.

If you choose to use `TextVectorizer.build_vocab`, you can also choose whether to build the vocab randomly or based on the occurences of the tokens in descending order. By defaults, it will build the vocab basde on the latter. When this is the case, you can specify how many most frequnt tokens you want to keep by specifying, for example, `top=10000`. This is not an option for the random mode because that does not make sense. 

To gather the text, we will use a simple function `gather_text` that specifically gather all the "text_a" and "text_b" from our datasets into a list. 

**Remember that the dictionaries should be built upon the train set or some external source, but never on the dev set or test set.**

In [4]:
from utils import gather_text

text = gather_text(train_set)
print(f"{len(text)} pieces of texts gathered from the train set.\n")

V.build_vocab(text)

6000 pieces of texts gathered from the train set.

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


In [5]:
from random import sample, seed
# check

seed(543)

tmp = "{:20}{}"
print("5 random examples from the vocab_to_idx dictionary\n")

for item in sample(list(V.vocab_to_idx.items()), 5):
    print(tmp.format(*item))
    
    
print("\n\n5 random examples from the idx_to_vocab dictionary\n")

for idx, tk in sample(list(V.idx_to_vocab.items()), 5):
    print(tmp.format(str(idx), tk))

5 random examples from the vocab_to_idx dictionary

genome              6909
accomplish          5258
facility            7479
wedding             1032
build               741


5 random examples from the idx_to_vocab dictionary

3816                mobility
6394                practicing
5614                cfl
4865                retrograde
6659                tablet


### Saving the mapping dictionaries

You can use `TextVectorizer.save_vocab_as_json` which by default will save the dictionaries, if built, as `vocab_to_idx.json` and `idx_to_vocab.json` in the current working directory. 

In [6]:
V.save_vocab_as_json()

vocab_to_idx.json has been successfully saved!
idx_to_vocab.json has been successfully saved!


### Reusing the mapping dictionaries

When you need to reuse them, simply call `TextVectorizer.load_vocab_from_json`. If you do specify the filepathes to the two dictionaries, it will search for `vocab_to_idx.json` and `idx_to_vocab.json` in the current working directory and return them if any of them exists. 

Below, we first empty the two mapping dictionaries and reload them from the json files we saved just now. 

In [7]:
V.vocab_to_idx, V.idx_to_vocab = None, None

V.load_vocab_from_json()

vocab_to_idx.json has been successfully loaded! Please call [1mX.vocab_to_idx[0m to find out more.
idx_to_vocab.json has been successfully loaded! Please call [1mX.idx_to_vocab[0m to find out more.

Where [X] stands for the name you used for this TextVectorizer class.


### Encoding and decoding text

- Encoding: call initialized `TextVectorizer` or `TextVectorizer.text_encoder`.
- Decoding: call `TextVectorizer.text_decoder`.

Let's use some sample texts from the dev set as examples.

In [8]:
dev_text = gather_text(dev_set)
sample_texts = sample(dev_text, 3)

assert len(dev_text) == len(dev_set) * 2 # text_a plus text_b

seed(834)

for text in sample_texts:
    encoded = V.text_encoder(text) # or V(text)
    decoded = V.text_decoder(encoded)
    print("Original:", text)
    print("Encoded:", encoded)
    print("Decoded:", decoded)
    print()
    
# Or pass a list of text or text_ids to encode or decode
print(V.text_decoder(V(sample_texts)))

Original: How do I stop my Pit Bull/English Bulldog mix from biting my shoes?
Encoded: [5, 10, 6, 172, 19, 1, 1, 1, 607, 33, 1, 19, 1981]
Decoded: how do i stop my [UNK] [UNK] [UNK] mix from [UNK] my shoes

Original: How much does it cost to patent something?
Encoded: [5, 102, 21, 20, 323, 8, 1971, 326]
Decoded: how much does it cost to patent something

Original: Why am I not comfortable with the caste-based reservation system in India?
Encoded: [16, 77, 6, 57, 6218, 31, 2, 7515, 602, 194, 9, 37]
Decoded: why am i not comfortable with the castebased reservation system in india

['how do i stop my [UNK] [UNK] [UNK] mix from [UNK] my shoes', 'how much does it cost to patent something', 'why am i not comfortable with the castebased reservation system in india']


### Encode the train set

We will reuse the `encode_dataset` built in the last tutorial to encode the train set (or dev set for validation).

In [9]:
from utils import encode_dataset

train_set_encoded = list(encode_dataset(train_set, encoder=V.text_encoder))

# check
print(train_set_encoded[0])

[[5, 10, 6, 223, 7, 43, 852], [5, 10, 6, 223, 34, 852, 9, 94], 1]


## Building batches

We will reuse the `build_batches` built in the last tutorial to build the mini batches for the train set (or dev set for validation). The parameters are same except one `include_seq_len` that is added specifically for 

In [10]:
from utils import build_batches

train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=True)

print(f"Number of {len(train_set_batched)} batches created!")

Number of 47 batches created!


In [11]:
# check

for idx, batch in enumerate(train_set_batched[:5]):
    a, b, a_l, b_l, l = batch
    print(f"{'-' * 10} # {idx+1} batch {'-' * 10}")
    
    print("Shape of text_a batch:", a.shape)
    print("Shape of text_b batch:", b.shape)
    print("Shape of text_a_len batch:", a_l.shape)
    print("Shape of text_b_len batch:", b_l.shape)
    print("Shape of label batch:", l.shape)
    print()

---------- # 1 batch ----------
Shape of text_a batch: (64, 23)
Shape of text_b batch: (64, 45)
Shape of text_a_len batch: (64,)
Shape of text_b_len batch: (64,)
Shape of label batch: (64,)

---------- # 2 batch ----------
Shape of text_a batch: (64, 30)
Shape of text_b batch: (64, 29)
Shape of text_a_len batch: (64,)
Shape of text_b_len batch: (64,)
Shape of label batch: (64,)

---------- # 3 batch ----------
Shape of text_a batch: (64, 28)
Shape of text_b batch: (64, 36)
Shape of text_a_len batch: (64,)
Shape of text_b_len batch: (64,)
Shape of label batch: (64,)

---------- # 4 batch ----------
Shape of text_a batch: (64, 26)
Shape of text_b batch: (64, 27)
Shape of text_a_len batch: (64,)
Shape of text_b_len batch: (64,)
Shape of label batch: (64,)

---------- # 5 batch ----------
Shape of text_a batch: (64, 21)
Shape of text_b batch: (64, 35)
Shape of text_a_len batch: (64,)
Shape of text_b_len batch: (64,)
Shape of label batch: (64,)

