In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-20
# GitHub: https://github.com/jaaack-wang 

## Quick start

With wrapped up functions that we will gradually learn throughout this tutorial, preprocessing the text data into one that is ready for model training can be as simple as following. Does it really work? Let's explore!

In [2]:
from utils import load_dataset, gather_text
from paddle_utils import * 

train_set = load_dataset('train.txt')

text = gather_text(train_set)
V = TextVectorizer()
V.build_vocab(text)

trans_fn = get_trans_fn(V, include_seq_len=False)
batchify_fn = get_batchify_fn(include_seq_len=False)
train_loader = create_dataloader(train_set, trans_fn, batchify_fn)

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


In [3]:
for example in train_loader:
    print(example)
    break

[Tensor(shape=[64, 55], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[6   , 47  , 7   , ..., 0   , 0   , 0   ],
        [26  , 208 , 3702, ..., 0   , 0   , 0   ],
        [4   , 3   , 22  , ..., 0   , 0   , 0   ],
        ...,
        [17  , 13  , 23  , ..., 0   , 0   , 0   ],
        [20  , 11  , 19  , ..., 0   , 0   , 0   ],
        [6   , 15  , 7   , ..., 0   , 0   , 0   ]]), Tensor(shape=[64, 32], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[6  , 11 , 7  , ..., 0  , 0  , 0  ],
        [5  , 24 , 321, ..., 0  , 0  , 0  ],
        [6  , 15 , 7  , ..., 0  , 0  , 0  ],
        ...,
        [17 , 13 , 23 , ..., 0  , 0  , 0  ],
        [4  , 38 , 8  , ..., 0  , 0  , 0  ],
        [26 , 651, 78 , ..., 0  , 0  , 0  ]]), Tensor(shape=[64], dtype=int64, place=CPUPlace, stop_gradient=True,
       [1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,

## Overview

In this tutorial, we will use functions from `paddle` to help us preprocess and numericalize datasets. As these functions are native to `paddle`, so there is an advantage in training models constructed by using `paddle`, especially when the datasets are large. You will also need [`paddlenlp`](https://github.com/PaddlePaddle/PaddleNLP), a nlp package designed by the `paddle` team, to get everything going. To download it, simply run in command `pip3 install --upgrade paddlenlp`.

If you need more intuition about the ins and outs of this process, please refer to `2 - preprocess_data.ipynb` in the same folder.

Below are the structure of this tutoiral:

- [Load dataset](#1)
- [Create vocab_to_idx mapping dictionary](#2)
- [Text encoder](#3)
- [Example converter](#4)
    - [Coverting multiple examples](#4-1)
- [Creating dataloader](#5)
    - [Transform the dataset into  Dataset class using MapDataset](#5-1)
    - [A data sampler](#5-2)
    - [Building a batchify method](#5-3)
    - [Now the dataloader](#5-4)
- [A quick test](#6)
- [Wrapped up functions](#7)
    - [TextVectorizer](#7-1)
    - [Example converter](#7-2)
    - [Get trans_fn](#7-3)
    - [Get batchify_fn](#7-4)
    - [Create dataloader](#7-5)
- [More thorough tests](#8)
    - [Initializations](#8-1)
    - [Test One: CNN](#8-2)
    - [Test Two: RNN](#8-3)

<a name="1"></a>
## Load dataset

As usual, let's first use the `load_dataset` function compiled in the last two tutorials to load the datasets.

In [4]:
from utils import load_dataset

train_set = load_dataset('train.txt')

# check. should be 3000 (recall `1 - get_data.ipynb`)
len(train_set)

3000

<a name="2"></a>

## Create `vocab_to_idx` mapping dictionary

The purpose of creating a `vocab_to_idx` mapping dictionary is for later encoding or numeralizing text data for model training. In the `2.1 - wrapped_up_data_preprocessor` tutorial, we have learnt how to use `TextVectorizer` to conveniently do this job. 

In this tutorial, we will use `paddlenlp.data.Vocab` to do a simlar job. Fortunately, there are many English tokenizers that come with `paddlenlp` and we will use a simple one from `from paddlenlp.transformers.bert.tokenizer.BasicTokenizer` (a lot of tokenizers are stored in the [transformers folder](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers)). Apparently, no one is going to rememberize so a long path to import the tokenizer. Alternatively, you can just use the `.split` function to tokenize English or use the `tokenize` function from `utils.py`.

Unfortunately, unlike the `TextVectorize` we have built, `paddlenlp.data.Vocab` do not have a text_encoder function for us to use directly, but we can built it ourself.

In [5]:
from paddlenlp.data import Vocab
# such a long path! sucks!
from paddlenlp.transformers.bert.tokenizer import BasicTokenizer

In [6]:
# First, we need a tokenize func, "()" is needed here!
tokenize = BasicTokenizer().tokenize

# Then we need a list of tokenized texts
from utils import gather_text
text = gather_text(train_set) # ---> gather text from the train_set
tokens = list(map(tokenize, text)) # ---> a list of tokenized texts ([[w1, w2...], [w1, w2...]...])

# build the vocabulary which will give us the mapping dictionaries for encoding
V = Vocab.build_vocab(tokens, unk_token='[UNK]', pad_token='[PAD]')

In [7]:
# Let's see what we have 
# The most useful will be: V.token_to_idx and V.idx_to_token
[d for d in dir(V) if not d.startswith("_")]

['bos_token',
 'build_vocab',
 'eos_token',
 'from_dict',
 'from_json',
 'idx_to_token',
 'load_vocabulary',
 'pad_token',
 'to_indices',
 'to_json',
 'to_tokens',
 'token_to_idx',
 'unk_token']

In [8]:
# check

tmp = "{:20}{}"
print("The first 10 examples from the token_to_idx dictionary\n")

for item in list(V.token_to_idx.items())[:10]:
    print(tmp.format(*item))
    
    
print("\n\nThe first 10 random examples from the idx_to_token dictionary\n")

for idx, tk in list(V.idx_to_token.items())[:10]:
    print(tmp.format(str(idx), tk))

The first 10 examples from the token_to_idx dictionary

[PAD]               0
[UNK]               1
?                   2
the                 3
what                4
is                  5
how                 6
i                   7
a                   8
to                  9


The first 10 random examples from the idx_to_token dictionary

0                   [PAD]
1                   [UNK]
2                   ?
3                   the
4                   what
5                   is
6                   how
7                   i
8                   a
9                   to


In [9]:
# for unseen token_to_idx

print("Index for \033[1mThis_Word_Does_Not_Exist\033[0m:", V.token_to_idx["This_Word_Does_Not_Exist"])

Index for [1mThis_Word_Does_Not_Exist[0m: 1


<a name="3"></a>
## Text encoder

If you are interested, you can also build a `text_decoder` as we did before. 

In [10]:
def text_encoder(text, 
                 tokenize=tokenize, 
                 token_to_idx=V.token_to_idx):
    
    tokens = tokenize(text)
    out = []
    for tk in tokens:
        out.append(token_to_idx[tk])
        
    return out

In [11]:
# check

print("Original text:", text[0])
print("Encoded text:", text_encoder(text[0]))

Original text: How do I write a good essay?
Encoded text: [6, 11, 7, 247, 8, 51, 932, 2]


<a name="4"></a>
## Example converter 

We will see why this is needed shortly. The purpose of an example converter is to transform a given example into the data we need for training models. Later, we will use `map` function to map the `example_converter` to an entire dataset.

If you see the previous tutorials, we will know that since the `RNN` models takes as an input the text seq length, we also need to take that into consideration. More generally, as different tasks or different models have different needs, you need to tailor the `example_converter` based on your specific needs.

In [12]:
def example_converter(example, text_encoder, include_seq_len):
    
    text_a, text_b, label = example
    encoded_a = text_encoder(text_a)
    encoded_b = text_encoder(text_b)
    if include_seq_len:
        len_a, len_b = len(encoded_a), len(encoded_b)
        return encoded_a, encoded_b, len_a, len_b, label
    return encoded_a, encoded_b, label

In [13]:
# check

encoded_a, encoded_b, label = example_converter(train_set[0], text_encoder, False)
a, b, l = train_set[0]

print("Original text_a:", a)
print("Encoded text_a:", encoded_a)

print("\nOriginal text_b:", b)
print("Encoded text_b:", encoded_b)

# nothing changes for label

Original text_a: How do I write a good essay?
Encoded text_a: [6, 11, 7, 247, 8, 51, 932, 2]

Original text_b: How do I write an essay in English?
Encoded text_b: [6, 11, 7, 247, 40, 932, 10, 107, 2]


<a name="4-1"></a>
### Coverting multiple examples

To convert multiple examples, we can either loop through the examples one by one or use the `map` function. The `map` function takes a function as the first input and then one or several iterable (e.g., list or tuple) corresponding to the parameters of that function. In our `example_converter` we have three parameters, that means we will need to do something like the following:

```python
>>> examples = 'a list of examples'
>>> n = len(examples)
>>> E = text_encoder
>>> B = include_seq_len
>>> converted = map(example_converter, examples, [E] * n, [B] * n)
```

That looks bad, right? If you do not multiply the later two parameters, making them iterable, we will see something like the following:

In [14]:
examples = train_set[:2]
converted = map(example_converter, examples, text_encoder, False)

TypeError: 'function' object is not iterable

**There are three ways to get around this:**

- First, assign a default value to the latter two parameters (text_encoder & include_seq_len) so that you only need to pass the "example" paramters when using `map`. If we want to change the default values, we need to do it every time by changing the `example_converter` function directly.

- Second, use the handy built-in function `partial` from `functools` or `lambda` to assign default values for a function's parameters, which will create another encapsulated function for you. 

- Third, in our case, we can simply make the `example_converter` return the text seq length regardless, but that means, you need to construct all your models taking this into account (some models that do not need seq length info will have the related parameters but never use them). 


Below, we will show the second way because it is what is commonly adopted by the deep learning community when doing natural language processing.

#### Use `lambda`

In [15]:
trans_fn = lambda example: example_converter(example, text_encoder, False)
converted = map(trans_fn, examples)

for idx, conv in enumerate(converted):
    print(f"Example #{idx+1}....")
    e_a, e_b, l = conv
    print("Encoded text_a:", e_a)
    print("Encoded text_b:", e_b)
    print("Label:", l)

Example #1....
Encoded text_a: [6, 11, 7, 247, 8, 51, 932, 2]
Encoded text_b: [6, 11, 7, 247, 40, 932, 10, 107, 2]
Label: 1
Example #2....
Encoded text_a: [30, 5, 3, 22, 4101, 210, 2]
Encoded text_b: [30, 13, 3, 22, 725, 7069, 3809, 4101, 210, 92, 230, 2]
Label: 0


#### Use `partial`

In [16]:
from functools import partial 

# names for paramters you want to set default must be given
trans_fn = partial(example_converter, 
                   text_encoder=text_encoder, 
                   include_seq_len=False)

converted = map(trans_fn, examples)

for idx, conv in enumerate(converted):
    print(f"Example # {idx+1} ...")
    e_a, e_b, l = conv
    print("Encoded text_a:", e_a)
    print("Encoded text_b:", e_b)
    print("Label:", l)

Example # 1 ...
Encoded text_a: [6, 11, 7, 247, 8, 51, 932, 2]
Encoded text_b: [6, 11, 7, 247, 40, 932, 10, 107, 2]
Label: 1
Example # 2 ...
Encoded text_a: [30, 5, 3, 22, 4101, 210, 2]
Encoded text_b: [30, 13, 3, 22, 725, 7069, 3809, 4101, 210, 92, 230, 2]
Label: 0


<a name="5"></a>
## Creating dataloader

Now comes with the most important points! **I figure that detailed explanations may not help you to understand what will be shown below, because you may need to practice again and again, and compare with what we have done previously to build a solid intuition.** Let's simply take a dataloder as a black box. All you need to know is what needs to go in and what will come out. Here are some of the points you need to know:

- A dataloader is something iterable and will work more efficiently with the models constructed by a deep learning framework, especially when trained on GPUs because they can load data asynchronously.


- For a dataloader, you need to pass the dataset you want to train on, the `example_converter` (or the encapsulated `trans_fn`) , a data `sampler` (i.e., how to build batches), as well as a `batchify` method to preprocess a given batch (or sample).


- For the dataset, its type needs to be what is called `Dataset` (map-style dataset) or `IterableDataset` (iterable-style dataset) in order to make everything work. 


Enough words. Let's just see what this can be done.

<a name="5-1"></a>
### Transform the dataset into  `Dataset` class using `MapDataset`

Some properties of the `Dataset`:

- It is iterable both by a for loop and by a slicing index (just like a list!)
- It has a `Dataset.map` method that will transform (namely numericalize) the entire dataset given a `trans_fn` (with only one paratemter allowed)
- The transformed `Dataset` will be iterable by a for loop but can not be indexed anymore.

In [17]:
import paddle
from paddlenlp.datasets import MapDataset

train = MapDataset(train_set)

print("Type of train", type(train))
print("Is train's type a Dataset?", isinstance(train, paddle.io.Dataset))

Type of train <class 'paddlenlp.datasets.dataset.MapDataset'>
Is train's type a Dataset? True


In [18]:
from copy import deepcopy # we need "deepcopy" to protect the orignal data

train_copy = deepcopy(train)
train_copy.map(trans_fn)

for idx, emp in enumerate(train_copy):
    print(f"Number {idx+1} example: ", emp)
    if idx == 2:
        break
        
# versus (remeber, after transformation, we can't use slicing index!)
print("\nThe first three examples...\n", train[:3])

Number 1 example:  ([6, 11, 7, 247, 8, 51, 932, 2], [6, 11, 7, 247, 40, 932, 10, 107, 2], 1)
Number 2 example:  ([30, 5, 3, 22, 4101, 210, 2], [30, 13, 3, 22, 725, 7069, 3809, 4101, 210, 92, 230, 2], 0)
Number 3 example:  ([6, 11, 7, 2476, 8, 3456, 36, 82, 1887, 14, 2456, 7239, 10, 24, 2], [32, 7, 21, 159, 1404, 18, 15, 7, 2770, 247, 23, 3211, 325, 10, 8, 5102, 57, 3456, 2, 5, 24, 51, 29, 321, 2], 0)

The first three examples...
 [('How do I write a good essay?', 'How do I write an essay in English?', 1), ('Which is the best thriller movies ?', 'Which are the best brain twisting psychological thriller movies ever made?', 0), ('How do I preserve a journal that has pencil and pen writings in it?', "If I'm depressed, can I atleast write my feelings down in a diary/journal? Is it good or bad?", 0)]


<a name="5-2"></a>
### A data sampler

There are two types of data sampler in `paddle`. One is `paddle.io.DistributedBatchSampler` for distributed multiple GPU training. And the other is `paddle.io.BatchSampler` for the normal use. We will use the later here.

Please note that the batched dataset will be only iterable in a for loop, but we can use the `list` function nested within a `len` function to count how many batches there are. For every batch inside this iterable output are a list of indices that index the dataset. For example, an index "1" would mean "dataset\[1\]". The indices will later be used by the dataloader to retrieve examples from the transformed/numerilized dataset. 

In [19]:
from paddle.io import BatchSampler

# do not forget to map the "trans_fn" to the "train" dataset. with the "map" func applied,
# you can no longer run this cell again, as the "train" has been transformed and no longer has "map" (see above)
bacth_sampler = BatchSampler(dataset=train.map(trans_fn), shuffle=True, batch_size=64)

# check: 3000 == 46 * 64 + 56
print("Number of batches:", len(list(bacth_sampler)))
print("Number of items in the last batch:",  len(list(bacth_sampler)[-1]))
print("Everything all right?", 3000 == 46 * 64 + 56)

Number of batches: 47
Number of items in the last batch: 56
Everything all right? True


In [20]:
first_10 = list(bacth_sampler)[0][:10]

print("This is the first 10 examples from the first batch:", first_10)
print("The indices are connecting with the transformed/numericalized dataset.")

This is the first 10 examples from the first batch: [2472, 2285, 646, 725, 1637, 2020, 2471, 1059, 2078, 2132]
The indices are connecting with the transformed/numericalized dataset.


<a name="5-3"></a>
### Building a `batchify` method 

The purpose of the `batchify` method is to provide a set of methods to further preprocess the bacthed dataset in a way that make possible model training. More concretely, the following are three things we need to consider:

- for the text ids (numericalized text) within a batch and of same kind (text_a versus text_b), we need to make sure that they are of same length/dimension (aligned with the max length in a batch). We can use the `paddlenlp.data.Pad` function to do that.
- for the label values or the likes within a batch, we want them to be stacked together in a array. We will use the `paddlenlp.data.Stack` to do that. 
- for every bacthed element (e.g., text_a, text_b, label), we want them to be separated. We can use the `paddlenlp.data.Tuple` function to do that.

Please note that, all these class methods are callable. You can use use them by calling, for example, `Pad()(YourInput)`. But you can also initialize certain values when calling, for example `Pad(axis=0, pad_val=0)(YourInput)`. Moreover, for those RNN models, we will need to create anthoer two `Stack()` for the text_a_seq_len and text_b_seq_len respectively.

In [21]:
from paddlenlp.data import Pad, Stack, Tuple

# The Tuple() will pass the "samples" to the three funcs inside one by one
# then a list of three outputs returned by the three funcs (in this case)
batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=0),  # for text_a_ids; axis=0, pad_val=0 are default just so you know
        Pad(axis=0, pad_val=0),  # for text_b_ids
        Stack(dtype="int64")  # for label
    ): fn(samples)

In [22]:
# check. Note the following has not been batched.

t_a, t_b, l = batchify_fn(list(map(trans_fn, train_set)))
print("Shape of text_a_ids preprocessed:", t_a.shape)
print("Shape of text_b_ids preprocessed:", t_b.shape)
print("Shape of labels preprocessed:", l.shape)

Shape of text_a_ids preprocessed: (3000, 67)
Shape of text_b_ids preprocessed: (3000, 74)
Shape of labels preprocessed: (3000,)


<a name="5-4"></a>
### Now the dataloader

The dataloader will "load" the numericalized dataset according to our instructions and return an iterable object for the model to loop through during training. As mentioned, we will teach the dataloader how to sample (build batches for) the dataset, and how to further preprocess the built batches. The dataloader will also transform the numericalized dataset into `paddle.Tensor`, making model training quicker (for larger datasets trained on GPUs).

Again, the output cannot be retrieved by index.

In [23]:
from paddle.io import DataLoader

dataloader = DataLoader(
    train,
    batch_sampler=bacth_sampler, 
    collate_fn=batchify_fn)

In [24]:
# check

for d in dataloader:
    print(d)
    break

[Tensor(shape=[64, 56], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[4 , 5 , 24, ..., 0 , 0 , 0 ],
        [6 , 15, 7 , ..., 0 , 0 , 0 ],
        [4 , 13, 35, ..., 0 , 0 , 0 ],
        ...,
        [32, 7 , 21, ..., 0 , 0 , 0 ],
        [6 , 11, 7 , ..., 0 , 0 , 0 ],
        [6 , 15, 7 , ..., 0 , 0 , 0 ]]), Tensor(shape=[64, 32], dtype=int64, place=CPUPlace, stop_gradient=True,
       [[4   , 26  , 24  , ..., 0   , 0   , 0   ],
        [6   , 38  , 7   , ..., 0   , 0   , 0   ],
        [6   , 1928, 15  , ..., 0   , 0   , 0   ],
        ...,
        [32  , 7   , 1056, ..., 0   , 0   , 0   ],
        [6   , 13  , 3   , ..., 0   , 0   , 0   ],
        [26  , 651 , 78  , ..., 0   , 0   , 0   ]]), Tensor(shape=[64], dtype=int64, place=CPUPlace, stop_gradient=True,
       [1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0])]


<a name="6"></a>
## A quick test

It works!

In [25]:
from paddle_models.BoW import BoW

In [26]:
def get_model(model):
    model = paddle.Model(model)
    optimizer = paddle.optimizer.Adam(
    parameters=model.parameters(), learning_rate=5e-4)
    criterion = paddle.nn.CrossEntropyLoss()
    metric = paddle.metric.Accuracy()
    model.prepare(optimizer, criterion, metric)
    return model

In [27]:
model = BoW(len(V.token_to_idx), 2)
model = get_model(model)
%time model.fit(dataloader, epochs=1, verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/1


  return (isinstance(seq, collections.Sequence) and


CPU times: user 1.62 s, sys: 61.1 ms, total: 1.68 s
Wall time: 1.63 s


<a name="7"></a>
## Wrapped up functions

Before heading to the next section, you can test the following functions/class methods up and see if you can utilize them to do a quick start yourself!

<a name="7-1"></a>
### TextVectorizer

The following wrapped up class method remsembles the one that we built in the `2.1 - wrapped_up_data_preprocessor.ipynb`, but here we are using as many functions as from `paddle`. If you are interested, you can look a look back at the `TextVectorizer` inside the `utils.py` and see if you can create some additional functions (such as save the results into json file for later re-loading). 

In [28]:
from paddlenlp.data import Vocab
from paddlenlp.transformers.bert.tokenizer import BasicTokenizer
from collections.abc import Iterable


class TextVectorizer:
     
    def __init__(self, tokenizer=None):
        self.tokenize = tokenizer if tokenizer \
        else BasicTokenizer().tokenize
        self.vocab_to_idx = None
        self.idx_to_vocab = None
        self._V = None
    
    def build_vocab(self, text):
        tokens = list(map(tokenize, text))
        self._V = Vocab.build_vocab(tokens, unk_token='[UNK]', pad_token='[PAD]')
        self.vocab_to_idx = self._V.token_to_idx
        self.idx_to_vocab = self._V.idx_to_token
        
        print('Two vocabulary dictionaries have been built!\n' \
             + 'Please call \033[1mX.vocab_to_idx | X.idx_to_vocab\033[0m to find out more' \
             + ' where [X] stands for the name you used for this TextVectorizer class.')
        
    def text_encoder(self, text):
        if isinstance(text, list):
            return [self(t) for t in text]
        
        tks = self.tokenize(text)
        out = [self.vocab_to_idx[tk] for tk in tks]
        return out
            
    def text_decoder(self, text_ids, sep=" "):
        if all(isinstance(ids, Iterable) for ids in text_ids):
            return [self.text_decoder(ids, sep) for ids in text_ids]
            
        out = []
        for text_id in text_ids:
            out.append(self.idx_to_vocab[text_id])
            
        return f'{sep}'.join(out)
    
    def __call__(self, text):
        if self.vocab_to_idx:
            return self.text_encoder(text)
        raise ValueError("No vocab is built!")

<a name="7-2"></a>
### Example converter

Nothing to change here

In [29]:
def example_converter(example, text_encoder, include_seq_len):
    
    text_a, text_b, label = example
    encoded_a = text_encoder(text_a)
    encoded_b = text_encoder(text_b)
    if include_seq_len:
        len_a, len_b = len(encoded_a), len(encoded_b)
        return encoded_a, encoded_b, len_a, len_b, label
    return encoded_a, encoded_b, label

<a name="7-3"></a>
### Get trans_fn

Let's customize a method to return trans_fn for us for this series of tutorial!

In [30]:
def get_trans_fn(text_encoder, include_seq_len):
    return lambda ex: example_converter(ex, text_encoder, include_seq_len)

<a name="7-4"></a>
### Get batchify_fn

Let's customize a method to return batchify_fn for us for this series of tutorial!

In [31]:
from paddlenlp.data import Pad, Stack, Tuple


def get_batchify_fn(include_seq_len):
    
    if include_seq_len:
        stack = [Stack(dtype="int64")] * 3
    else:
        stack = [Stack(dtype="int64")]
    
    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=0),  
        Pad(axis=0, pad_val=0),  
        *stack
    ): fn(samples)
    
    return batchify_fn

<a name="7-5"></a>
### Create dataloader 

In [32]:
from paddlenlp.datasets import MapDataset
from paddle.io import BatchSampler, DataLoader


def create_dataloader(dataset, 
                      trans_fn, 
                      batchify_fn, 
                      batch_size=64, 
                      shuffle=True, 
                      sampler=BatchSampler):
    
    
    if not isinstance(dataset, MapDataset):
        dataset = MapDataset(dataset)
        
    dataset.map(trans_fn)
    batch_sampler = sampler(dataset, 
                            shuffle=shuffle, 
                            batch_size=batch_size)
    
    dataloder = DataLoader(dataset, 
                           batch_sampler=batch_sampler, 
                           collate_fn=batchify_fn)
    
    return dataloder

<a name="8"></a>
## More thorough tests 

This time, we will include the dev_set for validation and the test_set for evaluation!

<a name="8-1"></a>
### Initializations 

In [33]:
from utils import load_dataset, gather_text

train_set, dev_set, test_set = load_dataset(['train.txt', 'dev.txt', 'test.txt'])

text = gather_text(train_set)
V = TextVectorizer()
V.build_vocab(text)

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


<a name="8-2"></a>
### Test One: CNN

In [34]:
trans_fn = get_trans_fn(V, False)
batchify_fn = get_batchify_fn(False)
train_loader = create_dataloader(train_set, trans_fn, batchify_fn)
dev_loader = create_dataloader(dev_set, trans_fn, batchify_fn)
test_loader = create_dataloader(test_set, trans_fn, batchify_fn)

In [35]:
from paddle_models.CNN import CNN

model = CNN(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_loader, dev_loader, epochs=5, verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
Eval begin...
Eval samples: 1000
Epoch 2/5
Eval begin...
Eval samples: 1000
Epoch 3/5
Eval begin...
Eval samples: 1000
Epoch 4/5
Eval begin...
Eval samples: 1000
Epoch 5/5
Eval begin...
Eval samples: 1000
CPU times: user 36 s, sys: 380 ms, total: 36.4 s
Wall time: 35.6 s


In [36]:
model.evaluate(test_loader)

Eval begin...
step 10/16 - loss: 0.8315 - acc: 0.6312 - 71ms/step
step 16/16 - loss: 0.7882 - acc: 0.6340 - 60ms/step
Eval samples: 1000


{'loss': [0.7882101], 'acc': 0.634}

<a name="8-3"></a>
### Test Two: RNN

In [37]:
trans_fn = get_trans_fn(V, True)
batchify_fn = get_batchify_fn(True)
train_loader = create_dataloader(train_set, trans_fn, batchify_fn)
dev_loader = create_dataloader(dev_set, trans_fn, batchify_fn)
test_loader = create_dataloader(test_set, trans_fn, batchify_fn)

In [38]:
from paddle_models.S_RNN import SimpleRNN

model = SimpleRNN(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_loader, dev_loader, epochs=5, verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
Eval begin...
Eval samples: 1000
Epoch 2/5
Eval begin...
Eval samples: 1000
Epoch 3/5
Eval begin...
Eval samples: 1000
Epoch 4/5
Eval begin...
Eval samples: 1000
Epoch 5/5
Eval begin...
Eval samples: 1000
CPU times: user 15.6 s, sys: 333 ms, total: 16 s
Wall time: 14.8 s


In [39]:
model.evaluate(test_loader)

Eval begin...
step 10/16 - loss: 1.1164 - acc: 0.5547 - 44ms/step
step 16/16 - loss: 1.5635 - acc: 0.5440 - 33ms/step
Eval samples: 1000


{'loss': [1.5634658], 'acc': 0.544}