In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-16, modified on 2022-01-17
# GitHub: https://github.com/jaaack-wang 

## Get paddle

In case you have not installed paddle,run the following cell.

In [2]:
!python3 -m pip install paddlepaddle
#pip3 install paddlepaddle



## Get Data

In case you have not run the `1 - get_data.ipynb`, run the following cell.

In [3]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
import get_data

get_data.get_quora_data()

--2022-01-17 21:35:43--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.53.2
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.53.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv.2’


2022-01-17 21:35:48 (13.5 MB/s) - ‘quora_duplicate_questions.tsv.2’ saved [58176133/58176133]

train.txt has been saved!
dev.txt has been saved!
test.txt has been saved!


## Preprocess and numericalize text data

In case you have not run the `2 - preprocess_data.ipynb`, run the following cell.

In [4]:
from utils import *

# ---- load dataset ----
train_set, dev_set, test_set = load_dataset(['train.txt', 'dev.txt', 'test.txt'])

# ---- numericalize the train set ----
V = TextVectorizer(tokenize) 
text = gather_text(train_set) # for collecting texts from train set
V.build_vocab(text) # for building mapping vocab_to_idx dictionary and text_encoder
train_set_encoded = list(encode_dataset(train_set, encoder=V.text_encoder)) # encodoing train set
dev_set_encoded = list(encode_dataset(dev_set, encoder=V.text_encoder)) # encodoing dev set for validation
test_set_encoded  = list(encode_dataset(test_set, encoder=V.text_encoder)) # encodoing dev set for prediction

# ---- build mini batches for the train and dev set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=False)
dev_set_batched = build_batches(dev_set_encoded, batch_size=64, include_seq_len=False)
test_set_batched = build_batches(test_set_encoded, batch_size=64, include_seq_len=False)

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


## Training and evaluating models 

### BoW (Bag of Words) model

#### Training

In [5]:
from paddle_models.BoW import BoW
import paddle

In [6]:
def get_model(model):
    model = paddle.Model(model)
    optimizer = paddle.optimizer.Adam(
    parameters=model.parameters(), learning_rate=5e-4)
    criterion = paddle.nn.CrossEntropyLoss()
    metric = paddle.metric.Accuracy()
    model.prepare(optimizer, criterion, metric)
    return model

In [7]:
model = BoW(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5

  return (isinstance(seq, collections.Sequence) and


save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 1.99 s, sys: 141 ms, total: 2.13 s
Wall time: 2.14 s


#### Evaluation on the test set

In [8]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 0.8390 - acc: 0.6359 - 2ms/step
step 16/16 - loss: 0.8447 - acc: 0.6330 - 2ms/step
Eval samples: 1000


{'loss': [0.8446506], 'acc': 0.633}

### CNN (Convolutional Neural Network) model

#### Training

In [9]:
from paddle_models.CNN import CNN

model = CNN(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 18.7 s, sys: 224 ms, total: 18.9 s
Wall time: 19 s


#### Evaluation on the test set

In [10]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 0.9297 - acc: 0.6578 - 24ms/step
step 16/16 - loss: 0.7140 - acc: 0.6520 - 25ms/step
Eval samples: 1000


{'loss': [0.7140364], 'acc': 0.652}

## RNN (Recurrent neural network) models

As the RNN models also take as an input the sequence length, we need to re-encode the train set, dev set, and test set. 

In [11]:
# ---- build mini batches for the train and dev set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=True)
dev_set_batched = build_batches(dev_set_encoded, batch_size=64, include_seq_len=True)
test_set_batched = build_batches(test_set_encoded, batch_size=64, include_seq_len=True)

### Simple RNN model

#### Training

In [12]:
from paddle_models.S_RNN import SimpleRNN

model = SimpleRNN(len(V.vocab_to_idx), 2, bidirectional=False)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 5.76 s, sys: 228 ms, total: 5.99 s
Wall time: 6.02 s


#### Evaluation on the test set

In [13]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 1.4090 - acc: 0.5891 - 7ms/step
step 16/16 - loss: 1.3625 - acc: 0.5670 - 7ms/step
Eval samples: 1000


{'loss': [1.3625097], 'acc': 0.567}

### GRU (Gated recurrent units) model 

#### Training

In [14]:
from paddle_models.GRU import GRU

model = GRU(len(V.vocab_to_idx), 2, bidirectional=False)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 12.8 s, sys: 287 ms, total: 13.1 s
Wall time: 13.2 s


#### Evaluation on the test set

In [15]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 1.0250 - acc: 0.6219 - 19ms/step
step 16/16 - loss: 2.1369 - acc: 0.6340 - 18ms/step
Eval samples: 1000


{'loss': [2.1368642], 'acc': 0.634}

### LSTM (Long short-term memory) model

#### Training

In [16]:
from paddle_models.LSTM import LSTM

model = LSTM(len(V.vocab_to_idx), 2, bidirectional=False)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 15.3 s, sys: 292 ms, total: 15.6 s
Wall time: 15.7 s


#### Evaluation on the test set

In [17]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 1.1004 - acc: 0.6234 - 22ms/step
step 16/16 - loss: 1.8570 - acc: 0.6380 - 21ms/step
Eval samples: 1000


{'loss': [1.8569803], 'acc': 0.638}