## Get Data

In case you have not run the `1 - get_data.ipynb`, run the following cell.

In [1]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
import get_data

get_data.get_quora_data()

--2022-01-16 15:58:20--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.53.2
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.53.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv.1’


2022-01-16 15:58:23 (30.8 MB/s) - ‘quora_duplicate_questions.tsv.1’ saved [58176133/58176133]

train.txt has been saved!
dev.txt has been saved!
test.txt has been saved!


## Preprocess and numericalize text data

In case you have not run the `2 - preprocess_data.ipynb`, run the following cell.

In [2]:
from utils import *

# ---- load dataset ----
train_set, dev_set, test_set = load_dataset(['train.txt', 'dev.txt', 'test.txt'])

# ---- numericalize the train set ----
V = TextVectorizer(tokenize) 
text = gather_text(train_set) # for collecting texts from train set
V.build_vocab(text) # for building mapping vocab_to_idx dictionary and text_encoder
train_set_encoded = list(encode_dataset(train_set, encoder=V.text_encoder)) # encodoing train set
dev_set_encoded = list(encode_dataset(dev_set, encoder=V.text_encoder)) # encodoing dev set for validation
test_set_encoded  = list(encode_dataset(test_set, encoder=V.text_encoder)) # encodoing dev set for prediction

# ---- build mini batches for the train and dev set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=False)
dev_set_batched = build_batches(dev_set_encoded, batch_size=64, include_seq_len=False)
test_set_batched = build_batches(test_set_encoded, batch_size=64, include_seq_len=False)

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


## Training and evaluating models 

### BoW (Bag of Words) model

#### Training

In [3]:
from paddle_models.BoW import BoW
import paddle

In [4]:
def get_model(model):
    model = paddle.Model(model)
    optimizer = paddle.optimizer.Adam(
    parameters=model.parameters(), learning_rate=5e-4)
    criterion = paddle.nn.CrossEntropyLoss()
    metric = paddle.metric.Accuracy()
    model.prepare(optimizer, criterion, metric)
    return model

In [5]:
model = BoW(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5

  return (isinstance(seq, collections.Sequence) and


save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 1.87 s, sys: 130 ms, total: 2 s
Wall time: 2 s


#### Evaluation on the test set

In [6]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 0.9671 - acc: 0.6203 - 2ms/step
step 16/16 - loss: 1.0170 - acc: 0.6120 - 2ms/step
Eval samples: 1000


{'loss': [1.0170405], 'acc': 0.612}

### CNN (Convolutional Neural Network) model

#### Training

In [7]:
from paddle_models.CNN import CNN

model = CNN(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 19 s, sys: 246 ms, total: 19.2 s
Wall time: 19.4 s


#### Evaluation on the test set

In [8]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 0.9078 - acc: 0.6422 - 26ms/step
step 16/16 - loss: 1.0210 - acc: 0.6330 - 26ms/step
Eval samples: 1000


{'loss': [1.0210375], 'acc': 0.633}

## RNN (Recurrent neural network) models

As the RNN models also take as an input the sequence length, we need to re-encode the train set, dev set, and test set. 

In [9]:
# ---- build mini batches for the train and dev set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=True)
dev_set_batched = build_batches(dev_set_encoded, batch_size=64, include_seq_len=True)
test_set_batched = build_batches(test_set_encoded, batch_size=64, include_seq_len=True)

### Simple RNN model

#### Training

In [10]:
from paddle_models.S_RNN import SimpleRNN

model = SimpleRNN(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 5.86 s, sys: 209 ms, total: 6.07 s
Wall time: 6.11 s


#### Evaluation on the test set

In [11]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 1.3285 - acc: 0.5578 - 7ms/step
step 16/16 - loss: 1.3155 - acc: 0.5600 - 8ms/step
Eval samples: 1000


{'loss': [1.3154628], 'acc': 0.56}

### GRU (Gated recurrent units) model 

#### Training

In [12]:
from paddle_models.GRU import GRU

model = GRU(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 11.9 s, sys: 197 ms, total: 12.1 s
Wall time: 12.1 s


#### Evaluation on the test set

In [13]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 1.4538 - acc: 0.6094 - 18ms/step
step 16/16 - loss: 1.0900 - acc: 0.6120 - 18ms/step
Eval samples: 1000


{'loss': [1.0899506], 'acc': 0.612}

### LSTM (Long short-term memory) model

#### Training

In [14]:
from paddle_models.LSTM import LSTM

model = LSTM(len(V.vocab_to_idx), 2)
model = get_model(model)
%time model.fit(train_set_batched, dev_set_batched, epochs=5, save_dir='./ckpt', verbose=1)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/0
Eval begin...
Eval samples: 1000
Epoch 2/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/1
Eval begin...
Eval samples: 1000
Epoch 3/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/2
Eval begin...
Eval samples: 1000
Epoch 4/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/3
Eval begin...
Eval samples: 1000
Epoch 5/5
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/4
Eval begin...
Eval samples: 1000
save checkpoint at /Users/wzx/Documents/GitHub/text_matching/ckpt/final
CPU times: user 14.5 s, sys: 225 ms, total: 14.7 s
Wall time: 14.7 s


#### Evaluation on the test set

In [15]:
model.evaluate(test_set_batched)

Eval begin...
step 10/16 - loss: 1.3624 - acc: 0.6078 - 22ms/step
step 16/16 - loss: 1.3373 - acc: 0.6220 - 21ms/step
Eval samples: 1000


{'loss': [1.3372626], 'acc': 0.622}