In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2022-01-17
# GitHub: https://github.com/jaaack-wang 

## Get PyTorch

In case you have not installed PyTorch,run the following cell.

In [2]:
!pip3 install torch torchvision

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m


## Get Data

In case you have not run the `1 - get_data.ipynb`, run the following cell.

In [3]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
import get_data

get_data.get_quora_data()

--2022-01-17 23:48:19--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.53.2
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.53.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2022-01-17 23:48:31 (5.37 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]

train.txt has been saved!
dev.txt has been saved!
test.txt has been saved!


## Preprocess and numericalize text data

In case you have not run the `2 - preprocess_data.ipynb`, run the following cell.

In [4]:
from utils import *

# ---- load dataset ----
train_set, dev_set, test_set = load_dataset(['train.txt', 'dev.txt', 'test.txt'])

# ---- numericalize the train set ----
V = TextVectorizer(tokenize) 
text = gather_text(train_set) # for collecting texts from train set
V.build_vocab(text) # for building mapping vocab_to_idx dictionary and text_encoder
train_set_encoded = list(encode_dataset(train_set, encoder=V.text_encoder)) # encodoing train set
dev_set_encoded = list(encode_dataset(dev_set, encoder=V.text_encoder)) # encodoing dev set for validation
test_set_encoded  = list(encode_dataset(test_set, encoder=V.text_encoder)) # encodoing dev set for prediction

# ---- build mini batches for the train and dev set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=False)
dev_set_batched = build_batches(dev_set_encoded, batch_size=64, include_seq_len=False)
test_set_batched = build_batches(test_set_encoded, batch_size=64, include_seq_len=False)

Two vocabulary dictionaries have been built!
Please call [1mX.vocab_to_idx | X.idx_to_vocab[0m to find out more where [X] stands for the name you used for this TextVectorizer class.


### Convert numpy arrays into tensors

It turns out that pytorch models do not accept numpy arrays during model training. The problem seems to be an attribute associated with `torch.Tensor` that has been named differently in `numpy.ndarray`, unlike `paddle`. 

To maintain consistency, this tutorial decided to not change the functions we will build together in the later tutorials. A better way of using packages in the pytorch ecosystem to preprocess and numericalize text data will be introduced separately, just as what I intended to do for the other two deep learning frameworks.

Likewise, `PyTorchUtils` is also a wrapped up class I wrote up just to get this quick starts going, which will also be introduced later. Although this is not the best practice of using `pytorch`, you will find it useful when realizing the very nuanced differences between different deep learning frameworks.

In [5]:
from pytorch_utils import to_tensor

train_set_batched = to_tensor(train_set_batched)
dev_set_batched = to_tensor(dev_set_batched)
test_set_batched = to_tensor(test_set_batched)

## Training and evaluating models 

In [6]:
from pytorch_utils import PyTorchUtils
import torch.optim as optim
import torch.nn as nn

### BoW (Bag of Words) model

#### Training

In [7]:
from pytorch_models.BoW import BoW


model = BoW(len(V.vocab_to_idx), 2)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=False)
%time PT.train(train_set_batched, dev_set_batched, epochs=5)

Epoch 1/5 {'Train loss': '0.70698', 'Train accu': '33.94'}
Validation... {'Dev loss': '0.68095', 'Dev accu': '39.86'}

Epoch 2/5 {'Train loss': '0.64689', 'Train accu': '49.47'}
Validation... {'Dev loss': '0.66661', 'Dev accu': '47.07'}

Epoch 3/5 {'Train loss': '0.61382', 'Train accu': '57.27'}
Validation... {'Dev loss': '0.66069', 'Dev accu': '49.94'}

Epoch 4/5 {'Train loss': '0.58215', 'Train accu': '63.16'}
Validation... {'Dev loss': '0.65781', 'Dev accu': '51.88'}

Epoch 5/5 {'Train loss': '0.54909', 'Train accu': '67.14'}
Validation... {'Dev loss': '0.65696', 'Dev accu': '54.36'}

CPU times: user 3.46 s, sys: 227 ms, total: 3.68 s
Wall time: 1.66 s


#### Evaluation on the test set

In [8]:
PT.evaluate(test_set_batched)

{'Test loss': '0.64489', 'Test accu': '55.23'}

### CNN (Convolutional Neural Network) model

#### Training

In [9]:
from pytorch_models.CNN import CNN


model = CNN(len(V.vocab_to_idx), 2)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=False)
%time PT.train(train_set_batched, dev_set_batched, epochs=5)

Epoch 1/5 {'Train loss': '0.68896', 'Train accu': '31.35'}
Validation... {'Dev loss': '0.67878', 'Dev accu': '53.59'}

Epoch 2/5 {'Train loss': '0.64433', 'Train accu': '59.66'}
Validation... {'Dev loss': '0.65398', 'Dev accu': '57.85'}

Epoch 3/5 {'Train loss': '0.57345', 'Train accu': '70.37'}
Validation... {'Dev loss': '0.63968', 'Dev accu': '60.64'}

Epoch 4/5 {'Train loss': '0.46555', 'Train accu': '79.54'}
Validation... {'Dev loss': '0.65233', 'Dev accu': '61.27'}

Epoch 5/5 {'Train loss': '0.33501', 'Train accu': '88.75'}
Validation... {'Dev loss': '0.68868', 'Dev accu': '61.35'}

CPU times: user 12.8 s, sys: 2.54 s, total: 15.4 s
Wall time: 11.2 s


#### Evaluation on the test set

In [10]:
PT.evaluate(test_set_batched)

{'Test loss': '0.65676', 'Test accu': '64.84'}

## RNN (Recurrent neural network) models

As the RNN models also take as an input the sequence length, we need to re-encode the train set, dev set, and test set. 

In [11]:
# ---- build mini batches for the train and dev set ----
train_set_batched = build_batches(train_set_encoded, batch_size=64, include_seq_len=True)
dev_set_batched = build_batches(dev_set_encoded, batch_size=64, include_seq_len=True)
test_set_batched = build_batches(test_set_encoded, batch_size=64, include_seq_len=True)

train_set_batched = to_tensor(train_set_batched)
dev_set_batched = to_tensor(dev_set_batched)
test_set_batched = to_tensor(test_set_batched)

### Simple RNN model

#### Training

In [12]:
from pytorch_models.S_RNN import SimpleRNN


model = SimpleRNN(len(V.vocab_to_idx), 2, bidirectional=False)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=True)
%time PT.train(train_set_batched, dev_set_batched, epochs=5)

Epoch 1/5 {'Train loss': '0.69128', 'Train accu': '35.06'}
Validation... {'Dev loss': '0.68908', 'Dev accu': '40.62'}

Epoch 2/5 {'Train loss': '0.66777', 'Train accu': '56.48'}
Validation... {'Dev loss': '0.68243', 'Dev accu': '50.10'}

Epoch 3/5 {'Train loss': '0.62369', 'Train accu': '65.12'}
Validation... {'Dev loss': '0.68859', 'Dev accu': '55.14'}

Epoch 4/5 {'Train loss': '0.55485', 'Train accu': '71.34'}
Validation... {'Dev loss': '0.72700', 'Dev accu': '55.35'}

Epoch 5/5 {'Train loss': '0.47938', 'Train accu': '77.40'}
Validation... {'Dev loss': '0.78453', 'Dev accu': '55.94'}

CPU times: user 7.52 s, sys: 1.25 s, total: 8.77 s
Wall time: 5.24 s


#### Evaluation on the test set

In [13]:
PT.evaluate(test_set_batched)

{'Test loss': '0.73299', 'Test accu': '59.28'}

### GRU (Gated recurrent units) model 

#### Training

In [14]:
from pytorch_models.GRU import GRU


model = GRU(len(V.vocab_to_idx), 2, bidirectional=False)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=True)
%time PT.train(train_set_batched, dev_set_batched, epochs=5)

Epoch 1/5 {'Train loss': '0.69199', 'Train accu': '40.32'}
Validation... {'Dev loss': '0.68728', 'Dev accu': '45.27'}

Epoch 2/5 {'Train loss': '0.67196', 'Train accu': '55.40'}
Validation... {'Dev loss': '0.67283', 'Dev accu': '55.06'}

Epoch 3/5 {'Train loss': '0.62632', 'Train accu': '64.53'}
Validation... {'Dev loss': '0.65178', 'Dev accu': '60.68'}

Epoch 4/5 {'Train loss': '0.55954', 'Train accu': '71.28'}
Validation... {'Dev loss': '0.65149', 'Dev accu': '62.83'}

Epoch 5/5 {'Train loss': '0.48512', 'Train accu': '76.65'}
Validation... {'Dev loss': '0.67842', 'Dev accu': '62.15'}

CPU times: user 14.4 s, sys: 4.09 s, total: 18.5 s
Wall time: 11.5 s


#### Evaluation on the test set

In [15]:
PT.evaluate(test_set_batched)

{'Test loss': '0.67833', 'Test accu': '60.86'}

### LSTM (Long short-term memory) model

#### Training

In [16]:
from pytorch_models.LSTM import LSTM


model = LSTM(len(V.vocab_to_idx), 2, bidirectional=False)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
criterion = nn.BCEWithLogitsLoss()
PT = PyTorchUtils(model, optimizer, criterion, include_seq_len=True)
%time PT.train(train_set_batched, dev_set_batched, epochs=5)

Epoch 1/5 {'Train loss': '0.69284', 'Train accu': '45.27'}
Validation... {'Dev loss': '0.68995', 'Dev accu': '43.09'}

Epoch 2/5 {'Train loss': '0.67827', 'Train accu': '53.87'}
Validation... {'Dev loss': '0.67460', 'Dev accu': '57.54'}

Epoch 3/5 {'Train loss': '0.61445', 'Train accu': '65.91'}
Validation... {'Dev loss': '0.64512', 'Dev accu': '60.49'}

Epoch 4/5 {'Train loss': '0.53654', 'Train accu': '73.67'}
Validation... {'Dev loss': '0.65003', 'Dev accu': '60.53'}

Epoch 5/5 {'Train loss': '0.45890', 'Train accu': '78.61'}
Validation... {'Dev loss': '0.68671', 'Dev accu': '61.33'}

CPU times: user 18.2 s, sys: 4.87 s, total: 23 s
Wall time: 13.9 s


#### Evaluation on the test set

In [17]:
PT.evaluate(test_set_batched)

{'Test loss': '0.68404', 'Test accu': '62.68'}