# Continuous Bag-of-Words

## Training CBoW
The CBoW model has been implemented to run through command line with the following flags used in the run below: 
* *-train* and *-val* is the path to the training and validation data, respectively. 
* *-d* needs a string denoting the direction of the text window, which can be 'before', 'after' or 'both' for the models $p\left(y_t|y_{t-c}\right)$, $p\left(y_t|y_{t+c}\right)$ and $p\left(y_t|y_{t-c}^{t+c}\right)$, respectively. 
* *-pad* will store True and enable padding to be used in the model. 
* *-ws* is an integer and sets the window size, $c$. 
* *-embed* is an integer that sets the number of embedding dimensions. 
* *-b* is an integer that sets the batch size. 
* *-lr* needs a float and sets the learning rate for the model. 
* *-e* is an integer and sets the number of iterations or epochs for the model to run. 
* *-f* requires a string that will be used as postfix for all output files. 

Additional option flags include: 
* *-r* is the path to a saved checkpoint of the model, which enables resuming training. 

In [1]:
# Set flags and parameters
tr = 'data/proteins.train.txt'
v = 'data/proteins.val.txt'
ws = 5
d_ws = 'both'
bs = 256
emb = 2
post_fix = '_{0}_{1}_lr001_em{2}'.format(d_ws, ws, emb)

# Working directory, where output files will be saved
wkdir = 'data/results/'

In [None]:
# Run CBoW model through command line
!python CBoWscripts/main_CBoW_aa.py -train $tr -val $v -d $d_ws -pad -ws $ws -b $bs -f $post_fix -lr 0.001 -e 50 -embed $emb -wd $wkdir


The program above will output the following files: 
* Checkpoints containing model details that can be loaded for testing or if training is resmed with *-r*. These are stored as .pt files. 
* Log files with epoch, loss, perplexity and accuracy. 
* The performance plot for the current number of epochs if traning is not interrupted prematurely.

## Testing best model

In [2]:
# Import packages
import CBoW_scripts.functions as f
import torch
import torch.utils.data as data_utils
import torch.nn as nn
import sys
import os
from tqdm import tqdm
import numpy as np

# Set best epoch
check = 50

### Load data

In [3]:
# Load word2idx (was created during training)
word2idx = torch.load('data/word2idx.pt')
print(word2idx)

{'A': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6, 'H': 7, 'I': 8, 'K': 9, 'L': 10, 'M': 11, 'N': 12, 'P': 13, 'Q': 14, 'R': 15, 'S': 16, 'T': 17, 'V': 18, 'W': 19, 'Y': 20, 'padding': 0}


In [4]:
# Load test data
test_data = f.DataLoader()
test_data.load_corpus(path='data/proteins.test.txt')

# Make context pairs for validation data
test_data.make_context_pairs(window_size=ws, padding=True, direction=d_ws)

# Convert to numpy
test_data.words_to_index(word2idx=word2idx)

# After data has been loaded it is good to check what is looks like. 
print('Number of test samples:\t', test_data.context_array[0].shape)

# Loading corpus...
	Done

# Making context pairs...
	Done

# Converting words to indices...
(19153517, 10)
	Done

Number of test samples:	 (19153517, 10)


In [5]:
# Make batches
test = data_utils.TensorDataset(torch.from_numpy(test_data.context_array[0]), 
                                torch.from_numpy(test_data.context_array[1]))
load_test = data_utils.DataLoader(test, batch_size=bs, shuffle=True)

### Load model to test

In [6]:
# Set up to use GPU if available
use_cuda = torch.cuda.is_available()
use_cuda

False

In [7]:
# Load net class
net = f.cbow(vocab_size=len(word2idx), embedding_dim=emb, padding=True)

# If GPU is available
if use_cuda:
    print('# Converting network to cuda-enabled')
    net.cuda()
    loc_map = None
else: 
    loc_map='cpu'
print(net)

cbow(
  (embeddings): Embedding(21, 2, padding_idx=0)
  (linear_out): Linear(in_features=2, out_features=21, bias=False)
)


In [None]:
# Set up neural net
net_path = wkdir+'check' + str(check) + post_fix + '.pt'
check = torch.load(net_path, map_location=loc_map)
net.load_state_dict(check['model_state_dict'])
epoch = check['epoch']

# Set criterion 
criterion = nn.CrossEntropyLoss()

### Run test

In [None]:
# Run model on test set
test_acc, test_loss = [], []

### Evaluation ###
net.eval()

test_preds, test_targs = [], []
test_losses, test_accs, test_lengths = 0, 0, 0
examples, n_examples = [], 5

# Print running 
pbar_test = tqdm(load_test, position=0)
pbar_test.set_description("[Epoch {}, test]".format(epoch+1))

for i, (inputs, labels) in enumerate(pbar_test):
    #print('Batch {0}/{1}'.format(i+1, len(load_test)))
    n_samples = inputs.shape[0]

    # Convert targets and input to cuda if available
    if use_cuda: 
        inputs = inputs.cuda()
        labels = labels.cuda()

    # Get predictions
    output = net(inputs)
    preds = torch.max(input=output, dim=1)[1]

    if use_cuda: 
        preds = preds.data.cpu().numpy()
    else: 
        preds = preds.data.numpy()

    # Calculate validation loss
    test_losses += criterion(output, labels).item() * n_samples
    test_accs += f.accuracy(y_true=labels, y_pred=output) * n_samples
    test_lengths += n_samples

    # Save predictions and labels
    test_preds += preds.tolist()
    test_targs += labels.tolist()

    # Save example inputs
    if len(examples) < n_examples: 
        for n in range(n_examples):
            examples.append([inputs[n], labels[n].item(), preds[n].item()])
    
    # Print percentage run
    pbar_test.set_postfix(loss=test_losses/test_lengths, perp=np.exp(test_losses/test_lengths), acc=test_accs/test_lengths)
print('\n### Test completed!')

[Epoch 50, test]:  29%|██▉       | 21919/74819 [02:49<06:37, 133.25it/s, acc=0.101, loss=2.88, perp=17.8]

In [None]:
# Show results of evaluation
print('# Epoch %2i, TEST: loss=%f, perp=%f, acc=%f\n' % (epoch+1, test_losses/test_lengths, 
                                                         np.exp(test_losses/test_lengths), 
                                                         test_accs/test_lengths))

In [None]:
# Show examples of predictions
print('# Input | Label | Prediction\n')
for ex in examples: 
    i, l, p = ex
    print(i + ' | ' + l + ' | ' + p + '\n')