# Neural Probabilistic Language Model (Yoshua Bengio et. al)

- Goal: We need to encode natural language into vector space
- Solution: Implement NPLM model

### Outline
- Basics
    - What is a probabilistic model of language?
    - ...other questions...
- Defining a NN model

### Basics

It may be unclear whether we can even map natural language to a vector space. This paper attempts this (and does pretty well). But we need some definitions.

#### What is a probabilistic model of language?

A statistical model of language can be represented by: the conditional probability of the next word, given all the previous words

$$ \hat{P}(w_{1}^{T}) = \prod_{t=1}^{T} \hat{P}(w_t | w_{1}^{t-1})$$
where
- $T$ is the number of words
- $w$ is a word

**Note:** Bengio's 2003 paper uses explicit calculation of the conditional probabilities, but it's much faster these days to instead use heirarchical softmax instead of softmax. (Since you traverse a binary tree instead of using exponential function)

Some things to think about
- Do we always need a corpus of words?
- What happens when we encounter a new set of words not in the corpus? (ie. "Dog was running in the bedroom" inside training data, then encountering "Cat was running in the garage")

### Defining a Neural Model

- Training set: made up of words $\{w_1 ... w_t, \forall w_t \in V\}$
- Input: All the previous words of the sequence
- Output: Predicting the last word of the sequence
- Objective: $f(w_1 ... w_t) = \hat{P}(w_t | w_{1}^{t-1})$
- Mapping: embeddings -> vectors

In [13]:
import nltk
import csv
import numpy as np

#If we don't have the relevant corpus (corpora in plural),
#download them
try:
    nltk.data.find('corpora/brown')
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('brown')
    nltk.download('wordnet')    

from nltk.corpus import brown
from nltk.corpus import wordnet

#Boilerplate pre-processing
num_train = 1000
UNK_symbol = "<UNK>"
vocab = set([UNK_symbol])

# create brown corpus again with all words
# no preprocessing, only lowercase
brown_corpus_train = []
for idx,paragraph in enumerate(brown.paras()):
    if idx == num_train:
        break
    words = []
    for sentence in paragraph:
        for word in sentence:
            words.append(word.lower())
    brown_corpus_train.append(words)

# create term frequency of the words
words_term_frequency_train = {}
for doc in brown_corpus_train:
    for word in doc:
        # this will calculate term frequency
        # since we are taking all words now
        words_term_frequency_train[word] = words_term_frequency_train.get(word,0) + 1

# create vocabulary
for doc in brown_corpus_train:
    for word in doc:
        if words_term_frequency_train.get(word,0) >= 5:
            vocab.add(word)

# create required lists
x_train = []
y_train = []
x_dev = []
y_dev = []

# create word to id mappings
word_to_id_mappings = {}
for idx,word in enumerate(vocab):
    word_to_id_mappings[word] = idx

# function to get id for a given word
# return <UNK> id if not found
def get_id_of_word(word):
    unknown_word_id = word_to_id_mappings['<UNK>']
    return word_to_id_mappings.get(word,unknown_word_id)

# creating training and dev set
for idx,paragraph in enumerate(brown.paras()):
    for sentence in paragraph:
        for i,word in enumerate(sentence):
            if i+2 >= len(sentence):
                # sentence boundary reached
                # ignoring sentence less than 3 words
                break
            # convert word to id
            x_extract = [get_id_of_word(word.lower()),get_id_of_word(sentence[i+1].lower())]
            y_extract = [get_id_of_word(sentence[i+2].lower())]
            if idx < num_train:
                x_train.append(x_extract)
                y_train.append(y_extract)
            else:
                x_dev.append(x_extract)
                y_dev.append(y_extract)

# making numpy arrays
x_train = np.array(x_train)
y_train = np.array(y_train)
x_dev = np.array(x_dev)
y_dev = np.array(y_dev)  

[nltk_data] Downloading package brown to /home/kkang2097/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/kkang2097/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
#Copy pasting tri-gram NPLM
#link: https://abhinavcreed13.github.io/blog/bengio-trigram-nplm-using-pytorch/
import torch
import multiprocessing
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import time

# Trigram Neural Network Model
class TrigramNNmodel(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(TrigramNNmodel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size, bias = False)

    def forward(self, inputs):
        # compute x': concatenation of x1 and x2 embeddings
        embeds = self.embeddings(inputs).view((-1,self.context_size * self.embedding_dim))
        # compute h: tanh(W_1.x' + b)
        out = torch.tanh(self.linear1(embeds))
        # compute W_2.h
        out = self.linear2(out)
        # compute y: log_softmax(W_2.h)
        log_probs = F.log_softmax(out, dim=1)
        # return log probabilities
        # BATCH_SIZE x len(vocab)
        return log_probs

In [15]:
# create parameters
gpu = 0 
# word vectors size
EMBEDDING_DIM = 200
CONTEXT_SIZE = 2
BATCH_SIZE = 64
# hidden units
H = 100
torch.manual_seed(13013)

# check if gpu is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
available_workers = multiprocessing.cpu_count()

print("--- Creating training and dev dataloaders with {} batch size ---".format(BATCH_SIZE))
train_set = np.concatenate((x_train, y_train), axis=1)
dev_set = np.concatenate((x_dev, y_dev), axis=1)
train_loader = DataLoader(train_set, batch_size = BATCH_SIZE, num_workers = available_workers)
dev_loader = DataLoader(dev_set, batch_size = BATCH_SIZE, num_workers = available_workers)

def get_accuracy_from_log_probs(log_probs, labels):
    probs = torch.exp(log_probs)
    predicted_label = torch.argmax(probs, dim=1)
    acc = (predicted_label == labels).float().mean()
    return acc

# helper function to evaluate model on dev data
def evaluate(model, criterion, dataloader, gpu):
    model.eval()

    mean_acc, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        dev_st = time.time()
        for it, data_tensor in enumerate(dataloader):
            context_tensor = data_tensor[:,0:2]
            target_tensor = data_tensor[:,2]
            context_tensor, target_tensor = context_tensor.cuda(gpu), target_tensor.cuda(gpu)
            log_probs = model(context_tensor)
            mean_loss += criterion(log_probs, target_tensor).item()
            mean_acc += get_accuracy_from_log_probs(log_probs, target_tensor)
            count += 1
            if it % 500 == 0: 
                print("Dev Iteration {} complete. Mean Loss: {}; Mean Acc:{}; Time taken (s): {}".format(it, mean_loss / count, mean_acc / count, (time.time()-dev_st)))
                dev_st = time.time()

    return mean_acc / count, mean_loss / count


--- Creating training and dev dataloaders with 64 batch size ---


In [16]:
loss_function = nn.NLLLoss()

# create model
model = TrigramNNmodel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE, H)

# load it to gpu
model.cuda(gpu)

# using ADAM optimizer
optimizer = optim.Adam(model.parameters(), lr = 2e-3)


# ------------------------- TRAIN & SAVE MODEL ------------------------
best_acc = 0
best_model_path = None
for epoch in range(5):
    st = time.time()
    print("\n--- Training model Epoch: {} ---".format(epoch+1))
    for it, data_tensor in enumerate(train_loader):       
        context_tensor = data_tensor[:,0:2]
        target_tensor = data_tensor[:,2]

        context_tensor, target_tensor = context_tensor.cuda(gpu), target_tensor.cuda(gpu)

        # zero out the gradients from the old instance
        model.zero_grad()

        # get log probabilities over next words
        log_probs = model(context_tensor)

        # calculate current accuracy
        acc = get_accuracy_from_log_probs(log_probs, target_tensor)

        # compute loss function
        loss = loss_function(log_probs, target_tensor)

        # backward pass and update gradient
        loss.backward()
        optimizer.step()

        if it % 500 == 0: 
            print("Training Iteration {} of epoch {} complete. Loss: {}; Acc:{}; Time taken (s): {}".format(it, epoch, loss.item(), acc, (time.time()-st)))
            st = time.time()

    print("\n--- Evaluating model on dev data ---")
    dev_acc, dev_loss = evaluate(model, loss_function, dev_loader, gpu)
    print("Epoch {} complete! Development Accuracy: {}; Development Loss: {}".format(epoch, dev_acc, dev_loss))
    if dev_acc > best_acc:
        print("Best development accuracy improved from {} to {}, saving model...".format(best_acc, dev_acc))
        best_acc = dev_acc
        # set best model path
        best_model_path = 'best_model_{}.dat'.format(epoch)
        # saving best model
        torch.save(model.state_dict(), best_model_path)


--- Training model Epoch: 1 ---
Training Iteration 0 of epoch 0 complete. Loss: 7.127009868621826; Acc:0.0; Time taken (s): 5.275617599487305
Training Iteration 500 of epoch 0 complete. Loss: 3.044797658920288; Acc:0.453125; Time taken (s): 0.9444775581359863

--- Evaluating model on dev data ---
Dev Iteration 0 complete. Mean Loss: 3.8231685161590576; Mean Acc:0.3125; Time taken (s): 0.33402490615844727
Dev Iteration 500 complete. Mean Loss: 3.9175908646421758; Mean Acc:0.2960641086101532; Time taken (s): 0.47676587104797363
Dev Iteration 1000 complete. Mean Loss: 4.007392706809106; Mean Acc:0.2820460796356201; Time taken (s): 0.44428467750549316
Dev Iteration 1500 complete. Mean Loss: 4.03563623329864; Mean Acc:0.27701324224472046; Time taken (s): 0.48658299446105957
Dev Iteration 2000 complete. Mean Loss: 3.9877283239531436; Mean Acc:0.2819918096065521; Time taken (s): 0.48996925354003906
Dev Iteration 2500 complete. Mean Loss: 3.948182597535937; Mean Acc:0.2864229381084442; Time t

Dev Iteration 14000 complete. Mean Loss: 3.829045092469496; Mean Acc:0.2853468060493469; Time taken (s): 0.508864164352417
Dev Iteration 14500 complete. Mean Loss: 3.8307474189836825; Mean Acc:0.2851031720638275; Time taken (s): 0.5069327354431152
Dev Iteration 15000 complete. Mean Loss: 3.8360246217900076; Mean Acc:0.28438103199005127; Time taken (s): 0.5189480781555176
Dev Iteration 15500 complete. Mean Loss: 3.842446598043196; Mean Acc:0.2835633456707001; Time taken (s): 0.4975090026855469
Epoch 1 complete! Development Accuracy: 0.2835707366466522; Development Loss: 3.8425192318297747

--- Training model Epoch: 3 ---
Training Iteration 0 of epoch 2 complete. Loss: 4.33746862411499; Acc:0.171875; Time taken (s): 0.3113560676574707
Training Iteration 500 of epoch 2 complete. Loss: 2.858914375305176; Acc:0.421875; Time taken (s): 0.8285279273986816

--- Evaluating model on dev data ---
Dev Iteration 0 complete. Mean Loss: 3.6360301971435547; Mean Acc:0.296875; Time taken (s): 0.2881963

Dev Iteration 11500 complete. Mean Loss: 3.8344505878233224; Mean Acc:0.2992457151412964; Time taken (s): 0.5461792945861816
Dev Iteration 12000 complete. Mean Loss: 3.8301251543113546; Mean Acc:0.2992263734340668; Time taken (s): 0.6074256896972656
Dev Iteration 12500 complete. Mean Loss: 3.8316258083829; Mean Acc:0.29886358976364136; Time taken (s): 0.5772831439971924
Dev Iteration 13000 complete. Mean Loss: 3.8421757816855755; Mean Acc:0.297604501247406; Time taken (s): 0.6942884922027588
Dev Iteration 13500 complete. Mean Loss: 3.8492774905103126; Mean Acc:0.29680439829826355; Time taken (s): 0.5990550518035889
Dev Iteration 14000 complete. Mean Loss: 3.851756470210722; Mean Acc:0.2963995933532715; Time taken (s): 0.6385955810546875
Dev Iteration 14500 complete. Mean Loss: 3.853557436410809; Mean Acc:0.29614338278770447; Time taken (s): 0.5067455768585205
Dev Iteration 15000 complete. Mean Loss: 3.8593542609689364; Mean Acc:0.29535531997680664; Time taken (s): 0.572192907333374
Dev

In [17]:
!ls

best_model_0.dat  best_model_3.dat		 naive_bayes.ipynb
best_model_2.dat  collaborative_filtering.ipynb  neural_plm.ipynb


In [18]:
# ---------------------- Loading Best Model -------------------
best_model = TrigramNNmodel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE, H)
best_model.load_state_dict(torch.load(best_model_path))
best_model.cuda(gpu)

cos = nn.CosineSimilarity(dim=0)

lm_similarities = {}

# word pairs to calculate similarity
words = {('computer','keyboard'),('cat','dog'),('dog','car'),('keyboard','cat')}

# ----------- Calculate LM similarities using cosine similarity ----------
for word_pairs in words:
    w1 = word_pairs[0]
    w2 = word_pairs[1]
    words_tensor = torch.LongTensor([get_id_of_word(w1),get_id_of_word(w2)])
    words_tensor = words_tensor.cuda(gpu)
    # get word embeddings from the best model
    words_embeds = best_model.embeddings(words_tensor)
    # calculate cosine similarity between word vectors
    sim = cos(words_embeds[0],words_embeds[1])
    lm_similarities[word_pairs] = sim.item()

print(lm_similarities)

{('computer', 'keyboard'): 1.0, ('cat', 'dog'): 0.022388342767953873, ('keyboard', 'cat'): 1.0, ('dog', 'car'): -0.03178047016263008}


In [None]:
#1st layer: Linear, g()
#2nd layer: Tanh, non-linearity
#3rd Layer: Activation (ReLU or something)

#TODO: Code this later in the notebook