# Lab 10 -- Language Model (RE02 - starting at 19:05)


What is Language Model in NLP?

> A language model learns to predict the probability of a sequence of words. 

Types of Language Models


*   **Statistical Language Models**: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words
*   **Neural Language Models**: These are new players in the NLP town and have surpassed the statistical language models in their effectiveness. They use different kinds of Neural Networks to model language



# Statistical Language Model (SLM)

A statistical language model produces probability distributions over sequences of words. Given a sequence, say of length m, it assigns a probability $P(w_1, \ldots, w_m)$ to the whole sequence. One model solution is to make the assumption that the probability distribution for a word depends only on the previous $n-1$ words. This is known as an n-gram model.

## Bigrams and Trigrams

An n-gram model is a type of probabilistic language model for predicting the next token in  a sequence in the form of a (n − 1)–order Markov model. Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram". English cardinal numbers are sometimes used, e.g., "four-gram", "five-gram", and so on. Note, we typically only use language models that are a bigram or higher. A unigram model would be looking at the previou $n-1 = 0$ words! Looking at no words would make for a very poor model.

The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.


Let's see how to build a such a model with NLTK. Let's download some Reuters data and inspect it. 

In [1]:
import nltk
from nltk.util import bigrams, trigrams
from collections import Counter, defaultdict
from nltk.corpus import reuters

nltk.download('reuters')
!unzip -qq /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
first_sentence = reuters.sents()[0]
first_sentence

['ASIAN',
 'EXPORTERS',
 'FEAR',
 'DAMAGE',
 'FROM',
 'U',
 '.',
 'S',
 '.-',
 'JAPAN',
 'RIFT',
 'Mounting',
 'trade',
 'friction',
 'between',
 'the',
 'U',
 '.',
 'S',
 '.',
 'And',
 'Japan',
 'has',
 'raised',
 'fears',
 'among',
 'many',
 'of',
 'Asia',
 "'",
 's',
 'exporting',
 'nations',
 'that',
 'the',
 'row',
 'could',
 'inflict',
 'far',
 '-',
 'reaching',
 'economic',
 'damage',
 ',',
 'businessmen',
 'and',
 'officials',
 'said',
 '.']

Now let's see what the n-grams look like. More details can be found at [bigrams()](https://www.nltk.org/api/nltk.util.html#nltk.util.bigrams),  [trigrams()](https://www.nltk.org/api/nltk.util.html#nltk.util.trigrams),  [ngrams()](https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams)

In [3]:
print("bigrams without padding: ", list(bigrams(first_sentence)))

print("bigrams with padding: ", list(bigrams(first_sentence, pad_left=True, pad_right=True)))

print("trigrams without padding: ", list(trigrams(first_sentence)))

print("trigrams with padding: ", list(trigrams(first_sentence, pad_left=True, pad_right=True)))

bigrams without padding:  [('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'FEAR'), ('FEAR', 'DAMAGE'), ('DAMAGE', 'FROM'), ('FROM', 'U'), ('U', '.'), ('.', 'S'), ('S', '.-'), ('.-', 'JAPAN'), ('JAPAN', 'RIFT'), ('RIFT', 'Mounting'), ('Mounting', 'trade'), ('trade', 'friction'), ('friction', 'between'), ('between', 'the'), ('the', 'U'), ('U', '.'), ('.', 'S'), ('S', '.'), ('.', 'And'), ('And', 'Japan'), ('Japan', 'has'), ('has', 'raised'), ('raised', 'fears'), ('fears', 'among'), ('among', 'many'), ('many', 'of'), ('of', 'Asia'), ('Asia', "'"), ("'", 's'), ('s', 'exporting'), ('exporting', 'nations'), ('nations', 'that'), ('that', 'the'), ('the', 'row'), ('row', 'could'), ('could', 'inflict'), ('inflict', 'far'), ('far', '-'), ('-', 'reaching'), ('reaching', 'economic'), ('economic', 'damage'), ('damage', ','), (',', 'businessmen'), ('businessmen', 'and'), ('and', 'officials'), ('officials', 'said'), ('said', '.')]
bigrams with padding:  [(None, 'ASIAN'), ('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'F

Now, let's build a trigram model using the Reuters corpus. Building a bigram model is completely analogous.


In [5]:
def to_dict(d):
    if isinstance(d, defaultdict):
        return dict((k, to_dict(v)) for k, v in d.items())
    return d

In [13]:
## 'lambda:' takes 0 positional arguments
a = lambda: 0
print(a())

0


In [19]:
aa = defaultdict(a)
print(aa)

defaultdict(<function <lambda> at 0x7fd034576b90>, {})


In [8]:
trigram_model = defaultdict(lambda: defaultdict(lambda: 0))
# lambda: 0 will of course always return zero, 
## but the preferred method to do that is defaultdict(int), which will do the same thing.
print(trigram_model) # defaultdict
trigram_model['a'] = 5 
print(trigram_model) 
print(to_dict({'a':5}))

defaultdict(<function <lambda> at 0x7fd03447f0a0>, {})
defaultdict(<function <lambda> at 0x7fd03447f0a0>, {'a': 5})
{'a': 5}


In [24]:
print(reuters.sents()[0])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


In [26]:
# create a model which contains the trigram counts  
trigram_model = defaultdict(lambda: defaultdict(lambda: 0))
## i.e., initialise a 'trigram_model' -- an empty dict

# for sentence in reuters.sents(): 
## let's check first sentence only
n = 1
for w1, w2, w3 in trigrams(reuters.sents()[0], pad_right=True, pad_left=True):
    if n == 1 or n == 2: 
        print(n)
        print(w1,w2,w3)
        print('before',trigram_model)
        trigram_model[(w1, w2)][w3] += 1 ## i.e., return count of (w3|(w1,w2))
        print(type(trigram_model[(w1, w2)][w3]))
        print('after',trigram_model)
        n += 1

1
None None ASIAN
before defaultdict(<function <lambda> at 0x7fd033f0a290>, {})
after defaultdict(<function <lambda> at 0x7fd033f0a290>, {(None, None): defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd033920f70>, {'ASIAN': 1})})
2
None ASIAN EXPORTERS
before defaultdict(<function <lambda> at 0x7fd033f0a290>, {(None, None): defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd033920f70>, {'ASIAN': 1})})
after defaultdict(<function <lambda> at 0x7fd033f0a290>, {(None, None): defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd033920f70>, {'ASIAN': 1}), (None, 'ASIAN'): defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd033921360>, {'EXPORTERS': 1})})


In [47]:
trigram_model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in reuters.sents(): 
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        trigram_model[(w1, w2)][w3] += 1 ## i.e., return count of (w3|(w1,w2))

In [29]:
# inspect the counts of some trigrams

print("count for 'what the economists':", trigram_model["what", "the"]["economists"])

print("count for 'what the nonexistingword':", trigram_model["what", "the"]["nonexistingword"])

# counts of the sentence starting with "The"
print("count for 'the' starting a sentence:", trigram_model[None, None]["The"])

count for 'what the economists': 2
count for 'what the nonexistingword': 0
count for 'the' starting a sentence: 8839


In [44]:
# convert counts to probabilities
n = 1
for w1_w2 in trigram_model: ## i.e., go through all (w1,w2) keys
    if n == 1 or n == 2:
        print('n--',n)
        n +=1 
        print('w1_w2',w1_w2)
        print('trigram_model[w1_w2]',trigram_model[w1_w2])
        print('.values()',trigram_model[w1_w2].values())
        total_count = float(sum(trigram_model[w1_w2].values()))
        w = 1
        for w3 in trigram_model[w1_w2]:
            if w == 1 or w == 2:
                print('w--',w)
                w += 1
                print('w3',w3)
                trigram_model[w1_w2][w3] /= total_count
                print(trigram_model[w1_w2][w3])

n-- 1
w1_w2 (None, None)
trigram_model[w1_w2] defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd022fea5f0>, {'ASIAN': 4, 'They': 446, 'But': 1054, 'The': 8839, 'Unofficial': 1, '"': 3589, 'In': 1380, 'Threat': 2, 'Taiwan': 38, 'Retaliation': 3, 'A': 764, 'Last': 202, 'Much': 8, 'He': 1586, 'Meanwhile': 41, 'Japan': 111, 'Deputy': 8, 'CHINA': 50, 'It': 1768, 'JAPAN': 164, 'MITI': 12, 'Nuclear': 1, 'THAI': 19, 'Thailand': 13, 'Export': 12, 'Products': 4, 'INDONESIA': 21, 'Prices': 43, 'Harahap': 2, 'Indonesia': 34, 'Indonesian': 9, 'AUSTRALIAN': 37, 'Cargo': 1, 'INDONESIAN': 12, 'Trading': 24, 'Physical': 2, 'Rubber': 3, 'Robusta': 2, 'No': 92, 'Trade': 52, 'Nainggolan': 1, 'Officials': 58, 'Transactions': 2, 'Total': 111, 'SRI': 9, 'WESTERN': 8, 'Bundey': 1, 'Annual': 5, 'SUMITOMO': 4, 'Osaka': 1, 'Some': 177, 'Others': 11, 'Now': 21, 'Among': 57, 'Regulations': 1, 'We': 73, 'Komatsu': 1, 'Article': 2, 'That': 104, 'Until': 15, 'Like': 6, 'SUBROTO': 3, 'Asked': 127, 'BUNDESBANK':

In [48]:
# convert counts to probabilities
for w1_w2 in trigram_model: ## i.e., go through all (w1,w2) keys
    n +=1 
    total_count = float(sum(trigram_model[w1_w2].values()))
    w = 1
    for w3 in trigram_model[w1_w2]:
        w += 1
        trigram_model[w1_w2][w3] /= total_count


In [49]:
print(trigram_model["what", "the"]["economists"] )

print(trigram_model["what", "the"]["nonexistingword"])

# probabilities of the sentence starting with "The"
print(trigram_model[None, None]["The"])

0.043478260869565216
0
0.16154324146501936


Now you have a tri-gram language model. Let's generate some text. The output text is actually really readable!


In [50]:
text = [None, None]
print(trigram_model[tuple(text[-2:])])

defaultdict(<function <lambda>.<locals>.<lambda> at 0x7fd021868040>, {'ASIAN': 7.31047591198187e-05, 'They': 0.008151180641859785, 'But': 0.019263104028072228, 'The': 0.16154324146501936, 'Unofficial': 1.8276189779954676e-05, '"': 0.06559324512025733, 'In': 0.02522114189633745, 'Threat': 3.655237955990935e-05, 'Taiwan': 0.0006944952116382777, 'Retaliation': 5.482856933986402e-05, 'A': 0.013963008991885371, 'Last': 0.0036917903355508444, 'Much': 0.0001462095182396374, 'He': 0.028986036991008116, 'Meanwhile': 0.0007493237809781417, 'Japan': 0.0020286570655749687, 'Deputy': 0.0001462095182396374, 'CHINA': 0.0009138094889977337, 'It': 0.03231230353095987, 'JAPAN': 0.002997295123912567, 'MITI': 0.0002193142773594561, 'Nuclear': 1.8276189779954676e-05, 'THAI': 0.00034724760581913885, 'Thailand': 0.00023759046713941077, 'Export': 0.0002193142773594561, 'Products': 7.31047591198187e-05, 'INDONESIA': 0.00038379998537904815, 'Prices': 0.000785876160538051, 'Harahap': 3.655237955990935e-05, 'Indo

In [None]:
import random

for sample_no in range(4):
    text = [None, None]
    
    sentence_finished = False

    # Keep generating the next word until reaching the end of the sentence
    while not sentence_finished: ## i.e., while sentence_finished=True, stop
        # Randomly select a probability threshold r
        r = random.random() ## r is different for each timestamp
        accumulator = .0 ## initialise the accumulator for each new word
    
        # Go through the possible w3 conditioned on current w1 and w2
        for word in trigram_model[tuple(text[-2:])].keys():
            # tuple(text[-2:]) == the last two words in the sentence generated
            # trigram_model[tuple(text[-2:])] will return all possible next words, given (w1,w2), associated with their probabilities
            # trigram_model[tuple(text[-2:])].keys() will return those words only
            ## for wrods in trigram_model[tuple(text[-2:])].keys() means that for each possible next word, will do:
            
            # Accumulate the probability
            accumulator += trigram_model[tuple(text[-2:])][word] # the probability of 'word' given (w1,w2)
    
            # When the threshold is reached, use the current w3 as the next word to be generated
            # # select words that are above the probability threshold
            if accumulator >= r:
                text.append(word)
                break
    
        # If the last two words are None, it will reach the end and stop generating
        if text[-2:] == [None, None]:
            sentence_finished = True

    # The generated sentence is as follows
    print("Sample output", sample_no)
    print(' '.join([t for t in text if t]))
    print()

Sample output 0
Trade Representative Clayton Yeutter ' s to 28 mln vs 312 . 4 mln dlrs because of inefficiencies , and virtual monopolies .

Sample output 1
Based on a Chicago investment advisory firm , and we must be repealed or we will continue to view MITI ' s competitive position had weakened .

Sample output 2
Balladur maintains 1987 2 . 20 dlrs Net 213 , 310 , 000 shares , or 20 cts Net loss 3 , 880 . 6 pct of Sprint .

Sample output 3
" Japanese people doubt Nakasone ' s comments boosted growing sentiment that the service sector earners of foreign exchange losses on bauxite mining and materials for national concern both over declining competitiveness of U . S . Currency up to 9 . 4 mln tonnes , compared with 14 . 8 pct increase in 1986 / 87 04 / 09 / 87 Prev Wk 4 / 2 pct month - old protection of intellectual property .



# Decoding Algorithms

In NLP tasks such as dialogue, text summarization, and machine translation, the prediction required is a sequence of words.

It is common for models developed for these types of problems to output a probability distribution over each word in the vocabulary for each word in the output sequence. **It is then left to a decoder process to transform the probabilities into a final sequence of words.**

Decoding the most likely output sequence involves searching through all the possible output sequences based on their likelihood. The size of the vocabulary is often tens or hundreds of thousands of words, or even millions of words. Therefore, the search problem is exponential in the length of the output sequence and is intractable (NP-complete) to search completely.

In practice, heuristic search methods are used to return one or more approximate or “good enough” decoded output sequences for a given prediction. In some special cases, we can develop methods that do not search the whole space explicitly, but do find the highest probability output nevertheless.

Candidate sequences of words are scored based on their likelihood. It is common to use a greedy search or a beam search to locate candidate sequences of text. We will look at both of these decoding algorithms now.

## Greedy Decoder

A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence. This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.

We can demonstrate the greedy search approach to decoding with a small contrived example in Python. We can start off with a prediction problem that involves a sequence of 10 words. Each word is predicted as a probability distribution over a vocabulary of 5 words

In [None]:
from numpy import array
from numpy import argmax

In [None]:
# define a sequence of 10 words over a vocab of 5 words
#       1st-w,2nd-w,3rd-w,4th-w,5th-w
data = [[0.01, 0.09, 0.3, 0.4, 0.2], # the first timestamp   --- 0.4 -index- 3
        [0.4, 0.3, 0.2, 0.01, 0.09], # the second timestamp  --- 0.4 -index- 0
        [0.01, 0.09, 0.3, 0.4, 0.2], # ...                   --- 0.4 -index- 3
        [0.4, 0.3, 0.2, 0.01, 0.09], # ...                   --- 0.4 -index- 0
        [0.01, 0.09, 0.3, 0.4, 0.2], # ...                   --- 0.4 -index- 3
        [0.4, 0.3, 0.2, 0.01, 0.09], # ...                   --- 0.4 -index- 0
        [0.01, 0.09, 0.3, 0.4, 0.2], # ...                   --- 0.4 -index- 3
        [0.4, 0.3, 0.2, 0.01, 0.09], # ...                   --- 0.4 -index- 0
        [0.01, 0.09, 0.3, 0.4, 0.2], # ...                   --- 0.4 -index- 3
        [0.4, 0.3, 0.2, 0.01, 0.09]] # ...                   --- 0.4 -index- 0
data = array(data)

We will assume that the words have been integer encoded, such that the column index can be used to look-up the associated word in the vocabulary. Therefore, the task of decoding becomes the task of selecting a sequence of integers from the probability distributions.

The argmax() mathematical function can be used to select the index of an array that has the largest value. We can use this function to select the word index that is most likely at each step in the sequence. This function is provided directly in numpy.

The greedy_decoder() function below implements this decoder strategy using the argmax function.

In [None]:
# greedy decoder, pick only index for largest probability each row
def greedy_decoder(data):
    ## i.e., for each timestamp, we select the highest probability
    return [argmax(s) for s in data]

Running the example outputs a sequence of integers that could then be mapped back to words in the vocabulary.

In [None]:
#decode seqeunce
result = greedy_decoder(data)
print(result)

[3, 0, 3, 0, 3, 0, 3, 0, 3, 0]


## Beam Search Decoder

Another popular heuristic is beam search, which improves on greedy search and returns a list of most likely output sequences.

Instead of greedily choosing the most likely next step as the sequence is constructed, beam search keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

We do not need to start with random states; instead, we start with the k most likely words as the first step in the sequence. Common beam width values are 1 for a greedy search and values of 5 or 10 for common benchmark problems in machine translation. Larger beam widths result in better performance of a model as the multiple candidate sequences increase the likelihood of better matching a target sequence. This increased performance results in a decrease in decoding speed.

In [58]:
from numpy import array
from numpy import argmax
from numpy import log

The beam_search_decoder() function below implements a beam search decoder.

In [61]:
print(len([[list(), 0.0]]))
sequences = [[list(), 0.0]]
print((-log(0.01)))

1
4.605170185988091


In [None]:
# beam search
def beam_search_decoder(data, k):
    sequences = [[list(), 0.0]] ## nested list [[[], 0.0]]
    # walk over each step in sequence
    for step,row in enumerate(data):
        # step == which timestamp
        # row == the probabilities of 5 words in that timestamp
        all_candidates = list()
        # expand each current candidate
        for i in range(len(sequences)):
            seq, score = sequences[i] # seq = [], score = 0.0 -- first timestamp
            for j in range(len(row)): # i.e., for each possible in the timestamp
                ## j indicates the index of the words
                ## row[j] returns the probability number
                candidate = [seq + [j], score + (-log(row[j])) ]  #we are summing up the negative log, so we need to find the minimum score(which is the highest prob)
                ## candidate == [index, probability] e.g., [0, 4.605170185988091]
                all_candidates.append(candidate)
        # order all candidates by score
        ordered = sorted(all_candidates, key=lambda tup:tup[1])
        ## key=lambda tup:tup[1] -- for each tup/candidate in all_candidates, sorted by tup[1]/candidate[1]
        ## i.e., candidate == [index, probability] -> candidate[1] == probability --> sorted by score

        # select k best
        sequences = ordered[:k]
        
        # display the k-best sequences
        print("The", str(k), "best sequences at step ", str(step), ": ")
        print(sequences)
        print()

    return sequences

We can tie this together with the sample data from the previous section and this time return the 3 most likely sequences. Running the example prints both the integer sequences and their log likelihood.

In [None]:
# decode sequence
result = beam_search_decoder(data, 3)


print()
print("The final decoded 3 best sequences: ")
for seq in result:
    print(seq)

The 3 best sequences at step  0 : 
[[[3], 0.916290731874155], [[2], 1.2039728043259361], [[4], 1.6094379124341003]]

The 3 best sequences at step  1 : 
[[[3, 0], 1.83258146374831], [[3, 1], 2.120263536200091], [[2, 0], 2.120263536200091]]

The 3 best sequences at step  2 : 
[[[3, 0, 3], 2.748872195622465], [[3, 0, 2], 3.036554268074246], [[3, 1, 3], 3.036554268074246]]

The 3 best sequences at step  3 : 
[[[3, 0, 3, 0], 3.66516292749662], [[3, 0, 3, 1], 3.952844999948401], [[3, 0, 2, 0], 3.952844999948401]]

The 3 best sequences at step  4 : 
[[[3, 0, 3, 0, 3], 4.581453659370775], [[3, 0, 3, 0, 2], 4.869135731822556], [[3, 0, 3, 1, 3], 4.869135731822556]]

The 3 best sequences at step  5 : 
[[[3, 0, 3, 0, 3, 0], 5.49774439124493], [[3, 0, 3, 0, 3, 1], 5.7854264636967105], [[3, 0, 3, 0, 2, 0], 5.785426463696711]]

The 3 best sequences at step  6 : 
[[[3, 0, 3, 0, 3, 0, 3], 6.414035123119085], [[3, 0, 3, 0, 3, 0, 2], 6.701717195570866], [[3, 0, 3, 0, 3, 1, 3], 6.701717195570866]]

The 3 

#Neural Language Model


Now, let's see how to build a language model for generating natural language text by implement and training a Recurrent Neural Network. The objective of this model is to generate new text, given that some input text is present. Let's start building the architecture.

In [None]:
import numpy as np 

from numpy import array
from numpy import argmax
from numpy import log

Let's use a popular nursery rhyme — “Cat and Her Kittens” as our corpus.



In [None]:
import re

# Pad sequences to the max length
def pad_sequences_pre(input_sequences, maxlen):
    output = []
    for inp in input_sequences:
        if len(inp)< maxlen:
            output.append([0]*(maxlen-len(inp)) + inp)
        else:
            output.append(inp[:maxlen])
    return output

# Prepare the data
def dataset_preparation(data):
    corpus = data.lower().split("\n")
    normalized_text=[]
    for string in corpus:
        tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
        normalized_text.append(tokens)
    tokenized_sentences=[sentence.strip().split(" ") for sentence in normalized_text]
    # The strip() method removes any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove)

    word_list_dict ={}
    for sent in tokenized_sentences:
        for word in sent:
            if word != "":
                word_list_dict[word] = 1
    word_list = list(word_list_dict.keys())
    word_to_index = {word:word_list.index(word) for word in word_list}

    total_words = len(word_list)+1

    # create input sequences using list of tokens
    input_sequences = []
    for line in tokenized_sentences:
        token_list = []
        for word in line:
            if word!="":
                token_list.append(word_to_index[word])
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)

    # pad sequences 
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences_pre(input_sequences, maxlen=max_sequence_len))

    # create predictors and label
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

    return predictors, np.array(label), max_sequence_len, total_words, word_list, word_to_index

data = '''The cat and her kittens
They put on their mittens
To eat a christmas pie
The poor little kittens
They lost their mittens
And then they began to cry.

O mother dear, we sadly fear
We cannot go to-day,
For we have lost our mittens
If it be so, ye shall not go
For ye are naughty kittens'''

predictors, label, max_sequence_len, total_words, word_list, word_to_index = dataset_preparation(data)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score

# Define the model
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim_1, hidden_dim_2, total_words):
        super(LSTMTagger, self).__init__()
        self.hidden_dim_1 = hidden_dim_1
        self.hidden_dim_2 = hidden_dim_2
        self.word_embeddings = nn.Embedding(total_words, embedding_dim)
        self.lstm1 = nn.LSTM(embedding_dim, hidden_dim_1, batch_first=True)  
        self.lstm2 = nn.LSTM(hidden_dim_1, hidden_dim_2, batch_first=True)  
        self.hidden2tag = nn.Linear(hidden_dim_2, total_words)


    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out_1, _ = self.lstm1(embeds)
        lstm_out_2, _ = self.lstm2(lstm_out_1)
        tag_space = self.hidden2tag(lstm_out_2[:,-1,:])
        # The reason we are using log_softmax here is that we want to calculate -log(p) and find the minimum score                    
        tag_scores = F.log_softmax(tag_space, dim=1)      
        return tag_scores

# Parameter setting
EMBEDDING_DIM = 10
HIDDEN_DIM_1 = 150
HIDDEN_DIM_2 = 100
batch_size=predictors.shape[0]

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM_1, HIDDEN_DIM_2, total_words).cuda()
loss_function = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


sentence =torch.from_numpy(predictors).cuda().to(torch.int64)
targets = torch.from_numpy(label).cuda().to(torch.int64)


# Training
for epoch in range(100):  

    model.train()
    model.zero_grad()       
    tag_scores = model(sentence)
    loss = loss_function(tag_scores, targets)
    loss.backward()
    optimizer.step()


    if epoch % 10 == 9:
        model.eval()
        _, predicted = torch.max(tag_scores, 1)
        prediction = predicted.view(-1).cpu().numpy()
        t = targets.view(-1).cpu().numpy()
        acc = accuracy_score(prediction,t)
        print('Epoch: %d, training loss: %.4f, training acc: %.2f%%'%(epoch+1,loss.item(),100*acc))



Epoch: 10, training loss: 3.6729, training acc: 4.17%
Epoch: 20, training loss: 3.4693, training acc: 10.42%
Epoch: 30, training loss: 3.0252, training acc: 14.58%
Epoch: 40, training loss: 2.5667, training acc: 29.17%
Epoch: 50, training loss: 2.1609, training acc: 54.17%
Epoch: 60, training loss: 1.8292, training acc: 66.67%
Epoch: 70, training loss: 1.5625, training acc: 77.08%
Epoch: 80, training loss: 1.3559, training acc: 83.33%
Epoch: 90, training loss: 1.1748, training acc: 87.50%
Epoch: 100, training loss: 1.0250, training acc: 87.50%


For decoding, let's first practice with beam search with k=1, which is equivalent to greedy decoding.

In [None]:
# convert index to word
def ind_to_word(predicted_ind):
    for word, index in word_to_index.items():
        if index == predicted_ind:
            return word
    return ""    


# get the top k most predicted results
def get_topK(predicted, k=1):
    
    # Get the index of the highest k index
    # Since the input is just one sentence, we can use [0] to extract the prediction result
    top_k = np.argsort(predicted[0])[-k:]

    # return a list of tuple
    # tuple[0]:word_id, tuple[1]:log(p)
    return [(id, predicted[0][id]) for id in top_k]



# Generate text, currently it only works with k=1 
def generate_text(seed_text, next_words, max_sequence_len, k=1):

    seed_candidates = [(seed_text, .0)]
    for _ in range(next_words):
        successives = []
        # if k = 1, len(seed_candidates) will always be 1
        for i in range(len(seed_candidates)):
            seed_text, score = seed_candidates[i]
            token_list = [word_to_index[word] for word in seed_text.split()]
            token_list = pad_sequences_pre([token_list], maxlen=max_sequence_len-1)

            seed_input = torch.from_numpy(np.array(token_list)).cuda().to(torch.int64)
            predicted = model(seed_input).cpu().detach().numpy()


            # Since it only works with k = 1, we can simply use [0] to get the word id and log(p)
            id, s = get_topK(predicted, k)[0]
            # get the output word
            output_word = ind_to_word(id)
            # put the word into the sentence input
            # calcualte the accumulated score by -log(p)
            successives.append((seed_text + ' ' + output_word, score - s)) 

        # Get the lowest k accumulated scores (highest k accumulated probabilities)
        # Then, make them as the seed_candidate for the next word to predict
        ordered = sorted(successives, key=lambda tup: tup[1])
        seed_candidates = ordered[:k]

    return seed_candidates[0][0]


print(generate_text("we naughty", 3, max_sequence_len, k=1))


we naughty go to day


Now, let's modify based on the above code to allow k>1:

In [None]:
def generate_text(seed_text, next_words, max_sequence_len, k=1):
   
    seed_candidates = [(seed_text, .0)]
    for _ in range(next_words):
        successives = []
        for i in range(len(seed_candidates)):
            seed_text, score = seed_candidates[i]
            token_list = [word_to_index[word] for word in seed_text.split()]
            token_list = pad_sequences_pre([token_list], maxlen=max_sequence_len-1)

            seed_input = torch.from_numpy(np.array(token_list)).cuda().to(torch.int64)
            predicted = model(seed_input).cpu().detach().numpy()
            
            # if k>1 , we can't simply use [0] to get the candidates
            # instead, we will modify as follows
            for id, s in get_topK(predicted, k):
                output_word= ind_to_word(id)
                successives.append((seed_text + ' ' + output_word, score - s))
        ordered = sorted(successives, key=lambda tup: tup[1])
        seed_candidates = ordered[:k]
    return seed_candidates[0][0]

# Please note that it can happen that k=1 and k=3 have the same output because this is only a small dataset.
print(generate_text("we naughty", 3, max_sequence_len, k=1))
print(generate_text("we naughty", 3, max_sequence_len, k=3))


we naughty go to day
we naughty go to day
