# Lab 7

## KEY CONTENTS


*   Part-Of-Speech Tagging

*   Hidden Markov Models (Probability-Based)
> It estimates the probability of a tag sequence for a given word sequence

    **Hidden States** -- Each observation will have m potential matching hidden states --- e.g., each word will have m potential matching tags to be assigned (the tag state is hidden, as we want to predict those tags).
    
    **Observation** -- Sequence of Observed Value, e.g., a sentence.

    **Model Parameters**

    1.   **Transition Probabilities** -- The probability of the specific state transition (i.e., a POS tag is considered as a state; the transition means from ONE hidden state/tag tranfering to another hidden state/tag -- e.g., Noun->Verb)

    2.   **Emission Probabilities** -- The relationship between the hidden/tag state and the observation (i.e., the probability of the tag/hidden state TO the observed word/data -> e.g., given the weather/hidden state, predict the type of clothing/observation)



*   Viterbi Algorithm*


# POS Tagging
POS tagging is the process of labelling a token in a corpus with a part of speech tag, based on the token's context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used

## Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns. For instance, we might guess that any word ending in _-ed_ is the past participle of a verb, and any word ending with _'s_ is a possessive noun. We can express these as a list of regular expressions: 



In [None]:
import nltk

# Downloading required corpus
nltk.download('punkt')
nltk.download('brown')

from nltk import word_tokenize
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [None]:
print(brown_tagged_sents[77])

[('Barber', 'NP'), (',', ','), ('who', 'WPS'), ('is', 'BEZ'), ('in', 'IN'), ('his', 'PP$'), ('13th', 'OD'), ('year', 'NN'), ('as', 'CS'), ('a', 'AT'), ('legislator', 'NN'), (',', ','), ('said', 'VBD'), ('there', 'EX'), ('``', '``'), ('are', 'BER'), ('some', 'DTI'), ('members', 'NNS'), ('of', 'IN'), ('our', 'PP$'), ('congressional', 'JJ'), ('delegation', 'NN'), ('in', 'IN'), ('Washington', 'NP'), ('who', 'WPS'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('see', 'VB'), ('it', 'PPO'), ('(', '('), ('the', 'AT'), ('resolution', 'NN'), (')', ')'), ('passed', 'VBN'), ("''", "''"), ('.', '.')]


In [None]:
 # Define regular expression patterns
patterns = [
        (r'.*ing$', 'VBG'),               # gerunds
        (r'.*ed$', 'VBD'),                # simple past
        (r'.*es$', 'VBZ'),                # 3rd singular present
        (r'.*ould$', 'MD'),               # modals
        (r'.*\'s$', 'NN$'),               # possessive nouns
        (r'.*s$', 'NNS'),                 # plural nouns
        (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
        (r'.*', 'NN')                     # nouns (default)
]

In [None]:
# Build regular expression tagger using the defined patterns
regexp_tagger = nltk.RegexpTagger(patterns)

# Print one of the sentences
print(brown_sents[77])
# Print one of the tagged sentences
print(regexp_tagger.tag(brown_sents[77]))

['Barber', ',', 'who', 'is', 'in', 'his', '13th', 'year', 'as', 'a', 'legislator', ',', 'said', 'there', '``', 'are', 'some', 'members', 'of', 'our', 'congressional', 'delegation', 'in', 'Washington', 'who', 'would', 'like', 'to', 'see', 'it', '(', 'the', 'resolution', ')', 'passed', "''", '.']
[('Barber', 'NN'), (',', 'NN'), ('who', 'NN'), ('is', 'NNS'), ('in', 'NN'), ('his', 'NNS'), ('13th', 'NN'), ('year', 'NN'), ('as', 'NNS'), ('a', 'NN'), ('legislator', 'NN'), (',', 'NN'), ('said', 'NN'), ('there', 'NN'), ('``', 'NN'), ('are', 'NN'), ('some', 'NN'), ('members', 'NNS'), ('of', 'NN'), ('our', 'NN'), ('congressional', 'NN'), ('delegation', 'NN'), ('in', 'NN'), ('Washington', 'NN'), ('who', 'NN'), ('would', 'MD'), ('like', 'NN'), ('to', 'NN'), ('see', 'NN'), ('it', 'NN'), ('(', 'NN'), ('the', 'NN'), ('resolution', 'NN'), (')', 'NN'), ('passed', 'VBD'), ("''", 'NN'), ('.', 'NN')]


In [None]:
# Evaluate the tagger (Calculate the accuracy/performance)
regexp_tagger.accuracy(brown_tagged_sents)

0.20326391789486245

In [None]:
raw = 'This race is awesome, I want to race too'
tokens = word_tokenize(raw)

print(regexp_tagger.tag(tokens))

[('This', 'NNS'), ('race', 'NN'), ('is', 'NNS'), ('awesome', 'NN'), (',', 'NN'), ('I', 'NN'), ('want', 'NN'), ('to', 'NN'), ('race', 'NN'), ('too', 'NN')]


# Hidden Markov Models 

A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model.

In [None]:
# Hidden Markov Models in Python
# Katrin Erk, https://www.katrinerk.com/, March 2013 updated March 2016
#
# This HMM addresses the problem of part-of-speech tagging. It estimates
# the probability of a tag sequence for a given word sequence as follows:
#
# Say words = w1....wN
# and tags = t1..tN
#
# then
# P(tags | words) is_proportional_to  product P(ti | t{i-1}) P(wi | ti)
#
# To find the best tag sequence for a given sequence of words,
# we want to find the tag sequence that has the maximum P(tags | words)
import nltk
import sys
nltk.download('brown')

from nltk.corpus import brown
from nltk.corpus import treebank


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [None]:
# Estimating P(wi | ti) from corpus data using Maximum Likelihood Estimation (MLE):
# P(wi | ti) = count(wi, ti) / count(ti)
#
# We add an artificial "start" tag at the beginning of each sentence, and
# We add an artificial "end" tag at the end of each sentence.
# So we start out with the brown tagged sentences,
# add the two artificial tags,
# and then make one long list of all the tag/word pairs.

brown_tags_words = []
brown_tagged_sents = brown.tagged_sents()

for sent in brown_tagged_sents:
    # sent is a list of word/tag pairs
    # add START/START at the beginning
    brown_tags_words.append( ("START", "START") )
    # then all the tag/word pairs for the word/tag pairs in the sentence.
    # shorten tags to 2 characters each
    brown_tags_words.extend([ (tag[:2], word) for (word, tag) in sent ])
    # then END/END
    brown_tags_words.append( ("END", "END") )

# conditional frequency distribution
cfd_tagwords = nltk.ConditionalFreqDist(brown_tags_words)
# conditional probability distribution
cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords, nltk.MLEProbDist)
##### refers to the emission table --> given the tag, the probability of the word_t is XXX

print("The probability of an adjective (JJ) being 'new' is", cpd_tagwords["JJ"].prob("new"))
print("The probability of a verb (VB) being 'duck' is", cpd_tagwords["VB"].prob("duck"))

# Estimating P(ti | t{i-1}) from corpus data using Maximum Likelihood Estimation (MLE):
# P(ti | t{i-1}) = count(t{i-1}, ti) / count(t{i-1})
brown_tags = [tag for (tag, word) in brown_tags_words ]

# make conditional frequency distribution:
# count(t{i-1} ti)
cfd_tags= nltk.ConditionalFreqDist(nltk.bigrams(brown_tags))
# make conditional probability distribution, using
# maximum likelihood estimate:
# P(ti | t{i-1})
cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)
##### refers to the transition table --> given the previous tag, the probability of the tag_i is XXX

print("If we have just seen 'DT', the probability of 'NN' is", cpd_tags["DT"].prob("NN"))
print( "If we have just seen 'VB', the probability of 'JJ' is", cpd_tags["VB"].prob("DT"))
print( "If we have just seen 'VB', the probability of 'NN' is", cpd_tags["VB"].prob("NN"))


The probability of an adjective (JJ) being 'new' is 0.01472344917632025
The probability of a verb (VB) being 'duck' is 6.042713350943527e-05
If we have just seen 'DT', the probability of 'NN' is 0.5057722522030194
If we have just seen 'VB', the probability of 'JJ' is 0.016885067592065053
If we have just seen 'VB', the probability of 'NN' is 0.10970977711020183


## Viterbi Algorithm

In [None]:
#####
# Viterbi:
# If we have a word sequence, what is the best tag sequence?
#
# The method above lets us determine the probability for a single tag sequence.
# But in order to find the best tag sequence, we need the probability
# for _all_ tag sequence.
# What Viterbi gives us is just a good way of computing all those many probabilities
# as fast as possible.

# what is the list of all tags?
distinct_tags = set(brown_tags)

sentence = ["This", "race", "is", "awesome", ",", "I", "want", "to", "race", "too" ]
#sentence = ["I", "saw", "her", "duck" ]
sentlen = len(sentence)

# viterbi:
# for each step i in 1 .. sentlen,
# store a dictionary
# that maps each tag X
# to the probability of the best tag sequence of length i that ends in X
viterbi = [ ]

# backpointer:
# for each step i in 1..sentlen,
# store a dictionary
# that maps each tag X
# to the previous tag in the best tag sequence of length i that ends in X
backpointer = [ ]


##### AT The First Timestep --- first word
first_viterbi = { }
first_backpointer = { }
for tag in distinct_tags:
    # don't record anything for the START tag
    if tag == "START": continue
    ##### cpd_tags refers to the transition table --> given the previous tag, the probability of the tag_t is XXX
    ###### cpd_tags['tag_(t-1)'].prob(tag_t) == P(tag_t|tag_(t-1))
    ##### cpd_tagwords refers to the emission table --> given the tag, the probability of the word_t is XXX
    ###### cpd_tagwords['tag_t'].prob(word_t) == P(word_t|tag_t)
    
    first_viterbi[ tag ] = cpd_tags["START"].prob(tag) * cpd_tagwords[tag].prob( sentence[0] )
    first_backpointer[ tag ] = "START"

print(first_viterbi)
print(first_backpointer)
    
viterbi.append(first_viterbi)
backpointer.append(first_backpointer)

print('================')

currbest = max(first_viterbi.keys(), key = lambda tag: first_viterbi[ tag ])
print( "Word", "'" + sentence[0] + "'", "current best two-tag sequence:", first_backpointer[ currbest], currbest)
# print( "Word", "'" + sentence[0] + "'", "current best tag:", currbest)

print('================')

##### Second-Timestep --TILL-- Last-Timestep
for wordindex in range(1, len(sentence)):
    ##### starting from the second word
    this_viterbi = { }
    this_backpointer = { }
    prev_viterbi = viterbi[-1]
    
    for tag in distinct_tags:
        # don't record anything for the START tag
        if tag == "START": continue

        # if this tag is X and the current word is w, then 
        # find the previous tag Y such that
        # the best tag sequence that ends in X
        # actually ends in Y X
        # that is, the Y that maximizes
        # prev_viterbi[ Y ] * P(X | Y) * P( w | X)
        # The following command has the same notation
        # that you saw in the sorted() command.
        best_previous = max(prev_viterbi.keys(),
                            key = lambda prevtag: \
            prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(tag) * cpd_tagwords[tag].prob(sentence[wordindex]))

        # Instead, we can also use the following longer code:
        # best_previous = None
        # best_prob = 0.0
        # for prevtag in distinct_tags:
        #    prob = prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob(tag) * cpd_tagwords[tag].prob(sentence[wordindex])
        #    if prob > best_prob:
        #        best_previous= prevtag
        #        best_prob = prob
        #
        this_viterbi[ tag ] = prev_viterbi[ best_previous] * \
            cpd_tags[ best_previous ].prob(tag) * cpd_tagwords[ tag].prob(sentence[wordindex])
        this_backpointer[ tag ] = best_previous

    currbest = max(this_viterbi.keys(), key = lambda tag: this_viterbi[ tag ])
    print( "Word", "'" + sentence[ wordindex] + "'", "current best two-tag sequence:", this_backpointer[ currbest], currbest)
    # print( "Word", "'" + sentence[ wordindex] + "'", "current best tag:", currbest)
    print('================')

    # done with all tags in this iteration
    # so store the current viterbi step
    viterbi.append(this_viterbi)
    backpointer.append(this_backpointer)


# done with all words in the sentence.
# now find the probability of each tag
# to have "END" as the next tag,
# and use that to find the overall best sequence
prev_viterbi = viterbi[-1]
best_previous = max(prev_viterbi.keys(),
                    key = lambda prevtag: prev_viterbi[ prevtag ] * cpd_tags[prevtag].prob("END"))

prob_tagsequence = prev_viterbi[ best_previous ] * cpd_tags[ best_previous].prob("END")

# best tagsequence: we store this in reverse for now, will invert later
best_tagsequence = [ "END", best_previous ]
# invert the list of backpointers
backpointer.reverse()

# go backwards through the list of backpointers
# (or in this case forward, because we have inverter the backpointer list)
# in each case:
# the following best tag is the one listed under
# the backpointer for the current best tag
current_best_tag = best_previous
for bp in backpointer:
    best_tagsequence.append(bp[current_best_tag])
    current_best_tag = bp[current_best_tag]

best_tagsequence.reverse()


print('================')

print( "The sentence was:", end = " ")
for w in sentence: print( w, end = " ")
print("\n")
print( "The best tag sequence is:", end = " ")
for t in best_tagsequence: print(t, end = " ")
print("\n")
print( "The probability of the best tag sequence is:", prob_tagsequence)


{'MD': 0.0, 'QL': 0.0, 'NR': 0.0, 'NI': 0.0, ')-': 0.0, '--': 0.0, 'EX': 0.0, 'WP': 0.0, 'NN': 0.0, '.': 0.0, 'RP': 0.0, 'RN': 0.0, 'PN': 0.0, 'IN': 0.0, 'WR': 0.0, 'VB': 0.0, 'END': 0.0, "''": 0.0, 'WD': 0.0, 'AP': 0.0, ',-': 0.0, "'": 0.0, 'FW': 0.0, 'WQ': 0.0, 'CD': 0.0, 'NP': 0.0, '``': 0.0, '.-': 0.0, 'TO': 0.0, '*': 0.0, 'PP': 0.0, ')': 0.0, 'UH': 0.0, 'AB': 0.0, 'DT': 0.0033218181276236437, ',': 0.0, 'HV': 0.0, ':-': 0.0, 'RB': 0.0, 'DO': 0.0, '(-': 0.0, ':': 0.0, 'CC': 0.0, 'JJ': 0.0, 'OD': 0.0, '(': 0.0, '*-': 0.0, 'BE': 0.0, 'CS': 0.0, 'AT': 0.0}
{'MD': 'START', 'QL': 'START', 'NR': 'START', 'NI': 'START', ')-': 'START', '--': 'START', 'EX': 'START', 'WP': 'START', 'NN': 'START', '.': 'START', 'RP': 'START', 'RN': 'START', 'PN': 'START', 'IN': 'START', 'WR': 'START', 'VB': 'START', 'END': 'START', "''": 'START', 'WD': 'START', 'AP': 'START', ',-': 'START', "'": 'START', 'FW': 'START', 'WQ': 'START', 'CD': 'START', 'NP': 'START', '``': 'START', '.-': 'START', 'TO': 'START', '*

In [None]:
# P(END|QL) = 0 
# P(END|RB) = 1

In [None]:
print('by default, the end statement is \n -- i.e., next print will be in a new line')
print('111next print')

print('-=-=-=-=-=-=-=-=-')

print('Now, we change the end statement to whilte space -- i.e., next print will be right after this print, following a white space', end = ' ')
print('222new print')

print('-=-=-=-=-=-=-=-=-')

print('try something else', end = '@@@')
print('333new print')

by default, the end statement is 
 -- i.e., next print will be in a new line
111next print
-=-=-=-=-=-=-=-=-
Now, we change the end statement to whilte space -- i.e., next print will be right after this print, following a white space 222new print
-=-=-=-=-=-=-=-=-
try something else@@@333new print


The code is implemented by [Katrin Erk](http://www.katrinerk.com/courses/python-worksheets/hidden-markov-models-for-pos-tagging-in-python)

##  Train HMM Tagger with NLTK HMM Trainer

The code above was a complete implementation of the details of an HMM. In this section, we will use an existing implementation in NLTK.

In [None]:
# Pretagged training data
brown_tagged_sents = brown.tagged_sents()

print(brown_tagged_sents)

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

In [None]:
# Import HMM module
from nltk.tag import hmm

# Setup a trainer with default(None) values
# And train with the data
trainer = hmm.HiddenMarkovModelTrainer()
trained_tagger = trainer.train_supervised(brown_tagged_sents)

print (trained_tagger)
# Prints the basic data about the tagger

tokens = word_tokenize("This race is awesome, I want to race too")
print(trained_tagger.tag(tokens))

<HiddenMarkovModelTagger 472 states and 56057 output symbols>
[('This', 'DT'), ('race', 'NN'), ('is', 'BEZ'), ('awesome', 'JJ'), (',', ','), ('I', 'PPSS'), ('want', 'VB'), ('to', 'TO'), ('race', 'VB'), ('too', 'QL')]


# Bi-LSTM based POS Tagger (Pytorch)

In this example, we construct and train a PoS tagger using a Bi-LSTM model.

![alt text](https://usydnlpgroup.files.wordpress.com/2020/03/bi-lstm_nton-e1586049916759.png)

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Training data

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize

nltk.download('treebank')
from nltk.corpus import treebank

import numpy as np
from sklearn.model_selection import train_test_split
 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


In [None]:
# Retrieve tagged sentences from treebank corpus
tagged_sentences = nltk.corpus.treebank.tagged_sents()
 
print(tagged_sentences[0])
print("Tagged sentences: ", len(tagged_sentences))
print("Tagged words:", len(nltk.corpus.treebank.tagged_words()))
#tagged_words(): list of (str,str) tuple

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tagged sentences:  3914
Tagged words: 100676


In [None]:
sentences, sentence_tags =[], [] 
for tagged_sentence in tagged_sentences:
    sentence = [v[0] for v in tagged_sentence]
    tags = [v[1] for v in tagged_sentence]
    sentences.append(np.array(sentence))
    sentence_tags.append(np.array(tags))
 
print(sentences[5])
print(sentence_tags[5])

['Lorillard' 'Inc.' ',' 'the' 'unit' 'of' 'New' 'York-based' 'Loews'
 'Corp.' 'that' '*T*-2' 'makes' 'Kent' 'cigarettes' ',' 'stopped' 'using'
 'crocidolite' 'in' 'its' 'Micronite' 'cigarette' 'filters' 'in' '1956'
 '.']
['NNP' 'NNP' ',' 'DT' 'NN' 'IN' 'JJ' 'JJ' 'NNP' 'NNP' 'WDT' '-NONE-' 'VBZ'
 'NNP' 'NNS' ',' 'VBD' 'VBG' 'NN' 'IN' 'PRP$' 'NN' 'NN' 'NNS' 'IN' 'CD'
 '.']


In [None]:
(train_sentences, 
 test_sentences, 
 train_tags, 
 test_tags) = train_test_split(sentences, sentence_tags, test_size=0.2, random_state = 42)

### Making vocab with special tokens

*PAD: Padding*

*OOV: Out Of Vocabulary*

In [None]:
words, tags = set([]), set([])
 
for s in train_sentences:
    for w in s:
        words.add(w.lower())

for ts in train_tags:
    for t in ts:
        tags.add(t)

word2index = {w: i + 2 for i, w in enumerate(list(words))}
word2index['-PAD-'] = 0  # The special value used for padding
word2index['-OOV-'] = 1  # The special value used for OOVs

tag2index = {t: i + 2 for i, t in enumerate(list(tags))}
tag2index['-PAD-'] = 0  # The special value used to tag padding
tag2index['-OOV-'] = 1  # The special value used to tag OOVs

In [None]:
def encode_sentences(sentences):
    res = []
    for sent in sentences:
        temp = [word2index[word.lower()] if word.lower() in word2index else word2index['-OOV-'] for word in sent]
        res.append(temp)
    return res

train_sentences_encoded = encode_sentences(train_sentences)
test_sentences_encoded = encode_sentences(test_sentences)


train_tags_y, test_tags_y = [], []

def tag_to_index(tags_list):
    res = []
    for tags in tags_list:
        temp = [tag2index[tag] if tag in tag2index else tag2index['-OOV-'] for tag in tags]
        res.append(temp)
    return res

train_tags_y = tag_to_index(train_tags)
test_tags_y = tag_to_index(test_tags)

### Padding

Not all of our sentences are the same length, but d
During training it is easier to work with sequences that are the same length, but our sentences vary in length. We solve this by adding padding (adding "-PAD-" enough times to make the sentences a certain length)

In [None]:
# Pad to max_length
max_length = len(max(train_sentences_encoded, key=len))
print(max_length) 

271


In [None]:
def pad_sequence(seq_list, max_length, index_dict):
    res = []
    for seq in seq_list:
        temp = seq[:]
        if len(seq)>max_length:
            res.append(temp[:max_length])
        else:
            temp += [index_dict['-PAD-']] * (max_length - len(seq))
            res.append(temp)
    return np.array(res)

train_sentences_encoded_pad = pad_sequence(train_sentences_encoded, max_length, word2index)
test_sentences_encoded_pad = pad_sequence(test_sentences_encoded, max_length, word2index)
train_tags_y_pad = pad_sequence(train_tags_y, max_length, tag2index)
test_tags_y_pad = pad_sequence(test_tags_y, max_length, tag2index)

### Build Dataset and Dataloader for training data

In [None]:
from torch.utils.data import TensorDataset
#More detailed info about the TensorDataset, https://pytorch.org/docs/1.1.0/_modules/torch/utils/data/dataset.html#TensorDataset
train_data = TensorDataset(torch.from_numpy(train_sentences_encoded_pad), torch.from_numpy(train_tags_y_pad))

from torch.utils.data import DataLoader
#More detailed info about the dataLoader, https://pytorch.org/docs/1.1.0/_modules/torch/utils/data/dataloader.html
batch_size = 128
train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True) 
# shuffle (bool, optional) – set to True to have the data reshuffled at every epoch (default: False).

## Model

In [None]:
import torch.nn as nn

class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)  
        self.hidden2tag = nn.Linear(hidden_dim * 2, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds)
        tag_space = self.hidden2tag(lstm_out)   
        return tag_space

EMBEDDING_DIM = 128
HIDDEN_DIM = 256

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word2index), len(tag2index)).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Note, the cell below can take 30+ minutes if you do not have GPU.

For the purpose of this lab, running for 2 epochs (which takes < 5 minutes) is sufficient if you don't want to wait and can't get access to a GPU.

In [None]:
from sklearn.metrics import accuracy_score

number_epochs = 20

for epoch in range(number_epochs):  
    loss_now = 0.0
    correct = 0

    for sentence,targets in train_loader:
        sentence = sentence.to(device)
        targets = targets.to(device)

        temp_batch_size = sentence.shape[0]

        model.train()
        optimizer.zero_grad()               
        tag_space = model(sentence)
        loss = loss_function(tag_space.view(-1, tag_space.shape[-1]), targets.view(-1))
        loss.backward()
        optimizer.step()

        loss_now += loss.item() * temp_batch_size
        predicted = torch.argmax(tag_space, -1)
        # Note: The training accuracy here is calculated with "PAD", which will result in a relative higher accuracy.
        correct += accuracy_score(predicted.view(-1).cpu().numpy(),targets.view(-1).cpu().numpy())*temp_batch_size

    print('Epoch: %d, training loss: %.4f, training accuracy: %.2f%%'%(epoch+1,loss_now/len(train_data),100*correct/len(train_data)))

Epoch: 1, training loss: 0.7572, training accuracy: 87.25%
Epoch: 2, training loss: 0.3024, training accuracy: 92.91%
Epoch: 3, training loss: 0.2378, training accuracy: 94.25%
Epoch: 4, training loss: 0.1994, training accuracy: 94.94%
Epoch: 5, training loss: 0.1676, training accuracy: 95.60%
Epoch: 6, training loss: 0.1409, training accuracy: 96.31%
Epoch: 7, training loss: 0.1193, training accuracy: 96.82%
Epoch: 8, training loss: 0.1020, training accuracy: 97.27%
Epoch: 9, training loss: 0.0878, training accuracy: 97.66%
Epoch: 10, training loss: 0.0761, training accuracy: 97.95%
Epoch: 11, training loss: 0.0664, training accuracy: 98.21%
Epoch: 12, training loss: 0.0585, training accuracy: 98.42%
Epoch: 13, training loss: 0.0515, training accuracy: 98.62%
Epoch: 14, training loss: 0.0456, training accuracy: 98.79%
Epoch: 15, training loss: 0.0404, training accuracy: 98.94%
Epoch: 16, training loss: 0.0360, training accuracy: 99.06%
Epoch: 17, training loss: 0.0321, training accura

## Test with the test set

In [None]:
model.eval()
sentence = torch.from_numpy(test_sentences_encoded_pad).to(device)
tag_space = model(sentence)
predicted = torch.argmax(tag_space, -1)
predicted = predicted.cpu().numpy()

# cut off the PAD part
test_len_list = [len(s) for s in test_sentences_encoded]
actual_predicted_list= []
for i in range(predicted.shape[0]):
    actual_predicted_list+=list(predicted[i])[:test_len_list[i]]

# get actual tag list
actual_tags = sum(test_tags_y, [])

print('Test Accuracy: %.2f%%'%(accuracy_score(actual_predicted_list,actual_tags)*100))

Test Accuracy: 88.06%


# Extension: Saving Data

Sample code for saving data to files.

### CSV file

Useful for saving tables of data in a way that you can look at (e.g., with a text editor) or read into a spreadsheet application.

In [None]:
import pandas as pd

data = ['this is a cat', 'today is a sunny day']
df = pd.DataFrame(data,columns=['data'])

# Save data to csv file
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
df.to_csv('save_as_csv.csv')

In [None]:
# Load saved data
# https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv
df = pd.read_csv('save_as_csv.csv')

# data.head() # Uncomment to check how the data look like

data = df['data'].tolist()
data[:2]

['this is a cat', 'today is a sunny day']

### JSON file

Convenient if you want to be able to look at the output (e.g., with a text editor) or read it into a different program, even using a different langauge.

In [None]:
import json
data = [['this','is','a','cat'],['today','is','a','sunny','day']]
data_dict = {'data': data}

# Save data to json file
with open('save_as_json.json','w') as f:
  json.dump(data_dict,f)

In [None]:
# Load data from json file
with open('save_as_json.json','r') as f:
  data=json.load(f)
data['data']

[['this', 'is', 'a', 'cat'], ['today', 'is', 'a', 'sunny', 'day']]

### Pickle (pkl) file

Convenient for Python objects that you plan to reopen in Python later.

In [None]:
import pickle
data = [['this','is','a','cat'],['today','is','a','sunny','day']]
data_dict = {'data': data}

# Save data to pkl file
with open('save_as_pkl.pkl','wb') as f:
  pickle.dump(data_dict,f)

In [None]:
# Load data from pkl file
with open('save_as_pkl.pkl','rb') as f:
  data=pickle.load(f)
data['data']


[['this', 'is', 'a', 'cat'], ['today', 'is', 'a', 'sunny', 'day']]

### Save files from colab to your google drive

The example below shows how to mount your Google Drive on z runtime using an authorization code, and how to write and read files. Once executed, you will be able to see the new file (`foo.txt`) at [https://drive.google.com/](https://drive.google.com/).

In [None]:
from google.colab import drive
drive.mount('/gdrive')

with open('/gdrive/My Drive/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat '/gdrive/My Drive/foo.txt'

Mounted at /gdrive
Hello Google Drive!

In [None]:
# Copy the saved data file from colab to your google drive
# save_as_pkl.pkl is generated using the sample code from the section above
!cp save_as_pkl.pkl /gdrive/My\ Drive

In [None]:
# After you copy or move your saved data file to your google drive
# you can load it directly from your google drive later 
with open('/gdrive/My Drive/save_as_pkl.pkl','rb') as f:
  data=pickle.load(f)
data

{'data': [['this', 'is', 'a', 'cat'], ['today', 'is', 'a', 'sunny', 'day']]}