# Introduction
In `2. Part of Speech Tagging - LSTM` I attempted to use a RNN for Part of Speech (POS) tagging based on word embeddings created with Google's word2vec pretrained on their news data set, and some extra features.

In this notebook I will have another go at it with some more experience with RNNs. Also, this time I will be sure to ignore padded data when calculating loss functions by using the `sample_weight` parameter of `fit`. I think not suing sample weight might have severly harmed my previous model.

This time I will include a frozen embedding layer in my model instead of pre processing the data into word vectors.
I'll try word embeddings trained by [Glove](https://nlp.stanford.edu/projects/glove/) this time. Glove  which is Stanfords embedding model.

# The data
I will be using the same training data for my tagger as in `1. Part of Speech Tagging - First Atempt`:
[Universal Dependencies - English Web Treebank](http://universaldependencies.org/treebanks/en_ewt/index.html), a CoNLL-U formart corpus with 254 830 words and 16 622 sentences in english *taken from various web media including weblogs, newsgroups, emails, reviews, and Yahoo! answers*.

## Load the Data
First lets load the training data and convert it to a python dictionary and a pandas data frame.
I use the [conllu](https://github.com/EmilStenstrom/conllu) python package to parse the CoNLL-U files to dictionaries.

In [1]:
import numpy as np
import conllu

Read the data.

In [2]:
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-train.conllu'.format(directory), 'r', encoding='utf-8') as f:
    train_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-dev.conllu'.format(directory), 'r', encoding='utf-8') as f:
    dev_text = f.read()
    
directory = 'UD/UD_English-EWT'
with open('{}/en_ewt-ud-test.conllu'.format(directory), 'r', encoding='utf-8') as f:
    test_text = f.read()

Convert it to a dictionary.

In [3]:
train_dict = conllu.parse(train_text)
dev_dict = conllu.parse(dev_text)
test_dict = conllu.parse(test_text)

Count sentences and tokens.

In [4]:
from functools import reduce

n_train_sentences = len(train_dict)
n_train_tokens = reduce(lambda x, y: x + len(y), train_dict, 0)

print("The training set contains {} sentences and {} tokens".format(n_train_sentences, n_train_tokens))

The training set contains 12543 sentences and 204607 tokens


In [5]:
train_sentences = [[token['form'] for token in sentence] for sentence in train_dict]
train_labels = [[token['upostag'] for token in sentence] for sentence in train_dict]

dev_sentences = [[token['form'] for token in sentence] for sentence in dev_dict]
dev_labels = [[token['upostag'] for token in sentence] for sentence in dev_dict]

test_sentences = [[token['form'] for token in sentence] for sentence in test_dict]
test_labels = [[token['upostag'] for token in sentence] for sentence in test_dict]

In [6]:
pos_tags = list(set(reduce(lambda x, y: x + y, train_labels)))

In [7]:
n_labels = len(pos_tags)
n_labels

17

In [8]:
pos_idx = dict(zip(pos_tags, np.arange(len(pos_tags))))

In [9]:
pos_encoding = {}
for pos, i in pos_idx.items():
    pos_encoding[pos] = np.zeros(len(pos_tags))
    pos_encoding[pos][i]=1

# Feature Engineering
[Wang et al.](https://arxiv.org/pdf/1510.06168.pdf) showed that a bidirectional LSTM network could achieve state of the art performance without using any morphological features, they only used these features:
* Word embedding of the word (cast to lower case)
* Suffix of length two, one-hot encoded
* Wether the word is all caps, lower case, or has an initial capital letter. One-hot encoded.

I am not using a bidirectional LSTM, but at least I am using a RNN. 
I'll opt to only use word embeddings for starters.

## Glove
I will opt to use the 100 dimensional Glove 6B data. My vocabulary will be exactly the words inside the pretrained model.

In [46]:
glove_path = 'glove/glove.6B/glove.6B.100d.txt'
embeddings = {}
token_index = {}
with open(glove_path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        tok, *vec = line.split()
        embeddings[tok] = np.array(vec, dtype='float32')
        # Reserve index 0 for unknown words
        token_index[tok] = i + 1

Let's check if we have any words that are out of vocabulary OOV.

In [11]:
oov = []
for sentence in train_sentences:
    for tok in sentence:
        if tok not in embeddings:
            oov.append(tok)

In [12]:
len(oov)

31192

Ouch, 31 000 tokens out of 204 000 tokens are OOV. Most likely it's caused by what preprocessing was used.

In [13]:
oov[:10]

['Al',
 'Zaman',
 'American',
 'Shaikh',
 'Abdullah',
 'Ani',
 'Qaim',
 'Syrian',
 'This',
 'DPA']

Aha! Glove is only lowe case, as I should convert my data to lower case as well.

In [14]:
train_sentences = [[token.lower() for token in sentence] for sentence in train_sentences]
dev_sentences = [[token.lower() for token in sentence] for sentence in dev_sentences]
test_sentences = [[token.lower() for token in sentence] for sentence in test_sentences]

In [15]:
oov = []
for sentence in train_sentences:
    for tok in sentence:
        if tok not in embeddings:
            oov.append(tok)

In [16]:
len(oov)

2442

In [17]:
oov[:100]

["'akkab",
 'jubur',
 'batawi',
 'sarhid',
 'batawi',
 'clientelage',
 "47's",
 'fallujan',
 'saddamites',
 'fallujan',
 'sweared',
 'unscear',
 'conseguences',
 'emercom',
 'wi940',
 'http://www.ibiblio.org/expo/soviet.exhibit/chernobyl.html',
 'http://www.ibrae.ac.ru/ibrae/eng/chernobyl/nat_rep/nat_repe.htm#24',
 'http://www.nsrl.ttu.edu/chernobyl/wildlifepreserve.htm',
 'http://www.environmentalchemistry.com/yogi/hazmat/articles/chernobyl1.html',
 'http://digon_va.tripod.com/chernobyl.htm',
 'http://www.oneworld.org/index_oc/issue196/byckau.html',
 'http://www.collectinghistory.net/chernobyl/',
 'http://www.ukrainianweb.com/chernobyl_ukraine.htm',
 'http://www.bullatomsci.org/issues/1993/s93/s93marples.html',
 'http://www.calguard.ca.gov/ia/chernobyl-15%20years.htm',
 'http://www.infoukes.com/history/chornobyl/gregorovich/index.html',
 'http://www.un.org/ha/chernobyl/',
 'http://www.tecsoc.org/pubs/history/2002/apr26.htm',
 'http://www.chernobyl.org.uk/page2.htm',
 'http://www.time.

Lot's of Arabic sounding words, urls and misspelled words. Let's see if the gensim data has an url token.

In [18]:
'url' in embeddings

True

Cool, let's replace all urls with that one. Also, let's check what POS tags are expected for urls.

In [19]:
import re

In [20]:
def process_word(word):
    # Match words starting with www., http:// or https://
    if re.match(r'^(?:https{0,1}\:\/\/.*|www\.*)', word):
        return "url"
    else:
        return word.lower()

In [21]:
train_sentences = [[process_word(word) for word in sentence] for sentence in train_sentences]
dev_sentences = [[process_word(word) for word in sentence] for sentence in dev_sentences]
test_sentences = [[process_word(word) for word in sentence] for sentence in test_sentences]

In [22]:
oov = []
for sentence in train_sentences:
    for tok in sentence:
        if tok not in embeddings:
            oov.append(tok)

In [23]:
len(oov)

2310

In [24]:
url_tags = []
for i, sentence in enumerate(train_sentences):
    for j, tok in enumerate(sentence):
        if tok == 'url':
            url_tags.append(train_labels[i][j])

In [25]:
np.unique(np.array(url_tags),return_counts=True)

(array(['NOUN', 'PROPN', 'X'],
       dtype='<U5'), array([  2,   1, 132], dtype=int64))

Most are classified as `X`. I think casting them all to `url` is sound!

It would be interesting to correct the spelling of all misspelled words. One method could be to calculate the [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance) of all OOV words to words inside the vocabulary, and casting them to the closest match (if it is withing some maxmimum distance).

[Norvig](http://norvig.com/spell-correct.html) shows how to find all words within 2 Damerau-Levenshtein distance. Neat!
THis method is easily extended to find all words within k.

First, find all characters present in the embeddings.

In [26]:
character_vocab = set()

for word in embeddings:
    for c in word:
        character_vocab.add(c)

In [27]:
len(character_vocab)

489

Ouch, that's a lot of characters. Lets ignore all characters that are not ascii. (I think these characters are the most common in misspellings.)

In [28]:
# Only keep characters in the ascii range
ascii_vocab = set([c for c in character_vocab if 32 <= ord(c) <= 126])

In [29]:
len(ascii_vocab)

68

In [30]:
def neighbour_words(word):
    # Tuples with all possible splits of word
    splits = [(word[:i], word[i:]) for i in range(len(word))]
    
    # All words generated by deleting one character
    deletes = [L + R[1:] for L, R in splits if R]
    # All words generated by swapping two characters in word
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    # All words generated by inserting a character in word
    insertions = [L + c + R for L, R in splits for c in ascii_vocab]
    # All words generated by replacing a character in word
    replace = [L + c +R[1:] for L, R, in splits for c in ascii_vocab if R]
    
    return set(deletes+transposes+insertions+replace)
    

In [31]:
def fix_spelling(word, vocab, max_distance=1):
    candidates = set([word])
    
    for i in range(max_distance+1):
        for candidate in candidates:
            if candidate in vocab:
                return candidate
        new_candidates = set()
        for candidate in candidates:
            new_candidates = new_candidates.union(neighbour_words(candidate))
        candidates = new_candidates
    return None

Right, the amount of possibilities explode for longer words and this large vocabulary.
I'll only search for matches withing just one edit, and if there are many possibilities I don't care which one.

Here is a version of the above function that will just look at distance one, and return the first match.

In [32]:
def find_one_neighbour(word, vocab):
    # Tuples with all possible splits of word
    splits = [(word[:i], word[i:]) for i in range(len(word))]
    
    # All words generated by deleting one character
    for L, R in splits:
        candidate = L + R[1:] if R else None
        if candidate in vocab:
            return candidate
    # All words generated by swapping two characters in word
    for L, R in splits:
        candidate = L + R[1:] if R else None
        if candidate in vocab:
            return candidate
    # All words generated by swapping two characters in word
    for L, R in splits:    
        candidate = L + R[1] + R[0] + R[2:] if len(R) > 1 else None
        if candidate in vocab:
            return candidate
    # All words generated by inserting a character in word
    for L, R in splits:
        for c in ascii_vocab:    
            candidate = L + c + R 
            if candidate in vocab:
                return candidate
    # All words generated by replacing a character in word
    for L, R, in splits:
        for c in ascii_vocab:    
            candidate = L + c +R[1:] if R else None
            if candidate in vocab:
                return candidate
        
    return None
    

In [33]:
find_one_neighbour("mispeled", ["misspelled"])

In [34]:
fixed_spelling = [find_one_neighbour(word, embeddings) for word in oov]

In [35]:
new_spelling = dict(zip(oov, fixed_spelling))

In [36]:
list(new_spelling.items())[:20]

[("'akkab", None),
 ('jubur', 'bubur'),
 ('batawi', 'tatawi'),
 ('sarhid', 'sahid'),
 ('clientelage', None),
 ("47's", '47s'),
 ('fallujan', 'falluja'),
 ('saddamites', None),
 ('sweared', 'seared'),
 ('unscear', None),
 ('conseguences', 'consequences'),
 ('emercom', None),
 ('wi940', None),
 ("50's", '50s'),
 ('hirsohima', 'hiroshima'),
 ('nagaski', 'nagasaki'),
 ('.......', None),
 ('wearies', 'wearied'),
 ('post-saddam', None),
 ('jawaharal', 'jawaharlal')]

In [37]:
n_fixed = len(oov) - np.asarray(fixed_spelling, dtype='bool').sum()

In [38]:
n_fixed

1066

Might not always be perfect corrections, but I think this is better than keeping the words as oov.

In [39]:
def replace_words(word, fix_dict):
    if word in fix_dict:
        return fix_dict[word]
    else:
        return word

In [40]:
def find_oov(sentences, vocab):
    oov = []
    for sentence in train_sentences:
        for tok in sentence:
            if tok not in embeddings:
                oov.append(tok)
    return oov

In [41]:
oov_train = find_oov(train_sentences, embeddings)
fixed_spelling_train = [find_one_neighbour(word, embeddings) for word in oov_train]
fixed_spelling_train = dict(zip(oov_train, fixed_spelling_train))

In [42]:
oov_dev = find_oov(dev_sentences, embeddings)
fixed_spelling_dev = [find_one_neighbour(word, embeddings) for word in oov_dev]
fixed_spelling_dev = dict(zip(oov_dev, fixed_spelling_dev))

In [43]:
oov_test = find_oov(test_sentences, embeddings)
fixed_spelling_test = [find_one_neighbour(word, embeddings) for word in oov_test]
fixed_spelling_test = dict(zip(oov_test, fixed_spelling_test))

In [44]:
train_sentences = [[replace_words(word, fixed_spelling_train) for word in sentence] for sentence in train_sentences]
dev_sentences = [[replace_words(word, fixed_spelling_dev) for word in sentence] for sentence in dev_sentences]
test_sentences = [[replace_words(word, fixed_spelling_test) for word in sentence] for sentence in test_sentences]

## Encoding the data
I will encode the targets and the tokens as integers.
Later I will encode targets with one-hot encoding.

### Labels
First build a map from pos tag to an index.

In [79]:
pos_tag_index = {}
for i, pos in enumerate(pos_tags):
    # Reserve 0 for padded labels
    pos_tag_index[pos] = i + 1

In [97]:
Y_train = np.asarray([np.asarray([pos_tag_index[pos] for pos in sentence]) for sentence in train_labels])
Y_dev = np.asarray([np.asarray([pos_tag_index[pos] for pos in sentence]) for sentence in dev_labels])
Y_test = np.asarray([np.asarray([pos_tag_index[pos] for pos in sentence]) for sentence in test_labels])

### Tokens
I already built the mapping from token to index.
Encode all OOV tokens as 0.

In [98]:
X_train = np.asarray([np.asarray([token_index[tok] if tok in token_index else 0 for tok in sentence]) for sentence in train_sentences])
X_dev = np.asarray([np.asarray([token_index[tok] if tok in token_index else 0 for tok in sentence]) for sentence in dev_sentences])
X_test = np.asarray([np.asarray([token_index[tok] if tok in token_index else 0 for tok in sentence]) for sentence in test_sentences])

## Pad the data

In [99]:
max_sentence_length = max([len(sentence) for sentence in train_sentences])

In [100]:
from keras.preprocessing.sequence import pad_sequences

In [101]:
X_train = pad_sequences(X_train, maxlen=max_sentence_length, padding='post')
Y_train = pad_sequences(Y_train, maxlen=max_sentence_length, padding='post')

In [102]:
X_dev = pad_sequences(X_dev, maxlen=max_sentence_length, padding='post')
Y_dev = pad_sequences(Y_dev, maxlen=max_sentence_length, padding='post')

In [103]:
X_test = pad_sequences(X_test, maxlen=max_sentence_length, padding='post')
Y_test = pad_sequences(Y_test, maxlen=max_sentence_length, padding='post')

### Set sample weights

In [120]:
sample_weights_train = (X_train != 0).reshape(X_train.shape[0], X_train.shape[1])
sample_weights_train = sample_weights_train.astype(int)

In [121]:
sample_weights_dev = (X_dev != 0).reshape(X_dev.shape[0], X_dev.shape[1])
sample_weights_dev = sample_weights_dev.astype(int)

In [122]:
sample_weights_test = (X_test != 0).reshape(X_test.shape[0], X_test.shape[1])
sample_weights_test = sample_weights_test.astype(int)

## One-hot Encode Labels

In [104]:
def one_hot(i, n):
    arr = np.zeros(n)
    arr[i] = 1
    return arr

In [105]:
Y_train[0]

array([14,  7, 14,  7,  9,  4, 15, 14, 14, 14,  7, 14,  7,  1,  4,  2,  1,
        4,  2,  1,  4,  2, 14,  7,  2,  1,  9,  4,  7,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

In [108]:
Y_train = np.asarray([np.asarray([one_hot(i, n_labels+1) for i in sentence]) for sentence in Y_train])

Y_dev = np.asarray([np.asarray([one_hot(i, n_labels+1) for i in sentence]) for sentence in Y_dev])

Y_test = np.asarray([np.asarray([one_hot(i, n_labels+1) for i in sentence]) for sentence in Y_test])

In [107]:
Y_train.shape

(12543, 159, 18)

# RNN

## Embedding Layer
Create a frozen embedding layer from the Glove data.

In [109]:
embedding_matrix = np.array(list(embeddings.values()))

In [110]:
n_words, embedding_dims = embedding_matrix.shape

In [112]:
from keras.layers import Embedding

In [113]:
embedding_layer = Embedding(n_words,
                            embedding_dims,
                            weights=[embedding_matrix],
                            input_length=max_sentence_length,
                            trainable=False)

## First Model

In [150]:
from keras.layers import GRU, Input, Dense, Dropout, BatchNormalization
from keras.models import Model

In [151]:
latent_dim = 256

In [152]:
model_input = Input(shape=(None,))
x = embedding_layer(model_input)
x = BatchNormalization()(x)
x = GRU(latent_dim, return_sequences=True)(x)
model_output = Dense(n_labels + 1, activation='softmax')(x)
model = Model(model_input, model_output)

In [153]:
# Run training
model.compile(optimizer='adam', loss='categorical_crossentropy', sample_weight_mode='temporal', metrics=['accuracy'])

In [154]:
model.fit(X_train, Y_train,
          batch_size=64,
          epochs=1,
          sample_weight=sample_weights_train,
          validation_data=(X_dev, Y_dev, sample_weights_dev))

Train on 12543 samples, validate on 2002 samples
Epoch 1/1


<keras.callbacks.History at 0x1cc924c9c50>

Wow, after just one epoch the model scores 96% accuracy on the dev set. Seems almost to good to be true.
I will have to verify this!

One small note first. I tried to train my model without a BatchNormalization layer, and it was not able to make progress at all. Goes to show how important normalised data is!