## Collecting a Dataset
1. You will use a dataset from the CoNLL conferences that benchmark natural language processing systems and tasks. There were two conferences on named entity recognition: CoNLL 2002 (Spanish and Dutch) and CoNLL 2003 (English and German). In this assignment, you will work on the English dataset. Read the description of the task.
2. The datasets are protected by a license and you need to obtain it to reconstruct the data. Alternatively, you can use a local copy or try to find one on github (type conll2003 in the search box) or use the Google dataset search: https://toolbox.google.com/datasetsearch. You can find a local copy in the /usr/local/cs/EDAN95/datasets/NER-data folder.
3. The dataset comes in the form of three files: a training set, a development set, and a test set. You will use the test set to evaluate your models. For this, you will apply the conlleval script that will compute the harmonic mean of the precision and recall: F1. You have a local copy of this script in /usr/local/cs/EDAN95/datasets/ner/bin. conlleval is written in Perl. Be sure to have it on your machine to run it.

In [1]:
# Files in directory conll2003, train.txt, valid.txt and test.txt
# https://github.com/ningshixian/NER-CONLL2003

## Collecting the Embeddings
1. Download the GloVe embeddings 6B from https://nlp.stanford.edu/projects/glove/ and keep the 100d vectors.
2. Write a function that reads GloVe embeddings and store them in a dictionary, where the keys will be the words and the values, the embeddings.
3. Using a cosine similarity, compute the 5 closest words to the words table, france, and sweden.

In [2]:
import os
import numpy as np
glove_dir = '/Users/Marcel/Documents/Python/edan95/project_4/glove.6b'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.strip().split()
    word = values[0]
    vector = np.array(values[1:], dtype='float32') 
    embeddings_index[word] = vector
f.close()
print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [3]:
# Need to run this in order for my kernel not to crash
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [4]:
import operator
def top5(word, embd):
    cdict = {}
    for w in embd:
        cdict[w] = np.dot(embd[w],embd[word])/(np.linalg.norm(embd[w])*np.linalg.norm(embd[word]))
    sorted_dict = sorted(cdict.items(), key = operator.itemgetter(1),reverse=True)
    return sorted_dict[1:6]

words =['france','sweden','table']
for w in words:
    print(w)
    print(top5(w,embeddings_index))

france
[('belgium', 0.8076423), ('french', 0.8004377), ('britain', 0.79505277), ('spain', 0.7557464), ('paris', 0.74815863)]
sweden
[('denmark', 0.8624401), ('norway', 0.80732495), ('finland', 0.7906495), ('netherlands', 0.74684644), ('austria', 0.74668366)]
table
[('tables', 0.80211616), ('place', 0.6582379), ('bottom', 0.65597206), ('room', 0.65436906), ('side', 0.6433667)]


## Reading the Corpus and Building Indices
You will read the corpus with programs available from https://github.com/pnugues/edan95. These programs will enable you to load the files in the form of a list of dictionaries.
1. Write a function that for each sentence returns the X and Y lists of symbols consisting of words and NER tags.
2. Create a vocabulary of all the words observed in the training set and the words in GloVe.
3. Create indices and inverted indices for the words and the NER: i.e. you will associate each word with a number. You will use index 0 for the padding symbol and 1 for unknown words.

In [5]:
BASE_DIR = '/Users/Marcel/Documents/Python/edan95/project_4/conll003-englishversion/'

def load_conll2003_en():
    train_file = BASE_DIR + 'train.txt'
    dev_file = BASE_DIR + 'valid.txt'
    test_file = BASE_DIR + 'test.txt'
    column_names = ['form', 'ppos', 'pchunk', 'ner']
    train_sentences = open(train_file).read().strip()
    dev_sentences = open(dev_file).read().strip()
    test_sentences = open(test_file).read().strip()
    return train_sentences, dev_sentences, test_sentences, column_names

import re

class Token(dict):
    pass

class CoNLLDictorizer:

    def __init__(self, column_names, sent_sep='\n\n', col_sep=' +'):
        self.column_names = column_names
        self.sent_sep = sent_sep
        self.col_sep = col_sep

    def fit(self):
        pass

    def transform(self, corpus):
        corpus = corpus.strip()
        sentences = re.split(self.sent_sep, corpus)
        return list(map(self._split_in_words, sentences))

    def fit_transform(self, corpus):
        return self.transform(corpus)

    def _split_in_words(self, sentence):
        rows = re.split('\n', sentence)
        return [Token(dict(zip(self.column_names,
                               re.split(self.col_sep, row))))
                for row in rows]

In [6]:
train_sentences, dev_sentences, test_sentences, column_names = load_conll2003_en()

conll_dict = CoNLLDictorizer(column_names, col_sep=' +')
train_dict = conll_dict.transform(train_sentences)
dev_dict = conll_dict.transform(dev_sentences)
print(train_dict[0])


[{'form': '-DOCSTART-', 'ppos': '-X-', 'pchunk': '-X-', 'ner': 'O'}]


In [7]:
def build_sequences(corpus_dict, key_x='form', key_y='pos', tolower=True):
    """
    Creates sequences from a list of dictionaries
    :param corpus_dict:
    :param key_x:
    :param key_y:
    :return:
    """
    X = []
    Y = []
    for sentence in corpus_dict:
        x = [word[key_x] for word in sentence]
        y = [word[key_y] for word in sentence]
        if tolower:
            x = list(map(str.lower, x))
        X += [x]
        Y += [y]
    return X, Y

Training set

In [8]:
# Build the words and NER sequence tags
X_words, Y_ner = build_sequences(train_dict, key_x='form', key_y='ner')
print('First sentence, words', X_words[1])
print('First sentence, NER', Y_ner[1])
# Extract the list of unique words and NER and vocab including glove 
word_set = sorted(list(set([item for sublist in X_words for item in sublist])))
ner_set = sorted(list(set([item for sublist in Y_ner for item in sublist])))

glove_set = sorted([key for key in embeddings_index.keys()])
vocab = sorted(list(set(glove_set + word_set)))

# Building the indices 
rev_word_idx = dict(enumerate(vocab, start=2))
rev_ner_idx = dict(enumerate(ner_set, start=2))
rev_word_idx[0]=0
rev_word_idx[1]='-unknown-'
word_idx = {v: k for k, v in rev_word_idx.items()}
ner_idx = {v: k for k, v in rev_ner_idx.items()}

First sentence, words ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.']
First sentence, NER ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


Development set

In [9]:
# Build the words and NER sequence tags 
X_words_dev, Y_ner_dev = build_sequences(dev_dict, key_x='form', key_y='ner')

# Extract the list of unique words and NER and vocab including glove 
word_set_dev = sorted(list(set([item for sublist in X_words_dev for item in sublist])))
ner_set_dev = sorted(list(set([item for sublist in Y_ner_dev for item in sublist])))

# Building the indices 
rev_word_idx_dev = dict(enumerate(vocab, start=2))
rev_ner_idx_dev = dict(enumerate(ner_set_dev, start=2))
rev_word_idx_dev[0]=0
rev_word_idx_dev[1]='-unknown-'
word_idx_dev = {v: k for k, v in rev_word_idx_dev.items()}
ner_idx_dev = {v: k for k, v in rev_ner_idx_dev.items()}

## Building the Embedding Matrix
1. Create a matrix of dimensions (M, N), where M, will the size of the vocabulary: The unique words in the training set and the words in GloVe, and N, the dimension of the embeddings.
The padding symbol and the unknown word symbol will be part of the vocabulary.
The shape of your matrix should be: (402597, 100). Initialize it with random values.
2. Fill the matrix with the GloVe embeddings. You will use the indices from the previous section.

In [10]:
max_words=len(rev_word_idx.keys())
embedding_dim=100
embedding_matrix = np.random.rand(max_words, embedding_dim)*3.575#max value
for word, i in word_idx.items():
    embedding_vector = embeddings_index.get(word) 
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

## Creating the X and Y Sequences
You will now create the input and output sequences with numerical indices
1. Convert the X and Y list of symbols in a list of numbers using the indices you created.
2. Pad the sentences using the pad_sequences function.
3. Do the same for the development set.

Now we have symols:

In [11]:
print('First sentence, words', X_words[1])
print('First sentence, NER', Y_ner[1])

First sentence, words ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.']
First sentence, NER ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']


We want to create sequences of numbers, let us convert them!

In [12]:
X_words_idx = [list(map(lambda x: word_idx.get(x, 1), x)) for x in X_words]
Y_ner_idx = [list(map(lambda x: ner_idx.get(x, 1), x)) for x in Y_ner]

In [13]:
print('First sentence, words', X_words_idx[1])
print('First sentence, NER', Y_ner_idx[1])

First sentence, words [142143, 307143, 161836, 91321, 363368, 83766, 85852, 218260, 936]
First sentence, NER [4, 10, 3, 10, 10, 10, 3, 10, 10]


Ok, good! Now we just need to pad the sentences so that all sentences have the same length!

In [14]:
from keras.preprocessing.sequence import pad_sequences
maxlen = 150
X_words_idx = pad_sequences(X_words_idx,maxlen=maxlen)
Y_ner_idx = pad_sequences(Y_ner_idx,maxlen=maxlen)

Using TensorFlow backend.


In [15]:
print('First sentence, words', X_words_idx[1])
print('First sentence, NER', Y_ner_idx[1])

First sentence, words [     0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      0      0      

Let's do the same for validation set

In [16]:
X_words_idx_dev = [list(map(lambda x: word_idx_dev.get(x, 1), x)) for x in X_words_dev]
Y_ner_idx_dev = [list(map(lambda x: ner_idx_dev.get(x, 1), x)) for x in Y_ner_dev]
X_words_idx_dev = pad_sequences(X_words_idx_dev,maxlen=maxlen)
Y_ner_idx_dev = pad_sequences(Y_ner_idx_dev,maxlen=maxlen)

Let's also convert Y to categorical values!

In [17]:
from keras.utils.np_utils import to_categorical
Y_ner_idx_cat = to_categorical(Y_ner_idx)
Y_ner_idx_dev_cat = to_categorical(Y_ner_idx_dev)

## Building a Simple Recurrent Neural Network
1. Create a simple recurrent network and train a model with the train set. As layers, you will use Embedding, SimpleRNN, and Dense.
2. Compile and fit your network. You will report the training and validation losses and accuracies and comment on the possible overfit.
3. Apply your network to the test set and report the accuracy you obtained. You will use the evaluate method.

Output tolkar jag det som att det är NER tag, det finns olika tags, 0=inte NER, och I-XXX = NER, där XXX kan vara organisation, person eller plats eller andra.

In [18]:
ner_vocab_size=len(ner_idx.keys())+2

In [19]:
text_vocabulary_size = len(vocab) + 2
print('text_vocabulary_size\t',text_vocabulary_size)
print('embedding_dim\t\t',embedding_dim)
print('maxlen\t\t\t',maxlen)
print('ner_vocab_size\t\t',ner_vocab_size)
print('X\t\t\t',X_words_idx.shape)
print('Y\t\t\t',Y_ner_idx.shape)
print('X_val\t\t\t',X_words_idx_dev.shape)
print('Y_val\t\t\t',Y_ner_idx_dev.shape)

text_vocabulary_size	 402597
embedding_dim		 100
maxlen			 150
ner_vocab_size		 11
X			 (14987, 150)
Y			 (14987, 150)
X_val			 (3466, 150)
Y_val			 (3466, 150)


In [20]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, SimpleRNN,Bidirectional

model = Sequential() 

# input här kommer vara emb_mat som vi lägger till som vikter i emb_lay och fryser så att de inte kan förändras
model.add(Embedding(text_vocabulary_size,embedding_dim,input_length=maxlen,mask_zero=False))
model.layers[0].set_weights([embedding_matrix]) 
model.layers[0].trainable = False
# output blir 150 x 100

model.add(Bidirectional(SimpleRNN(32,return_sequences=True)))
model.add(Dense(ner_vocab_size, activation='softmax')) 

model.summary()






_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 150, 100)          40259700  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 150, 64)           8512      
_________________________________________________________________
dense_1 (Dense)              (None, 150, 11)           715       
Total params: 40,268,927
Trainable params: 9,227
Non-trainable params: 40,259,700
_________________________________________________________________


In [21]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['acc']) 




In [22]:
model.fit(X_words_idx, Y_ner_idx_cat,
          epochs=3, 
          batch_size=128,
          validation_data=(X_words_idx_dev, Y_ner_idx_dev_cat))

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 14987 samples, validate on 3466 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x13c28b9b0>

## Evaluating your System
You will use the official script to evaluate the performance of your system
1. Use the predict method to predict the tags of the whole test set
2. Write your results in a file, where the two last columns will be the hand-annotated tag and the predicted tag. The fields must be separated by a space.
3. Apply conlleval to your output. Report the F1 result.
4. Try to improve your model by modifying some parameters, adding layers, adding Bidirectional and Dropout.
5. Evaluate your network again

Preprossesing test data

In [23]:
test_dict = conll_dict.transform(test_sentences)
X_words, Y_ner = build_sequences(test_dict, key_x='form', key_y='ner')
X_words_test, Y_ner_test = build_sequences(test_dict, key_x='form', key_y='ner')

# Extract the list of unique words and NER and vocab including glove 
word_set_test = sorted(list(set([item for sublist in X_words_test for item in sublist])))
ner_set_test = sorted(list(set([item for sublist in Y_ner_test for item in sublist])))

# Building the indices 
rev_word_idx_test = dict(enumerate(vocab, start=2))
rev_ner_idx_test = dict(enumerate(ner_set_test, start=2))
rev_word_idx_test[0]=0
rev_word_idx_test[1]='-unknown-'
word_idx_test = {v: k for k, v in rev_word_idx_test.items()}
ner_idx_test = {v: k for k, v in rev_ner_idx_test.items()}

# Converting sequences to indicies
X_words_idx_test = [list(map(lambda x: word_idx_test.get(x, 1), x)) for x in X_words_test]
X_words_idx_test = pad_sequences(X_words_idx_test,maxlen=maxlen)

In [24]:
predicted = model.predict(X_words_idx_test)

#### Convert predicted (3D matrix) to NER tags for each sequence in 3 steps:
1. Convert probabilities to NER index
2. Remove padding
3. Convert index to NER tag

In [25]:
def creat_output(predicted, ner_idx,X_words_test,Y_ner_test,filename):
    Y_out_pad= np.argmax(predicted,axis=2)
    inv_ner_idx = {v: k for k, v in ner_idx.items()}
    Y_out = []
    inv_ner_idx[0]='O'
    inv_ner_idx[1]='wtf'
    for i in range(len(Y_out_pad)):
        temp_old = Y_out_pad[i][-(len(X_words_test[i])):]
        temp_new = []
        for j in temp_old:
            temp_new.append(inv_ner_idx[j])
        Y_out.append(temp_new)

    f_out = open(filename, 'w')
    for i in range(len(X_words_test)): # For each sentence
        for j in range(len(X_words_test[i])): # Fore each word
            word = X_words_test[i][j]
            NER = Y_ner_test[i][j]
            PNER = Y_out[i][j]
            f_out.write(word + ' ' + NER + ' ' + PNER + '\n')
        f_out.write('\n')
    f_out.close()
    return Y_out


In [26]:
Y_new = creat_output(predicted, ner_idx,X_words_test,Y_ner_test,'new_out')
!perl ./conlleval.pl <new_out

processed 46666 tokens with 5648 phrases; found: 5565 phrases; correct: 3792.
accuracy:  93.33%; precision:  68.14%; recall:  67.14%; FB1:  67.64
              LOC: precision:  70.46%; recall:  77.64%; FB1:  73.87  1838
             MISC: precision:  60.74%; recall:  51.14%; FB1:  55.53  591
              ORG: precision:  59.10%; recall:  56.11%; FB1:  57.57  1577
              PER: precision:  77.36%; recall:  74.58%; FB1:  75.94  1559


## Building a LSTM Network
1. Create a simple LSTM network and train a model with the train set. As layers, you will use Embedding, LSTM, and Dense.
2. Apply conlleval to your output. Report the F1 result.
3. Try to improve your model by modifying some parameters, adding layers, adding Bidirectional, Dropout, possibly mixing SimpleRNN.
4. Apply your network to the test set and report the accuracy you obtained. you need to reach a F1 of 82 to pass.

In [27]:
from keras.layers import LSTM, Dropout

model = Sequential() 

# input här kommer vara emb_mat som vi lägger till som vikter i emb_lay och fryser så att de inte kan förändras
model.add(Embedding(text_vocabulary_size,embedding_dim,input_length=maxlen,mask_zero=False))
# output blir 150 x 100

model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(100,return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(SimpleRNN(100,return_sequences=True)))
model.add(Dropout(0.2))

model.add(Dense(200, activation='relu')) 
#model.add(Dropout(0.5))
model.add(Dense(ner_vocab_size, activation='softmax')) 

model.layers[0].set_weights([embedding_matrix]) 
model.layers[0].trainable = False

model.summary()

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 150, 100)          40259700  
_________________________________________________________________
bidirectional_2 (Bidirection (None, 150, 200)          160800    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 150, 200)          60200     
_________________________________________________________________
dense_2 (Dense)              (None, 150, 200)          40200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 150, 200)          0         
_________________________________________________________________
dense_3 (Dense)              (None, 150, 11)           2211      
Total params: 40,523,111
Tr

In [28]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',metrics=['acc']) 
model.fit(X_words_idx, Y_ner_idx_cat,
          epochs=15, 
          batch_size=128,
          validation_data=(X_words_idx_dev, Y_ner_idx_dev_cat))

Train on 14987 samples, validate on 3466 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x11e0df198>

In [29]:
predicted = model.predict(X_words_idx_test)

In [30]:
Y_new = creat_output(predicted, ner_idx,X_words_test,Y_ner_test,'BILSTM100_BISRNN100_200_DO_8ep_out')

In [31]:
!perl ./conlleval.pl <BILSTM100_BISRNN100_200_DO_8ep_out

processed 46666 tokens with 5648 phrases; found: 5673 phrases; correct: 4657.
accuracy:  96.30%; precision:  82.09%; recall:  82.45%; FB1:  82.27
              LOC: precision:  87.58%; recall:  86.27%; FB1:  86.92  1643
             MISC: precision:  69.23%; recall:  65.38%; FB1:  67.25  663
              ORG: precision:  73.96%; recall:  78.51%; FB1:  76.17  1763
              PER: precision:  90.71%; recall:  89.98%; FB1:  90.34  1604


Look up keras checkpoint to save weights at each epoch