## Objectives

The objectives of this assignment are to:

1 - Write a program to recognize named entities in text

2 - Learn how to manage a text data set

3 - Apply recurrent neural networks to text

4 - Know what word embeddings are

## Programming

#### 0 - Collecting a Dataset

You will use a dataset from the CoNLL conferences that benchmark natural language processing systems and tasks. There were two conferences on named entity recognition: CoNLL 2002 (Spanish and Dutch) and CoNLL 2003 (English and German). In this assignment, you will work on the English dataset. Read the description of the task.

1 - The datasets are protected by a license and you need to obtain it to reconstruct the data. Alternatively, you can use a local copy or try to find one on github (type conll2003 in the search box) or use the Google dataset search: https://toolbox.google.com/datasetsearch. You can find a local copy in the /usr/local/cs/EDAN95/datasets/NER-data folder.

2 - The dataset comes in the form of three files: a training set, a development set, and a test set. You will use the test set to evaluate your models. For this, you will apply the conlleval script that will compute the harmonic mean of the precision and recall: F1. You have a local copy of this script in /usr/local/cs/EDAN95/datasets/ner/bin. conlleval is written in Perl. Be sure to have it on your machine to run it.

## Python Headers 

#### The Modules

In [1]:
import os
os.environ['KERAS_BACKEND']='tensorflow'
import sys

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction import DictVectorizer
import time
from keras import models, layers

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.models import load_model
import math
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import LSTM, Bidirectional, SimpleRNN, Dense


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### Some Parameters

In [2]:
OPTIMIZER = 'rmsprop'
SCALER = True
SIMPLE_MODEL = False
BATCH_SIZE = 32
EPOCHS = 10
#MINI_CORPUS = True

EMBEDDING_DIM = 100
UNKNOWN_TOKEN = '__UNK__'
W_SIZE = 2
EMBEDDING_DIM = 100
MAX_SEQUENCE_LENGTH = 150

LSTM_UNITS = 512

## Download the GloVe word embeddings
Head to https://nlp.stanford.edu/projects/glove/ (where you can learn more about the GloVe algorithm), and download the pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it.

Here we Write a function that reads GloVe embeddings and store them in a dictionary, where the keys will be the words and the values, the embeddings.

An embedding is a mapping from discrete objects, such as words, to vectors of real numbers. The individual dimensions in these vectors typically have no inherent meaning. Instead, it's the overall patterns of location and distance between vectors that machine learning takes advantage of. Embeddings are important for input to machine learning. Classifiers, and neural networks more generally, work on vectors of real numbers. They train best on dense vectors, where all values contribute to define an object. However, many important inputs to machine learning, such as words of text, do not have a natural vector representation. Embedding functions are the standard and effective way to transform such discrete input objects into useful continuous vectors.

In [3]:
def load(file):
    """
    Return the embeddings in the from of a dictionary
    :param file:
    :return:
    """
    file = file
    embeddings = {}
    
    glove = open(file)
    for line in glove:
        values = line.strip().split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        embeddings[word] = vector
    glove.close()
    
    embeddings_dict = embeddings
    #embedded_words = sorted(list(embeddings_dict.keys()))
    
    return embeddings_dict

#### Embeddings dictionary

In [4]:
BASE_DIR = '/media/hi8826mo-s/BEEE-DE51/Ultimi/EDAN95_Applied_Machine_Learning/labs/lab4/'
embedding_file = BASE_DIR + 'glove.6B/glove.6B.100d.txt'
embeddings_dict = load(embedding_file)
embedded_words = sorted(list(embeddings_dict.keys()))

embeddings_dict['table']

#### Embeddings Index

In [5]:
BASE_DIR = '/media/hi8826mo-s/BEEE-DE51/Ultimi/EDAN95_Applied_Machine_Learning/labs/lab4/'
glove_dir = BASE_DIR + 'glove.6B/'

embeddings_index = {}

f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


#### Using a cosine similarity, compute the 5 closest words to the words table, france, and sweden.

list(embeddings_dict.values())[:1]

list(embeddings_dict.keys())[:6]

In [6]:
from scipy.spatial.distance import cosine

table = embeddings_dict['france']
#table = np.random.rand(1,100)
#a = np.random.rand(5,100)
#similarities = []
sim_dict = {}
#simmi = {}
for word, vector in embeddings_dict.items():
#for word in embeddings_dict.values:
        #print(cosine(table, a[i]))
        #print(cosine(table, a[i]))
        sim = cosine(table, vector)
        #sim = cosine_similarity(table,word)
        #key = embeddings_dict.get(word)
        #print (sim)
        #sim_dict.update(word=sim)
        sim_dict[word] = sim

sorted_by_value = sorted(sim_dict.items(), key = lambda kv: kv[1])        

sorted_by_value[0 : 5]
#print(table)
#print(len(sim_dict))

#print(similarities[:3])
#print(sim_dict.get(1.438331514596939))
#sim_dict.items()

[('france', 0.0),
 ('belgium', 0.19235771894454956),
 ('french', 0.19956225156784058),
 ('britain', 0.2049471139907837),
 ('spain', 0.24425369501113892)]

## Preprocessing
Preprocessing is more complex through 4 steps.

#### Preparing the data
Like all other neural networks, deep-learning models don’t take as input raw text: they only work with numeric tensors. Vectorizing text is the process of transforming text into numeric tensors. The same, We can’t feed lists of integers into a neural network. You have to turn your lists into tensors. Vectorizing text is the process of transforming text into numeric tensors. 

Collectively, the different units into which you can break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and
then associating numeric vectors with the generated tokens. These vectors, packed into sequence tensors, are fed into deep neural networks.

There are multiple ways to associate a vector with a token: one-hot encoding of tokens, and token embedding (typically used exclusively for words, and called word embedding).

There are two ways to do that:

1 - Pad your lists so that they all have the same length, turn them into an integer tensor of shape (samples, word_indices), and then use as the first layer in your network a layer capable of handling such integer tensors (the Embedding layer, which we’ll cover in detail later in the book).

2 - One-hot encode your lists to turn them into vectors of 0s and 1s. This would mean, for instance, turning the sequence [3, 5] into a 10,000 dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s. Then you could use as the first layer in your network a Dense layer, capable of handling floating-point vector data.

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. 

#### 1 - Loading the Corpus: function for reading the corpus

In [8]:
BASE_DIR = '/media/hi8826mo-s/BEEE-DE51/Ultimi/EDAN95_Applied_Machine_Learning/labs/lab4/'

def load_conll2009():
    train_file = BASE_DIR + 'NER-data/eng.train'
    dev_file = BASE_DIR + 'NER-data/eng.valid'
    test_file = BASE_DIR + 'NER-data/eng.test'
    # test2_file = 'simple_pos_test.txt'

    #column_names = ['id', 'form', 'lemma', 'plemma', 'pos', 'ppos']
    column_names = ['form', 'pos', 'chunk', 'ner']

    train_sentences = open(train_file).read().strip()
    dev_sentences = open(dev_file).read().strip()
    test_sentences = open(test_file).read().strip()
    # test2_sentences = open(test2_file).read().strip()
    return train_sentences, dev_sentences, test_sentences, column_names

# Read the corpus
train_sentences, dev_sentences, test_sentences, column_names = load_conll2009()

print(type(test_sentences))

<class 'str'>


#### 2 - Class for tokenization / Storing the rows in dictionaries
Converting the Corpus in a Dictionary, we follow the fit-transform pattern of sklearn

In [9]:
import regex as re

class Token(dict):
    pass

class CoNLLDictorizer:

    def __init__(self, column_names, sent_sep='\n\n', col_sep=' +'):
        self.column_names = column_names
        self.sent_sep = sent_sep
        self.col_sep = col_sep

    def fit(self):
        pass

    def transform(self, corpus):
        corpus = corpus.strip()
        sentences = re.split(self.sent_sep, corpus)
        return list(map(self._split_in_words, sentences))

    def fit_transform(self, corpus):
        return self.transform(corpus)

    def _split_in_words(self, sentence):
        rows = re.split('\n', sentence)
        return [Token(dict(zip(self.column_names,
                               re.split(self.col_sep, row))))
                for row in rows]

store the rows in dictionaries

In [10]:
#conll_dict = CoNLLDictorizer(column_names, col_sep='\t')
conll_dict = CoNLLDictorizer(column_names, col_sep=' +')
train_dict = conll_dict.transform(train_sentences)

#if MINI_CORPUS:
#   train_dict = train_dict[:len(train_dict) // 5]
    
test_dict = conll_dict.transform(test_sentences)
dev_dict = conll_dict.transform(dev_sentences)

print('First sentence, train:', train_dict[0])
print('Second sentence, train:', train_dict[1])
print('First sentence, test:', test_dict[0])

First sentence, train: [{'form': '-DOCSTART-', 'pos': '-X-', 'chunk': 'O', 'ner': 'O'}]
Second sentence, train: [{'form': 'EU', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'I-ORG'}, {'form': 'rejects', 'pos': 'VBZ', 'chunk': 'I-VP', 'ner': 'O'}, {'form': 'German', 'pos': 'JJ', 'chunk': 'I-NP', 'ner': 'I-MISC'}, {'form': 'call', 'pos': 'NN', 'chunk': 'I-NP', 'ner': 'O'}, {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP', 'ner': 'O'}, {'form': 'boycott', 'pos': 'VB', 'chunk': 'I-VP', 'ner': 'O'}, {'form': 'British', 'pos': 'JJ', 'chunk': 'I-NP', 'ner': 'I-MISC'}, {'form': 'lamb', 'pos': 'NN', 'chunk': 'I-NP', 'ner': 'O'}, {'form': '.', 'pos': '.', 'chunk': 'O', 'ner': 'O'}]
First sentence, test: [{'form': '-DOCSTART-', 'pos': '-X-', 'chunk': '-X-', 'ner': 'O'}]



#### 3 - Extracting the Context and Dictorizing it
Extract the features and store them in dictionaries.

We extract windows of five words surrounding the word.


#### 4 - Vectorizing the symbols (X and Matrices)
Vectorizing X: We transform the x symbols into numbers

## Creating the X and Y Sequences 
Function to build the two-way sequences: Two vectors: x and y

In [11]:
#def build_sequences(corpus_dict, key_x='form', key_y='upos', tolower=True):
def build_sequences(corpus_dict, key_x='form', key_y='ner', tolower=True):
    """
    Creates sequences from a list of dictionaries
    :param corpus_dict:
    :param key_x:
    :param key_y:
    :return:
    """
    X = []
    Y = []
    for sentence in corpus_dict:
        x = []
        y = []
        
        for word in sentence:
            x += [word[key_x]]
            y += [word[key_y]]
            
        if tolower:
            x = list(map(str.lower, x))
            
        X += [x]
        Y += [y]
    return X, Y

In [12]:
X_train_cat, Y_train_cat = build_sequences(train_dict)

print('First sentence, words \n', X_train_cat[1])
print('First sentence, NER \n', Y_train_cat[1])
print('\n')

First sentence, words 
 ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.']
First sentence, NER 
 ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']




#### Doing the same for the development set
X_dev_cat, Y_dev_cat = build_sequences(dev_dict)

print('First sentence in dev_dict, words \n', X_dev_cat[1])
print('First sentence in dev_dict, NER \n', Y_dev_cat[1])

#### Extracting the Unique Words and Named Entities Recognition 
Training set 

In [13]:
vocabulary_words = sorted(list(
    set([word for sentence 
         in X_train_cat for word in sentence])))

ner = sorted(list(set([ner for sentence 
                       in Y_train_cat for ner in sentence])))
print('Unique words in training \n', ner)
NB_CLASSES = len(ner)
print('\n')

Unique words in training 
 ['B-LOC', 'B-MISC', 'B-ORG', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']




# Extracting  - Dev set
vocabulary_words_dev = sorted(list(
    set([word for sentence 
         in X_dev_cat for word in sentence])))

ner = sorted(list(set([ner for sentence 
                       in Y_dev_cat for ner in sentence])))
print('Unique words in Dev set \n', ner)
NB_CLASSES = len(ner)

#### We create the dictionary
We add two words for the padding symbol and unknown words

In [14]:
embeddings_words = embeddings_dict.keys()
print('Words in GloVe:',  len(embeddings_dict.keys()))

vocabulary_words = sorted(list(set(vocabulary_words + 
                                   list(embeddings_words))))
cnt_uniq = len(vocabulary_words) + 2
print('# unique words in the training vocabulary: embeddings and corpus:', 
      cnt_uniq)
print('\n')

Words in GloVe: 400000
# unique words in the training vocabulary: embeddings and corpus: 402597




# We create the dictionary - Dev set
embeddings_words = embeddings_dict.keys()
print('Words in GloVe:',  len(embeddings_dict.keys()))

vocabulary_words_dev = sorted(list(set(vocabulary_words_dev + 
                                   list(embeddings_words))))
cnt_uniq = len(vocabulary_words_dev) + 2
print('# unique words in the dev vocabulary: embeddings and corpus:', 
      cnt_uniq)

#### Function to convert the words or NER to indices

In [15]:
def to_index(X, idx):
    """
    Convert the word lists (or NER lists) to indexes
    :param X: List of word (or NER) lists
    :param idx: word to number dictionary
    :return:
    """
    X_idx = []
    for x in X:
        # We map the unknown words to one
        x_idx = list(map(lambda x: idx.get(x, 1), x))
        X_idx += [x_idx]
        
    return X_idx

#### We create the indexes

In [16]:
# We start at one to make provision for the padding symbol 0 
# in RNN and LSTMs and 1 for the unknown words

rev_word_idx = dict(enumerate(vocabulary_words, start=2))
#rev_ner_idx = dict(enumerate(ner, start=2))
ner_rev_idx = dict(enumerate(ner, start=2))

word_idx = {v: k for k, v in rev_word_idx.items()}
#ner_idx = {v: k for k, v in rev_ner_idx.items()}
ner_idx = {v: k for k, v in ner_rev_idx.items()}

print('word index: \n', list(word_idx.items())[:10])
print('NER index: \n', list(ner_idx.items())[:10])

# We create the parallel sequences of indexes
X_idx = to_index(X_train_cat, word_idx)
Y_idx = to_index(Y_train_cat, ner_idx)

print('First sentences, word indices \n', X_idx[:3])
print('First sentences, NER indices \n', Y_idx[:3])


word index: 
 [('!', 2), ('!!', 3), ('!!!', 4), ('!!!!', 5), ('!!!!!', 6), ('!?', 7), ('!?!', 8), ('"', 9), ('#', 10), ('##', 11)]
NER index: 
 [('B-LOC', 2), ('B-MISC', 3), ('B-ORG', 4), ('I-LOC', 5), ('I-MISC', 6), ('I-ORG', 7), ('I-PER', 8), ('O', 9)]
First sentences, word indices 
 [[935], [142143, 307143, 161836, 91321, 363368, 83766, 85852, 218260, 936], [284434, 79019]]
First sentences, NER indices 
 [[9], [7, 9, 6, 9, 9, 9, 6, 9, 9], [8, 8]]


#### We create the indexes - Dev set

# We start at one to make provision for the padding symbol 0 
# in RNN and LSTMs and 1 for the unknown words

rev_word_idx_dev = dict(enumerate(vocabulary_words_dev, start=2))
rev_ner_idx_dev = dict(enumerate(ner, start=2))

word_idx_dev = {v: k for k, v in rev_word_idx_dev.items()}
ner_idx_dev = {v: k for k, v in rev_ner_idx_dev.items()}

#print('word index:', list(word_idx.items())[:10])
#print('NER index:', list(ner_idx.items())[:10])

# We create the parallel sequences of indexes
X_idx_dev = to_index(X_dev_cat, word_idx_dev)
Y_idx_dev = to_index(Y_dev_cat, ner_idx_dev)

print('First sentences, word indices \n', X_idx_dev[:3])
print('First sentences, NER indices \n', Y_idx_dev[:3])


#### We pad the sentences

In [17]:
X = pad_sequences(X_idx)
Y = pad_sequences(Y_idx)

print(X[0])
print(Y[0])

# The number of NER classes and 0 (padding symbol)
Y_train = to_categorical(Y, num_classes=len(ner) + 2)
print(Y_train[0])


[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0 935]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 9]
[[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


#### We pad the sentences - Dev set

X = pad_sequences(X_idx_dev)
Y = pad_sequences(Y_idx_dev)

print(X[0])
print(Y[0])

# The number of NER classes and 0 (padding symbol)
Y_dev = to_categorical(Y, num_classes=len(ner) + 2)
print(Y_dev[0])


#### We create an embedding matrix

0 is the padding symbol and index one is a unknown word


In [18]:
rdstate = np.random.RandomState(1234567)
embedding_matrix = rdstate.uniform(-0.05, 0.05, 
                                   (len(vocabulary_words) + 2, 
                                    EMBEDDING_DIM))

In [19]:
for word in vocabulary_words:
    if word in embeddings_dict:
        # If the words are in the embeddings, we fill them with a value
        embedding_matrix[word_idx[word]] = embeddings_dict[word]


In [20]:
print('Shape of embedding matrix:', embedding_matrix.shape)
print('Embedding of table \n', embedding_matrix[word_idx['table']])
print('Embedding of the padding symbol, idx 0, random numbers \n', 
      embedding_matrix[0])

Shape of embedding matrix: (402597, 100)
Embedding of table 
 [-0.61453998  0.89692998  0.56770998  0.39102    -0.22437     0.49035001
  0.10868     0.27410999 -0.23833001 -0.52152997  0.73550999 -0.32653999
  0.51304001  0.32415    -0.46709001  0.68050998 -0.25497001 -0.040484
 -0.54417998 -1.05480003 -0.46691999  0.23557     0.31233999 -0.34536999
  0.14793    -0.53745002 -0.43215001 -0.48723999 -0.51019001 -0.90509999
 -0.17918999 -0.018376    0.09719    -0.31623     0.75120002  0.92236
 -0.49965     0.14036    -0.28296    -0.97443002 -0.0094408  -0.62944001
  0.14711    -0.94375998  0.0075222   0.18565001 -0.99172002  0.072789
 -0.18474001 -0.52901     0.38995001 -0.45677    -0.21932     1.37230003
 -0.29635999 -2.2342     -0.36667001  0.04987     0.63420999  0.53275001
 -0.53955001  0.31398001 -0.44698    -0.38389     0.066668   -0.02168
  0.20558     0.59456003 -0.24891999 -0.52794999 -0.3761      0.077104
  0.75221997 -0.2647     -0.0587      0.67540997 -0.16559    -0.49278
 -0.

#### Embedding matrix - Dev set

rdstate = np.random.RandomState(1234567)
embedding_matrix_dev = rdstate.uniform(-0.05, 0.05, 
                                   (len(vocabulary_words_dev) + 2, 
                                    EMBEDDING_DIM))

for word in vocabulary_words_dev:
    if word in embeddings_dict:
        # If the words are in the embeddings, we fill them with a value
        embedding_matrix[word_idx[word]] = embeddings_dict[word]


print('Shape of embedding matrix:', embedding_matrix.shape)
print('Embedding of table', embedding_matrix[word_idx_dev['table']])
print('Embedding of the padding symbol, idx 0, random numbers', 
      embedding_matrix[0])

#### Using word embeddings
Another popular and powerful way to associate a vector with a word is the use of dense word vectors, also called word embeddings. Whereas the vectors obtained through one-hot encoding are binary, sparse (mostly made of zeros), and very high-dimensional (same dimensionality as the number of words in the vocabulary), word embeddings are low-
dimensional floating-point vectors (that is, dense vectors, as opposed to sparse vectors).

Unlike the word vectors obtained via one-hot encoding, word
embeddings are learned from data. It’s common to see word embeddings that are 256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large vocabularies. On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or greater (capturing a vocabulary of 20,000 tokens, in this case). So, word embeddings pack more information into far fewer dimensions.

There are two ways to obtain word embeddings:

1 - Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.

2 - Load into your model word embeddings that were precomputed using a different machine-learning task than the one you’re trying to solve. These are called pretrained word embeddings.

## The Simple Reccurent Network (Tagger)

In [1]:
model = models.Sequential()

model.add(layers.Embedding(len(vocabulary_words) + 2,      
                           EMBEDDING_DIM,
                           mask_zero=True,
                           input_length=None))

model.layers[0].set_weights([embedding_matrix])
# The default is True
model.layers[0].trainable = False

# a simple RNN network
#model.add(SimpleRNN(100, return_sequences=True))

# a simple RNN network with Bidirectional
#model.add(Bidirectional(SimpleRNN(100, return_sequences=True)))

#a simple LSTM network
#model.add(LSTM(100, return_sequences=True))                         # dropout=0.1, recurrent_dropout=0.5,

# a stack of several recurrent layers 
# Using recurrent dropout to fight overfitting
model.add(LSTM(100, return_sequences=True))

model.add(Bidirectional(LSTM(100, return_sequences=True)))           # the last layer only returns the last output           

model.add(layers.Dropout(0.25))
model.add(Dense(NB_CLASSES + 2, activation='softmax'))


NameError: name 'models' is not defined

#### Fitting the Model

In [633]:
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.summary()

model.fit(X, Y_train, epochs=EPOCHS, batch_size=BATCH_SIZE)



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, None, 100)         40069600  
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, None, 100)         20100     
_________________________________________________________________
bidirectional_13 (Bidirectio (None, None, 200)         160800    
_________________________________________________________________
dense_18 (Dense)             (None, None, 9)           1809      
Total params: 40,252,309
Trainable params: 40,252,309
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f6b748dcb38>

## Evaluation of system: 

#### Formatting the Test set

In [544]:
# In X_dict, we replace the words with their index
X_test_cat, Y_test_cat = build_sequences(test_dict)

# We create the parallel sequences of indexes
X_test_idx = to_index(X_test_cat, word_idx)
Y_test_idx = to_index(Y_test_cat, ner_idx)

print('X[0] test idx', X_test_idx[0])
print('Y[0] test idx', Y_test_idx[0])

X_test_padded = pad_sequences(X_test_idx)
Y_test_padded = pad_sequences(Y_test_idx)

print('X[0] test idx passed \n', X_test_padded[0])
print('Y[0] test idx padded \n', Y_test_padded[0])

# One extra symbol for 0 (padding)
Y_test_padded_vectorized = to_categorical(Y_test_padded, 
                                          num_classes=len(ner) + 2)
print('Y[0] test idx padded vectorized \n', Y_test_padded_vectorized[0])
print(X_test_padded.shape)
print(Y_test_padded_vectorized.shape)


X[0] test idx [891]
Y[0] test idx [8]
X[0] test idx passed [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 891]
Y[0] test idx padded [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 8]
Y[0] test idx padded vectorized [[1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 

In [545]:
# Evaluates with the padding symbol
test_loss, test_acc = model.evaluate(X_test_padded, 
                                     Y_test_padded_vectorized)
print('Loss:', test_loss)
print('Accuracy:', test_acc)


Loss: 0.21689066391976208
Accuracy: 0.9342477429702668


#### We evaluate on all the test corpus

In [546]:
print('X_test' + '\n', X_test_cat[0])
print('X_test_padded' + '\n', X_test_padded[0])

corpus_ner_predictions = model.predict(X_test_padded)

print('Y_test' + '\n', Y_test_cat[0])
print('Y_test_padded' + '\n', Y_test_padded[0])
print('predictions' + '\n', corpus_ner_predictions[0])

X_test
 ['-docstart-']
X_test_padded
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 891]
Y_test
 ['O']
Y_test_padded
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 8]
predictions
 [[3.8932071e-03 2.0994707e-03 6.1671459e-03 ... 3.7002046e-02
  2.2152795e-02 8.5350865e-01]
 [3.8932071e-03 2.0994707e-03 6.1671459e-03 ... 3.7002046e-02
  2.2152

#### Remove padding

In [547]:
ner_pred_num = []

for sent_nbr, sent_ner_predictions in enumerate(corpus_ner_predictions):
    ner_pred_num += [sent_ner_predictions[-len(X_test_cat[sent_nbr]):]]
    
print(ner_pred_num[:2])

[array([[4.7312988e-06, 2.9684611e-06, 2.4525192e-05, 1.1403169e-04,
        1.4805868e-04, 5.4486602e-04, 5.2190293e-04, 2.3054771e-04,
        9.9840838e-01]], dtype=float32), array([[2.35657950e-07, 4.72813220e-08, 2.13263456e-06, 2.08299498e-05,
        1.93560903e-04, 2.43502064e-03, 6.14209042e-04, 1.60869149e-05,
        9.96717870e-01],
       [3.48824331e-10, 1.01019589e-10, 1.23680586e-08, 1.13703969e-07,
        1.92286166e-06, 3.88664557e-05, 1.63839941e-05, 1.45495562e-06,
        9.99941230e-01],
       [8.62602178e-07, 4.31470397e-07, 4.58842114e-05, 6.77316712e-05,
        9.16631699e-01, 3.71768326e-02, 3.94256413e-02, 2.82957801e-04,
        6.36795023e-03],
       [6.13386320e-09, 1.94492467e-09, 1.21948574e-07, 2.15761929e-06,
        5.64679322e-05, 7.13074172e-04, 3.63511936e-04, 4.09523673e-05,
        9.98823702e-01],
       [4.29836389e-09, 1.28940214e-09, 6.39856381e-08, 7.42536997e-07,
        1.09179764e-05, 7.81541690e-04, 5.43775735e-04, 5.81360189e-04,
  

#### Convert NER indices to symbols

In [548]:
ner_pred = []

for sentence in ner_pred_num:
    ner_pred_idx = list(map(np.argmax, sentence))
    #ner_pred_cat = list(map(rev_ner_idx.get, ner_pred_idx))
    ner_pred_cat = list(map(ner_rev_idx.get, ner_pred_idx))
    ner_pred += [ner_pred_cat]

print(ner_pred[:2])
print(len(ner_pred))
print(Y_test_cat[:2])

[['O'], ['O', 'O', 'I-LOC', 'O', 'O', 'O', 'O', 'I-LOC', 'O', 'O', 'O', 'O']]
3684
[['O'], ['O', 'O', 'I-LOC', 'O', 'O', 'O', 'O', 'I-PER', 'O', 'O', 'O', 'O']]


#### Writting the results of our predictions and test set in one file
After using the preict() method to predict the tags of the whole test set, we need to write our results in a file, where the two last columns will be the hand-annotated tag and the predicted tag. The fields must be separated by a space. 


In [549]:
def save(file, test_dict, column_names):
    """
    Saves the corpus in a file
    :param file:
    :param corpus_dict:
    :param column_names:
    :return:
    """
    with open(file, 'w') as f_out:
        for sentence in test_dict:
            sentence_lst = []
            
            for row in sentence:
                # The lambda technique is useful for example when we want to pass 
                # a simple function as an argument to another function, like this:
                items = map(lambda x: row.get(x, '_'), column_names)
                #sentence_lst += '\t'.join(items) + '\n'
                sentence_lst += ' '.join(items) + '\n'
                
            sentence_lst += '\n'
            f_out.write(''.join(sentence_lst))
            

In [550]:

testfile = 'NER-data/eng.test'

#column_names = ['id', 'form', 'lemma', 'cpos', 'pos', 'feats']
column_names_pred = ['form', 'pos', 'chunk', 'ner', 'predicted-ner']

testset = open(testfile).read().strip()

#conll_dict = CoNLLDictorizer(column_names, col_sep='\t')    # XXXXXX
conll_dict_pred = CoNLLDictorizer(column_names_pred, col_sep=' +')
test_dict_pred = conll_dict_pred.transform(testset)
print(len(test_dict_pred))

print(test_dict_pred[2])
#print(list(test_dict[:2]))
# word_idx = {v: k for k, v in rev_word_idx.items()} 

print((ner_pred[2]))
#pred_dict = {}
#sent_index = 0

print("hello",test_dict_pred[1])

for sent_index in range(len(test_dict_pred)):
#for sentence in test_dict_pred:
    #word_index = 0
    
    #if len(ner_pred[currentIndex]) != len(sentence):
    #    print("ERROR!")
    #ext_keys = []
    #ext_tags = []
    for word_index in range(len(test_dict_pred[sent_index])):
    #for word in sentence:   # every word is a row - dictionary
        #list(word.keys()).append('predicted_ner')
        #list(word.values()).append(str(ner_pred[sent_index][word_index]))
        #ext_keys = list(word.keys()).append('predicted_ner')
        #ext_tags = list(word.values()).append(str(ner_pred[sent_index][word_index]))
        #test_dict_pred = dict(zip(ext_keys, ext_tags))
        
        #pred_dict['predicted-ner'] = ner_pred[sent_index][word_index]
        #zip_dict = dict(zip(word, pred_dict))
        #word['predicted-ner'] = ner_pred[sent_index][word_index]
        test_dict_pred[sent_index][word_index]['predicted-ner'] = str(ner_pred[sent_index][word_index])
        #test_dict_pred.update({'predicted-ner': ner_pred[sent_index][word_index]})
        #sim_dict.update(word=sim)
        #value = ner_pred[sent_index][word_index]
        #word.update('predicted-ner'=value)
        #word_index += 1
    
    
    #sent_index += 1
    
    #if sent_index > 2000:
    #    break
    # i vårt test_dict, har vi meningar    
# para ihop dessa meningar med ner_pred mening
# para ihop orden i test-dict-meningen med tag i ner_pred mening   

# skriv ut file med: "ord", "GS", "pred-tag"

#print(type(zip_dict[:1]))
save('out', test_dict_pred, column_names_pred)
#save('out', zip_dict, column_names_pred)


3684
[{'form': 'Nadim', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'I-PER'}, {'form': 'Ladki', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'I-PER'}]
['I-LOC', 'I-ORG']
hello [{'form': 'SOCCER', 'pos': 'NN', 'chunk': 'I-NP', 'ner': 'O'}, {'form': '-', 'pos': ':', 'chunk': 'O', 'ner': 'O'}, {'form': 'JAPAN', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'I-LOC'}, {'form': 'GET', 'pos': 'VB', 'chunk': 'I-VP', 'ner': 'O'}, {'form': 'LUCKY', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'O'}, {'form': 'WIN', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'O'}, {'form': ',', 'pos': ',', 'chunk': 'O', 'ner': 'O'}, {'form': 'CHINA', 'pos': 'NNP', 'chunk': 'I-NP', 'ner': 'I-PER'}, {'form': 'IN', 'pos': 'IN', 'chunk': 'I-PP', 'ner': 'O'}, {'form': 'SURPRISE', 'pos': 'DT', 'chunk': 'I-NP', 'ner': 'O'}, {'form': 'DEFEAT', 'pos': 'NN', 'chunk': 'I-NP', 'ner': 'O'}, {'form': '.', 'pos': '.', 'chunk': 'O', 'ner': 'O'}]


#### Applying the evaluator in conlleval script
Compute F1 = precision/recall
By applying conlleval to the producted output and report the F1 result.

Run the scorer in terminal like this:

perl conlleval < out
#where out is replaced with the name of your output file.

#### The output results from conlleval
The evaluator prints precision and recall measures, and their harmonic mean (the F-measure, FB1). We also see the performance for each of the types of names, LOC, MISC, ORG and PER in this case. Note that this evaluation is quite tough: we get no credit for an almost-correct group.

Here is what the evaluator writes when running it on the output file:

processed 46666 tokens with 5648 phrases; found: 5156 phrases; correct: 3100.
accuracy:  91.59%; precision:  60.12%; recall:  54.89%; FB1:  57.39
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  7
              LOC: precision:  63.00%; recall:  68.41%; FB1:  65.59  1811
             MISC: precision:  41.90%; recall:  40.88%; FB1:  41.38  685
              ORG: precision:  57.35%; recall:  32.63%; FB1:  41.60  945
              PER: precision:  66.16%; recall:  69.88%; FB1:  67.97  1708


### Evaluate

In [551]:
total, correct, total_ukn, correct_ukn = 0, 0, 0, 0

for id_s, sentence in enumerate(X_test_cat):
    for id_w, word in enumerate(sentence):
        total += 1
        if ner_pred[id_s][id_w] == Y_test_cat[id_s][id_w]:
            correct += 1
        # The word is not in the dictionary
        if word not in word_idx:
            total_ukn += 1
            if ner_pred[id_s][id_w] == Y_test_cat[id_s][id_w]:
                correct_ukn += 1

print('total %d, correct %d, accuracy %f' % 
      (total, correct, correct / total))
if total_ukn != 0:
    print('total unknown %d, correct %d, accuracy %f' % 
          (total_ukn, correct_ukn, correct_ukn / total_ukn))

total 46666, correct 43811, accuracy 0.938821
total unknown 1186, correct 944, accuracy 0.795953


## Prediction of Named Entities Recognition

In [552]:
def predict_sentence(sentence, model, word_idx, 
                     vocabulary_words, rev_idx_ner, verbose=False):
    # Predict one sentence
    sentence = sentence.split()
    word_idxs = to_index([sentence], word_idx)
    word_idx_padded = pad_sequences(word_idxs)

    ner_idx_pred = model.predict(word_idx_padded)
    
    # We remove padding
    ner_idx_pred = ner_idx_pred[0][-len(sentence):]
    ner_idx = list(map(np.argmax, ner_idx_pred))
    ner = list(map(rev_idx_ner.get, ner_idx))
    
    if verbose:
        print('Sentence', sentence)
        print('Sentence word indexes', word_idxs)
        print('Padded sentence', word_idx_padded)
        print('NER predicted', ner_idx_pred[0])
        print('NER shape', ner_idx_pred.shape)
        
    return ner

def predict_sentence(sentence, dict_vect, model, ner_rev_idx):

    #column_names = ['id', 'form']
    column_names = ['form', 'ppos', 'pchunk', 'ner']
    
    sentence = list(enumerate(sentence.lower().split(), start=1))
    
    conll_cols = ''
    for tuple in sentence:
        conll_cols += str(tuple[0]) + '\t' + tuple[1] + '\n'
    #print(conll_cols)

    conll_dict = CoNLLDictorizer(column_names, col_sep='\t')
    sent_dict = conll_dict.transform(conll_cols)
    #print('Sentence:', sent_dict[0])

    context_dictorizer = ContextDictorizer()
    context_dictorizer.fit(sent_dict)
    X_dict, y = context_dictorizer.transform(sent_dict, 
                                             training_step=False)
    # print('Sentence, padded:', X_dict)
    # print('NER, y:', y)
    
    X_num = dict_vect.transform(X_dict)
    if SCALER:
    # We standardize X_num
        X = scaler.transform(X_num)
    else:
        X = X_num

    # print(X)
    y_prob = model.predict(X)
    y_pred = y_prob.argmax(axis=-1)
    y_pred_cat = [ner_rev_idx[i] for i in y_pred]   # XXXXXX
    
    return y_pred_cat


for sentence in sentences:
    y_test_pred_cat = predict_sentence(sentence.lower(), 
                                       model, word_idx, 
                                       vocabulary_words, 
                                       rev_ner_idx)
    print(sentence)
    print(y_test_pred_cat)

sentences = ["That round table might collapse .",
                 "The man can learn well .",
                 "The man can swim .",
                 "The man can simwo ."]
for sentence in sentences:
    y_test_pred_cat = predict_sentence(sentence.lower(), 
                                       dict_vectorizer,
                                       model,  
                                       ner_rev_idx)
    #print(sentence)
    #print(y_test_pred_cat)

In [553]:
sentences = ["That round table might collapse .",
                 "The man can learn well .",
                 "The man can swim .",
                 "The man can simwo ."]
for sentence in sentences:
    y_test_pred_cat = predict_sentence(sentence.lower(), 
                                       model, word_idx, 
                                       vocabulary_words, 
                                       ner_rev_idx)
    print(sentence)
    print(y_test_pred_cat)

That round table might collapse .
['O', 'O', 'O', 'O', 'O', 'O']
The man can learn well .
['O', 'O', 'O', 'O', 'O', 'O']
The man can swim .
['O', 'O', 'O', 'O', 'O']
The man can simwo .
['O', 'O', 'O', 'O', 'O']


In [353]:
print(test_sentences[0:50])

-DOCSTART- -X- -X- O

SOCCER NN I-NP O
- : O O
JAP



#### Tokenize the data ---------- Chollet 6.1

Let's vectorize the texts we collected, and prepare a training and validation split. We will merely be using the concepts we introduced earlier in this section.

Because pre-trained word embeddings are meant to be particularly useful on problems where little training data is available (otherwise, task-specific embeddings are likely to outperform them), we will add the following twist: we restrict the training data to its first 200 samples. So we will be learning to classify movie reviews after looking at just 200 examples...


Now let's build an embedding matrix that we will be able to load into an Embedding layer. 
It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 is not supposed to stand for any word or token -- it's a placeholder.

#### Collecting the Embeddings

1 - Download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ and keep the 100d vectors.

2 - Write a function that reads GloVe embeddings and store them in a dictionary, where the keys will be the words and the values, the embeddings.

3 - Using a cosine similarity, compute the 5 closest words to the word table.


#### 1 - Reading the Corpus and Building Indices

You will read the corpus with programs available from https://github.com/pnugues/edan95. These programs will enable you to load the files in the form of a list of dictionaries.

1 - Write a function that extracts the words and NER tags and returns X and Y list of symbols.

2 - Create indices and inverted indices for the words and the NER: i.e. you will associate each word with a number. The words will be the set of all the words observed in the training set and the words in GloVe. You will use index 0 for the padding symbol and 1 for unknown words. (see Chollet page 69)

#### 2 - Building the Embedding Matrix

4 - Create a matrix whose size will be that of all the words. Initialize it with random values.

5 - Fill the matrix with the GloVe embeddings.

#### 3 - Creating the X and Y Sequences

You will now create the input and output sequences with numerical indices

1 - Convert the X and Y list of symbols in a list of numbers using the indices you created.

2 - Pad the sentences using the pad_sequences function.

3 - Do the same for the development set.

#### 4 - Building a Simple Recurrent Neural Network

1 - Create a simple recurrent network and train a model with the train set. As layers, you will use Embedding, SimpleRNN, and Dense.

2 - Compile and fit your network. You will report the training and validation losses and accuracies and comment on the possible overfit.

3 - Apply your network to the test set and report the accuracy as well as the confusion matrix you obtained. You will use the evaluate method.

#### 5 - Evaluating your System

You will use the official script to evaluate the performance of your system

1 - Use the predict method to predict the tags of the whole test set

2 - Write your results in a file, where the two last columns will be the hand-annotated tag and the predicted tag. The fields must be separated by a space.

3 - Apply conlleval to your output. Report the F1 result.

4 - Try to improve your model by modifying some parameters, adding layers, adding Bidirectional and Dropout.

5 - Evaluate your network again

#### 6 - Building a LSTM Network

1 - Create a simple LSTM network and train a model with the train set. As layers, you will use Embedding, LSTM, and Dense.

2 - Apply conlleval to your output. Report the F1 result.

3 - Try to improve your model by modifying some parameters, adding layers, adding Bidirectional, Dropout, possibly mixing SimpleRNN.

4 - Apply your network to the test set and report the accuracy as well as the confusion matrix you obtained. you need to reach a F1 of 84 to pass.