# Assignment 3 - Named Entity Recognition (NER)

Welcome to the third programming assignment of Course 3. In this assignment, you will learn to build more complicated models with Trax. By completing this assignment, you will be able to: 

- Design the architecture of a neural network, train it, and test it. 
- Process features and represents them
- Understand word padding
- Implement LSTMs
- Test with your own sentence

## Outline
- [Introduction](#0)
- [Part 1:  Exploring the data](#1)
    - [1.1  Importing the Data](#1.1)
    - [1.2  Data generator](#1.2)
		- [Exercise 01](#ex01)
- [Part 2:  Building the model](#2)
	- [Exercise 02](#ex02)
- [Part 3:  Train the Model ](#3)
	- [Exercise 03](#ex03)
- [Part 4:  Compute Accuracy](#4)
	- [Exercise 04](#ex04)
- [Part 5:  Testing with your own sentence](#5)

<a name="0"></a>
# Introduction

We first start by defining named entity recognition (NER). NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times, etc. 

For example:

<img src="https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/images/ner.png" width="500px"/>

Is labeled as follows: 

- French: geopolitical entity
- Morocco: geographic entity 
- Christmas: time indicator

Everything else that is labeled with an `O` is not considered to be a named entity. In this assignment, you will train a named entity recognition system that could be trained in a few seconds (on a GPU) and will get around 75% accuracy. Then, you will load in the exact version of your model, which was trained for a longer period of time. You could then evaluate the trained version of your model to get 96% accuracy! Finally, you will be able to test your named entity recognition system with your own sentence.

In [1]:
%%capture
!pip -q install trax==1.3.1

In [8]:
%%capture
!wget https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/datasets/ner_dataset.csv
!wget https://github.com/martin-fabbri/colab-notebooks/raw/master/deeplearning.ai/nlp/datasets/ner-data.tar.gz
!tar -xvf ner-data.tar.gz

In [4]:
import os
import random as rnd

import numpy as np
import pandas as pd
import trax
from trax import layers as tl

trax.supervised.trainer_lib.init_random_number_generators(33)

DeviceArray([ 0, 33], dtype=uint32)

In [3]:
def get_vocab(vocab_path, tags_path):
    vocab = {}
    with open(vocab_path) as f:
        for i, l in enumerate(f.read().splitlines()):
            vocab[l] = i  # to avoid the 0
        # loading tags (we require this to map tags to their indices)
    vocab['<PAD>'] = len(vocab) # 35180
    tag_map = {}
    with open(tags_path) as f:
        for i, t in enumerate(f.read().splitlines()):
            tag_map[t] = i 
    
    return vocab, tag_map

def get_params(vocab, tag_map, sentences_file, labels_file):
    sentences = []
    labels = []

    with open(sentences_file) as f:
        for sentence in f.read().splitlines():
            # replace each token by its index if it is in vocab
            # else use index of UNK_WORD
            s = [vocab[token] if token in vocab 
                 else vocab['UNK']
                 for token in sentence.split(' ')]
            sentences.append(s)

    with open(labels_file) as f:
        for sentence in f.read().splitlines():
            # replace each label by its index
            l = [tag_map[label] for label in sentence.split(' ')] # I added plus 1 here
            labels.append(l) 
    return sentences, labels, len(sentences)


<a name="1"></a>
# Part 1:  Exploring the data

We will be using a dataset from Kaggle, which we will preprocess for you. The original data consists of four columns, the sentence number, the word, the part of speech of the word, and the tags.  A few tags you might expect to see are: 

* geo: geographical entity
* org: organization
* per: person 
* gpe: geopolitical entity
* tim: time indicator
* art: artifact
* eve: event
* nat: natural phenomenon
* O: filler word


In [12]:
data = pd.read_csv("ner_dataset.csv", encoding="ISO-8859-1")
train_sents = open("data/small/train/sentences.txt", "r").readline()
train_labels = open("data/small/train/labels.txt", "r").readline()
print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)
print('ORIGINAL DATA:\n', data.head(5))
del(data, train_sents, train_labels)

SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O

ORIGINAL DATA:
     Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O


<a name="1.1"></a>
## 1.1  Importing the Data

In this part, we will import the preprocessed data and explore it.