## Assignment 3 - Named Entity Recognition

In this assignment, we are going to build a Named Entity Recognition model. With this model, we will also tag new data.

More on Named Entity Recognition:

https://blog.paralleldots.com/data-science/named-entity-recognition-milestone-models-papers-and-technologies/

https://blog.paralleldots.com/product/applications-named-entity-recognition-api/

### Steps:

**1. Import the data**

**2. Build the model**

**3. Pick a dataset to run the model on**

**4. Build a function to load new data and print the tags**

Your web application will load small sections of text (such as tweets or headlines) and from that, you will tag the text based on the presence of named entities.

*What you will be graded on:*

1. Ability to build a model on word and tag data

2. Ability to use the model to predict on new data and display that prediction

*The model will be based on:*
1. Embeddings from words
2. Embeddings from tag inputs

### Step 1: Importing the data

Below is some code to get you started. As in the part of speech tagging example, you will have to write code to:

0. Split your data into a train/test set (Do a 80/20 or 90/10 split since we'll be later applying this model to an entirely separate set of data)
1. Find the set of all words
2. Find the set of all tags
3. **Create a function called ent_tagger** that will turn a sentence into this output for model building :
``` [('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have',  'O'), ('marched',  'O'), ('through',  'O'), ('London', 'B-geo'), ('to',  'O'), ('protest',  'O'), ('the',  'O'), ('war',  'O'), ('in',  'O'), ('Iraq',  'B-geo'), ('and', 'O'), ('demand',  'O'), ('the',  'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops',  'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]
```
4. Make a dictionary of words to index and entity tag to index

In [1]:
import pandas as pd
import numpy as np
import os

os.chdir("/Users/priya/Downloads")

### NER DATASET IS FOUND IN THE COURSE REPO
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
print(data.head(10))
X= data.iloc[:,0:3]
# X.head()

    Sentence #           Word  POS    Tag
0  Sentence: 1      Thousands  NNS      O
1  Sentence: 1             of   IN      O
2  Sentence: 1  demonstrators  NNS      O
3  Sentence: 1           have  VBP      O
4  Sentence: 1        marched  VBN      O
5  Sentence: 1        through   IN      O
6  Sentence: 1         London  NNP  B-geo
7  Sentence: 1             to   TO      O
8  Sentence: 1        protest   VB      O
9  Sentence: 1            the   DT      O


In [2]:
words = list(set(data["Word"].values))
n_words = len(words)
print(n_words)

tags= list(set(data["Tag"].values))
n_tags= len(tags)
print(n_tags)
    

35178
17


In [5]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w,t in zip(s["Word"].values.tolist(),s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
                                                           
                                                           
    def ent_tagger(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
        
getter = SentenceGetter(data)
sent = getter.ent_tagger()
print(sent)
sentences= getter.sentences
print(sentences[1])

[('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have', 'O'), ('marched', 'O'), ('through', 'O'), ('London', 'B-geo'), ('to', 'O'), ('protest', 'O'), ('the', 'O'), ('war', 'O'), ('in', 'O'), ('Iraq', 'B-geo'), ('and', 'O'), ('demand', 'O'), ('the', 'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops', 'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]
[('Iranian', 'B-gpe'), ('officials', 'O'), ('say', 'O'), ('they', 'O'), ('expect', 'O'), ('to', 'O'), ('get', 'O'), ('access', 'O'), ('to', 'O'), ('sealed', 'O'), ('sensitive', 'O'), ('parts', 'O'), ('of', 'O'), ('the', 'O'), ('plant', 'O'), ('Wednesday', 'B-tim'), (',', 'O'), ('after', 'O'), ('an', 'O'), ('IAEA', 'B-org'), ('surveillance', 'O'), ('system', 'O'), ('begins', 'O'), ('functioning', 'O'), ('.', 'O')]


In [45]:
word2idx = {w: i for i, w in enumerate(words)}
tag2idx = {t: i for i, t in enumerate(tags)}


### Step 1a: Formatting the data
Data will need to be

1. Indexed
2. Limited by vocabulary (ie replace tokens with UNKNOWN if they are too rare, come up with a reasonable limit based on your survey of the data and also model performance)
3. Padded

In [30]:
dic= data["Word"].unique()
indexed={}
l=len(dic)
for w in range(l):
    indexed[dic[w]]=w
# print(indexed)    
uniqueindex=[]
for w in range(0,len(data)):
    key= data["Word"][w]
    value= indexed[key]
    uniqueindex.append(value)

data["index"]=uniqueindex
data['freq'] = data.groupby('Word')['Word'].transform('count')
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag,index,freq
0,Sentence: 1,Thousands,NNS,O,0,114
1,Sentence: 1,of,IN,O,1,26354
2,Sentence: 1,demonstrators,NNS,O,2,110
3,Sentence: 1,have,VBP,O,3,5485
4,Sentence: 1,marched,VBN,O,4,65
5,Sentence: 1,through,IN,O,5,515
6,Sentence: 1,London,NNP,B-geo,6,261
7,Sentence: 1,to,TO,O,7,23213
8,Sentence: 1,protest,VB,O,8,237
9,Sentence: 1,the,DT,O,9,52573


In [31]:
max_len= 75

In [32]:
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=0)
y = [[tag2idx[w[1]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value= tag2idx["O"])
y = [to_categorical(i, num_classes=n_tags) for i in y]

In [18]:
len(X[1])

75

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size=0.3)

### Step 2. Build the model

Here we will build a Bidirectional LSTM-CRF model using the `Bidirectional` function from Keras and `CRF` function from Keras-contrib

**Documentation and source code:**

https://keras.io/layers/wrappers/#bidirectional

https://github.com/keras-team/keras-contrib

Fit your model with a validation split of 0.1, feel free to use as many epochs as you like. Base your predictions both from the input words **and** the tags from previous words like in the POS example.

After building your model, grade your performance on your test set, both by comparing your predicted output to the actual (*at least 3 examples*) and calculate the averaged precision and recall for your tags.

In [34]:
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras_contrib.layers import CRF

In [35]:
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words + 1, output_dim=20,
                  input_length=max_len, mask_zero=True)(input)  # 20-dim embedding
model = Bidirectional(LSTM(units=50, return_sequences=True,
                           recurrent_dropout=0.1))(model)  # variational biLSTM
model = TimeDistributed(Dense(50, activation="relu"))(model)  # a dense layer as suggested by neuralNer
crf = CRF(n_tags)  # CRF layer
out = crf(model)  # output

In [36]:
model = Model(input, out)
model.compile(optimizer='rmsprop', loss=crf.loss_function)
model.fit(X_train, np.array(y_train), epochs=3, batch_size=10, validation_split=0.1)

Train on 30213 samples, validate on 3358 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1a494d39e8>

In [47]:
i=100
p = model.predict(np.array([X_test[i]]))
p = np.argmax(p, axis=-1)
Actual = np.argmax(y_test[i], -1)
print("{:20}{:6} {:}".format("Word", "Actual", "Pred"))
pred_tags=[]
actual_tags=[]
for w,t, pred in zip(X_test[i],Actual,p[0]):
    if w != 0:
        print("{:20}: {:6} {:}".format(words[w-1], tags[t], tags[pred]))
        pred_tags.append(tags[pred])
        actual_tags.append(tags[t])
        

Word                Actual Pred
criminalizes        : O      O
symptom             : O      O
renaming            : O      O
Teams               : O      O
Yekiti              : O      O
arena               : O      O
Suu                 : O      O
chancellor          : O      O
plundered           : O      O
tours               : O      O
hijackers           : O      O
Schatten            : O      O
Arafat              : O      O
J.S.                : O      O
Gouled              : O      O
Yekiti              : O      O
economic            : B-gpe  B-geo
Self-Defense        : O      O
facets              : O      O
Hobart              : O      O
tours               : O      O
quelling            : O      O
Schatten            : O      O
Arafat              : O      O
3,700               : O      O
holiday             : O      O
Paulos              : O      O
tours               : O      O
crewmembers         : O      O
determining         : O      O
DeLay               : O      O


In [48]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(actual_tags,pred_tags)
precision = precision_score(actual_tags,pred_tags, average='weighted')
recall = recall_score(actual_tags,pred_tags, average='weighted')
f1 = f1_score(actual_tags,pred_tags, average='weighted')
    
print("ACCURACY: {:.3f}".format(accuracy))
print("PRECISION: {:.3f}".format(precision))
print("RECALL: {:.3f}".format(recall))
print("F1: {:.3f}".format(f1))

ACCURACY: 0.968
PRECISION: 0.968
RECALL: 0.968
F1: 0.968


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Step 3. Pick a dataset

Pick a dataset that has short text, similar to the sentences you just tagged. Headlines and tweets are good choices.

https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=news&page=1&pageSize=20&size=all&filetype=all&license=all

In [49]:
df= pd.read_csv("abcnews-date-text.csv")
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [50]:
from nltk.tokenize import WordPunctTokenizer
from collections import Counter
from string import punctuation, ascii_lowercase
import regex as re
from tqdm import tqdm

# setup tokenizer
tokenizer = WordPunctTokenizer()

vocab = Counter()

def text_to_wordlist(text, lower=False):
    
    # Tokenize
    text = tokenizer.tokenize(text)
    
   
    # Return a list of words
    vocab.update(text)
    return text

def process_comments(list_sentences, lower=False):
    comments = []
    for text in tqdm(list_sentences):
        txt = text_to_wordlist(text, lower=lower)
        comments.append(txt)
    return comments


list_sentences = list(df["headline_text"].fillna("NAN_WORD").values)


comments = process_comments(list_sentences, lower=True)

100%|██████████| 1103665/1103665 [00:11<00:00, 93972.19it/s]


In [51]:
flat_list =[item for sublist in comments for item in sublist]
print(flat_list[0:25])

new=flat_list[15:20]
n_words = len(flat_list)
print(n_words)
max_len= 75
word2idx = {w: i for i, w in enumerate(words)}

x_test_sent = pad_sequences(sequences=[[word2idx.get(w, 0) for w in new]],
                            padding="post", value=0, maxlen=max_len)

['aba', 'decides', 'against', 'community', 'broadcasting', 'licence', 'act', 'fire', 'witnesses', 'must', 'be', 'aware', 'of', 'defamation', 'a', 'g', 'calls', 'for', 'infrastructure', 'protection', 'summit', 'air', 'nz', 'staff', 'in']
7105908


In [57]:
p1 = model.predict(np.array([x_test_sent[0]]))
p1 = np.argmax(p1, axis=-1)
print("{:15}||{}".format("Word", "Prediction"))
word_new=[]
pred_new=[]
for w, pred in zip(flat_list, p1[0]):
    print("{:15}: {:5}".format(w, tags[pred]))
    word_new.append(w)
    pred_new.append(tags[pred])
    

Word           ||Prediction
aba            : B-geo
decides        : O    
against        : O    
community      : O    
broadcasting   : O    
licence        : B-geo
act            : B-geo
fire           : B-geo
witnesses      : B-geo
must           : B-geo
be             : B-geo
aware          : B-geo
of             : B-geo
defamation     : B-geo
a              : B-geo
g              : B-geo
calls          : B-geo
for            : B-geo
infrastructure : B-geo
protection     : B-geo
summit         : B-geo
air            : B-geo
nz             : B-geo
staff          : B-geo
in             : B-geo
aust           : B-geo
strike         : B-geo
for            : B-geo
pay            : B-geo
rise           : B-geo
air            : B-geo
nz             : B-geo
strike         : B-geo
to             : B-geo
affect         : B-geo
australian     : B-geo
travellers     : B-geo
ambitious      : B-geo
olsson         : B-geo
wins           : B-geo
triple         : B-geo
jump           : B-geo
antic 

### Step 4. Tag your new data!

Create a modification to the **ent_tagger function** that combines words and tags from your original dataset. Now allow the function to also load new text from your new data set, and output the tags predicted from your trained model alongside the text. Make your function load five random texts from your data and output the tagged text.

In [53]:
# def ent_tagger_new(self):
list=[]
for i in range(0,len(word_new)):
    list.append((word_new[i],pred_new[i]))
print(list)        

[('aba', 'B-geo'), ('decides', 'O'), ('against', 'O'), ('community', 'O'), ('broadcasting', 'O'), ('licence', 'B-geo'), ('act', 'B-geo'), ('fire', 'B-geo'), ('witnesses', 'B-geo'), ('must', 'B-geo'), ('be', 'B-geo'), ('aware', 'B-geo'), ('of', 'B-geo'), ('defamation', 'B-geo'), ('a', 'B-geo'), ('g', 'B-geo'), ('calls', 'B-geo'), ('for', 'B-geo'), ('infrastructure', 'B-geo'), ('protection', 'B-geo'), ('summit', 'B-geo'), ('air', 'B-geo'), ('nz', 'B-geo'), ('staff', 'B-geo'), ('in', 'B-geo'), ('aust', 'B-geo'), ('strike', 'B-geo'), ('for', 'B-geo'), ('pay', 'B-geo'), ('rise', 'B-geo'), ('air', 'B-geo'), ('nz', 'B-geo'), ('strike', 'B-geo'), ('to', 'B-geo'), ('affect', 'B-geo'), ('australian', 'B-geo'), ('travellers', 'B-geo'), ('ambitious', 'B-geo'), ('olsson', 'B-geo'), ('wins', 'B-geo'), ('triple', 'B-geo'), ('jump', 'B-geo'), ('antic', 'B-geo'), ('delighted', 'B-geo'), ('with', 'B-geo'), ('record', 'B-geo'), ('breaking', 'B-geo'), ('barca', 'B-geo'), ('aussie', 'B-geo'), ('qualifier',