# Named Entity Recognition (NER) using BiLSTM-LSTM Model
In this code, we will explore the task of Named Entity Recognition (NER) and build a BiLSTM-LSTM model for NER using the TensorFlow library. NER is a natural language processing (NLP) task that involves identifying and classifying named entities, such as person names, locations, organizations, and other types of named entities, in text data.

## What is Named Entity Recognition (NER)?
Named Entity Recognition (NER) is a subtask of information extraction that aims to locate and classify named entities in text into predefined categories. The named entities can be proper nouns, such as names of people, organizations, locations, or expressions of time, quantities, monetary values, percentages, etc. NER is widely used in various NLP applications, including question answering, information retrieval, machine translation, and sentiment analysis.

### Dataset
For this task, we will use a dataset called ner_dataset.csv. This dataset contains sentences with corresponding words and their named entity tags. The data has the following columns: Sentence #, Word, POS (Part of Speech), and Tag.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from tensorflow.keras.utils import plot_model, pad_sequences, to_categorical
from sklearn.model_selection import train_test_split
import spacy
from spacy import displacy
from itertools import chain
from numpy.random import seed

# set seeds for reproducibility
seed(1)
tensorflow.random.set_seed(2)

# load data
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


### Data Preparation
Before we can train our NER model, we need to preprocess and prepare the data. This involves converting words and tags into numerical representations, padding sequences, and splitting the data into training, validation, and testing sets.

To prepare the data, we'll define a few helper functions. Let's go through them step by step:

### 1. Get Unique Words and Tags
We'll start by defining a function that extracts unique words and tags from the data:

In [2]:
# function to get unique words and tags from data
def get_dict_map(data, token_or_tag):
    tok2idx = {}
    idx2tok = {}
    
    if token_or_tag == 'token':
        vocab = list(set(data['Word'].to_list()))
    else:
        vocab = list(set(data['Tag'].to_list()))
    
    idx2tok = {idx:tok for  idx, tok in enumerate(vocab)}
    tok2idx = {tok:idx for  idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok

This function takes the data and the type of token (word or tag) as inputs and returns dictionaries to map tokens to indices and vice versa.

### 2. Map Words and Tags to Indices
Next, we'll map words and tags to their corresponding indices using the dictionaries obtained from the previous step. We add two new columns to the data DataFrame: Word_idx and Tag_idx, which contain the indices for words and tags, respectively.

In [3]:
token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')
data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)
data_fillna = data.fillna(method='ffill', axis=0)

### 3. Group Data by Sentences
Since our model will process sentences, we need to group the data by sentences. We'll use the groupby function from pandas to group the data based on the "Sentence #" column:

In [4]:
# Groupby and collect columns
data_group = data_fillna.groupby(['Sentence #'],as_index=False)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))

  data_group = data_fillna.groupby(['Sentence #'],as_index=False)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))


The grouped data is stored in the data_group DataFrame, which contains the sentences, words, parts of speech (POS), tags, word indices, and tag indices.

### 4. Pad and Split the Data
Next, we'll define a function that pads the token sequences and splits the data into training, validation, and testing sets:

In [5]:
def get_pad_train_test_val(data_group, data):

    #get max token and tag length
    n_token = len(list(set(data['Word'].to_list())))
    n_tag = len(list(set(data['Tag'].to_list())))

    #Pad tokens (X var)    
    tokens = data_group['Word_idx'].tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens, maxlen=maxlen, dtype='int32', padding='post', value= n_token - 1)

    #Pad Tags (y var) and convert it into one hot encoding
    tags = data_group['Tag_idx'].tolist()
    pad_tags = pad_sequences(tags, maxlen=maxlen, dtype='int32', padding='post', value= tag2idx["O"])
    n_tags = len(tag2idx)
    pad_tags = [to_categorical(i, num_classes=n_tags) for i in pad_tags]
    
    #Split train, test and validation set
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2020)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_,tags_,test_size = 0.25,train_size =0.75, random_state=2020)

    print(
        'train_tokens length:', len(train_tokens),
        '\ntrain_tokens length:', len(train_tokens),
        '\ntest_tokens length:', len(test_tokens),
        '\ntest_tags:', len(test_tags),
        '\nval_tokens:', len(val_tokens),
        '\nval_tags:', len(val_tags),
    )
    
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags

train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, data)

train_tokens length: 32372 
train_tokens length: 32372 
test_tokens length: 4796 
test_tags: 4796 
val_tokens: 10791 
val_tags: 10791


This function takes the grouped data and the original data as inputs and returns the padded tokens and tags along with the split train, validation, and test sets.

### 5. Prepare Model Input Parameters
Finally, we'll define some input parameters for our NER model:

In [6]:
input_dim = len(list(set(data['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group['Word_idx'].tolist()])
n_tags = len(tag2idx)

Here, input_dim represents the vocabulary size (number of unique words plus one for padding), output_dim represents the dimensionality of the word embeddings, input_length represents the maximum sequence length, and n_tags represents the number of unique tags.

Now that we have prepared the data, let's move on to building the BiLSTM-LSTM model.

## Build the BiLSTM-LSTM Model
Our NER model will consist of an embedding layer, a bidirectional LSTM layer, a unidirectional LSTM layer, and a time-distributed dense layer. We'll use the Keras Sequential API to build the model.

Let's define a function to create the model:

In [7]:
def get_bilstm_lstm_model():
    model = Sequential()

    # Add Embedding layer
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))

    # Add bidirectional LSTM
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode = 'concat'))

    # Add LSTM
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(TimeDistributed(Dense(n_tags, activation="sigmoid")))

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

In this function, we first add an embedding layer that converts the input word indices into dense word embeddings. Then, we add a bidirectional LSTM layer to capture the contextual information from both directions. We follow it with a unidirectional LSTM layer for further sequence modeling. Finally, we add a time-distributed dense layer that predicts the tags for each word in the input sequence. The model is compiled with the categorical cross-entropy loss function and the Adam optimizer.
### Train the Model
With the model architecture defined, we can now train the model on our prepared data. We'll define a function that trains the model for a specified number of epochs:

In [8]:
def train_model(X, y, model):
    loss = list()
    for i in range(25):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

results = pd.DataFrame()
model_bilstm_lstm = get_bilstm_lstm_model()
plot_model(model_bilstm_lstm)
results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 104, 64)           2251456   
                                                                 
 bidirectional (Bidirectiona  (None, 104, 128)         66048     
 l)                                                              
                                                                 
 lstm_1 (LSTM)               (None, 104, 64)           49408     
                                                                 
 time_distributed (TimeDistr  (None, 104, 17)          1105      
 ibuted)                                                         
                                                                 
Total params: 2,368,017
Trainable params: 2,368,017
Non-trainable params: 0
_________________________________________________________________
You must install pydot (`pip install pydot`) a

In this function, we iterate over a specified number of epochs and train the model using the training tokens and tags. We store the loss values in a list for further analysis.

### Testing the Model
To test our NER model, let's define a sample text and visualize the named entities using the spacy library:

In [9]:
nlp = spacy.load('en_core_web_sm')
text = nlp('London is the capital and largest city of England and the United Kingdom, with a population of just under 9 million. \n It stands on the River Thames in south-east England at the head of a 50-mile (80 km) estuary down to the North Sea, and has been a major settlement for two millennia. \n The City of London, its ancient core and financial centre, was founded by the Romans as Londinium and retains its medieval boundaries. \n The City of Westminster, to the west of the City of London, has for centuries hosted the national government and parliament. \n Since the 19th century, the name "London" has also referred to the metropolis around this core, historically split between the counties of Middlesex, Essex, Surrey, Kent, and Hertfordshire, which since 1965 has largely comprised Greater London, which is governed by 33 local authorities and the Greater London Authority.')
displacy.render(text, style = 'ent', jupyter=True)

This code uses the en_core_web_sm model from spacy to tokenize the text and extract named entities. The displacy.render function is used to visualize the named entities in the text.

# Conclusion
We explored the task of Named Entity Recognition (NER) and built a BiLSTM-LSTM model for NER using the TensorFlow library. We covered the data preparation steps, model architecture, training, and testing. NER is a powerful technique in NLP and can be applied to various real-world applications for extracting structured information from unstructured text data.