<a href="https://colab.research.google.com/github/iduryodhanrao/ml-tests/blob/main/DL_NER_with_LSTM_and_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NER prediction with LSTM and Transformers**

For this assignment we will be exploring the use of lstms and transformers for named entity recognition (NER) tasks. In this case, we will be looking at recognizing word tagging (e.g., classifying each word as a business, a place, etc...)

First, download and upload the ner_dataset.csv file from this site (https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv), we will be using this for experiments.

Import the libraries we will need.

In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from itertools import chain


Let's look at the structure of the data

In [None]:
data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape')
data.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'ner_dataset.csv'

We next need to create a mapping between tokens, tags, and ids. Each token should map to a unique id, and each tag should map to a unique class.

Now you might have noticed that each sentece is split into multiple rows.

1.   List item
2.   List item

We need to transform this data into sequences of words and tags.

In [None]:
# Fill na
data_fillna = data.fillna(method='ffill', axis=0)
# Groupby and collect columns
data_group = data_fillna.groupby(
['Sentence #'],as_index=False
)['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))
# Visualise data
data_group.head(10).


SyntaxError: invalid syntax (<ipython-input-5-a27a943f7aea>, line 8)

Next we split the data into training and testing

In [None]:
#Enter your code here
# sample data
['thousands of demonstrators were in Hyde park', 'thousands', 'O']
# ['thousands of demonstrators were in Hyde park', 'of', 'O']
['thousands of demonstrators were in Hyde park', 'demonstrators', 'O']

['thousands of demonstrators were in Hyde park', 'Hyde', 'A-geo']
['thousands of demonstrators were in Hyde park', 'Park', 'B-geo']

# X, y
['thousands of demonstrators were in Hyde park', 'thousands'] ['O']
['thousands of demonstrators were in Hyde park', 'Hyde'] ['A-geo']

# 100 samples of class 'O', 10 from 'A-geo', 10 from 'B-geo'
# y - can take a few values. M classes
# Multi class classification problem
# CLass imbalance: O is super popular
# Train M binary classifiers, each predicting i th class vs all
# training data for class 'A-geo' : 10 vs 110
# Need to down sample majority class
[0.2, 0.1, 0.7]

In [None]:
# Define Metrics
# Class specific PR => PR, ROC AUC

# Encoding:
## Goal:
'thousands of demonstrators were in Hyde park' =>
[0.2, 1.1, 3.0, 0.4, ...0.9]

## Approaches:


1.   Word encoding: create a vocabulary. enumerate them
10K - 20K words.

thousands => 345
thousand => 999
demonstrators => 12
demonstrate => 671
demonstrating => 910

one hot encoding for each token
[]
10 * 10K element sparse vector

2.   Word piece encoding
thousand => thou san d' '
thousands => thou san ds' '
repeat the same one-hot encoding

hundreds

3. Byte pair encoding




In [None]:
['thousands of demonstrators were in Hyde park', 'thousands'] ['O'] =>
[10, 31, 671, ..., 0]
[10, 31, 671, ..., 2]

**Next** create the LSTM model

In [None]:
pre_loaded_embeddings = {'sentence': [0.1,0.3]}
def get_embedding(sentence):
  unknown_sentence_embedding = [0,0,0]
  e = pre_loaded_embeddings.get(sentence,
                                   unknown_sentence_embedding)
  return torch.tensor(e)

In [None]:
# Multi class classification model
class LSTMNerModel(nn.Module):
  def __init__(self,embedding_dim, num_tags, hidden_dim=64):
    super().__init__()
    self.embedding = get_embedding
    self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim)
    self.linear = nn.Linear(in_features=hidden_dim, out_features=num_tags)
    self.soft_max = nn.functional.log_softmax # (in, dim=1)

  def forward(self, sentence, word_index):
    embeddings = self.embedding(sentence)
    lstm_out, lstm_hidden = self.lstm(embeddings)
    # decide which lstm output to use
    # one option: pick out the word_index
    y = lstm_out[:, word_index]
    y = self.linear(y)
    tag_scores = self.soft_max(y, dim=1)
    return tag_scores

Define the loss function for the task

In [None]:
# Enter code here
# CELoss => Binary CE Loss
learning_rate = 0.001
epochs = 100
model = LSTMNerModel(...)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# tags should be of the same form as the tag_scores
# [0, 0, 1]
for epoch in range(epochs):
  for sentence, word_position, tags in training_data:
    tag_scores = model.forward(sentence, word_position)
    # tag_scores = model(sentence, word_position)
    loss = loss_function(tag_scores, tags)

    model.zero_grad()
    loss.backward()
    optimizer.step()
    # compute metrics
    # plot convergence

Train the model. First, find some pre-trained embeddings to help us with the task...for example, you can find GloVe embeddings here https://nlp.stanford.edu/projects/glove/

In [None]:
def load_embeddings():
  lines = open("glove.6B.100d.txt", "r").readlines()

  w2e = {}
  for l in lines:
    s = l.split(" ")
    word = s[0]
    embedding = np.zeros( (1, len(s)-1))
    for k, x in enumerate(s[1:]):
      embedding[0,k] = float(x.strip())

    w2e[word] = embedding

  return w2e

w2e = load_embeddings()



In [None]:
# Enter Code here