# Named Entity Recognition (NER)

Named Entity Recognition (NER) is an important  task in natural language processing. In this assignment you will implement a neural network model for NER.  In particular you will implement an approach called Sliding Window Neural Network. The dataset is composed of sentences. The dataframe already has each words parsed in one column and the corresponding label (entity) in the second column. We will build a "window" model, the idea on the window model is to use 5-word window to predict the name entity of the middle word. Here is the first observation in our data:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from ner import *
import pandas as pd
from sklearn.model_selection import train_test_split

In [14]:
data = pd.read_csv("data/Genia4ERtask1.iob2", sep="\t", header=None, names=["word", "label"])

FileNotFoundError: [Errno 2] No such file or directory: 'data/Genia4ERtask1.iob2'

In [3]:
data.head()

NameError: name 'data' is not defined

In [4]:
tiny_data = pd.read_csv("data/tiny.ner.train", sep="\t", header=None, names=["word", "label"])

The second observation is the 5 words starting with 'gene' and the label is the entity for the word 'and'. We have 5 features (categorical variables) which are words. We will use a word embedding to represent each value of the categorical features. For each observation, we concatenate the values of the 5 word embeddings for that observation. The vector of concatenated embeddings is feeded to a linear layer.

## Split dataset

In [5]:
data = tiny_data

In [6]:
N = int(data.shape[0]*0.8)
N

77

In [7]:
train_df = data.iloc[:N,].copy()
valid_df = data.iloc[N:,].copy()

In [8]:
train_df.shape, valid_df.shape

((77, 2), (20, 2))

## Word and label to index mapping

In [9]:
vocab2index = label_encoding(train_df["word"].values)
label2index = label_encoding(train_df["label"].values)

In [10]:
len(label2index)

7

In [11]:
label2index

{'B-DNA': 0,
 'B-cell_type': 1,
 'B-protein': 2,
 'I-DNA': 3,
 'I-cell_type': 4,
 'I-protein': 5,
 'O': 6}

## Label Encoding categorical variables

In [12]:
tiny_vocab2index = label_encoding(tiny_data["word"].values)
tiny_label2index = label_encoding(tiny_data["label"].values)
tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)

In [13]:
actual = np.array([17, 53, 31, 25, 44, 41, 32,  0, 11,  1])
assert(np.array_equal(tiny_data_enc.iloc[30:40].word.values, actual))

## Dataset definition

In [14]:
tiny_ds = NERDataset(tiny_data_enc)

In [15]:
len(tiny_ds)

93

In [16]:
x, y = tiny_ds[0]
x,y

(array([11, 30, 26, 18, 13]), 6)

In [17]:
x, y = tiny_ds[0]
assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
assert(y == 6)
assert(len(tiny_ds) == 93)

## Testing

In [18]:
# encoding datasets
train_df_enc = dataset_encoding(train_df, vocab2index, label2index)
valid_df_enc = dataset_encoding(valid_df, vocab2index, label2index)

In [19]:
# creating datasets
train_ds =  NERDataset(train_df_enc)
valid_ds = NERDataset(valid_df_enc)

# dataloaders
batch_size = 10000
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

In [20]:
vocab_size = len(vocab2index)+1
n_class = len(label2index)
emb_size = 100

model = NERModel(vocab_size, n_class, emb_size)
optimizer = get_optimizer(model, lr = 0.01, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  2.226 val loss 1.594 and accuracy 0.562
train loss  1.124 val loss 1.395 and accuracy 0.500
train loss  0.503 val loss 1.346 and accuracy 0.562
train loss  0.216 val loss 1.369 and accuracy 0.562
train loss  0.090 val loss 1.424 and accuracy 0.500
train loss  0.036 val loss 1.490 and accuracy 0.500
train loss  0.015 val loss 1.556 and accuracy 0.500
train loss  0.008 val loss 1.618 and accuracy 0.500
train loss  0.004 val loss 1.676 and accuracy 0.500
train loss  0.003 val loss 1.729 and accuracy 0.500


In [21]:
optimizer = get_optimizer(model, lr = 0.001, wd = 1e-5)
train_model(model, optimizer, train_dl, valid_dl, epochs=10)

train loss  0.002 val loss 1.730 and accuracy 0.500
train loss  0.001 val loss 1.733 and accuracy 0.500
train loss  0.001 val loss 1.737 and accuracy 0.500
train loss  0.001 val loss 1.742 and accuracy 0.500
train loss  0.001 val loss 1.747 and accuracy 0.500
train loss  0.001 val loss 1.753 and accuracy 0.500
train loss  0.001 val loss 1.758 and accuracy 0.500
train loss  0.001 val loss 1.764 and accuracy 0.500
train loss  0.001 val loss 1.770 and accuracy 0.500
train loss  0.001 val loss 1.776 and accuracy 0.500


In [22]:
valid_loss, valid_acc = valid_metrics(model, valid_dl)

In [23]:
valid_loss, valid_acc

(1.7757858037948608, 0.5)

In [25]:
np.abs(valid_loss - 0.3)

1.4757858037948608

In [24]:
assert(np.abs(valid_loss - 0.3) < 0.02)

AssertionError: 

In [25]:
assert(np.abs(valid_acc - 0.9) < 0.01)

In [26]:
def read_tiny_data():
   data = pd.read_csv("data/tiny.ner.train", sep="\t", header=None, names=["word", "label"])
   return data

In [27]:
def test_Dataset():
   tiny_data = read_tiny_data()
   tiny_vocab2index = label_encoding(tiny_data["word"].values)
   tiny_label2index = label_encoding(tiny_data["label"].values)
   tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)
   tiny_ds = NERDataset(tiny_data_enc)
   x, y = tiny_ds[0]
   assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
   assert(y == 6)
   assert(len(tiny_ds) == 93)




In [28]:
tiny_data = read_tiny_data()
tiny_vocab2index = label_encoding(tiny_data["word"].values)
tiny_label2index = label_encoding(tiny_data["label"].values)
tiny_data_enc = dataset_encoding(tiny_data, tiny_vocab2index, tiny_label2index)
tiny_ds = NERDataset(tiny_data_enc)
x, y = tiny_ds[0]
#    assert(np.array_equal(x, np.array([11, 30, 26, 18, 13])))
#    assert(y == 6)
#    assert(len(tiny_ds) == 93)



In [29]:
tiny_data

Unnamed: 0,word,label
0,IL-2,B-DNA
1,gene,I-DNA
2,expression,O
3,and,O
4,NF-kappa,B-protein
...,...,...
92,to,O
93,involve,O
94,protein,B-protein
95,tyrosine,I-protein


In [30]:
y

6