# Home Exercise on Named Entity Recognition

Implement a **Recurrent Neural Network model** (**[Bidirectional LSTM-CRF Models for Sequence Tagging](https://arxiv.org/pdf/1508.01991)**) to extract named entities from text. Entity labels are encoded using the **BIO notation**, where each entity label is assigned a **B** (Beginning) or **I** (Inside) tag. The **B-** tag indicates the beginning of an entity, while the **I-** tag marks words inside the same entity.

These tags help identify multi-word entities. For example, in the phrase **"World War II"**, the labels would be: **(B-eve, I-eve, I-eve)**. Words that do not belong to any entity are labeled as **O** (Outside).

- **Data**: [Annotated GMB Corpus](https://www.kaggle.com/datasets/shoumikgoswami/annotated-gmb-corpus?select=GMB_dataset.txt) *(the last 10% of rows serve as the test set).*

**Note**: Submit only a **single Jupyter Notebook file** that can handle all tasks, including data downloading, preprocessing, model training, and model evaluation. *(Submissions that do not follow the guidelines will receive a score of 0.)*

## Grading Criteria

For valid submissions, scores will be assigned based on the **leaderboard ranking** (**strictly greater**):

- **Top 25%** → **10 points**
- **25% - 50%** → **9.0 points**
- **50% - 75%** → **8.0 points**
- **75% - 100%** → **7.0 points**


In [1]:
%pip install numpy pandas tensorflow scikit-learn seqeval kaggle
%pip install tensorflow==2.18.0


Collecting seqeval
  Using cached seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




In [2]:
%pip install tensorflow-addons==0.8.3
%pip install tensorflow==2.2.0-rc3

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement tensorflow-addons==0.8.3 (from versions: none)
ERROR: No matching distribution found for tensorflow-addons==0.8.3


Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement tensorflow==2.2.0-rc3 (from versions: 2.16.0rc0, 2.16.1, 2.16.2, 2.17.0rc0, 2.17.0rc1, 2.17.0, 2.17.1, 2.18.0rc0, 2.18.0rc1, 2.18.0rc2, 2.18.0, 2.19.0rc0)
ERROR: No matching distribution found for tensorflow==2.2.0-rc3


In [3]:
# Import required libraries
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Bidirectional, TimeDistributed, Dense, Dropout
from sklearn.model_selection import train_test_split
from tensorflow_addons.layers import CRF
from seqeval.metrics import classification_report as seqeval_report


ModuleNotFoundError: No module named 'tensorflow_addons'

In [None]:
# ===================================
# 📥 STEP 1: Download Dataset from Kaggle
# ===================================
KAGGLE_DATASET = "shoumikgoswami/annotated-gmb-corpus"
DATA_FILENAME = "GMB_dataset.txt"

# Set up Kaggle API credentials (Make sure you've added your Kaggle API key)
os.system(f"kaggle datasets download -d {KAGGLE_DATASET} --unzip")

# Load the dataset
DATA_PATH = DATA_FILENAME  # Change if the filename is different


In [None]:
# ===================================
# 📊 STEP 2: Load and Preprocess Dataset
# ===================================
# Read the dataset
df = pd.read_csv(DATA_PATH, sep=' ', names=['word', 'pos', 'chunk', 'label'], skip_blank_lines=True)

# Fill missing values
df = df.fillna(method='ffill')

# Remove sentences where label is missing
df = df[df['label'].notna()]

# Create a list of sentences with corresponding labels
agg_func = lambda s: [(w, l) for w, l in zip(s['word'].values.tolist(), s['label'].values.tolist())]
sentences = df.groupby(df.index).apply(agg_func).tolist()

In [None]:
# ===================================
# 🔢 STEP 3: Tokenization & Index Mapping
# ===================================
# Create vocabulary mappings
words = list(set(df["word"].values))  # Unique words
labels = list(set(df["label"].values))  # Unique entity labels

# Word to index mappings
word2idx = {w: i + 1 for i, w in enumerate(words)}
label2idx = {l: i for i, l in enumerate(labels)}
idx2label = {i: l for l, i in label2idx.items()}  # Reverse mapping

# Maximum sentence length
MAX_LEN = 50

# Convert words and labels to indices
X = pad_sequences([[word2idx.get(w[0], 0) for w in s] for s in sentences], maxlen=MAX_LEN, padding="post")
y = pad_sequences([[label2idx[w[1]] for w in s] for s in sentences], maxlen=MAX_LEN, padding="post")

# Convert labels to categorical format
y = np.array([to_categorical(i, num_classes=len(labels)) for i in y])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)


In [None]:
# ===================================
# 🏗️ STEP 4: Build the BiLSTM-CRF Model
# ===================================
# Input Layer
input_layer = Input(shape=(MAX_LEN,))

# Embedding Layer (Word Embeddings)
embedding_layer = Embedding(input_dim=len(words) + 1, output_dim=50, input_length=MAX_LEN)(input_layer)

# Bidirectional LSTM Layer
bilstm_layer = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.5, recurrent_dropout=0.25))(embedding_layer)

# Dense Layer for label prediction
dense_layer = TimeDistributed(Dense(len(labels), activation="relu"))(bilstm_layer)

# CRF Layer for structured prediction
crf_layer = CRF(len(labels))
output_layer = crf_layer(dense_layer)

# Define and compile the model
model = Model(input_layer, output_layer)
model.compile(optimizer="adam", loss=crf_layer.loss, metrics=["accuracy"])

# Print Model Summary
model.summary()

In [None]:
# ===================================
# 🚀 STEP 5: Train the Model
# ===================================
history = model.fit(
    X_train, np.array(y_train),
    validation_data=(X_test, np.array(y_test)),
    batch_size=32,
    epochs=5,
    verbose=1
)

In [None]:
# ===================================
# 📝 STEP 6: Evaluate Model Performance
# ===================================
# Make predictions
y_pred = model.predict(X_test)

# Convert predictions to label indices
y_pred = np.argmax(y_pred, axis=-1)
y_true = np.argmax(y_test, axis=-1)

# Convert indices to label names
y_pred_labels = [[idx2label[i] for i in row] for row in y_pred]
y_true_labels = [[idx2label[i] for i in row] for row in y_true]

# Print classification report
print(seqeval_report(y_true_labels, y_pred_labels))
