# **Home Exercise on Named Entity Recognition**
Implement a **Recurrent Neural Network model (Bidirectional LSTM-CRF Models for Sequence Tagging)** to extract named entities from text, entity labels are encoded using the BIO notation, where each entity label is assigned a **B** (Beginning) or **I** (Inside) tag. The **B-** tag indicates the beginning of an entity, while the **I-** tag marks words inside the same entity.

These tags help identify multi-word entities. For example, in the phrase **"World War II"**, the labels would be: **(B-eve, I-eve, I-eve)**. Words that do not belong to any entity are labeled as **O (Outside)**.

* Data: [Annotated GMB Corpus](https://www.kaggle.com/datasets/shoumikgoswami/annotated-gmb-corpus?select=GMB_dataset.txt)(**the last 10% of sentences serve as the test set**).

In [10]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tqdm
import shutil, sys, zipfile

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from datetime import datetime
import datetime
from helper_functions import *

print(f"The last time this notebook was run is: {datetime.datetime.now().strftime('%H:%M:%S %d/%m/%')}")


The last time this notebook was run is: 08:48:08 21/11/%


In [5]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shoumikgoswami/annotated-gmb-corpus")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/shoumikgoswami/annotated-gmb-corpus?dataset_version_number=1...


100%|██████████| 462k/462k [00:00<00:00, 632kB/s]

Extracting files...
Path to dataset files: /home/dikhang/.cache/kagglehub/datasets/shoumikgoswami/annotated-gmb-corpus/versions/1





In [13]:
src_dir = "/home/dikhang/.cache/kagglehub/datasets/shoumikgoswami/annotated-gmb-corpus/versions/1"
filename = "GMB_dataset.txt"

full_path = os.path.join(src_dir, filename)
file_path = move_file(full_path, "./data")

print("Moved to:", file_path)


Moved to: ./data/GMB_dataset.txt


## Loading to Dataset class by DataLoader

In [17]:
from torch.utils.data import Dataset, DataLoader
import random
from collections import Counter, defaultdict
from typing import List, Tuple

START_TAG = "<START>"
STOP_TAG = "<STOP>"
PAD_TAG = "<PAD>"
UNK_TOKEN = "<UNK>"

In [None]:
def read_text(path) -> Tuple[List[List[str]], List[List[str]]]:
    sentences, tags = [], []
    words, labels = [], []
    
    with open(path, 'r', encoding='utf8') as f:
        for line in f:
            line = line.strip()
            if not line:
                if words:
                    sentences.append(words)
                    tags.append(labels)
                    words, labels = [], []
                continue
            parts = line.split()
            if len(parts) == 1:
                token = parts[0]
                tag = "O"
            else:
                token = parts[0]
                tag = parts[-1]
            words.append(token)
            labels.append(tag)
    if words:
        sentences.append(words)
        tags.append(labels)
    return sentences, tags

       