<a href="https://colab.research.google.com/github/aparnaashok2125/Elevvo-Pathways-NLP-Internship/blob/main/Elevvo_Pathways_Task_4_Named_Entity_Recognition_(NER)_from_News_Articles_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 4: Named Entity Recognition (NER) on News Articles | NLP Internship – Elevvo Pathways

## Objective:
This task focuses on performing Named Entity Recognition (NER) using both rule-based and model-based approaches on the **CoNLL-2003 dataset**. The aim is to identify and categorize entities like **PERSON**, **ORGANIZATION**, and **GPE (Geopolitical Entity)** in news text.

The project also includes a comparison of two SpaCy models – `en_core_web_sm` and `en_core_web_lg`, and the implementation of a **BiLSTM model** for enhanced entity recognition. Additionally, visualization is done using **SpaCy's displaCy**.

---

## 🗂 Dataset:
- **CoNLL-2003 Named Entity Recognition Dataset**
- Source: Kaggle
- Format: Token-level tagging with BIO (Begin-Inside-Outside) format
- Entities considered: `PERSON`, `ORG`, `GPE`

---

## 🛠 Tools and Libraries Used:
- **Python**
- **SpaCy** – for rule-based and pre-trained model-based NER
- **Pandas & NumPy** – for data handling
- **TensorFlow/Keras or PyTorch** – for BiLSTM implementation
- **Matplotlib & displaCy** – for visualization

---

## ✅ Key Steps:
1. **Load and Preprocess the Dataset**
   - Parse and clean the CoNLL-2003 format
   - Structure the data for both rule-based and BiLSTM models

2. **Rule-Based NER (SpaCy)**
   - Apply simple rules to extract named entities using SpaCy’s built-in pipeline

3. **Model-Based NER with SpaCy**
   - Load two pre-trained models: `en_core_web_sm` and `en_core_web_lg`
   - Extract and compare named entities across models

4. **Custom BiLSTM-Based NER**
   - Tokenize and pad sequences
   - Train a BiLSTM model for sequence labeling
   - Evaluate using precision, recall, and F1-score

5. **Visualization and Comparison**
   - Use displaCy to visually highlight extracted entities
   - Compare performance between models (SpaCy vs BiLSTM)

---

## 🧾 Outcome:
By the end of this task, we will have:
- A working pipeline for named entity recognition
- Comparison of multiple models on NER
- Visual insights into how models detect entities in text



In [1]:
!pip install spacy pandas tensorflow
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: 

In [2]:
import spacy
from spacy import displacy
from spacy.language import Language
from spacy.tokens import Span
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from google.colab import files
import os
from itertools import chain
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import tensorflow
from tensorflow.keras.utils import plot_model

np.random.seed(1)
tensorflow.random.set_seed(2)


In [3]:
@Language.component("custom_rule_based_ner")
def custom_rule_based_ner(doc):
    custom_ents = []
    for token in doc:
        if token.text.lower() in ['company', 'corporation', 'inc.', 'ltd.', 'group']:
            if token.i > 0:
                span = Span(doc, token.i-1, token.i+1, label="ORG")
                custom_ents.append(span)
    doc.ents = list(doc.ents) + custom_ents
    return doc


In [4]:
print("Upload the CoNLL-2003 dataset (ner_dataset.csv)")
uploaded = files.upload()

def load_conll_data(file_path=None):
    if file_path and os.path.exists(file_path):
        data = pd.read_csv(file_path, encoding='unicode_escape')
    else:
        sample_text = "Apple Inc. launched a new product in New York. Tim Cook presented at the United Nations."
        data = pd.DataFrame({
            'Sentence #': ['Sentence: 1'],
            'Word': sample_text.split(),
            'POS': ['NNP'] * len(sample_text.split()),
            'Tag': ['O'] * len(sample_text.split())
        })
    data['Sentence #'] = data['Sentence #'].ffill()
    data_group = data.groupby('Sentence #').agg({
        'Word': lambda x: ' '.join(str(w) for w in x if pd.notnull(w)),
        'Tag': list,
        'POS': list
    }).reset_index()
    return data, data_group

file_path = list(uploaded.keys())[0] if uploaded else None
data, data_group = load_conll_data(file_path)


Upload the CoNLL-2003 dataset (ner_dataset.csv)


Saving ner_dataset.csv to ner_dataset (1).csv


In [5]:
def get_dict_map(data, token_or_tag):
    vocab = list(set(data['Word'].to_list())) if token_or_tag == 'token' else list(set(data['Tag'].to_list()))
    idx2tok = {idx: tok for idx, tok in enumerate(vocab)}
    tok2idx = {tok: idx for idx, tok in enumerate(vocab)}
    return tok2idx, idx2tok

token2idx, idx2token = get_dict_map(data, 'token')
tag2idx, idx2tag = get_dict_map(data, 'tag')

data['Word_idx'] = data['Word'].map(token2idx)
data['Tag_idx'] = data['Tag'].map(tag2idx)
data_fillna = data.ffill()
data_group_bilstm = data_fillna.groupby(['Sentence #'], as_index=False)[['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx']].agg(lambda x: list(x))


In [6]:
def get_pad_train_test_val(data_group, data):
    n_token = len(set(data['Word'].to_list()))
    n_tag = len(set(data['Tag'].to_list()))
    tokens = data_group['Word_idx'].tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens, maxlen=maxlen, padding='post', value=n_token - 1)
    tags = data_group['Tag_idx'].tolist()
    pad_tags = pad_sequences(tags, maxlen=maxlen, padding='post', value=tag2idx["O"])
    pad_tags = [to_categorical(i, num_classes=n_tag) for i in pad_tags]
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, random_state=2020)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_, tags_, test_size=0.25, random_state=2020)
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags

train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group_bilstm, data)


In [7]:
input_dim = len(set(data['Word'].to_list()))+1
output_dim = 64
input_length = max([len(s) for s in data_group_bilstm['Word_idx'].tolist()])
n_tags = len(tag2idx)

def get_bilstm_lstm_model():
    model = Sequential()
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim))
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2)))
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5))
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

def train_model(X, y, model):
    loss = []
    for i in range(5):
        hist = model.fit(X, np.array(y), batch_size=1000, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

model_bilstm = get_bilstm_lstm_model()
_ = model_bilstm(train_tokens[:1])  # Dry run
results = pd.DataFrame()
results['bilstm_loss'] = train_model(train_tokens, train_tags, model_bilstm)


[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m168s[0m 5s/step - accuracy: 0.8007 - loss: 2.5987 - val_accuracy: 0.9681 - val_loss: 0.3428
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m157s[0m 6s/step - accuracy: 0.9676 - loss: 0.3442 - val_accuracy: 0.9681 - val_loss: 0.2563
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m149s[0m 6s/step - accuracy: 0.9676 - loss: 0.2807 - val_accuracy: 0.9681 - val_loss: 0.2064
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m147s[0m 6s/step - accuracy: 0.9676 - loss: 0.2340 - val_accuracy: 0.9681 - val_loss: 0.1945
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m144s[0m 6s/step - accuracy: 0.9677 - loss: 0.2130 - val_accuracy: 0.9681 - val_loss: 0.1821


In [8]:
def perform_spacy_ner(text, model_name):
    nlp = spacy.load(model_name)
    if model_name == "en_core_web_sm":
        nlp.add_pipe("custom_rule_based_ner", after="ner")
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    html = displacy.render(doc, style="ent", jupyter=False)
    return entities, html

sample_text = data_group['Word'].iloc[0]
models = ["en_core_web_sm", "en_core_web_lg"]
spacy_results = {}

for model in models:
    print(f"\nProcessing with {model}...")
    entities, html_viz = perform_spacy_ner(sample_text, model)
    spacy_results[model] = {'entities': entities, 'visualization': html_viz}
    display(pd.DataFrame(entities, columns=['Entity', 'Type']))
    display(HTML(html_viz))
    output_path = f"/content/ner_viz_{model}.html"
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(html_viz)
    files.download(output_path)



Processing with en_core_web_sm...


Unnamed: 0,Entity,Type
0,Thousands,CARDINAL
1,London,GPE
2,Iraq,GPE
3,British,NORP


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Processing with en_core_web_lg...


Unnamed: 0,Entity,Type
0,Thousands,CARDINAL
1,London,GPE
2,Iraq,GPE
3,British,NORP


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [9]:
print("\n📊 Comparison of SpaCy models:")
for entity_type in ['PERSON', 'ORG', 'GPE']:
    print(f"\n{entity_type} entities:")
    for model in models:
        entities = [e[0] for e in spacy_results[model]['entities'] if e[1] == entity_type]
        print(f"{model}: {entities}")



📊 Comparison of SpaCy models:

PERSON entities:
en_core_web_sm: []
en_core_web_lg: []

ORG entities:
en_core_web_sm: []
en_core_web_lg: []

GPE entities:
en_core_web_sm: ['London', 'Iraq']
en_core_web_lg: ['London', 'Iraq']
