 Build a Named Entity Recognition (NER) system for extracting entities from real-world text 
such as news articles or social media data. And measure its accuracy, precision, recall, and F1
score. 

In [1]:
texts = [
    "Apple CEO Tim Cook visited India in 2023.",
    "Elon Musk announced Tesla's new factory in Texas.",
    "Prime Minister Narendra Modi addressed the UN."
]


In [2]:
true_entities = [
    [('Apple', 'ORG'), ('Tim Cook', 'PERSON'), ('India', 'GPE'), ('2023', 'DATE')],
    [('Elon Musk', 'PERSON'), ('Tesla', 'ORG'), ('Texas', 'GPE')],
    [('Narendra Modi', 'PERSON'), ('UN', 'ORG')]
]


In [3]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting spacy
  Downloading spacy-3.8.11-cp313-cp313-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.15-cp313-cp313-win_amd64.whl.metadata (2.3 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.13-cp313-cp313-win_amd64.whl.metadata (9.9 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.12-cp313-cp313-win_amd64.whl.metadata (2.6 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.10-cp313-cp313-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.2-cp313-cp313-win

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")


In [5]:
predicted_entities = []

for text in texts:
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    predicted_entities.append(entities)

print("Predicted Entities:")
for ent in predicted_entities:
    print(ent)


Predicted Entities:
[('Apple', 'ORG'), ('Tim Cook', 'PERSON'), ('India', 'GPE'), ('2023', 'DATE')]
[('Elon Musk', 'PERSON'), ('Tesla', 'ORG'), ('Texas', 'GPE')]
[('Narendra Modi', 'PERSON'), ('UN', 'ORG')]


In [6]:
y_true = []
y_pred = []

for true, pred in zip(true_entities, predicted_entities):
    true_set = set(true)
    pred_set = set(pred)

    for ent in true_set:
        y_true.append(ent[1])
        y_pred.append(ent[1] if ent in pred_set else 'O')

    for ent in pred_set - true_set:
        y_true.append('O')
        y_pred.append(ent[1])


In [7]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(
    y_true, y_pred, average='weighted'
)

print("Accuracy :", accuracy)
print("Precision:", precision)
print("Recall   :", recall)
print("F1-score :", f1)


Accuracy : 1.0
Precision: 1.0
Recall   : 1.0
F1-score : 1.0
