<a href="https://colab.research.google.com/github/martinthetechie/nlp-guide/blob/main/named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>Named Entity Recognition</h2>

In [9]:
quote = 'On January 9, 2007, Steve Jobs announced the first iPhone at the Macworld convention, receiving substantial media attention.'

<h4>Rules-Based</h4>

In [4]:
# Rules-Based
import spacy
from spacy.pipeline import EntityRuler

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Define custom patterns
patterns = [
    {"label": "DATE", "pattern": [{"SHAPE": "dd"}, {"SHAPE": "/"}, {"SHAPE": "dd"}, {"SHAPE": "/"}, {"SHAPE": "dddd"}]},
    {"label": "PERSON", "pattern": [{"LOWER": "steve"}, {"LOWER": "jobs"}]},
    {"label": "ORG", "pattern": [{"LOWER": "iphone"}, {"LOWER": "macworld"}]}
]

# Add patterns to SpaCy pipeline
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

doc = nlp(quote)

# Print detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)



January 9, 2007 DATE
Steve Jobs PERSON
first ORDINAL


<h4>Statistical Approach</h4>

In [5]:
!pip install sklearn-crfsuite python-crfsuite hmmlearn


Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting hmmlearn
  Downloading hmmlearn-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.9 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hmmlearn-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.1/161.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn-crfsuite, hmmlearn
Successfully installed hmmlearn-0.3.2 pytho

In [10]:
import numpy as np
from hmmlearn import hmm
from sklearn.preprocessing import LabelEncoder
import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

# Load the training data from the CoNLL-2002 corpus
train_sents = list(conll2002.iob_sents('esp.train'))

# Prepare the training data
def prepare_data(sentences):
    words = []
    labels = []
    for sent in sentences:
        for word, pos, label in sent:
            words.append(word)
            labels.append(label)
    return words, labels

words, labels = prepare_data(train_sents)

# Encode words and labels as numbers
word_encoder = LabelEncoder()
label_encoder = LabelEncoder()

X = word_encoder.fit_transform(words).reshape(-1, 1)
y = label_encoder.fit_transform(labels)

# Define HMM model parameters
n_states = len(set(labels))  # Number of states (NER labels)
n_observations = len(set(words))  # Number of observations (words)

# Train the HMM model
model = hmm.MultinomialHMM(n_components=n_states, n_iter=100)
model.fit(X)

# Predict the sequence of labels for the original training data (for simplicity)
y_pred = model.predict(X)

# Decode the predicted labels back to original NER tags
predicted_labels = label_encoder.inverse_transform(y_pred)

# Display some of the predicted tokens with their predicted NER tags
for token, label in zip(words[:20], predicted_labels[:20]):  # Displaying first 20 for brevity
    print(f"{token}: {label}")

# Now, let's predict the entities for your provided quote
quote_tokens = quote.split()

# Map unseen words to '<UNK>'
quote_tokens = [word if word in word_encoder.classes_ else '<UNK>' for word in quote_tokens]

# Encode words from the quote, skip those not seen in the training set
quote_X = []
for word in quote_tokens:
    if word in word_encoder.classes_:
        encoded_word = word_encoder.transform([word])[0]
        quote_X.append([encoded_word])
    else:
        quote_X.append([None])  # Use None or a placeholder for unseen words

quote_X = np.array([x for x in quote_X if x[0] is not None])

# Predict the NER tags for the quote
quote_y_pred = model.predict(quote_X)

# Decode the predicted labels back to original NER tags
quote_predicted_labels = label_encoder.inverse_transform(quote_y_pred)

# Display the tokens with their predicted NER tags
print("\nPredicted NER tags for the quote:")
for token, label in zip(quote_tokens, quote_predicted_labels):
    print(f"{token}: {label}")


[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!
https://github.com/hmmlearn/hmmlearn/issues/335
https://github.com/hmmlearn/hmmlearn/issues/340


Melbourne: B-PER
(: O
Australia: I-PER
): I-PER
,: I-PER
25: I-PER
may: I-PER
(: I-PER
EFE: I-PER
): I-PER
.: I-PER
-: I-PER
El: I-PER
Abogado: I-PER
General: I-PER
del: I-PER
Estado: I-PER
,: I-PER
Daryl: I-PER
Williams: I-PER

Predicted NER tags for the quote:
<UNK>: B-PER
<UNK>: I-ORG
<UNK>: I-LOC


In [12]:
from sklearn_crfsuite import CRF, metrics
import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

# Load the training and test data from the CoNLL-2002 corpus
train_sents = list(conll2002.iob_sents('esp.train'))
test_sents = list(conll2002.iob_sents('esp.testb'))

# Function to extract features from a sentence
def sent2features(sent):
    return [token2features(sent, i) for i in range(len(sent))]

# Function to extract features from a token
def token2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

# Function to extract labels from a sentence
def sent2labels(sent):
    return [label for token, postag, label in sent]

# Prepare the training and test data
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

# Define and train the CRF model
crf = CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

# Predict and evaluate on the test set
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, digits=3))

# Now let's predict the entities for your quote
quote = "On January 9, 2007, Steve Jobs announced the first iPhone at the Macworld convention, receiving substantial media attention."
tokens = quote.split()

# Dummy POS tags for each token (normally, you would use a POS tagger)
pos_tags = ['IN', 'NNP', 'CD', 'CD', 'NNP', 'NNP', 'VBD', 'DT', 'JJ', 'NNP', 'IN', 'DT', 'NNP', 'NNP', 'VBG', 'JJ', 'NN', 'NN']

# Prepare the features for the quote
X_quote = sent2features(list(zip(tokens, pos_tags)))

# Predict the NER tags for the quote
y_pred_quote = crf.predict([X_quote])[0]

# Display the tokens with their predicted NER tags
for token, label in zip(tokens, y_pred_quote):
    print(f"{token}: {label}")


[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


              precision    recall  f1-score   support

       B-LOC      0.809     0.779     0.794      1084
      B-MISC      0.732     0.540     0.621       339
       B-ORG      0.809     0.831     0.820      1400
       B-PER      0.845     0.890     0.867       735
       I-LOC      0.679     0.637     0.657       325
      I-MISC      0.701     0.591     0.641       557
       I-ORG      0.854     0.786     0.819      1104
       I-PER      0.890     0.946     0.917       634
           O      0.992     0.996     0.994     45355

    accuracy                          0.971     51533
   macro avg      0.812     0.777     0.792     51533
weighted avg      0.970     0.971     0.970     51533

On: B-ORG
January: I-ORG
9,: I-ORG
2007,: I-ORG
Steve: I-ORG
Jobs: I-ORG
announced: I-ORG
the: I-ORG
first: I-ORG
iPhone: I-ORG
at: I-ORG
the: I-ORG
Macworld: I-ORG
convention,: O
receiving: O
substantial: O
media: O
attention.: O


<h4>Deep Learning Approach</h4>

In [18]:
# Using Hugging Face Transformers with BERT
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english')

# Define NER pipeline
ner_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)

# Extract Entities
entities = ner_pipeline(quote)

for entity in entities:
    print(f"{entity['word']}:{entity['entity']}")

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Steve:I-PER
Job:I-PER
##s:I-PER
iPhone:I-MISC
Mac:I-MISC
##world:I-MISC
