In [1]:
!pip uninstall sklearn_crfsuite
!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git#egg=sklearn_crfsuite

[0mCollecting sklearn_crfsuite
  Cloning https://github.com/MeMartijn/updated-sklearn-crfsuite.git to /tmp/pip-install-bffy6r7o/sklearn-crfsuite_09acaf6b3da24cd09b9243dd4f3fc670
  Running command git clone --filter=blob:none --quiet https://github.com/MeMartijn/updated-sklearn-crfsuite.git /tmp/pip-install-bffy6r7o/sklearn-crfsuite_09acaf6b3da24cd09b9243dd4f3fc670
  Resolved https://github.com/MeMartijn/updated-sklearn-crfsuite.git to commit 675038761b4405f04691a83339d04903790e2b95
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-crfsuite>=0.8.3 (from sklearn_crfsuite)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sklearn_crfsuite
  Building wheel for sklearn_crfsuite (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn_crfsuite: filena

In [2]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Information Extraction

Until now we have mostly focused on classification problems using BOW or TF-IDF representations of text. However, for some tasks this is not enough. Sometimes it is beneficial to know information about words that come before or after other words of a sentence. Named Entity Extraction is one of those tasks. We have looked at NER during our preprocessing lessons but now we will look into another option of training an NER model ourselves.

## Named Entity Recognition

To train a model that can classify words as specific entities we will use Conditional Random Fields (CRF). With the CRF, we use a labeled dataset of input sequences and corresponding label sequences, and apply maximum likelihood estimation to learn the weights of the features that maximize the likelihood of the observed label sequences.

The following code comes from the textbook: https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch5/02_NERTraining.ipynb

In [3]:
from nltk.tag import pos_tag
from sklearn_crfsuite import CRF, metrics
#from sklearn.metrics import make_scorer,confusion_matrix
from pprint import pprint
from sklearn.metrics import f1_score,classification_report
from sklearn.pipeline import Pipeline
import string
import warnings
import nltk
nltk.download('averaged_perceptron_tagger')
#warnings.filterwarnings('ignore')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [4]:
"""
Load the training/testing data.
input: conll format data, but with only 2 tab separated colums - words and NEtags.
output: A list where each item is 2 lists.  sentence as a list of tokens, NER tags as a list for each token.
"""
def load__data_conll(file_path):
    myoutput,words,tags = [],[],[]
    fh = open(file_path)
    for line in fh:
        line = line.strip()
        if "\t" not in line:
            #Sentence ended.
            myoutput.append([words,tags])
            words,tags = [],[]
        else:
            word, tag = line.split("\t")
            words.append(word)
            tags.append(tag)
    fh.close()
    return myoutput

**Explore what the following function does to a sentence. Try it with some sentences of your own.**

In [5]:
def sent2feats(sentence):
    feats = []
    sen_tags = pos_tag(sentence)
    for i in range(0,len(sentence)):
        word = sentence[i]
        wordfeats = {}
        wordfeats['word'] = word
        if i == 0:
            wordfeats["prevWord"] = wordfeats["prevSecondWord"] = "<S>"
        elif i==1:
            wordfeats["prevWord"] = sentence[0]
            wordfeats["prevSecondWord"] = "</S>"
        else:
            wordfeats["prevWord"] = sentence[i-1]
            wordfeats["prevSecondWord"] = sentence[i-2]
        if i == len(sentence)-2:
            wordfeats["nextWord"] = sentence[i+1]
            wordfeats["nextNextWord"] = "</S>"
        elif i==len(sentence)-1:
            wordfeats["nextWord"] = "</S>"
            wordfeats["nextNextWord"] = "</S>"
        else:
            wordfeats["nextWord"] = sentence[i+1]
            wordfeats["nextNextWord"] = sentence[i+2]

        wordfeats['tag'] = sen_tags[i][1]
        if i == 0:
            wordfeats["prevTag"] = wordfeats["prevSecondTag"] = "<S>"
        elif i == 1:
            wordfeats["prevTag"] = sen_tags[0][1]
            wordfeats["prevSecondTag"] = "</S>"
        else:
            wordfeats["prevTag"] = sen_tags[i - 1][1]

            wordfeats["prevSecondTag"] = sen_tags[i - 2][1]

        if i == len(sentence) - 2:
            wordfeats["nextTag"] = sen_tags[i + 1][1]
            wordfeats["nextNextTag"] = "</S>"
        elif i == len(sentence) - 1:
            wordfeats["nextTag"] = "</S>"
            wordfeats["nextNextTag"] = "</S>"
        else:
            wordfeats["nextTag"] = sen_tags[i + 1][1]
            wordfeats["nextNextTag"] = sen_tags[i + 2][1]
        #That is it! You can add whatever you want!
        feats.append(wordfeats)
    return feats

In [6]:
#Extract features from the conll data, after loading it.
def get_feats_conll(conll_data):
    feats = []
    labels = []
    for sentence in conll_data:
        feats.append(sent2feats(sentence[0]))
        labels.append(sentence[1])
    return feats, labels

Now we are ready to train the CRF model.

In [7]:
#Train a sequence model
def train_seq(X_train,Y_train,X_dev,Y_dev):
    crf = CRF(algorithm='lbfgs', c1=0.1, c2=10, max_iterations=50)#, all_possible_states=True)
    #Just to fit on training data
    crf.fit(X_train, Y_train)
    labels = list(crf.classes_)
    #testing:
    y_pred = crf.predict(X_dev)
    sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
    print("Overall F1 score: ", metrics.flat_f1_score(Y_dev, y_pred,average='weighted', labels=labels))
    print(metrics.flat_classification_report(Y_dev, y_pred, labels=sorted_labels, digits=3))
    #get_confusion_matrix(Y_dev, y_pred,labels=sorted_labels)

In [9]:
train_path = 'train.txt'
test_path = 'test.txt'

conll_train = load__data_conll(train_path)
conll_dev = load__data_conll(test_path)

print("Training a sequence classification model with CRF")
feats, labels = get_feats_conll(conll_train)
devfeats, devlabels = get_feats_conll(conll_dev)
train_seq(feats, labels, devfeats, devlabels)


Training a sequence classification model with CRF
Overall F1 score:  0.9255163144785534
              precision    recall  f1-score   support

           O      0.973     0.981     0.977     38289
       B-LOC      0.694     0.765     0.728      1667
       I-LOC      0.738     0.482     0.584       257
      B-MISC      0.650     0.310     0.419       701
      I-MISC      0.624     0.505     0.558       214
       B-ORG      0.670     0.561     0.610      1660
       I-ORG      0.551     0.704     0.618       834
       B-PER      0.773     0.766     0.769      1616
       I-PER      0.819     0.886     0.851      1156

    accuracy                          0.928     46394
   macro avg      0.721     0.662     0.679     46394
weighted avg      0.926     0.928     0.926     46394



## Entity Linking

Entity Linking is the challenge of resolving ambiguous textual mentions to unique concepts in a knowledge base. The full tutorial also has a video on SpaCy:

Video: https://spacy.io/universe/project/video-entity-linking

Notebook: https://github.com/explosion/projects/blob/v3/tutorials/nel_emerson/notebooks/notebook_video.ipynb

In [2]:
#!pip install spacy==3.0.6
!pip install spacy-lookups-data
!python -m spacy download en_core_web_lg

Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.5-py2.py3-none-any.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.5
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do

In [30]:
import spacy
nlp = spacy.load("en_core_web_lg")
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(f"Named Entity '{ent.text}' with label '{ent.label_}'")

Named Entity 'Emerson' with label 'PERSON'
Named Entity 'Wimbledon' with label 'EVENT'


In [31]:
import csv
from pathlib import Path

def load_entities():
    entities_loc = "entities.csv"

    names = dict()
    descriptions = dict()
    with open(entities_loc, "r", encoding="utf8") as csvfile:
        csvreader = csv.reader(csvfile, delimiter=",")
        for row in csvreader:
            qid = row[0]
            name = row[1]
            desc = row[2]
            names[qid] = name
            descriptions[qid] = desc
    return names, descriptions

In [32]:
name_dict, desc_dict = load_entities()
for QID in name_dict.keys():
    print(f"{QID}, name={name_dict[QID]}, desc={desc_dict[QID]}")

Q312545, name=Roy Stanley Emerson, desc=Australian tennis player
Q48226, name=Ralph Waldo Emerson, desc=American philosopher, essayist, and poet
Q215952, name=Emerson Ferreira da Rosa, desc=Brazilian footballer


We have 3 entries here, of 3 different people called Emerson. The first step to perform Entity Linking, is to set up a knowledge base that contains the unique identifiers of the entities we are interested in.

In [33]:
#from spacy.kb import KnowledgeBase
#kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

from spacy.kb import InMemoryLookupKB
kb = InMemoryLookupKB(vocab=nlp.vocab, entity_vector_length=300)



Now we will add our entities to our knowledgebase. We provide the qid, the entity vector, and freq (Estimate of the frequency of the entity in a typical corpus).

In [34]:
for qid, desc in desc_dict.items():
    desc_doc = nlp(desc)
    desc_enc = desc_doc.vector
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)   # 342 is an arbitrary value here

We first add the full names. Here, we are 100% certain that they resolve to their corresponding QID, as there is no ambiguity.

In [35]:
for qid, name in name_dict.items():
    kb.add_alias(alias=name, entities=[qid], probabilities=[1])   # 100% prior probability P(entity|alias)

We also want to add the alias "Emerson". We'll assume that each of our 3 Emersons is equally famous and thus we set their probabilities to be equal for each entity.

In [36]:
qids = name_dict.keys()
probs = [0.3 for qid in qids]
kb.add_alias(alias="Emerson", entities=qids, probabilities=probs)

4831166512461469197

So this will be our Knowledge base. We can check the entities and aliases that are contained in it:

In [37]:
print(f"Entities in the KB: {kb.get_entity_strings()}")
print(f"Aliases in the KB: {kb.get_alias_strings()}")
print(f"Candidates for 'Roy Stanley Emerson': {[c.entity_ for c in kb.get_alias_candidates('Roy Stanley Emerson')]}")
print(f"Candidates for 'Emerson': {[c.entity_ for c in kb.get_alias_candidates('Emerson')]}")
print(f"Candidates for 'Bob': {[c.entity_ for c in kb.get_alias_candidates('Sofie')]}")

Entities in the KB: ['Q215952', 'Q312545', 'Q48226']
Aliases in the KB: ['Roy Stanley Emerson', 'Emerson Ferreira da Rosa', 'Ralph Waldo Emerson', 'Emerson']
Candidates for 'Roy Stanley Emerson': ['Q312545']
Candidates for 'Emerson': ['Q312545', 'Q48226', 'Q215952']
Candidates for 'Bob': []


In [38]:
# change the directory and file names to whatever you like
import os
output_dir = Path.cwd().parent / "my_output"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
kb.to_disk(output_dir / "my_kb")

In [39]:
nlp.to_disk(output_dir / "my_nlp")

In [40]:
import json

json_loc = "emerson_annotated_text.jsonl"
with open(json_loc, "r", encoding="utf8") as jsonfile:
    line = jsonfile.readline()
    print(line)   # print just the first line

{"text":"Interestingly, Emerson is one of only five tennis players all-time to win multiple slam sets in two disciplines, only matched by Frank Sedgman, Margaret Court, Martina Navratilova and Serena Williams.","_input_hash":2024197919,"_task_hash":-1926469210,"spans":[{"start":15,"end":22,"text":"Emerson","rank":0,"label":"ORG","score":1,"source":"en_core_web_lg","input_hash":2024197919}],"meta":{"score":1},"options":[{"id":"Q48226","html":"<a href='https://www.wikidata.org/wiki/Q48226'>Q48226: American philosopher, essayist, and poet</a>"},{"id":"Q215952","html":"<a href='https://www.wikidata.org/wiki/Q215952'>Q215952: Brazilian footballer</a>"},{"id":"Q312545","html":"<a href='https://www.wikidata.org/wiki/Q312545'>Q312545: Australian tennis player</a>"},{"id":"NIL_otherLink","text":"Link not in options"},{"id":"NIL_ambiguous","text":"Need more context"}],"_session_id":null,"_view_id":"choice","accept":["Q312545"],"answer":"accept"}



In [41]:
dataset = []

with open(json_loc, "r", encoding="utf8") as jsonfile:
    for line in jsonfile:
        example = json.loads(line)
        text = example["text"]
        if example["answer"] == "accept":
            QID = example["accept"][0]
            offset = (example["spans"][0]["start"], example["spans"][0]["end"])
            entity_label = example["spans"][0]["label"]
            entities = [(offset[0], offset[1], entity_label)]
            links_dict = {QID: 1.0}
        dataset.append((text, {"links": {offset: links_dict}, "entities": entities}))

Check what this looks like.

In [42]:
dataset[0]

('Interestingly, Emerson is one of only five tennis players all-time to win multiple slam sets in two disciplines, only matched by Frank Sedgman, Margaret Court, Martina Navratilova and Serena Williams.',
 {'links': {(15, 22): {'Q312545': 1.0}}, 'entities': [(15, 22, 'ORG')]})

**How many cases do we have annotated?**

In [43]:
gold_ids = []
for text, annot in dataset:
    for span, links_dict in annot["links"].items():
        for link, value in links_dict.items():
            if value:
                gold_ids.append(link)

from collections import Counter
print(Counter(gold_ids))

Counter({'Q312545': 10, 'Q48226': 10, 'Q215952': 10})


Prepare the training and test dataset.

In [44]:
import random

train_dataset = []
test_dataset = []
for QID in qids:
    indices = [i for i, j in enumerate(gold_ids) if j == QID]
    train_dataset.extend(dataset[index] for index in indices[0:8])  # first 8 in training
    test_dataset.extend(dataset[index] for index in indices[8:10])  # last 2 in test

random.shuffle(train_dataset)
random.shuffle(test_dataset)

In [45]:
test_dataset

[('Carlyle in particular was a strong influence on him; Emerson would later serve as an unofficial literary agent in the United States for Carlyle, and in March 1835, he tried to persuade Carlyle to come to America to lecture.',
  {'links': {(53, 60): {'Q48226': 1.0}}, 'entities': [(53, 60, 'ORG')]}),
 ('In 1841 Emerson published Essays, his second book, which included the famous essay "Self-Reliance".',
  {'links': {(8, 15): {'Q48226': 1.0}}, 'entities': [(8, 15, 'PERSON')]}),
 ("Emerson's first Wimbledon singles title came in 1964, with a final victory over Fred Stolle.",
  {'links': {(0, 7): {'Q312545': 1.0}}, 'entities': [(0, 7, 'ORG')]}),
 ('Emerson was inducted into the International Tennis Hall of Fame in 1982 and the Sport Australia Hall of Fame in 1986.',
  {'links': {(0, 7): {'Q312545': 1.0}}, 'entities': [(0, 7, 'ORG')]}),
 ('Emerson made his Brazil debut on 10 September 1997, in a home friendly match against Ecuador, in Salvador, Bahia, also scoring a goal in the match, as 

With our datasets now properly set up, we'll now create Example objects to feed into the training process. This object is new in spaCy v3. Essentially, it contains a document with predictions (predicted) and one with gold-standard annotations (reference). During training, the pipeline will compare its predictions to the gold-standard, and update the weights of the neural network accordingly.

For entity linking, the algorithm needs access to gold-standard sentences, because the algorithms use the context from the sentence to perform the disambiguation. You can either provide gold-standard sent_starts annotations, or run a component such as the parser or sentencizer on your reference documents:

In [46]:
from spacy.training import Example

TRAIN_EXAMPLES = []
if "sentencizer" not in nlp.pipe_names:
    nlp.add_pipe("sentencizer")
sentencizer = nlp.get_pipe("sentencizer")
for text, annotation in train_dataset:
    example = Example.from_dict(nlp.make_doc(text), annotation)
    example.reference = sentencizer(example.reference)
    TRAIN_EXAMPLES.append(example)

Then, we'll create a new Entity Linking component and add it to the pipeline.

We also need to make sure the entity_linker component is properly initialized. To do this, we need a get_examples function that returns some example training data, as well as a kb_loader argument.

In [47]:
from spacy.ml.models import load_kb

entity_linker = nlp.add_pipe("entity_linker", config={"incl_prior": False}, last=True)
entity_linker.initialize(get_examples=lambda: TRAIN_EXAMPLES, kb_loader=load_kb(output_dir / "my_kb"))

Next, we will run the actual training loop for the new component, taking care to only train the entity linker and not the other components.

In [62]:
from spacy.util import minibatch
# batch size
batch_size = 32

with nlp.select_pipes(enable=["entity_linker"]):
    optimizer = nlp.resume_training()

    prev_training_loss = float('inf')
    patience = 5  # consecutive iterations allowed with no improvement
    count_no_improvement = 0

    for itn in range(100):
        random.shuffle(TRAIN_EXAMPLES)

        # training
        batches = minibatch(TRAIN_EXAMPLES, size=batch_size)
        losses = {}
        for batch in batches:
            nlp.update(
                batch,
                drop=0.2,       # prevent overfitting
                losses=losses,
                sgd=optimizer,
            )

        #calculate average training loss
        avg_training_loss = losses.get("entity_linker", float('inf'))

        print(f"Iteration {itn}, Training Loss: {avg_training_loss}")

        #check for early stopping based on training loss (you could also add a validation set)
        if avg_training_loss >= prev_training_loss:
            count_no_improvement += 1
        else:
            count_no_improvement = 0

        if count_no_improvement >= patience:
            print(f"Stopping early at iteration {itn} as training loss did not improve.")
            break

        prev_training_loss = avg_training_loss

print(f"Final Iteration {itn}, Training Loss: {avg_training_loss}")


Iteration 0, Training Loss: 0.2325533390045166
Stopping early at iteration 41 as training loss did not improve.
Final Iteration 41, Training Loss: 0.24009428024291993


In [63]:
for text, true_annot in test_dataset:
    print(text)
    print(f"Gold annotation: {true_annot}")
    doc = nlp(text)  # to make this more efficient, you can use nlp.pipe() just once for all the texts
    for ent in doc.ents:
        if ent.text == "Emerson":
            print(f"Prediction: {ent.text}, {ent.label_}, {ent.kb_id_}")
    print()

Carlyle in particular was a strong influence on him; Emerson would later serve as an unofficial literary agent in the United States for Carlyle, and in March 1835, he tried to persuade Carlyle to come to America to lecture.
Gold annotation: {'links': {(53, 60): {'Q48226': 1.0}}, 'entities': [(53, 60, 'ORG')]}
Prediction: Emerson, ORG, Q215952

In 1841 Emerson published Essays, his second book, which included the famous essay "Self-Reliance".
Gold annotation: {'links': {(8, 15): {'Q48226': 1.0}}, 'entities': [(8, 15, 'PERSON')]}
Prediction: Emerson, ORG, Q48226

Emerson's first Wimbledon singles title came in 1964, with a final victory over Fred Stolle.
Gold annotation: {'links': {(0, 7): {'Q312545': 1.0}}, 'entities': [(0, 7, 'ORG')]}
Prediction: Emerson, ORG, Q48226

Emerson was inducted into the International Tennis Hall of Fame in 1982 and the Sport Australia Hall of Fame in 1986.
Gold annotation: {'links': {(0, 7): {'Q312545': 1.0}}, 'entities': [(0, 7, 'ORG')]}
Prediction: Emerson

In [64]:
text = "Tennis champion Emerson was expected to win Wimbledon."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

Emerson PERSON Q215952
Wimbledon EVENT NIL
