# L5: Information extraction

# Omkar Bhutra (omkbh878)

Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, **named entity recognition** (identifying mentions of entities) and **entity linking** (matching these mentions to entities in a knowledge base).

We start by loading spaCy:

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

The data that we will be using has been tokenized following the conventions of the [Penn Treebank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html), and we need to prevent spaCy from using its own tokenizer on top of this. We therefore override spaCy&rsquo;s tokenizer with one that simply splits on space.

In [4]:
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        return Doc(self.vocab, words=text.split(" "))

nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

## Data set

The main data set for this lab is a collection of news wire articles in which mentions of named entities have been annotated with page names from the [English Wikipedia](https://en.wikipedia.org/wiki/). The next code cell loads the training and the development parts of the data into Pandas data frames.

In [5]:
import bz2
import csv
import pandas as pd

with bz2.open("ner-train.tsv.bz2", 'rt',encoding='utf8') as source:
    df_train = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

with bz2.open("ner-dev.tsv.bz2", 'rt',encoding='utf8') as source:
    df_dev = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

Each row in these two data frames corresponds to one mention of a named entity and has five columns:

1. a unique identifier for the sentence containing the entity mention
2. the pre-tokenized sentence, with tokens separated by spaces
3. the start position of the token span containing the entity mention
4. the end position of the token span (exclusive, as in Python list indexing)
5. the entity label; either a Wikipedia page name or the generic label `--NME--`

The following cell prints the first five samples from the training data:

In [6]:
df_train.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0000-000,EU rejects German call to boycott British lamb .,0,1,--NME--
1,0000-000,EU rejects German call to boycott British lamb .,2,3,Germany
2,0000-000,EU rejects German call to boycott British lamb .,6,7,United_Kingdom
3,0000-001,Peter Blackburn,0,2,--NME--
4,0000-002,BRUSSELS 1996-08-22,0,1,Brussels


In this sample, we see that the first sentence is annotated with three entity mentions:

* the span 0–1 &lsquo;EU&rsquo; is annotated as a mention but only labelled with the generic `--NME--`
* the span 2–3 &lsquo;German&rsquo; is annotated with the page [Germany](http://en.wikipedia.org/wiki/Germany)
* the span 6–7 &lsquo;British&rsquo; is annotated with the page [United_Kingdom](http://en.wikipedia.org/wiki/United_Kingdom)

## Problem 1: Evaluation measures

To warm up, we ask you to write code to print the three measures that you will be using for evaluation:

In [7]:
import numpy as np
def evaluation_report(gold, pred):
    """Print precision, recall, and F1 score.
    
    Args:
        gold: The set with the gold-standard values.
        pred: The set with the predicted values.
    
    Returns:
        Nothing, but prints the precision, recall, and F1 values computed
        based on the specified sets.
    """
    # TODO: Replace the next line with your own code
    
    precision = len(gold.intersection(pred))*100/len(pred)
    recall = len(gold.intersection(pred))*100/len(gold)
    f1_score = 2 * (precision*recall)/(precision + recall)
    
    print("Precision score is",precision,"%")
    print("Recall score is",recall,"%")
    print("f1 score is",f1_score,"%")
    pass

To test your code, you can run the following cell:

In [8]:
evaluation_report(set(range(3)), set(range(5)))

Precision score is 60.0 %
Recall score is 100.0 %
f1 score is 75.0 %


This should give you a precision of 60%, a recall of 100%, and an F1-value of 75%.

## Problem 2: Span recognition

One of the first tasks that an information extraction system has to solve is to locate and classify (mentions of) named entities, such as persons and organizations. Here we will tackle the simpler task of recognizing **spans** of tokens that contain an entity mention, without the actual entity label.

The English language model in spaCy features a full-fledged [named entity recognizer](https://spacy.io/usage/linguistic-features#named-entities) that identifies a variety of entities, and can be updated with new entity types by the user. Your task in this problem is to evaluate the performance of this component when predicting entity spans in the development data.

Start by implementing a generator function that yields the gold-standard spans in a given data frame.

**Hint:** The Pandas method [`itertuples()`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.itertuples.html) is useful when iterating over the rows in a DataFrame.

In [18]:
def gold_spans(df):
    """Yield the gold-standard mention spans in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        triples consisting of the sentence id, start position, and end
        position of each span.
    """
    # TODO: Replace the next line with your own code
    for _,sentence_id, _, beg, end, *_ in df.itertuples():   
        yield(sentence_id,beg,end) 

To test your code, you can count the spans yielded by your function. When called on the development data, you should get a total of 5,917 unique triples. The first triple and the last triple should be

    ('0946-000', 2, 3)
    ('1161-010', 1, 3)  

In [19]:
spans_dev_gold = set(gold_spans(df_dev))
print(len(spans_dev_gold))

5917


In [20]:
spans_dev_gold1 = list(gold_spans(df_dev))
print(spans_dev_gold1[0])
print(spans_dev_gold1[5916])

('0946-000', 2, 3)
('1161-010', 1, 3)


Your next task is to write code that calls spaCy to predict the named entities in the development data, and to evaluate the accuracy of these predictions in terms of precision, recall, and F1. Print these scores using the function that you wrote for Problem&nbsp;1.

In [21]:
# TODO: Write code here to run and evaluate the spaCy NER on the development data
def spans(data):
    for i in data.sentence_id.unique():
        sentence = data[data.sentence_id == i].sentence.values[0]
        doc = nlp(sentence)
        for ent in doc.ents:
            yield (i, ent.start, ent.end)
evaluation_report(spans_dev_gold, set(spans(df_dev)))

Precision score is 52.94499221587961 %
Recall score is 68.97076221057968 %
f1 score is 59.904587155963306 %


## Problem 3: Error analysis

As you were able to see in Problem&nbsp;2, the span accuracy of the named entity recognizer is far from perfect. In particular, only slightly more than half of the predicted spans are correct according to the gold standard. Your next task is to analyse this result in more detail.

Write code that prints the false positives and the false negatives from the automatic prediction. Have a look at the output. What are your observations? How could you improve the result? Discuss these questions in a short text.

In [24]:
# TODO: Write code here to do your analysis
print("The number of false negatives are",len(spans_dev_gold - set(spans(df_dev))))
#print("The false negatives are",list(spans_dev_gold - set(spans(df_dev))))
def false_pos(gold, pred, df):
    for i in gold:
        if i not in pred:
            print(i)
            print(df[df['sentence_id'] == i[0]]['sentence'].values[0].split(' ')[i[1]:i[2]])
false_pos(spans_dev_gold, spans(df_dev), df_dev)            

The number of false negatives are 1836
('0969-002', 9, 10)
['French']
('1128-013', 7, 8)
['Swiss']
('1130-003', 6, 8)
['Bertillon', '166']
('1077-006', 0, 1)
['England']
('1090-014', 0, 1)
['Seles']
('1031-008', 25, 26)
['Pirelli']
('0949-006', 23, 24)
['Hoddle']
('1160-018', 4, 5)
['Collins']
('0957-018', 26, 28)
['Alex', 'Corretja']
('1009-003', 25, 26)
['U.S.']
('1054-008', 6, 7)
['Rabobank']
('1064-002', 1, 2)
['Wallaby']
('1135-013', 0, 2)
['Suu', 'Kyi']
('1156-009', 11, 12)
['Burundi']
('1146-004', 12, 13)
['Aziz']
('0959-013', 2, 3)
['PITTSBURGH']
('0966-143', 4, 5)
['Britain']
('0965-006', 0, 1)
['Jamaica']
('1152-014', 9, 10)
['F-14']
('0962-011', 3, 5)
['Brian', 'Henninger']
('1159-000', 7, 8)
['Belgium']
('1060-020', 5, 6)
['Spain']
('1100-002', 1, 3)
['Tom', 'Johnson']
('1057-002', 18, 19)
['Edgbaston']
('1134-007', 0, 1)
['Ramos']
('1074-002', 0, 1)
['Mauritania']
('0990-005', 7, 8)
['Yeltsin']
('1142-005', 9, 11)
['Faik', 'Nerweyi']
('1123-004', 0, 1)
['Ssangbangwool']
('

('1022-007', 13, 14)
['C$']
('1058-044', 13, 15)
['Dave', 'Nilsson']
('1133-011', 0, 1)
['Wang']
('1011-013', 5, 6)
['Norwest']
('1058-026', 1, 2)
['Orioles']
('1091-003', 39, 40)
['Austria']
('1096-021', 0, 1)
['Foster']
('1099-004', 0, 1)
['Scotland']
('1101-008', 33, 35)
['Osvaldo', 'Sanchez']
('1124-010', 35, 36)
['Afrikaans']
('1125-006', 28, 29)
['Zairean']
('1160-004', 4, 6)
['Liam', 'Neeson']
('1055-006', 0, 2)
['M.', 'Atherton']
('0995-010', 27, 28)
['Islamic']
('0947-006', 20, 21)
['Durham']
('0957-015', 11, 12)
['France']
('1087-018', 0, 1)
['Eyles']
('0968-004', 43, 44)
['Milan']
('0972-005', 0, 2)
['M.', 'Waugh']
('1140-001', 0, 2)
['HONG', 'KONG']
('0948-022', 7, 9)
['The', 'Oval']
('1146-002', 16, 17)
['Iraq']
('1146-000', 4, 5)
['Baghdad']
('0957-030', 5, 6)
['U.S.']
('1056-016', 8, 10)
['Andrea', 'Collinelli']
('1039-001', 6, 7)
['Indonesian']
('1095-009', 0, 2)
['New', 'York']
('0997-006', 5, 6)
['Moscow']
('1155-011', 3, 4)
['Germany']
('1052-004', 3, 6)
['Jan', 'All

['Chicago', 'PMI']
('1160-016', 9, 10)
['Dublin']
('1098-002', 6, 7)
['WBO']
('1036-014', 0, 1)
['Longyear']
('1058-027', 19, 21)
['Frank', 'Thomas']
('0974-009', 6, 9)
['Aravinda', 'de', 'Silva']
('1057-007', 16, 18)
['Wasim', 'Akram']
('1036-017', 9, 10)
['Barentsburg']
('1088-000', 2, 5)
['HONG', 'KONG', 'OPEN']
('1045-003', 7, 8)
['Palestinian']
('0954-004', 2, 4)
['Peter', 'Nicol']
('1035-001', 0, 1)
['BONN']
('1150-001', 2, 3)
['N.H.']
('1133-013', 16, 17)
['Wang']
('1008-002', 0, 1)
['WASHINGTON']
('0971-002', 3, 4)
['Australia']
('0966-121', 1, 3)
['Jean', 'Galfione']
('1138-002', 14, 18)
['Human', 'Rights', 'in', 'China']
('1031-005', 15, 16)
['Wuxi']
('0966-041', 4, 5)
['Russia']
('1015-003', 26, 27)
['Canada']
('0957-012', 0, 2)
['Alexander', 'Volkov']
('1090-000', 4, 5)
['U.S.']
('1096-041', 13, 15)
['Donne', 'Wall']
('0981-006', 10, 12)
['North', 'American']
('1101-012', 71, 73)
['Ramon', 'Ramirez']
('1072-019', 36, 38)
['Andrew', 'Mehrtens']
('0960-010', 8, 10)
['New', 'H

['Chicago', 'Board', 'of', 'Trade']
('0994-015', 10, 11)
['Moscow']
('1138-003', 13, 14)
['Wang']
('1126-010', 16, 17)
['U.N.']
('1027-012', 3, 4)
['Palestinians']
('1070-015', 1, 3)
['Mauricio', 'Gugelmin']
('0959-010', 3, 5)
['NEW', 'YORK']
('0997-006', 0, 2)
['Naina', 'Yeltsin']
('1025-011', 30, 31)
['Jerusalem']
('1072-013', 11, 13)
['Sean', 'Fitzpatrick']
('0961-007', 1, 2)
['Randall']
('0994-001', 0, 2)
['Larisa', 'Sayenko']
('1016-002', 4, 5)
['Israelis']
('0995-006', 20, 21)
['Layron']
('1126-002', 33, 34)
['Serbs']
('1055-009', 0, 2)
['R.', 'Irani']
('1027-009', 0, 1)
['Palestinians']
('1072-018', 60, 62)
['Wayne', 'Fyvie']
('0958-002', 0, 3)
['Major', 'League', 'Baseball']
('1090-005', 11, 13)
['Czech', 'Republic']
('0969-003', 5, 6)
['Nice']
('1016-000', 3, 5)
['Middle', 'East']
('0996-003', 30, 32)
['Foreign', 'Ministry']
('1011-019', 12, 14)
['Boatmen', "'s"]
('1056-027', 1, 3)
['Kathrin', 'Freitag']
('1092-007', 13, 15)
['South', 'Africa']
('0972-011', 5, 6)
['Chandana']


('1018-003', 2, 4)
['Nicole', 'Nichterwitz']
('0966-064', 1, 3)
['Merlene', 'Ottey']
('1127-006', 8, 9)
['Chechen']
('0990-012', 0, 1)
['Lebed']
('0971-003', 8, 10)
['Sri', 'Lanka']
('1094-051', 2, 3)
['PITTSBURGH']
('1148-004', 0, 2)
['LA', 'PRESSE']
('1027-016', 12, 13)
['al-Ram']
('0966-010', 1, 3)
['Gillian', 'Russell']
('1027-011', 12, 13)
['Palestinian']
('1085-001', 0, 1)
['MINSK']
('0953-012', 22, 24)
['Puerto', 'Rico']
('1154-002', 16, 17)
['Algerian']
('1096-020', 22, 23)
['Atlanta']
('1127-009', 9, 10)
['Chechnya']
('0966-031', 1, 3)
['Michael', 'Johnson']
('1077-005', 0, 1)
['Hoddle']
('0966-021', 4, 5)
['U.S.']
('0974-009', 3, 5)
['Asanka', 'Gurusinha']
('0980-014', 16, 17)
['U.S.']
('1062-000', 2, 3)
['WALES']
('1025-005', 14, 16)
['West', 'Bank']
('1036-004', 0, 1)
['Longyear']
('0966-055', 1, 3)
['Anthony', 'Washington']
('0990-002', 0, 1)
['Russian']
('0948-040', 3, 4)
['Middlesex']
('1143-003', 16, 17)
['France']
('1101-012', 88, 90)
['Francisco', 'Palencia']
('1084-0

['Corentine', 'Martins']
('0948-036', 6, 8)
['Old', 'Trafford']
('1094-015', 0, 1)
['MINNESOTA']
('1067-014', 2, 3)
['Hamilton']
('1104-002', 6, 7)
['Spanish']
('1103-015', 76, 79)
['Gilles', 'De', 'Bilde']
('1152-007', 24, 25)
['Iraqi']
('0963-024', 3, 5)
['Tom', 'Lehman']
('0957-016', 5, 6)
['Spain']
('1028-000', 7, 8)
['Mich']
('0972-011', 0, 2)
['D.', 'Lehmann']
('1154-008', 9, 10)
['ONDH']
('1087-002', 20, 23)
['Hong', 'Kong', 'Open']
('1007-027', 3, 5)
['Wall', 'Street']
('0956-016', 0, 1)
['Hanwha']
('1101-011', 82, 84)
['Patrice', 'Loko']
('1016-013', 7, 8)
['German']
('1137-008', 7, 9)
['South', 'Korean']
('1156-003', 32, 33)
['Italian']
('0960-001', 0, 2)
['Larry', 'Fine']
('1100-001', 0, 1)
['DUBLIN']
('1009-009', 9, 10)
['U.S.']
('1081-002', 0, 1)
['Armenia']
('1096-042', 8, 9)
['RBI']
('1019-013', 8, 9)
['Dutch']
('1047-000', 0, 2)
['RUGBY', 'LEAGUE']
('1131-002', 8, 10)
['Madeleine', 'Albright']
('0965-003', 1, 3)
['Dennis', 'Mitchell']
('1057-004', 7, 8)
['England']
('11

['G.', 'Hamilton']
('1004-003', 37, 38)
['Sihanouk']
('1063-002', 0, 1)
['Ukraine']
('1069-007', 0, 1)
['Glamorgan']
('1051-009', 45, 46)
['Spain']
('0966-087', 4, 5)
['Australia']
('0966-093', 4, 5)
['Germany']
('1069-004', 5, 6)
['Hampshire']
('0966-163', 4, 6)
['Davidson', 'Ezinwa']
('0949-001', 0, 1)
['LONDON']
('1131-002', 31, 33)
['Latin', 'American']
('1132-001', 0, 2)
['MEXICO', 'CITY']
('0959-012', 0, 1)
['Cincinnati']
('0957-010', 11, 12)
['Spain']
('0966-058', 1, 3)
['Virgilijus', 'Alekna']
('0993-007', 4, 5)
['Belgrade']
('0966-114', 1, 3)
['Craig', 'Winrow']
('1068-014', 3, 6)
['Queens', 'Park', 'Rangers']
('1087-017', 0, 1)
['Jansher']
('1160-024', 19, 20)
['Ireland']
('1008-010', 37, 38)
['Fla']
('1027-001', 0, 2)
['Sami', 'Aboudi']
('1051-011', 25, 27)
['Stephen', 'McAllister']
('1156-001', 0, 1)
['ROME']
('0998-001', 0, 1)
['KHASAVYURT']
('1058-030', 10, 11)
['Chicago']
('1116-008', 0, 1)
['Charles']
('1057-006', 17, 19)
['Matthew', 'Maynard']
('0977-000', 0, 1)
['Cana

['Vladimir', 'Makovsky']
('1056-034', 24, 26)
['Jean-Michel', 'Monin']
('1045-000', 0, 2)
['U.N.', 'Council']
('1094-001', 0, 2)
['NEW', 'YORK']
('1096-022', 1, 2)
['Braves']
('1081-002', 7, 9)
['World', 'Cup']
('1096-034', 19, 21)
['Brent', 'Mayne']
('0993-003', 26, 27)
['Belgrade']
('0988-002', 2, 5)
['Driefontein', 'Consolidated', 'Ltd']
('1012-009', 7, 8)
['U.N.']
('1007-004', 1, 2)
['Clinton']
('1154-003', 0, 1)
['Algerian']
('0954-001', 0, 2)
['HONG', 'KONG']
('1000-004', 0, 1)
['Cofinec']
('0966-017', 4, 5)
['Germany']
('0994-010', 32, 34)
['Soviet', 'Union']
('1156-002', 9, 10)
['Tanzanian']
('1006-010', 2, 3)
['Libyans']
('1096-036', 13, 15)
['Alan', 'Benes']
('1012-006', 1, 4)
['U.S.', 'Treasury', 'Department']
('0966-084', 1, 3)
['Fabrizio', 'Mori']
('1101-003', 37, 39)
['Franck', 'Leboeuf']
('0957-020', 18, 20)
['Czech', 'Republic']
('1009-002', 0, 1)
['WASHINGTON']
('1113-004', 2, 4)
['Hansa', 'Rostock']
('1152-017', 10, 14)
['Tarawa', 'Amphibious', 'Readiness', 'Group']
(

('1094-044', 0, 2)
['WESTERN', 'DIVISION']
('1033-003', 28, 29)
['Chechnya']
('1082-000', 4, 5)
['SWITZERLAND']
('1015-003', 5, 6)
['Titanic']
('1056-043', 14, 15)
['USA']
('1007-007', 45, 47)
['Wall', 'Street']
('1031-006', 30, 31)
['Malaysia']
('1016-014', 5, 6)
['Israeli']
('1009-015', 1, 4)
['California', 'Avocado', 'Commission']
('1073-010', 0, 2)
['New', 'Zealand']
('1017-006', 13, 15)
['Air', 'France']
('1157-002', 1, 5)
['North', 'Atlantic', 'Treaty', 'Organisation']
('0958-030', 0, 1)
['BALTIMORE']
('0966-159', 1, 3)
['Rita', 'Ramaunaskaite']
('1052-005', 8, 10)
['Peter', 'Dubovsky']
('1006-012', 3, 4)
['Egypt']
('1115-006', 0, 1)
['Orr']
('1051-009', 15, 17)
['Philip', 'Walton']
('0966-048', 1, 3)
['Laban', 'Rotich']
('1055-000', 2, 3)
['ENGLAND']
('0998-000', 0, 1)
['Lebed']
('1016-009', 3, 4)
['Bonn']
('1064-007', 40, 41)
['Northampton']
('1040-001', 0, 1)
['TOKYO']
('0993-004', 21, 22)
['Batajnica']
('1110-000', 4, 5)
['WBA']
('1136-002', 26, 28)
['China', 'Daily']
('1153-

('1096-010', 9, 10)
['Reds']
('1133-000', 0, 1)
['China']
('1012-002', 20, 21)
['U.S.']
('1145-001', 0, 1)
['MANAMA']
('1096-000', 5, 6)
['ERA']
('1031-004', 16, 21)
['Wuxi', 'Tong', 'Ling', 'Company', 'Ltd']
('0976-001', 0, 1)
['WARSAW']
('0980-015', 13, 16)
['Jackson', 'Hole', 'symposium']
('1112-002', 16, 17)
['France']
('0980-003', 3, 4)
['NAPM']
('1009-012', 9, 10)
['U.S.']
('1088-002', 21, 22)
['Pakistan']
('1087-019', 10, 12)
['Portuguese', 'Open']
('1060-009', 4, 5)
['Italy']
('0972-036', 2, 4)
['S.', 'Waugh']
('1004-003', 7, 9)
['Sereipheap', 'Thmei']
('1014-009', 6, 7)
['Dole']
('0991-006', 4, 5)
['Maskhadov']
('1075-000', 5, 6)
['BENIN']
('1103-009', 10, 12)
['Rustu', 'Recber']
('0960-013', 20, 21)
['Washington']
('1064-007', 21, 24)
['Rugby', 'Football', 'Union']
('1130-001', 0, 1)
['HAVANA']
('0986-006', 12, 13)
['McLoughlin']
('1009-010', 0, 1)
['California']
('1152-016', 22, 24)
['Saudi', 'Arabia']
('0985-004', 6, 9)
['World', 'Boxing', 'Council']
('1121-003', 41, 42)
['

('1060-017', 1, 3)
['Paul', 'McGinley']
('1096-003', 0, 1)
['Brown']
('0966-078', 4, 5)
['Kenya']
('1116-025', 2, 3)
['Belgian']
('1054-016', 4, 5)
['Denmark']
('1150-003', 6, 8)
['Lilac', 'Falls']
('0966-122', 4, 5)
['Russia']
('0996-007', 24, 25)
['U.N.']
('1144-004', 5, 6)
['KDP']
('1022-010', 10, 11)
['Canadian']
('0966-161', 10, 11)
['U.S.']
('1017-006', 8, 9)
['Communist-led']
('1133-002', 8, 9)
['China']
('1055-007', 0, 2)
['G.', 'Thorpe']
('1072-003', 16, 18)
['All', 'Blacks']
('1068-029', 0, 2)
['Cambridge', 'United']
('1056-016', 22, 24)
['Anton', 'Chantyr']
('1099-007', 7, 8)
['Scotland']
('1058-028', 0, 1)
['Thomas']
('1144-003', 16, 17)
['Arbil']
('0948-025', 3, 4)
['Gloucestershire']
('1036-018', 0, 1)
['Sletten']
('1076-029', 0, 1)
['Obilic']
('1160-014', 30, 31)
['Jordan']
('0946-004', 2, 3)
['Somerset']
('1148-008', 4, 5)
['Tunisia']
('1010-019', 8, 9)
['AIDS']
('1012-006', 22, 23)
['Libyan']
('1028-000', 2, 6)
['Bay', 'Co', 'Bldg', 'Auth']
('1066-021', 0, 1)
['Hudders

('1011-018', 3, 4)
['Missouri']
('0995-010', 4, 5)
['Afghanistan']
('1011-017', 15, 16)
['Midwest']
('1073-006', 7, 8)
['Joubert']
('0958-027', 0, 1)
['CLEVELAND']
('1072-019', 41, 43)
['Justin', 'Marshall']
('1006-014', 17, 18)
['Sudanese']
('1101-000', 5, 7)
['WORLD', 'CUP']
('0993-006', 1, 2)
['Batajnica']
('1041-001', 0, 1)
['TOKYO']
('1068-023', 0, 1)
['Millwall']
('1068-036', 0, 1)
['Scarborough']
('1008-006', 4, 5)
['Chicago']
('1116-009', 13, 14)
['Napoleon']
('1051-009', 18, 19)
['Ireland']
('0957-007', 6, 8)
['Doug', 'Flach']
('0983-006', 18, 19)
['Khartoum']
('1026-001', 0, 1)
['CAIRO']
('1044-007', 0, 1)
['Arafat']
('0966-039', 4, 5)
['Germany']
('0990-002', 17, 18)
['Moscow']
('1027-007', 36, 37)
['Gaza']
('1097-009', 10, 11)
['Edberg']
('1060-024', 1, 3)
['Paul', 'Broadhurst']
('1072-019', 67, 69)
['Robin', 'Brooke']
('1122-001', 0, 1)
['LONDON']
('1131-005', 5, 6)
['Uruguay']
('1103-015', 15, 17)
['Dirk', 'Medved']
('1060-007', 1, 3)
['Robert', 'Allenby']
('1160-014', 20

('1096-002', 2, 3)
['ERA']
('1027-014', 4, 6)
['Abu', 'Ammar']
('0972-005', 5, 6)
['Jayasuriya']
('1017-001', 0, 1)
['PARIS']
('1116-037', 13, 14)
['Uzbekistan']
('1090-003', 37, 39)
['U.S.', 'Open']
('1012-008', 19, 20)
['Tripoli']
('1103-006', 10, 12)
['Sergen', 'Yalcin']
('0996-006', 3, 4)
['Serbian']
('1073-004', 5, 9)
['Joost', 'van', 'der', 'Westhuizen']
('0960-040', 7, 8)
['Tarango']
('0984-012', 0, 2)
['UNITED', 'STATES']
('0990-006', 0, 1)
['Lebed']
('1054-003', 8, 12)
['Tour', 'of', 'the', 'Netherlands']
('0978-002', 11, 14)
['Daniels', 'Pharmaceuticals', 'Inc']
('1027-011', 3, 6)
['Arab', 'East', 'Jerusalem']
('1014-003', 17, 19)
['Scott', 'Reed']
('0983-000', 0, 1)
['Iraqi']
('1127-012', 12, 14)
['Mubatik', 'Dagayeva']
('0961-007', 10, 11)
['NFL']
('1152-007', 31, 32)
['Iraq']
('1135-000', 0, 1)
['Burma']
('1092-004', 13, 14)
['Belgium']
('1037-000', 6, 7)
['China']
('1054-011', 4, 5)
['Italy']
('1045-003', 18, 19)
['Germany']
('1156-004', 14, 16)
['Great', 'Lakes']
('1008-

In [15]:
print("The number of false positives are",len(set(spans(df_dev))-spans_dev_gold))
#print("The false positives are",set(set(spans(df_dev))-spans_dev_gold))
def false_neg(gold, pred, df):
    for i in pred:
        if i not in gold:
            print(i)
            print(df[df['sentence_id'] == i[0]]['sentence'].values[0].split(' ')[i[1]:i[2]])
false_neg(spans_dev_gold, spans(df_dev), df_dev)   

The number of false positives are 3627
('0946-002', 6, 7)
['four']
('0946-002', 8, 9)
['38']
('0946-002', 10, 11)
['Friday']
('0946-002', 19, 20)
['39']
('0946-002', 22, 24)
['two', 'days']
('0946-004', 5, 6)
['83']
('0946-004', 7, 10)
['the', 'opening', 'morning']
('0946-004', 17, 18)
['first']
('0946-004', 20, 21)
['94']
('0946-004', 27, 28)
['296']
('0946-004', 34, 35)
['three']
('0946-004', 36, 37)
['83']
('0946-005', 2, 3)
['213']
('0946-005', 11, 12)
['second']
('0946-005', 22, 23)
['174']
('0946-006', 26, 29)
['Yorkshire', 'at', 'Headingley']
('0946-007', 7, 8)
['one-day']
('0946-007', 11, 12)
['158']
('0946-007', 13, 20)
['his', 'first', 'championship', 'century', 'of', 'the', 'season']
('0946-007', 24, 25)
['372']
('0946-007', 28, 29)
['first']
('0946-007', 32, 33)
['82']
('0946-008', 20, 21)
['four']
('0946-008', 22, 23)
['24']
('0946-008', 24, 25)
['48']
('0946-008', 31, 32)
['119']
('0946-008', 33, 34)
['five']
('0946-009', 24, 25)
['four']
('0946-009', 26, 27)
['45']
('094

['3.']
('0966-023', 1, 2)
['Florian']
('0966-023', 6, 7)
['13.36']
('0966-024', 0, 1)
['4.']
('0966-024', 6, 7)
['13.52']
('0966-025', 0, 1)
['5.']
('0966-025', 6, 7)
['13.52']
('0966-026', 0, 1)
['6.']
('0966-026', 6, 7)
['13.53']
('0966-027', 6, 7)
['13.58']
('0966-028', 6, 7)
['13.60']
('0966-030', 0, 1)
['1.']
('0966-030', 6, 8)
['19.97', 'seconds']
('0966-031', 0, 1)
['2.']
('0966-031', 6, 7)
['20.02']
('0966-032', 0, 1)
['3.']
('0966-032', 6, 7)
['20.37']
('0966-033', 0, 1)
['4.']
('0966-033', 6, 7)
['20.41']
('0966-034', 0, 1)
['5.']
('0966-034', 6, 7)
['20.54']
('0966-035', 6, 7)
['20.78']
('0966-036', 6, 7)
['20.90']
('0966-037', 6, 7)
['20.96']
('0966-039', 0, 1)
['1.']
('0966-039', 6, 8)
['19.89', 'metres']
('0966-040', 0, 1)
['2.']
('0966-040', 6, 7)
['18.80']
('0966-041', 0, 1)
['3.']
('0966-041', 6, 7)
['18.63']
('0966-042', 0, 1)
['4.']
('0966-042', 6, 7)
['18.55']
('0966-043', 0, 1)
['5.']
('0966-043', 6, 7)
['18.41']
('0966-045', 0, 1)
['1.']
('0966-045', 6, 10)
['3', 

['Driefontein', 'Consolidated', 'Ltd', "'s"]
('0988-002', 16, 17)
['Wednesday']
('0988-002', 17, 18)
['night']
('0988-002', 21, 23)
['Gold', 'Fields']
('0988-002', 24, 27)
['South', 'Africa', 'Ltd']
('0988-002', 29, 30)
['Friday']
('0988-005', 0, 3)
['At', 'least', '17']
('0988-005', 17, 19)
['Driefontein', 'Consolidated']
('0988-005', 23, 27)
['Kloof', 'Gold', 'Mining', 'Co']
('0988-005', 27, 29)
['this', 'month']
('0988-006', 5, 7)
['482', '1003']
('0989-002', 9, 11)
['two', 'days']
('0989-002', 18, 19)
['Education']
('0989-002', 26, 28)
['four', 'hours']
('0989-002', 33, 34)
['Friday']
('0990-000', 2, 4)
['Lebed', 'Chechnya']
('0990-002', 7, 8)
['Friday']
('0990-002', 23, 26)
['Alexander', 'Lebed', "'s"]
('0990-003', 11, 14)
['Interfax', 'quoted', 'Chernomyrdin']
('0990-005', 9, 10)
['yesterday']
('0990-006', 18, 20)
['last', 'week']
('0990-006', 28, 32)
['more', 'than', 'a', 'year']
('0990-008', 3, 4)
['Friday']
('0990-008', 15, 17)
['the', 'day']
('0990-009', 11, 14)
['the', 'Russ

('1005-015', 10, 11)
['Thursday']
('1005-015', 13, 14)
['5.255']
('1006-003', 10, 15)
['Sunday', 'of', 'the', '27th', 'anniversary']
('1006-004', 5, 9)
['the', '"', 'great', 'revolution']
('1006-004', 27, 28)
['two']
('1006-004', 34, 35)
['1988']
('1006-005', 15, 16)
['one']
('1006-006', 15, 16)
['Sunday']
('1006-007', 0, 1)
['Three']
('1006-007', 18, 22)
['September', '1', ',', '1969']
('1006-007', 33, 34)
['27-year-old']
('1006-007', 37, 40)
['King', 'Mohammed', 'Idris']
('1006-009', 24, 25)
['42']
('1006-009', 33, 34)
['Friday']
('1006-011', 11, 12)
['July']
('1006-013', 0, 3)
['At', 'least', '20']
('1006-013', 10, 12)
['early', 'July']
('1006-014', 12, 14)
['last', 'year']
('1006-014', 15, 16)
['thousands']
('1007-004', 23, 26)
['billions', 'of', 'dollars']
('1007-005', 22, 25)
['the', 'Federal', 'Reserve']
('1007-010', 0, 2)
['This', 'week']
('1007-010', 15, 17)
['15', 'percent']
('1007-010', 29, 32)
['$', '548', 'billion']
('1007-011', 3, 6)
['$', '8.4', 'billion']
('1007-012', 1

['second']
('1034-002', 21, 22)
['1992']
('1034-002', 30, 31)
['43']
('1034-003', 28, 31)
['El', 'Al', "'s"]
('1034-004', 18, 21)
['May', 'this', 'year']
('1034-007', 5, 6)
['171']
('1034-007', 6, 7)
['542']
('1034-007', 7, 8)
['8982']
('1034-007', 10, 11)
['171']
('1035-002', 7, 10)
['earlier', 'this', 'week']
('1035-002', 32, 33)
['Friday']
('1035-003', 29, 30)
['Wednesday']
('1035-004', 10, 13)
['around', '20', 'percent']
('1035-004', 18, 20)
['May', '1994']
('1035-008', 12, 13)
['Armenian']
('1035-009', 7, 8)
['Tuesday']
('1035-009', 23, 24)
['two']
('1035-009', 25, 26)
['first']
('1035-009', 28, 30)
['last', 'December']
('1035-010', 14, 22)
['the', 'Organisation', 'for', 'Security', 'and', 'Cooperation', 'in', 'Europe']
('1036-000', 3, 5)
['Arctic', 'island']
('1036-005', 3, 4)
['Thursday']
('1036-005', 26, 27)
['141']
('1036-009', 14, 15)
['Thursday']
('1036-011', 19, 20)
['29']
('1036-014', 6, 9)
['just', 'over', '1,000']
('1036-017', 26, 27)
['52']
('1036-018', 11, 13)
['30', '

['British']
('1060-004', 0, 1)
['Saturday']
('1060-005', 0, 1)
['1.']
('1060-005', 3, 5)
['510,258', 'pounds']
('1060-006', 0, 1)
['2.']
('1060-007', 0, 1)
['3.']
('1060-008', 0, 1)
['4.']
('1060-008', 3, 4)
['301,972']
('1060-009', 0, 1)
['5.']
('1060-009', 1, 2)
['Costantino']
('1060-009', 6, 7)
['297,157']
('1060-010', 6, 7)
['254,247']
('1060-011', 0, 1)
['7.']
('1060-011', 1, 4)
['Andrew', 'Coltart', '248,142']
('1060-012', 0, 1)
['8.']
('1060-012', 6, 7)
['239,733']
('1060-013', 0, 1)
['9.']
('1060-015', 6, 7)
['211,175']
('1060-016', 7, 8)
['209,412']
('1060-017', 0, 3)
['13.', 'Paul', 'McGinley']
('1060-017', 6, 7)
['208,978']
('1060-018', 6, 7)
['202,593']
('1060-019', 7, 8)
['195,283']
('1060-020', 0, 1)
['16.']
('1060-020', 7, 8)
['184,180']
('1060-022', 7, 8)
['182,533']
('1060-023', 1, 4)
['Jonathan', 'Lomas', '181,005']
('1061-000', 0, 4)
['RUGBY', 'UNION', '-', 'ENGLISH']
('1061-000', 7, 9)
['WELSH', 'RESULTS']
('1061-003', 5, 6)
['Saturday']
('1061-004', 3, 4)
['one']
(

['Saturday']
('1084-003', 4, 6)
['81st', 'minute']
('1084-003', 10, 11)
['89th']
('1085-000', 6, 9)
['WORLD', 'CUP', 'QUALIFIER']
('1085-002', 3, 4)
['1-0']
('1085-002', 6, 7)
['1-0']
('1085-002', 15, 16)
['4']
('1085-002', 18, 19)
['Saturday']
('1085-003', 5, 6)
['35th']
('1086-002', 2, 4)
['Moldova', '2-0']
('1086-002', 6, 7)
['1-0']
('1086-002', 15, 16)
['2']
('1086-002', 18, 19)
['Saturday']
('1086-003', 5, 6)
['39th']
('1086-003', 12, 13)
['53rd']
('1087-002', 17, 18)
['one']
('1087-002', 19, 23)
['the', 'Hong', 'Kong', 'Open']
('1087-002', 25, 26)
['Saturday']
('1087-003', 21, 22)
['third']
('1087-008', 18, 19)
['15-7']
('1087-008', 21, 22)
['15-8']
('1087-015', 3, 7)
['the', 'Professional', 'Squash', 'Association']
('1087-017', 5, 6)
['eighth']
('1087-017', 6, 8)
['Hong', 'Kong']
('1087-017', 13, 16)
['Australian', 'Rodney', 'Eyles']
('1087-018', 18, 19)
['15-4']
('1087-019', 9, 12)
['the', 'Portuguese', 'Open']
('1087-019', 13, 14)
['1993']
('1088-002', 3, 7)
['the', 'Hong', 'K

('1112-003', 22, 23)
['32-year-old']
('1112-003', 32, 33)
['daily']
('1112-003', 36, 37)
['Saturday']
('1112-004', 22, 24)
['this', 'summer']
('1112-008', 0, 3)
['Coach', 'Berti', 'Vogts']
('1112-008', 11, 14)
['next', 'week', "'s"]
('1112-008', 20, 21)
['first']
('1113-002', 2, 3)
['German']
('1113-002', 4, 5)
['second']
('1113-004', 0, 2)
['Karlsruhe', '2']
('1113-004', 2, 5)
['Hansa', 'Rostock', '0']
('1113-005', 5, 6)
['3']
('1113-006', 0, 2)
['Duisburg', '1']
('1113-006', 2, 4)
['Luebeck', '0']
('1114-000', 7, 8)
['QUALIFIER']
('1114-002', 5, 6)
['1-1']
('1114-002', 15, 16)
['Friday']
('1114-004', 11, 12)
['75th']
('1114-005', 0, 4)
['Lithuania', '-', 'Danius', 'Gleveckas']
('1114-005', 5, 6)
['13rd']
('1115-002', 22, 23)
['1984']
('1115-003', 1, 2)
['Sunday']
('1115-003', 2, 3)
['Telegraph']
('1115-003', 33, 36)
['the', 'Conservative', 'Party']
('1115-003', 37, 42)
['more', 'than', 'a', 'decade', 'ago']
('1115-010', 3, 6)
['Ronald', 'Reagan', "'s"]
('1116-000', 4, 6)
['September'

['the', 'Department', 'of', 'Defense']
('1152-007', 3, 4)
['Saturday']
('1152-010', 19, 21)
['the', 'weekend']
('1152-011', 16, 17)
['1994']
('1152-013', 16, 17)
['23,000']
('1152-014', 4, 5)
['158']
('1152-014', 29, 32)
['30', 'to', '40']
('1152-014', 45, 46)
['three']
('1152-014', 49, 52)
['the', 'United', 'States']
('1152-015', 5, 7)
['nearly', '1,000']
('1152-016', 1, 2)
['23,000']
('1152-016', 11, 12)
['15,000']
('1152-016', 16, 17)
['6,000']
('1152-016', 25, 26)
['2,000']
('1152-017', 6, 7)
['three']
('1152-017', 9, 14)
['the', 'Tarawa', 'Amphibious', 'Readiness', 'Group']
('1152-018', 9, 10)
['seven']
('1152-018', 16, 17)
['nine']
('1152-018', 27, 28)
['20']
('1153-000', 0, 1)
['Two']
('1153-001', 0, 2)
['BRUSSELS', '1996-08-31']
('1153-002', 0, 1)
['Two']
('1153-002', 6, 7)
['Thursday']
('1153-002', 15, 16)
['Saturday']
('1153-006', 2, 3)
['Friday']
('1153-006', 5, 6)
['two']
('1153-006', 11, 12)
['18']
('1153-006', 17, 18)
['19']
('1153-006', 38, 39)
['Thursday']
('1153-007', 

*TODO: Write a short text that discusses the errors that you observed*

*hyphen's, comma's, brackets and numbers are presented as entities. This can be cleaned for better analysis.*

Now, use the insights from your error analysis to improve the automated prediction that you implemented in Problem&nbsp;2. While the best way to do this would be to [update spaCy&rsquo;s NER model](https://spacy.io/usage/linguistic-features#updating) using domain-specific training data, for this lab it suffices to write code to pre-process the data and/or post-process the output produced by spaCy. You should be able to improve the F1 score from Problem&nbsp;2 by at last 15 percentage points.

In [32]:
def improved_gold_spans(df):
    previous_sentence_id = 0
    for _,sentence_id, sentence, beg, end, *_ in df.itertuples():
        if previous_sentence_id != sentence_id:
            previous_sentence_id = sentence_id
            doc = nlp(sentence)
            for ent in doc.ents:
                if (len(sentence.split(' ')[ent.start])>2):
                    if ent.label_ not in ["ORDINAL","CARDINAL","TIME", "DATE", "PERCENT", "MONEY","QUANTITY"]:
                        yield sentence_id, ent.start, ent.end
evaluation_report(spans_dev_gold, set(improved_gold_spans(df_dev)))   

Precision score is 85.37625239820933 %
Recall score is 67.68632753084333 %
f1 score is 75.50904977375565 %


In [33]:
print("The number of false negatives are",len(spans_dev_gold - set(spans(df_dev))))
false_pos(spans_dev_gold, set(improved_gold_spans(df_dev)), df_dev)

The number of false negatives are 1836
('0949-006', 23, 24)
['Hoddle']
('0957-018', 26, 28)
['Alex', 'Corretja']
('1054-008', 6, 7)
['Rabobank']
('0959-013', 2, 3)
['PITTSBURGH']
('1152-014', 9, 10)
['F-14']
('0990-005', 7, 8)
['Yeltsin']
('1123-004', 0, 1)
['Ssangbangwool']
('1149-001', 0, 1)
['DETROIT']
('0972-026', 0, 2)
['A.', 'Gurusinha']
('1068-034', 3, 4)
['Hartlepool']
('1067-008', 5, 6)
['Clydebank']
('1011-002', 0, 1)
['CHICAGO']
('1111-008', 5, 7)
['Beryll', 'Remblance']
('1145-003', 16, 17)
['al-Daih']
('1022-007', 21, 22)
['C$']
('0956-007', 2, 3)
['Samsung']
('1061-019', 0, 1)
['Currie']
('1133-004', 24, 26)
['Communist', 'Party']
('1068-025', 0, 1)
['Shrewsbury']
('1008-014', 5, 7)
['Commerce', 'Department']
('1056-014', 1, 3)
['Tatian', 'Stiajkina']
('1076-038', 0, 4)
['Buducnost', '(', 'V', ')']
('1045-003', 8, 9)
['U.N.']
('1046-008', 31, 32)
['Real']
('1142-012', 4, 5)
['KDP']
('1143-004', 25, 26)
['Sikhin']
('0984-005', 0, 1)
['BRITAIN']
('1065-014', 0, 1)
['Stirlin

['Magna']
('1147-001', 0, 1)
['BAGHDAD']
('1072-018', 5, 7)
['Andre', 'Joubert']
('1018-001', 0, 1)
['BERLIN']
('1101-012', 83, 85)
['Enrique', 'Alfaro']
('0955-011', 0, 1)
['Pohang']
('1072-019', 56, 58)
['Michael', 'Jones']
('1068-010', 0, 2)
['Port', 'Vale']
('1066-032', 0, 1)
['Brentford']
('1087-015', 4, 7)
['Professional', 'Squash', 'Association']
('1000-014', 11, 13)
['Czech', 'Republic']
('1058-042', 18, 19)
['Brewers']
('0948-004', 8, 13)
['Test', 'and', 'County', 'Cricket', 'Board']
('1123-001', 0, 1)
['SEOUL']
('1014-010', 32, 36)
['Commission', 'on', 'Presidential', 'Debates']
('0948-018', 3, 4)
['Worcestershire']
('0961-000', 1, 3)
['AMERICAN', 'FOOTBALL-RANDALL']
('1042-003', 5, 6)
['Huize']
('1080-000', 4, 5)
['LITHUANIA']
('1069-017', 19, 21)
['C.', 'Hooper']
('1055-028', 3, 4)
['Knight']
('1067-018', 0, 1)
['Arbroath']
('1055-025', 0, 2)
['Ijaz', 'Ahmed']
('1011-012', 12, 13)
['Minneapolis-based']
('0972-037', 6, 8)
['M.', 'Waugh']
('1033-007', 0, 1)
['Pro-Moscow']
('0

['Bay', 'Co', 'Building', 'Authority']
('1061-021', 0, 1)
['Melrose']
('1072-017', 1, 3)
['Walter', 'Little']
('0964-003', 0, 1)
['Silva']
('1068-008', 2, 3)
['Wolverhampton']
('1041-009', 2, 5)
['Apic', 'Yamada', 'Corp']
('1061-015', 2, 3)
['Caerphilly']
('1094-024', 0, 2)
['KANSAS', 'CITY']
('1061-020', 2, 3)
['Watsonians']
('1054-009', 6, 7)
['TVM']
('1067-017', 2, 3)
['Cowdenbeath']
('0972-024', 5, 6)
['Fleming']
('1103-016', 41, 43)
['Tolunay', 'Kafkas']
('1066-035', 0, 1)
['Millwall']
('1092-007', 9, 12)
['South', 'Korea', 'Australia']
('1116-009', 18, 19)
['Borodino']
('1068-000', 2, 3)
['ENGLISH']
('1058-030', 7, 8)
['RBI']
('1054-009', 4, 5)
['Denmark']
('0949-000', 2, 3)
['SHEARER']
('0947-007', 3, 4)
['Nottinghamshire']
('1115-010', 3, 5)
['Ronald', 'Reagan']
('0995-000', 0, 1)
['Tajik']
('1039-018', 18, 19)
['PDI']
('1067-005', 3, 4)
['Falkirk']
('1093-005', 3, 5)
['Luis', 'Lazo']
('1124-001', 0, 2)
['CAPE', 'TOWN']
('1039-018', 1, 5)
['Central', 'Jakarta', 'District', 'Cou

['Toronto', 'Bond', 'Traders', "'", 'Association']
('0999-001', 0, 1)
['MOSCOW']
('1030-037', 5, 9)
['New', 'York', 'Mercantile', 'Exchange']
('1123-007', 0, 1)
['Haitai']
('1022-014', 13, 14)
['C$']
('1094-009', 0, 1)
['BOSTON']
('1043-002', 15, 18)
['Central', 'Narcotics', 'Bureau']
('1072-019', 77, 79)
['Olo', 'Brown']
('0985-009', 0, 1)
['Bruno']
('0955-004', 2, 3)
['Ulsan']
('0955-005', 2, 3)
['Chonbuk']
('0972-021', 5, 6)
['Jayasuriya']
('1136-002', 1, 4)
['Dow', 'Chemical', 'Co']
('1091-002', 9, 12)
['National', 'Tennis', 'Centre']
('0948-024', 7, 9)
['Lord', "'s"]
('1000-009', 0, 1)
['Cofinec']
('1009-018', 31, 32)
['California']
('0947-011', 7, 8)
['Derbyshire']
('0955-004', 0, 1)
['Pohang']
('0947-009', 5, 7)
['W.', 'Athey']
('1030-015', 21, 25)
['New', 'York', 'Stock', 'Exchange']
('1076-005', 0, 1)
['Hajduk']
('1000-003', 2, 3)
['France-registered']
('0971-000', 2, 4)
['SRI', 'LANKA']
('1110-000', 2, 3)
['MILLER']
('0946-005', 14, 15)
['Simmons']
('1058-031', 30, 33)
['Bost

['Gough']
('1092-000', 4, 7)
['1997', 'FED', 'CUP']
('1095-013', 0, 1)
['CHICAGO']
('1105-000', 4, 5)
['SPANISH']
('1094-016', 0, 1)
['MILWAUKEE']
('1076-023', 0, 4)
['Proleter', '(', 'Z', ')']
('0953-011', 21, 23)
['Boxing', 'Association']
('0947-010', 15, 17)
['M.', 'Gatting']
('1109-000', 6, 7)
['BRAZIL']
('1065-028', 0, 1)
['Albion']
('1149-011', 3, 8)
['National', 'Highway', 'Traffic', 'Safety', 'Administration']
('0947-008', 2, 4)
['The', 'Oval']
('1069-008', 0, 1)
['Glamorgan']
('1071-001', 0, 1)
['EDMONTON']
('1069-009', 5, 6)
['Worcestershire']
('0979-002', 22, 23)
['U.S.']
('0972-036', 6, 7)
['Law']
('0958-038', 0, 2)
['CENTRAL', 'DIVISION']
('1053-003', 15, 17)
['Ijaz', 'Ahmed']
('1128-012', 3, 4)
['UWP-Hachette']
('1072-006', 0, 1)
['Springbok']
('1123-014', 0, 1)
['OB']
('1065-035', 0, 2)
['Inverness', 'Thistle']
('0972-009', 0, 2)
['S.', 'Waugh']
('1069-006', 3, 4)
['Glamorgan']
('0959-010', 0, 2)
['San', 'Diego']
('0985-011', 13, 14)
['Tyson']
('1075-003', 0, 1)
['Benin'

In [34]:
print("The number of false positives are",len(set(spans(df_dev))-spans_dev_gold))
false_neg(spans_dev_gold, set(improved_gold_spans(df_dev)), df_dev)  

The number of false positives are 3627
('1127-022', 37, 40)
['Freedom', 'for', 'Chechnya']
('1138-008', 13, 16)
['Democracy', 'Wall', 'movement']
('0953-012', 22, 25)
['Puerto', 'Rico', "'s"]
('1058-002', 16, 20)
['the', 'Cleveland', 'Indians', '5-3']
('0966-016', 1, 4)
['Leah', 'Pells', '(']
('1005-014', 10, 11)
['Comex']
('1055-032', 2, 6)
['Knight', 'b', 'Hollioake', '2']
('1106-006', 0, 4)
['Belgium', '-', 'Marc', 'Degryse']
('0977-002', 16, 20)
['the', 'Bank', 'of', 'Canada']
('1150-005', 24, 28)
['the', 'Rochester', 'Fire', 'Department']
('1096-031', 25, 29)
['the', 'San', 'Francisco', 'Giants']
('1069-007', 5, 7)
['Durham', '114']
('1058-022', 23, 26)
['the', 'Seattle', 'Mariners']
('0987-002', 22, 26)
['the', 'national', 'news', 'agency']
('1067-013', 0, 2)
['Dumbarton', '1']
('0957-019', 0, 3)
['Add', 'Women', "'s"]
('1076-000', 3, 7)
['LEAGUE', 'RESULTS', '/', 'STANDINGS']
('1157-004', 5, 6)
['Iraqi']
('1104-002', 12, 15)
['Deportivo', 'Coruna', '1']
('0957-033', 35, 37)
['6-

('1039-019', 9, 10)
['congress']
('1087-002', 19, 23)
['the', 'Hong', 'Kong', 'Open']
('1058-016', 23, 27)
['the', 'New', 'York', 'Yankees']
('1027-014', 4, 7)
['Abu', 'Ammar', "'s"]
('0948-046', 7, 8)
['Oval']
('0981-011', 0, 4)
['Douglas', '&', 'Lomason', "'s"]
('1137-008', 32, 35)
['the', 'Korean', 'War']
('1124-010', 32, 33)
['Coloureds']
('1160-002', 0, 1)
['VENICE']
('1071-001', 0, 2)
['EDMONTON', '1996-08-31']
('1070-000', 0, 6)
['MOTOR', 'RACING', '-', 'LEADING', 'QUALIFIERS', 'FOR']
('1054-000', 0, 9)
['CYCLING', '-', 'TOUR', 'OF', 'NETHERLANDS', 'FINAL', 'RESULTS', '/', 'STANDINGS']
('1149-011', 2, 8)
['the', 'National', 'Highway', 'Traffic', 'Safety', 'Administration']
('1068-022', 0, 3)
['Luton', '1', 'Rotherham']
('0977-008', 6, 7)
['+0.578']
('1043-002', 14, 18)
['the', 'Central', 'Narcotics', 'Bureau']
('0947-008', 3, 4)
['Oval']
('0992-008', 0, 8)
['The', 'Organisation', 'for', 'Security', 'and', 'Cooperation', 'in', 'Europe']
('1064-007', 20, 24)
['the', 'Rugby', 'Foot

Show that you achieve the performance goal by reporting the evaluation measures that you implemented in Problem&nbsp;1.

Before going on, we ask you to store the outputs of the improved named entity recognizer on the development data in a new data frame. This new frame should have the same layout as the original data frame for the development data that you loaded above, but should contain the predicted start and end positions for each token span. As the `label` of each span, you can use the special value `--NME--`.

In [35]:
# TODO: Write code here to store the predicted spans in a new data frame
predicted_spans = set(improved_gold_spans(df_dev))
#pred_spans.columns = ['sentence_id','beg','end']
df_predicted = []
for j in predicted_spans:
    df_predicted.append([j[0],df_dev[df_dev.sentence_id == j[0]].sentence.values[0],j[1],j[2],'--NME--'])

In [36]:
df_predicted = pd.DataFrame(df_predicted, columns = df_dev.columns)
df_predicted.sort_values(by=['sentence_id', 'beg']).reset_index(drop = True)

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0946-001,LONDON 1996-08-30,0,1,--NME--
1,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,--NME--
2,0946-002,West Indian all-rounder Phil Simmons took four...,3,5,--NME--
3,0946-002,West Indian all-rounder Phil Simmons took four...,14,15,--NME--
4,0946-003,"Their stay on top , though , may be short-live...",13,14,--NME--
5,0946-003,"Their stay on top , though , may be short-live...",15,16,--NME--
6,0946-003,"Their stay on top , though , may be short-live...",17,18,--NME--
7,0946-003,"Their stay on top , though , may be short-live...",24,25,--NME--
8,0946-003,"Their stay on top , though , may be short-live...",35,36,--NME--
9,0946-004,After bowling Somerset out for 83 on the openi...,2,3,--NME--


## Problem 4: Entity linking

Now that we have a method for predicting mention spans, we turn to the task of **entity linking**, which amounts to predicting the knowledge base entity that is referenced by a given mention. In our case, for each span we want to predict the Wikipedia page that this mention references.

Start by extending the generator function that you implemented in Problem&nbsp;2 to labelled spans.

In [37]:
def gold_mentions(df):
    """Yield the gold-standard mentions in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and entity label of each span.
    """
    # TODO: Replace the next line with your own code
    for i in df.itertuples():
        yield (i[1], i[3], i[4], i[5])

A naive baseline for entity linking on our data set is to link each mention span to the Wikipedia page name that we get when we join the tokens in the span by underscores, as is standard in Wikipedia page names. Suppose, for example, that a span contains the two tokens

    Jimi Hendrix

The baseline Wikipedia page name for this span would be

    Jimi_Hendrix

Implement this naive baseline and evaluate its performance. Print the evaluation measures that you implemented in Problem&nbsp;1.

<div class="alert alert-warning">
    Here and in the remainder of this lab, you should base your entity predictions on the predicted spans that you computed in Problem&nbsp;3.
</div>

In [104]:
# TODO: Write code here to implement the baseline
def baseline_(data):
    for  row in data.itertuples():
        label = "_".join(row.sentence.split(" ")[int(row.beg):int(row.end)])
        yield row.sentence_id, row.beg, row.end, label     
evaluation_report(set(gold_mentions(df_dev)),set(baseline_(df_predicted.copy() )))        

Precision score is 32.61564698358559 %
Recall score is 25.857698157850262 %
f1 score is 28.846153846153843 %


## Problem 5: Extending the training data using the knowledge base

State-of-the-art approaches to entity linking exploit information in knowledge bases. In our case, where Wikipedia is the knowledge base, one particularly useful type of information are links to other Wikipedia pages. In particular, we can interpret the anchor texts (the highlighted texts that you click on when you click on a link) as mentions of the entities (pages) that they link to. This allows us to harvest long lists over mention–entity pairings.

The following cell loads a data frame summarizing anchor texts and page references harvested from the first paragraphs of the English Wikipedia. The data frame also contains all entity mentions in the training data (but not the development or the test data).

In [45]:
with bz2.open("kb.tsv.bz2", 'rt',encoding="utf8") as source:
    df_kb = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

To understand what information is available in this data, the following cell shows the entry for the anchor text `Sweden`.

In [46]:
df_kb.loc[df_kb["mention"] == "Sweden"]

Unnamed: 0,mention,entity,prob
17436,Sweden,Sweden,0.985768
17437,Sweden,Sweden_national_football_team,0.014173
17438,Sweden,Sweden_men's_national_ice_hockey_team,5.9e-05


As you can see, each row of the data frame contains a pair $(m, e)$ of a mention $m$ and an entity $e$, as well as the conditional probability $P(e|m)$ for mention $m$ referring to entity $e$. These probabilities were estimated based on the frequencies of mention–entity pairs in the knowledge base. The example shows that the anchor text &lsquo;Sweden&rsquo; is most often used to refer to the entity [Sweden](http://en.wikipedia.org/wiki/Sweden), but in a few cases also to refer to Sweden&rsquo;s national football and ice hockey teams. Note that references are sorted in decreasing order of probability, so that the most probable pairing stands first.

Implement an entity linking method that resolves each mention to the most probable entity in the data frame. If the mention is not included in the data frame, you can predict the generic label `--NME--`. Print the precision, recall, and F1 of your method using the function that you implemented for Problem&nbsp;1.

In [98]:
# TODO: Write code here to implement the "most probable entity" method.
def most_probable_entity(mention):
    entity = df_kb.loc[df_kb.mention == mention]
    if (len(entity) > 0):
        return entity.iloc[0].entity
    else :
        return "--NME--"

def extender_knowledge_base(data):
    for i in data.itertuples(): 
        label = i.sentence.split(" ")[int(i.beg):int(i.end)]
        label = " ".join(label)
        yield i.sentence_id, i.beg, i.end, most_probable_entity(label)
                    
evaluation_report(set(gold_mentions(df_dev)),set(extender_knowledge_base(df_predicted)))

Precision score is 65.5510552121083 %
Recall score is 51.96890316038533 %
f1 score is 57.97511312217195 %


## Problem 6: Context-sensitive disambiguation

Consider the entity mention &lsquo;Lincoln&rsquo;. The most probable entity for this mention turns out to be [Lincoln, Nebraska](http://en.wikipedia.org/Lincoln,_Nebraska); but in pages about American history, we would be better off to predict [Abraham Lincoln](http://en.wikipedia.org/Abraham_Lincoln). This suggests that we should try to disambiguate between different entity references based on the textual context on the page from which the mention was taken. Your task in this last problem is to implement this idea.

Set up a dictionary that contains, for each mention $m$ that can refer to more than one entity $e$, a separate Naive Bayes classifier to predict the correct entity $e$, given the textual context of the mention. As the prior probabilities of the classifier, choose the probabilities $P(e|m)$ that you used in Problem&nbsp;5. To let you estimate the context-specific probabilities, we have compiled a data set with mention contexts:

In [48]:
with bz2.open("contexts.tsv.bz2") as source:
    df_contexts = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

This data frame contains, for each ambiguous mention $m$ and each knowledge base entity $e$ to which this mention can refer, up to 100 randomly selected contexts in which $m$ is used to refer to $e$. For this data, a **context** is defined as a bag of words containing the mention itself, as well as the 5 tokens to the left and the 5 tokens to the right of the mention. Here are a few examples:

In [49]:
df_contexts.head()

Unnamed: 0,mention,entity,context
0,1970,UEFA_Champions_League,Cup twice the first in 1970 and the second in ...
1,1970,FIFA_World_Cup,America 1975 and during the 1970 and 1978 Worl...
2,1990 World Cup,1990_FIFA_World_Cup,Manolo represented Spain at the 1990 World Cup
3,1990 World Cup,1990_FIFA_World_Cup,Hašek represented Czechoslovakia at the 1990 W...
4,1990 World Cup,1990_FIFA_World_Cup,renovations in 1989 for the 1990 World Cup The...


From this data frame, it is easy to select the data that you need to train the classifiers – the contexts and corresponding entities for all mentions. To illustrate this, the following cell shows how to select all contexts that belong to the mention &lsquo;Lincoln&rsquo;:

In [50]:
df_contexts.context[df_contexts.mention == "Lincoln"]

41465    Nebraska Concealed Handgun Permit In Lincoln m...
41466    Lazlo restaurants are located in Lincoln and O...
41467    California Washington Overland Park Kansas Lin...
41468    City Missouri Omaha Nebraska and Lincoln Nebra...
41469    by Sandhills Publishing Company in Lincoln Neb...
41470    the Humanities NEH based in Lincoln Nebraska H...
41471    town Hallam situated halfway between Lincoln a...
41472    games in Memorial Stadium in Lincoln The Huske...
41473    site and retail store in Lincoln Nebraska and ...
41474    The eastern segment begins in Lincoln and ends...
41475    1869 with one campus in Lincoln the system now...
41476    Ron Kurtenbach is Lincoln Nebraska communist b...
41477    state of Nebraska He represented Lincoln distr...
41478    national celebrity for his 1908 Lincoln to New...
41479    brewery founded in 1990 in Lincoln Nebraska US...
41480    state government agency Located in Lincoln the...
41481    Omaha Grand Island Kearney and Lincoln Nebrask.

Implement the context-sensitive disambiguation method and evaluate its performance. Here are some more hints that may help you along the way:

**Hint 1:** The prior probabilities for a Naive Bayes classifier can be specified using the `class_prior` option. You will have to provide the probabilities in the same order as the alphabetically sorted class (entity) names.

**Hint 2:** To tune the performance of your method, you can try to tweak the behaviour of the vectorizer (for example, should it apply lowercasing or not?) and the width of the window from which you are extracting context tokens at prediction time.

**Hint 3:** Not all mentions in the knowledge base are ambiguous, and therefore not all mentions have context data. If a mention has only one possible entity, pick that one. If a mention has no entity at all, predict the `--NME--` label.

In [111]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
entitymentiondict = {}
for m in df_contexts.mention.unique():
    pipeline = Pipeline([('cv', CountVectorizer()),
                     ('mnb', MultinomialNB(class_prior = 
                    list(df_kb[df_kb.mention == m].sort_values(by=["entity"]).prob)))])
    pipeline.fit(df_contexts.context[df_contexts.mention == m], df_contexts.entity[df_contexts.mention == m])
    entitymentiondict[m] = pipeline
    
def probable_entity(mention, sentence, beg, end, n):
    if mention in entity_dict:
        context = sentence.split(" ")
        context = context[max(0, int(beg) - n):min(len(context), int(end) + n)]
        context = ' '.join(context)
        return entitymentiondict[mention].predict([context])[0]
    return most_probable_entity(mention)
    
def mnb_predict(data):
    for i in data.itertuples():
        label = i.sentence.split(" ")[int(i.beg):int(i.end)]
        label = " ".join(label)
        yield i.sentence_id, i.beg, i.end, probable_entity(label, i.sentence, i.beg, i.end, 5)

evaluation_report(set(gold_mentions(df_dev)), set(mnb_predict(df_predicted)))    

Precision score is 67.02195693881902 %
Recall score is 53.13503464593544 %
f1 score is 59.276018099547514 %


In [112]:
def nb_predict_mentions(data):
    for i in data.itertuples():
        label = i.sentence.split(" ")[int(i.beg):int(i.end)]
        label = " ".join(label)
        yield i.sentence_id, i.beg, i.end, mnb_entity(label, i.sentence, i.beg, i.end, 10)

evaluation_report(set(gold_mentions(df_dev)), set(nb_predict_mentions(df_predicted)))  

Precision score is 67.21381368578129 %
Recall score is 53.28713875274632 %
f1 score is 59.44570135746607 %


In [113]:
def nb_predict_mentions(data):
    for i in data.itertuples():
        label = i.sentence.split(" ")[int(i.beg):int(i.end)]
        label = " ".join(label)
        yield i.sentence_id, i.beg, i.end, mnb_entity(label, i.sentence, i.beg, i.end, 20)

evaluation_report(set(gold_mentions(df_dev)), set(nb_predict_mentions(df_predicted))) 

Precision score is 67.32040076742699 %
Recall score is 53.37164103430793 %
f1 score is 59.53996983408748 %


You should expect to see a small (around 1&nbsp;unit) increase in both precision, recall, and F1. Published systems report a larger impact of context-sensitive disambiguation. Feel free to think about what could explain the relatively minor impact that we see here!

*With a buffer length of 5, we get a precision of 67.02% which is higher than the 65.55% from the knowledge extension done previously, also a 1% approx. increase in recall score and f1 score is observed. It is seen from that when the buffer length is increased , we notice a marginal increase in the precision which shows that an optimum span of context is required to maximise the precision.*

**This was the last lab in the Text Mining course. Well done!**

<div class="alert alert-info">
    Please read the section ‘General information’ on the ‘Labs’ page of the course website before submitting this notebook!
</div>