# L5: Information extraction

Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, **named entity recognition** (identifying mentions of entities) and **entity linking** (matching these mentions to entities in a knowledge base).

We start by loading spaCy:

In [26]:
import spacy

nlp = spacy.load('en_core_web_sm')

The data that we will be using has been tokenized following the conventions of the [Penn Treebank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html), and we need to prevent spaCy from using its own tokenizer on top of this. We therefore override spaCy&rsquo;s tokenizer with one that simply splits on space.

In [3]:
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        return Doc(self.vocab, words=text.split(' '))

nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

## Data set

The main data set for this lab is a collection of news wire articles in which mentions of named entities have been annotated with page names from the [English Wikipedia](https://en.wikipedia.org/wiki/). The next code cell loads the training and the development parts of the data into Pandas data frames.

In [14]:
import bz2
import csv
import pandas as pd

with bz2.open('ner-train.tsv.bz2', 'rt') as source:
    df_train = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

with bz2.open('ner-dev.tsv.bz2', 'rt') as source:
    df_dev = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

Each row in these two data frames corresponds to one mention of a named entity and has five columns:

1. a unique identifier for the sentence containing the entity mention
2. the pre-tokenized sentence, with tokens separated by spaces
3. the start position of the token span containing the entity mention
4. the end position of the token span (exclusive, as in Python list indexing)
5. the entity label; either a Wikipedia page name or the generic label `--NME--`

The following cell prints the first five samples from the training data:

In [15]:
df_train.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0000-000,EU rejects German call to boycott British lamb .,0,1,--NME--
1,0000-000,EU rejects German call to boycott British lamb .,2,3,Germany
2,0000-000,EU rejects German call to boycott British lamb .,6,7,United_Kingdom
3,0000-001,Peter Blackburn,0,2,--NME--
4,0000-002,BRUSSELS 1996-08-22,0,1,Brussels


In this sample, we see that the first sentence is annotated with three entity mentions:

* the span 0–1 &lsquo;EU&rsquo; is annotated as a mention but only labelled with the generic `--NME--`
* the span 2–3 &lsquo;German&rsquo; is annotated with the page [Germany](http://en.wikipedia.org/wiki/Germany)
* the span 6–7 &lsquo;British&rsquo; is annotated with the page [United_Kingdom](http://en.wikipedia.org/wiki/United_Kingdom)

## Problem 1: Evaluation measures

To warm up, we ask you to write code to print the three measures that you will be using for evaluation:

In [34]:
def evaluation_report(gold, pred):
    """Print precision, recall, and F1 score.
    
    Args:
        gold: The set with the gold-standard values.
        pred: The set with the predicted values.
    
    Returns:
        Nothing, but prints the precision, recall, and F1 values computed
        based on the specified sets.
    """
    # TODO: Replace the next line with your own code
    
    TP = len(pred.intersection(gold))
    g_l = len(gold)
    p_l = len(pred)
    
    precision = TP/p_l
    
    recall = TP/g_l
    
    f1 = (2*TP)/(g_l+p_l)
    
    print('Precision: ', precision, '\nRecall: ', recall, '\nF1: ', f1)
    pass

To test your code, you can run the following cell:

In [7]:
evaluation_report(set(range(3)), set(range(5)))

Precision:  0.6 
Recall:  1.0 
F1:  0.75


This should give you a precision of 60%, a recall of 100%, and an F1-value of 75%.

## Problem 2: Span recognition

One of the first tasks that an information extraction system has to solve is to locate and classify (mentions of) named entities, such as persons and organizations. Here we will tackle the simpler task of recognizing **spans** of tokens that contain an entity mention, without the actual entity label.

The English language model in spaCy features a full-fledged [named entity recognizer](https://spacy.io/usage/linguistic-features#named-entities) that identifies a variety of entities, and can be updated with new entity types by the user. Your task in this problem is to evaluate the performance of this component when predicting entity spans in the development data.

Start by implementing a generator function that yields the gold-standard spans in a given data frame.

**Hint:** The Pandas method [`itertuples()`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.itertuples.html) is useful when iterating over the rows in a DataFrame.

In [8]:
def gold_spans(df):
    """Yield the gold-standard mention spans in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        triples consisting of the sentence id, start position, and end
        position of each span.
    """
    # TODO: Replace the next line with your own code
    
    for row in df.itertuples():
        yield (row[1], row[3], row[4])
        

To test your code, you can count the spans yielded by your function. When called on the development data, you should get a total of 5,917 unique triples. The first triple and the last triple should be

    ('0946-000', 2, 3)
    ('1161-010', 1, 3)  

In [9]:
spans_dev_gold = set(gold_spans(df_dev))
print(len(spans_dev_gold))

5917


Your next task is to write code that calls spaCy to predict the named entities in the development data, and to evaluate the accuracy of these predictions in terms of precision, recall, and F1. Print these scores using the function that you wrote for Problem&nbsp;1.

In [10]:
# TODO: Write code here to run and evaluate the spaCy NER on the development data

def predictions(df):

    for row in df_dev.itertuples():
        sent = nlp(row[2])
        for ent in sent.ents:
            #print((row[1], ent.start, ent.end))
            yield (row[1], ent.start, ent.end)


In [11]:
predictions = set(predictions(df_dev))

In [12]:
evaluation_report(spans_dev_gold, predictions)

Precision:  0.5383185017930668 
Recall:  0.6849754943383471 
F1:  0.6028558679161089


## Problem 3: Error analysis

As you were able to see in Problem&nbsp;2, the span accuracy of the named entity recognizer is far from perfect. In particular, only slightly more than half of the predicted spans are correct according to the gold standard. Your next task is to analyse this result in more detail.

Here is a function that prints the false positives as well as the false negatives spans for a data frame, given a reference set of gold-standard spans and a candidate set of predicted spans.

In [13]:
from collections import defaultdict

def error_report(df, spans_gold, spans_pred):
    false_pos = defaultdict(list)
    for s, b, e in spans_pred - spans_gold:
        false_pos[s].append((b, e))
    false_neg = defaultdict(list)
    for s, b, e in spans_gold - spans_pred:
        false_neg[s].append((b, e))
    for row in df.drop_duplicates('sentence_id').itertuples():
        if row.sentence_id in false_pos or row.sentence_id in false_neg:
            print('Sentence:', row.sentence)
            for b, e in false_pos[row.sentence_id]:
                print('  FP:', ' '.join(row.sentence.split()[b:e]))
            for b, e in false_neg[row.sentence_id]:
                print('  FN:', ' '.join(row.sentence.split()[b:e]))

Use this function to inspect and analyse the errors that the automated prediction makes. Can you see any patterns? Base your analysis on the first 500 rows of the training data. Summarize your observations in a short text.

In [14]:
# TODO: Write code here to do your analysis
error_report(df_dev, spans_dev_gold, predictions)

Sentence: CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .
  FN: LEICESTERSHIRE
Sentence: West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .
  FP: Friday
  FP: 39
  FP: two days
  FP: four
  FP: 38
  FN: Somerset
Sentence: After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended their first innings by 94 runs before being bowled out for 296 with England discard Andy Caddick taking three for 83 .
  FP: first
  FP: 83
  FP: the opening morning
  FP: 296
  FP: 94
  FP: three
  FP: 83
Sentence: Trailing by 213 , Somerset got a solid start to their second innings before Simmons stepped in to bundle them out for 174 .
  FP: 213
  FP: 174
  FP: second
Sentence: Essex , however , look certain to regain their top spot after Nasser Hussain and Peter Such gave them a firm grip on their match against Yorksh

Sentence: Woods was among a group of 13 players at four under , including 1993 champion Billy Mayfair , who tied for second at last week 's World Series of Golf , and former U.S. Open champ Payne Stewart .
  FP: U.S.
  FP: last week 's
  FP: four
  FP: 13
  FP: second
  FP: 1993
  FN: Woods
  FN: U.S. Open
Sentence: Defending champion Scott Hoch shot a three-under 68 and was six strokes back .
  FP: six
  FP: 68
Sentence: Phil Mickelson , the only four-time winner on the PGA Tour , skipped the tournament after winning the World Series of Golf last week .
  FP: the PGA Tour
  FP: last week
  FP: the World Series of
  FN: PGA Tour
  FN: World Series of Golf
Sentence: Mark Brooks , Tom Lehman and Mark O'Meara , who make up the rest of the top four on the money list , also took the week off .
  FP: four
  FP: the week
Sentence: SOCCER - SILVA 'S `LOST PASSPORT ' EXCUSE NOT ENOUGH FOR FIFA .
  FN: SILVA
Sentence: MADRID 1996-08-30
  FN: MADRID
Sentence: Spanish first division team Deportivo

  FN: IAAF Grand Prix Final
Sentence: SOCCER - BARATELLI TO COACH NICE .
  FN: BARATELLI
  FN: NICE
Sentence: NICE , France 1996-08-30
  FP: 1996-08-30
Sentence: Former international goalkeeper Dominique Baratelli is to coach struggling French first division side Nice , the club said on Friday .
  FP: first
  FP: Friday
Sentence: Baratelli , who played for Nice and Paris St Germain , takes over from Albert Emon who was fired on Thursday after Nice 's home defeat to Guingamp 2-1 in the league .
  FP: Guingamp 2-1
  FP: Paris
  FP: Thursday
  FN: Guingamp
  FN: Baratelli
  FN: Paris St Germain
Sentence: Nice have been unable to win any of their four league matches played this season and are lying a lowly 18th in the table .
  FP: four
  FP: this season
  FP: 18th
  FN: Nice
Sentence: SOCCER - MILAN 'S LENTINI MOVES TO ATALANTA .
  FN: MILAN
  FN: LENTINI
  FN: ATALANTA
Sentence: Former Italian international winger Gianluigi Lentini , transferred to Milan in 1992 for what was believed to 

  FP: about 2,960,000
  FP: Friday
Sentence: Jones stock closed down 1/8 at 40 Friday .
  FP: 40 Friday
  FP: 1/8
Sentence: -- Chicago newsdesk , 312 408-8787
  FP: 312
Sentence: NEW YORK 1996-08-30
  FP: NEW YORK 1996-08-30
  FN: NEW YORK
Sentence: NYMEX refined product prices lingered at session lows amid slim volume before the close while crude experienced lackluster buying ahead of the U.S. Labor Day weekend , traders said .
  FP: NYMEX
  FP: Labor Day weekend
  FN: Labor Day
Sentence: Buying interest in crude did not have enough conviction to send it much higher since many players had left early to start the Labor Day holiday weekend , traders said .
  FP: the Labor Day holiday weekend
  FN: Labor Day
Sentence: NYMEX will be closed Monday due to Labor Day .
  FP: Monday
  FP: NYMEX
Sentence: -- Harry Milling , New York Energy Desk , +1 212-859-1761
  FP: -- Harry Milling
  FN: Harry Milling
Sentence: U.S. debt futures end lower , shaken by Chicago NAPM .
  FN: NAPM
Sentence: CHICA

  FP: Saturday
Sentence: Russian peacemaker Alexander Lebed and Chechen separatist military leader Aslan Maskhadov started a new round of peace talks on Friday in this settlement just outside the rebel region .
  FP: Friday
Sentence: Lebed , who flew into Chechnya earlier in the day , said he hoped to sign a framework agreement on a political settlement of the 20-month conflict in which tens of thousands of people have died .
  FP: tens of thousands
  FP: earlier in the day
  FP: 20-month
Sentence: Russian judge stabbed to death over $ 7 fine .
  FP: over $ 7
Sentence: MOSCOW 1996-08-30
  FN: MOSCOW
Sentence: A Moscow street vendor stabbed to death a woman judge in a city court on Friday after she fined him the equivalent of seven dollars for trading illegally , Interfax news agency said .
  FP: Friday
  FP: seven dollars
Sentence: Interfax said Judge Olga Lavrentyeva , 28 , on Thursday ordered the confiscation of several overcoats , suits and shirts which vendor Valery Ivankov , 41 , 

  FP: $ 548 billion
Sentence: Clinton proposes an $ 8.4 billion re-election agenda that would spare most home-sellers from capital gains taxes and give employers tax incentives to hire people off the welfare rolls .
  FP: $ 8.4 billion
Sentence: Clinton claims Dole 's plan would increase the deficit , while the White House said some corporate taxes would be raised to offset the cost of the president 's plan .
  FP: the White House
  FN: White House
Sentence: The experts said there are some unusual risks for the market from this year 's political season because the rush to promise tax cuts to win votes could upset Wall Street 's expectations that Washington will balance the budget .
  FP: year
  FN: Wall Street
Sentence: Ronald Reagan 's tax cut won him a second four-year term in the White House in 1984 , while Democrat Walter Mondale , who promised higher taxes , lost .
  FP: second
  FP: the White House
  FP: Ronald Reagan 's
  FP: 1984
  FN: White House
  FN: Ronald Reagan
Sentence: 

  FN: Bear Stearns & Co Inc
Sentence: Business : Company 's goal is to become a national provider of high quality extended-stay hotels in strategically selected markets located throughout the United States .
  FP: the United States
  FN: United States
Sentence: Syria on Friday condemned Israel 's settlement policy and said Prime Minister Benjamin Netanyahu was preparing for war with Arabs .
  FP: Friday
Sentence: Palestinian President Yasser Arafat described this week the decision to expand settlements in the West Bank as a declaration of war .
  FP: this week
  FP: the West Bank
  FN: West Bank
Sentence: Palestinians in the West Bank and Gaza observed a strike on Thursday to protest against the Israeli move .
  FP: Thursday
  FP: the West Bank
  FN: West Bank
Sentence: Syria has held sporadic peace talks with Israel since 1991 without achieving a breakthrough .
  FP: 1991
Sentence: Egypt police detain 26 suspected Moslem militants .
  FP: 26
Sentence: Police have detained 26 suspected

  FN: EOE
  FN: European Options Exchange
Sentence: Market-making in the rarely-traded FTO contract was expected to begin today , but an EOE spokesman said the 10 banks and brokers involved in the initiative needed time to get accustomed to changes in the electronic trading system .
  FP: today
  FP: 10
Sentence: We found it wise to take some time between the commitment to start and the actual start , " EOE spokesman Lex van Drooge told Reuters .
  FN: EOE
Sentence: He said no date had been fixed yet for the start of price making in the 10-year contract , but the EOE had agreed to speak again to the participants in one to two weeks .
  FP: one to two weeks
  FP: 10-year
Sentence: -- Amsterdam newsroom +31 20 504 5000 , Fax +31 20 504 5040
  FP: 20 504 5040
Sentence: The head of an international mediating mission defended its record on Friday in the face of criticism by pro-Moscow leaders in breakway Chechnya and insisted it was doing its best to bring peace to the region .
  FP: Friday

  FP: 1:7
  FP: 3.
  FN: Ekimov
Sentence: 4. Marco Lietti ( Italy ) MG-Technogym 1:16
  FP: MG-Technogym 1:16
  FP: 4.
  FN: MG-Technogym
Sentence: 5. Erik Dekker ( Netherlands ) Rabobank 1:23
  FP: 1:23
  FP: 5.
Sentence: 6. Ludwig 1:25
  FP: 6.
  FN: Ludwig
Sentence: 6. Breukink same time
  FP: 6.
  FN: Breukink
Sentence: 8. Maarten den Bakker ( Netherlands ) TVM 1:33
  FP: 8.
  FP: 1:33
  FP: Bakker
  FN: Maarten den Bakker
Sentence: 9. Andersson 1:34
  FP: 1:34
Sentence: 10. Skibby 1:45
  FP: 1:45
  FN: Skibby
Sentence: CRICKET - ENGLAND V PAKISTAN ONE-DAY SCOREBOARD .
  FP: CRICKET
  FN: ENGLAND
  FN: PAKISTAN
Sentence: Scoreboard of the second one-day cricket match between England and Pakistan on Saturday :
  FP: second
  FP: Saturday
Sentence: N. Knight st Moin Khan b Saqlain Mushtaq 113
  FP: N. Knight st Moin Khan b
  FN: Saqlain Mushtaq
  FN: N. Knight
  FN: Moin Khan
Sentence: A. Stewart b Mushtaq Ahmed 46
  FP: 46
  FP: A. Stewart b
  FN: A. Stewart
  FN: Mushtaq Ahmed
Sent

Sentence: Bristol Rovers 3 1 2 0 2 1 5
  FP: 3 1 2
  FP: Bristol
  FP: 2 1 5
  FN: Bristol Rovers
Sentence: Peterborough 3 1 1 1 4 4 4
  FP: 3 1 1 1 4
  FN: Peterborough
Sentence: Crewe 4 1 1 2 4 6 4
  FN: Crewe
Sentence: Bristol City 4 1 0 3 7 8 3
  FP: Bristol City 4 1
  FN: Bristol City
Sentence: Wycombe 4 0 3 1 2 3 3
  FP: 3 1 2 3 3
  FN: Wycombe
Sentence: Wrexham 2 0 2 0 5 5 2
  FP: 0 2
  FN: Wrexham
Sentence: Stockport 4 0 2 2 1 3 2
  FP: 4 0 2 2 1 3 2
  FN: Stockport
Sentence: Walsall 3 0 1 2 2 4 1
  FP: 3
  FN: Walsall
Sentence: Fulham 4 3 0 1 5 3 9
  FP: 4 3
  FP: 1 5 3 9
  FN: Fulham
Sentence: Hull 4 2 2 0 4 2 8
  FN: Hull
Sentence: Hartlepool 4 2 1 1 6 5 7
  FN: Hartlepool
Sentence: Torquay 4 2 1 1 5 3 7
  FP: 4 2 1 1 5 3
  FN: Torquay
Sentence: Cardiff 4 2 1 1 3 2 7
  FP: 1 1 3 2 7
  FN: Cardiff
Sentence: Scunthorpe 4 2 1 1 3 3 7
  FP: Scunthorpe 4 2 1 1 3 3 7
  FN: Scunthorpe
Sentence: Carlisle 4 2 1 1 2 1 7
  FN: Carlisle
Sentence: Scarborough 4 1 3 0 5 3 6
  FP: 4 1 3
  

  FP: BASKETBALL - TROFEJ BEOGRAD
  FP: RESULTS
  FN: TROFEJ BEOGRAD TOURNAMENT
Sentence: BELGRADE 1996-08-31
  FN: BELGRADE
Sentence: Results in the Trofej Beograd 96
  FN: Trofej Beograd 96
Sentence: Benetton ( Italy ) 92 Dinamo ( Russia ) 81 ( halftime 50-28 )
  FP: 92
  FP: 81
  FP: 50-28
  FN: Italy
Sentence: Alba ( Germany ) 75 Red Star ( Yugoslavia ) 70 ( 42-41 )
  FP: 70
  FP: 75
  FN: Alba
Sentence: BASKETBALLSOCCER - TROFEJ BEOGRAD TOURNAMENT RESULTS .
  FP: RESULTS
  FP: BASKETBALLSOCCER - TROFEJ BEOGRAD
  FN: TROFEJ BEOGRAD TOURNAMENT
Sentence: BELGRADE 1996-08-31
  FN: BELGRADE
Sentence: Results in the Trofej Beograd 96
  FN: Trofej Beograd 96
Sentence: Benetton ( Italy ) 92 Dinamo ( Russia ) 81 ( halftime 50-28 )
  FP: 92
  FP: 81
  FP: 50-28
  FN: Italy
Sentence: Alba ( Germany ) 75 Red Star ( Yugoslavia ) 70 ( 42-41 )
  FP: 75
  FP: 70
  FN: Alba
Sentence: SOCCER - ROMANIA BEAT LITHUANIA IN WORLD CUP QUALIFIER .
  FP: SOCCER - ROMANIA BEAT LITHUANIA IN WORLD CUP QUALIFI

  FP: the World Series of Golf
  FP: six
  FP: 70
  FN: World Series of Golf
Sentence: The top four on the PGA Tour money list all skipped the tournament .
  FP: four
Sentence: TENNIS - SPAIN , U.S. TEAMS OPEN ON ROAD FOR 1997 FED CUP .
  FP: 1997
  FP: FED CUP
  FN: SPAIN
  FN: 1997 FED CUP
Sentence: NEW YORK 1996-08-31
  FN: NEW YORK
Sentence: This year 's Fed Cup finalists -- defending champion Spain and the United States -- will hit the road to open the 1997 women 's international team competition , based on the draw conducted Saturday at the U.S. Open .
  FP: the United States
  FP: 1997
  FP: year
  FP: Saturday
  FP: the U.S. Open
  FN: United States
  FN: U.S. Open
Sentence: Spain travels to Belgium , while the U.S. team heads to the Netherlands for first-round matches March 1-2 .
  FP: March
Sentence: The other two first-round ties will pit hosts Germany against the Czech Republic and visiting France against Japan .
  FP: two
  FP: the Czech Republic
  FN: Czech Republic
Sente

  FN: NASEEM
Sentence: DUBLIN 1996-08-31
  FN: DUBLIN
Sentence: Britain 's Naseem Hamed retained his WBO featherweight title on Saturday when Mexico 's Manuel Medina was retired by his corner at the end of the 11th round .
  FP: the end of the 11th
  FP: Saturday
Sentence: SOCCER - AUSTRIA DOMINATE SCOTLAND IN WORLD CUP QUALIFIER .
  FN: AUSTRIA
  FN: SCOTLAND
  FN: WORLD CUP
Sentence: Austria dominated their World Cup group four qualifier against Scotland on Saturday with wave after wave of attacks but were unable to penetrate the visitors ' defence and had to settle for a goalless draw .
  FP: Saturday
  FP: four qualifier
Sentence: Scotland , who thrashed Belarus 5-1 in their opening group four match , were unable to repeat their performance .
  FP: four
  FP: Belarus 5-1
  FN: Belarus
Sentence: Austria 's best chance came in the 63rd minute with Stephan Marasek of SC Freiburg taking advantage of a scramble in the Scottish penalty area but his shot narrowly passing the left-hand pos

  FN: division
Sentence: Deportivo Coruna 1 Real Madrid 1
  FN: Deportivo Coruna
  FN: Real Madrid
Sentence: SOCCER - BELGIUM BEAT TURKEY 2-1 IN WORLD CUP QUALIFIER .
  FP: WORLD CUP QUALIFIER
  FN: TURKEY
  FN: BELGIUM
  FN: WORLD CUP
Sentence: ( halftime 2-0 ) in a World Cup group seven soccer qualifier on
  FP: seven
Sentence: Belgium - Marc Degryse ( 13th ) , Luis Oliveira ( 38th )
  FP: 13th
  FP: 38th
Sentence: Turkey - Sergen Yalcin ( 61st )
  FP: 61st
  FN: Sergen Yalcin
Sentence: SOCCER - AUSTRIA DRAW 0-0 WITH SCOTLAND IN WORLD CUP QUALIFIER .
  FN: AUSTRIA
  FN: WORLD CUP
  FN: SCOTLAND
Sentence: Austria and Scotland drew 0-0 in a World Cup soccer European group four qualifier on Saturday .
  FP: Saturday
  FP: four
Sentence: BOXING - KNOCK-OUT SPECIALIST MILLER DEFENDS TITLE .
  FN: MILLER
Sentence: DUBLIN 1996-08-31
  FN: DUBLIN
Sentence: A powerful right hook followed by a straight left gave defending champion Nate Miller a seventh round knock-out win over fellow American 

  FN: Ruch
Sentence: News-stand chain Ruch , which controls about 65 percent of Poland 's press distribution market , had a net profit of 16.2 million zlotys on sales of 2.7 billion zlotys in 1995 .
  FP: about 65 percent
  FP: 2.7 billion zlotys
  FP: 16.2 million zlotys
  FP: 1995
  FN: Ruch
Sentence: Zycie Warszawy said the new , open consortium , which also included several listed Polish firms , would on Monday inform Privatisation Minister Wieslaw Kaczmarek of its plans .
  FP: Monday
Sentence: It aims to invest $ 200 million in Ruch over five years -- more than the sum Ruch says it needs to upgrade its outlets .
  FP: five years
  FP: $ 200 million
  FN: Ruch
  FN: Ruch
Sentence: Initially Poland offered up to 75 percent of Ruch but in March Kaczmarek cancelled the tender and offered a minority stake with an option to increase the equity .
  FP: March
  FP: up to 75 percent
  FN: Ruch
Sentence: Two consortia -- UWP-Hachette and a consortium of a Polish SPC group and Swiss firms -

  FP: five weeks
  FP: Nichols
  FN: James Leander ( Leo ) Nichols
Sentence: The Dow Chemical Co of the United States will invest $ 4 billion to build an ethylene plant in Tianjin city in northern China , the China Daily said on Saturday .
  FP: The Dow Chemical Co
  FP: the China Daily
  FP: the United States
  FP: Saturday
  FP: $ 4 billion
  FN: China Daily
  FN: United States
  FN: Dow Chemical Co
Sentence: Tianjin boasts a range of infrastructure facilities , attracting several multinational oil companies to invest in recent years .
  FP: recent years
Sentence: North Korea demanded on Saturday that South Korea return a northern war veteran who has been in the South since the 1950-53 war , Seoul 's unification ministry said .
  FP: unification ministry
  FP: Saturday
  FP: 1950-53
Sentence: Kim , an unrepentant communist , was captured during the Korean War and released after spending more than 30 years in a southern jail .
  FP: more than 30 years
  FP: the Korean War
  FN: Korean

*TODO: Write a short text that summarises the errors that you observed*

Numbers is a entety in spacy but does not seem to be in the data
Time and dates is also an entety in spacy but des not seem to be that in the data.

Joins two entities as one entity


Now, use the insights from your error analysis to improve the automated prediction that you implemented in Problem&nbsp;2. While the best way to do this would be to [update spaCy&rsquo;s NER model](https://spacy.io/usage/linguistic-features#updating) using domain-specific training data, for this lab it suffices to write code to post-process the output produced by spaCy. You should be able to improve the F1 score from Problem&nbsp;2 by at last 15 percentage points.

In [15]:
# TODO: Write code here to improve the span prediction from Problem 2
def predictions_p(df):

    for row in df.itertuples():
        sent = nlp(row[2])
        for ent in sent.ents:
            if ent.label_ not in ['TIME', 'DATE', 'ORDINAL', 'CARDINAL', 'QUANTITY']: #  
                re = True
                tokens = ent.text.split()
                if tokens[-1].isnumeric(): 
                    re = False
                    new_ent = ' '.join(tokens[:-1])
                    #print((row[1], ent.start, ent.end-1, new_ent, 'spliiiiiiit', ent.text ))
                    yield (row[1], ent.start, ent.end-1)
                if re:
                    #print((row[1], ent.start, ent.end, ent.label_, ent.text ))
                    yield (row[1], ent.start, ent.end)



In [16]:
predictions_p = set(predictions_p(df_dev))

In [17]:
evaluation_report(spans_dev_gold, predictions_p)

Precision:  0.8185467683661181 
Recall:  0.6892006084164273 
F1:  0.7483255344527021


In [18]:
#error_report(df_dev, spans_dev_gold, predictions_p)

Show that you achieve the performance goal by reporting the evaluation measures that you implemented in Problem&nbsp;1.

Before going on, we ask you to store the outputs of the improved named entity recognizer on the development data in a new data frame. This new frame should have the same layout as the original data frame for the development data that you loaded above, but should contain the *predicted* start and end positions for each token span, rather than the gold positions. As the `label` of each span, you can use the special value `--NME--`.

In [37]:
# TODO: Write code here to store the predicted spans in a new data frame
def store_df(df):
    df_pred =[]
    for row in df.itertuples():
        sent = nlp(row[2])
        count = 0
        for ent in sent.ents:
            if ent.label_ not in ['TIME', 'DATE', 'ORDINAL', 'CARDINAL', 'QUANTITY']: #  
                re = True
                tokens = ent.text.split()
                if tokens[-1].isnumeric(): 
                    re = False
                    new_ent = ' '.join(tokens[:-1])
                    df_pred.append([row[1], row[2], ent.start, ent.end-1, '--NME--']) 
                if re:
                    df_pred.append([row[1], row[2], ent.start, ent.end, '--NME--'])
                    
    return pd.DataFrame(df_pred, columns =['sentence_id', 'sentence', 'beg', 'end', 'label'])

In [38]:
pred = store_df(df_dev)

In [39]:
pred.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0946-001,LONDON 1996-08-30,0,1,--NME--
1,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,--NME--
2,0946-002,West Indian all-rounder Phil Simmons took four...,5,7,--NME--
3,0946-002,West Indian all-rounder Phil Simmons took four...,14,15,--NME--
4,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,--NME--


In [40]:
df_dev.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0946-000,CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTE...,2,3,Leicestershire_County_Cricket_Club
1,0946-001,LONDON 1996-08-30,0,1,London
2,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,West_Indies_cricket_team
3,0946-002,West Indian all-rounder Phil Simmons took four...,3,5,Phil_Simmons
4,0946-002,West Indian all-rounder Phil Simmons took four...,12,13,Leicestershire_County_Cricket_Club


## Problem 4: Entity linking

Now that we have a method for predicting mention spans, we turn to the task of **entity linking**, which amounts to predicting the knowledge base entity that is referenced by a given mention. In our case, for each span we want to predict the Wikipedia page that this mention references.

Start by extending the generator function that you implemented in Problem&nbsp;2 to labelled spans.

In [28]:
def gold_mentions(df):
    """Yield the gold-standard mentions in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and entity label of each span.
    """
    # TODO: Replace the next line with your own code
    for row in df.itertuples():
        yield (row[1], row[3], row[4], row[5])

A naive baseline for entity linking on our data set is to link each mention span to the Wikipedia page name that we get when we join the tokens in the span by underscores, as is standard in Wikipedia page names. Suppose, for example, that a span contains the two tokens

    Jimi Hendrix

The baseline Wikipedia page name for this span would be

    Jimi_Hendrix

Implement this naive baseline and evaluate its performance. Print the evaluation measures that you implemented in Problem&nbsp;1.

<div class="alert alert-warning">
    Here and in the remainder of this lab, you should base your entity predictions on the predicted spans that you computed in Problem&nbsp;3.
</div>

In [27]:
# TODO: Write code here to implement the baseline
def predictions_label(df):

    for row in df.itertuples():
        sent = nlp(row[2])
        for ent in sent.ents:
            if ent.label_ not in ['TIME', 'DATE', 'ORDINAL', 'CARDINAL', 'QUANTITY']: #  
                re = True
                tokens = ent.text.split()
                if tokens[-1].isnumeric(): 
                    re = False
                    new_ent = '_'.join(tokens[:-1])
                    #print((row[1], ent.start, ent.end-1, new_ent, 'spliiiiiiit', ent.text ))
                    yield (row[1], ent.start, ent.end-1, new_ent.lower().title())
                if re:
                    #print((row[1], ent.start, ent.end, ent.label_, ent.text ))
                    yield (row[1], ent.start, ent.end, ent.text.lower().title().replace(' ', '_'))

In [25]:
predictions = set(predictions_label(df_dev))

In [26]:
gold = set(gold_mentions(df_dev))

evaluation_report(gold, predictions)

Precision:  0.3169409875551987 
Recall:  0.26685820517153963 
F1:  0.2897513533351684


## Problem 5: Extending the training data using the knowledge base

State-of-the-art approaches to entity linking exploit information in knowledge bases. In our case, where Wikipedia is the knowledge base, one particularly useful type of information are links to other Wikipedia pages. In particular, we can interpret the anchor texts (the highlighted texts that you click on) as mentions of the entities (pages) that they link to. This allows us to harvest long lists of mention–entity pairings.

The following cell loads a data frame summarizing anchor texts and page references harvested from the first paragraphs of the English Wikipedia. The data frame also contains all entity mentions in the training data (but not the development or the test data).

In [18]:
with bz2.open('kb.tsv.bz2', 'rt') as source:
    df_kb = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

To understand what information is available in this data, the following cell shows the entry for the anchor text `Sweden`.

In [54]:
df_kb.loc[df_kb.mention == 'Sweden']#.iloc[0][1]

Unnamed: 0,mention,entity,prob
17436,Sweden,Sweden,0.985768
17437,Sweden,Sweden_national_football_team,0.014173
17438,Sweden,Sweden_men's_national_ice_hockey_team,5.9e-05


As you can see, each row of the data frame contains a pair $(m, e)$ of a mention $m$ and an entity $e$, as well as the conditional probability $P(e|m)$ for mention $m$ referring to entity $e$. These probabilities were estimated based on the frequencies of mention–entity pairs in the knowledge base. The example shows that the anchor text &lsquo;Sweden&rsquo; is most often used to refer to the entity [Sweden](http://en.wikipedia.org/wiki/Sweden), but in a few cases also to refer to Sweden&rsquo;s national football and ice hockey teams. Note that references are sorted in decreasing order of probability, so that the most probable pairing come first.

Implement an entity linking method that resolves each mention to the most probable entity in the data frame. If the mention is not included in the data frame, you can predict the generic label `--NME--`. Print the precision, recall, and F1 of your method using the function that you implemented for Problem&nbsp;1.

In [36]:
def label_kb(label):
    try:
        entity = df_kb.loc[df_kb.mention == label.lower().title()].iloc[0][1]
    except:
        entity = '--NME--'
    
    return entity
    
    
    

In [30]:
label_kb('SwedEn')

'Sweden'

In [49]:
## Davids kod

stop_list = ['TIME', 'DATE', 'ORDINAL', 'CARDINAL', 'QUANTITY']

def store_df(df_dev):
    df_pred = []
    for row in df_dev.itertuples():
        doc = nlp(row[2])
        for ent in doc.ents:
            if ent.label_ not in stop_list:
                df_pred.append([row[1], row[2], ent.start, ent.end, '--NME--'])
    return pd.DataFrame(df_pred, columns=['sentence_id', 'sentence', 'beg', 'end', 'label'])

In [58]:
davids_df = store_df(df_dev)

In [46]:
# TODO: Write code here to implement the baseline
def predictions_kb(df):
    df_pred =[]
    for row in df.itertuples():
        sent = nlp(row[2])
        for ent in sent.ents:
            if ent.label_ not in ['TIME', 'DATE', 'ORDINAL', 'CARDINAL', 'QUANTITY']: #  
                re = True
                tokens = ent.text.split()
                if tokens[-1].isnumeric(): 
                    re = False
                    new_ent = ' '.join(tokens[:-1])
                    #print((row[1], ent.start, ent.end-1, new_ent, 'spliiiiiiit', ent.text ))
                    yield (row[1], ent.start, ent.end-1, label_kb(new_ent))
                if re:
                    #print((row[1], ent.start, ent.end, ent.label_, ent.text ))
                    yield (row[1], ent.start, ent.end, label_kb(ent.text))



In [47]:
pre = set(predictions_kb(df_dev))

In [48]:
gold = set(gold_mentions(df_dev))
#predictions = set(gold_mentions(pre))
evaluation_report(gold, pre)

Precision:  0.46146126053793657 
Recall:  0.38854149062024673 
F1:  0.4218735663822369


In [66]:
# Test to extract from DF


most_probable_ent = davids_df
for index, row in most_probable_ent.iterrows():
    words = row['sentence'].lower().title().split()
    ent = ' '.join(words[row['beg']:row['end']])
    mentions = df_kb.loc[df_kb.mention == ent]
    if(len(mentions.index)) > 0:
        label = mentions.iloc[0]['entity']
        most_probable_ent.at[index,'label'] = label
    else:
        most_probable_ent.at[index,'label'] = '--NME--'

In [67]:
gold = set(gold_mentions(df_dev))
pre = set(gold_mentions(most_probable_ent))
evaluation_report(gold, pre)

Precision:  0.46407065435568046 
Recall:  0.3907385499408484 
F1:  0.4242591063400312


In [69]:
len(pred)
davids_df.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0946-001,LONDON 1996-08-30,0,1,London
1,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,West_Indies_cricket_team
2,0946-002,West Indian all-rounder Phil Simmons took four...,5,7,--NME--
3,0946-002,West Indian all-rounder Phil Simmons took four...,14,15,Somerset_County_Cricket_Club
4,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,West_Indies_cricket_team


## Problem 6: Context-sensitive disambiguation

Consider the entity mention &lsquo;Lincoln&rsquo;. The most probable entity for this mention turns out to be [Lincoln, Nebraska](http://en.wikipedia.org/Lincoln,_Nebraska); but in pages about American history, we would be better off to predict [Abraham Lincoln](http://en.wikipedia.org/Abraham_Lincoln). This suggests that we should try to disambiguate between different entity references based on the textual context on the page from which the mention was taken. Your task in this last problem is to implement this idea.

Set up a dictionary that contains, for each mention $m$ that can refer to more than one entity $e$, a separate Naive Bayes classifier that is trained to predict the correct entity $e$, given the textual context of the mention. As the prior probabilities of the classifier, choose the probabilities $P(e|m)$ that you used in Problem&nbsp;5. To let you estimate the context-specific probabilities, we have compiled a data set with mention contexts:

In [3]:
with bz2.open('contexts.tsv.bz2') as source:
    df_contexts = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

This data frame contains, for each ambiguous mention $m$ and each knowledge base entity $e$ to which this mention can refer, up to 100 randomly selected contexts in which $m$ is used to refer to $e$. For this data, a **context** is defined as the 5 tokens to the left and the 5 tokens to the right of the mention. Here are a few examples:

In [85]:
df_contexts.head()

Unnamed: 0,mention,entity,context
0,1970,UEFA_Champions_League,Cup twice the first in @ and the second in 1983
1,1970,FIFA_World_Cup,America 1975 and during the @ and 1978 World C...
2,1990 World Cup,1990_FIFA_World_Cup,Manolo represented Spain at the @
3,1990 World Cup,1990_FIFA_World_Cup,Hašek represented Czechoslovakia at the @ and ...
4,1990 World Cup,1990_FIFA_World_Cup,renovations in 1989 for the @ The present capa...


In [158]:
df_kb.loc[df_kb.mention == 'Lincoln'].sort_values(by=['entity']).prob.values

array([0.22051282, 0.74871795, 0.03076923])

Note that, in each context, the position of the mention is indicated by the `@` symbol.

From this data frame, it is easy to select the data that you need to train the classifiers – the contexts and corresponding entities for all mentions. To illustrate this, the following cell shows how to select all contexts that belong to the mention &lsquo;Lincoln&rsquo;:

In [153]:
df_contexts[df_contexts.mention == 'Lincoln']
#df_contexts.entity[df_contexts.mention == 'Lincoln'].unique()
#df_contexts[df_contexts.mention == 'Lincoln']

#df_contexts.entity[df_contexts.mention == 'Lincoln'].values
#df_contexts.mention.unique()

Unnamed: 0,mention,entity,context
41465,Lincoln,"Lincoln,_Nebraska",Nebraska Concealed Handgun Permit In @ municip...
41466,Lincoln,"Lincoln,_Nebraska",Lazlo restaurants are located in @ and Omaha C...
41467,Lincoln,"Lincoln,_Nebraska",California Washington Overland Park Kansas @ N...
41468,Lincoln,"Lincoln,_Nebraska",City Missouri Omaha Nebraska and @ Nebraska It...
41469,Lincoln,"Lincoln,_Nebraska",by Sandhills Publishing Company in @ Nebraska USA
...,...,...,...
41609,Lincoln,Lincoln_City_F.C.,@ Leyton Orient
41610,Lincoln,Lincoln_City_F.C.,English division three Swansea @
41611,Lincoln,Lincoln_City_F.C.,league membership narrowly edging out @ on goa...
41612,Lincoln,Lincoln_City_F.C.,@ Cambridge


Implement the context-sensitive disambiguation method and evaluate its performance. Here are some more hints that may help you along the way:

**Hint 1:** The prior probabilities for a Naive Bayes classifier can be specified using the `class_prior` option. You will have to provide the probabilities in the same order as the alphabetically sorted class (entity) names.

**Hint 2:** Not all mentions in the knowledge base are ambiguous, and therefore not all mentions have context data. If a mention has only one possible entity, pick that one. If a mention has no entity at all, predict the `--NME--` label.

In [111]:
# TODO: Write code here to implement the context-sensitive disambiguation method

# Train naive Classifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

pipe = Pipeline([('counter', CountVectorizer()), ('naive', MultinomialNB())])

In [159]:

classifiers = {}

for mention in df_contexts.mention.unique():
    x = df_contexts.context[df_contexts.mention == mention].values
    y = df_contexts.entity[df_contexts.mention == mention].values
    pipe = Pipeline([('counter', CountVectorizer()), ('naive', MultinomialNB())])
    
    prior = df_kb.loc[df_kb.mention == mention].sort_values(by=['entity']).prob.values
    
    pipe.set_params(naive__class_prior = prior)
    classifiers[mention] = pipe.fit(x, y)


In [163]:
classifiers['Lincoln'].predict(['English division three Swansea Lincoln'])[0]

'Lincoln_City_F.C.'

In [178]:
# Predict by using the new alg that also look at the context:

def get_context(mention, text):
    context = text.replace(mention, '@')
    return context

def nb_label(mention, sent):
    context = get_context(mention, sent)
    return classifiers[mention.lower().title()].predict([context])[0]
    
    

In [179]:
def context_pred(df):
    df_pred =[]
    for row in df.itertuples():
        sent = nlp(row[2])
        for ent in sent.ents:
            if ent.label_ not in ['TIME', 'DATE', 'ORDINAL', 'CARDINAL', 'QUANTITY']:
                
                try:
                    label = nb_label(ent.text, row[2])
                except:
                    label = '--NME--'
                
                df_pred.append([row[1], row[2], ent.start, ent.end, label])
                    
    return pd.DataFrame(df_pred, columns =['sentence_id', 'sentence', 'beg', 'end', 'label'])


In [180]:
p = context_pred(df_dev)

In [181]:
gold = set(gold_mentions(df_dev))
pre = set(gold_mentions(p))
evaluation_report(gold, pre)

Precision:  0.26595744680851063 
Recall:  0.22393104613824574 
F1:  0.2431415726213414


You should expect to see a small (around 1&nbsp;unit) increase in both precision, recall, and F1.

**This was the last lab in the Text Mining course. Congratulations!**

<div class="alert alert-info">
    Please read the section ‘General information’ on the ‘Labs’ page of the course website before submitting this notebook!
</div>