# L2: Information Extraction

In this lab you will implement and evaluate a simple system for information extraction. The task of the system is to read sentences and extract pairs of the form $(x, y)$, where $x$&nbsp;is a string denoting a person, $y$&nbsp;is a string denoting an organisation, and $x$ is the &lsquo;leader&rsquo; of&nbsp;$y$. Consider the following example sentence:

<blockquote>
Mr. Obama also selected Lisa Jackson to head the Environmental Protection Agency.
</blockquote>

From this sentence the system should extract the pair
```
("Lisa Jackson", "Environmental Protection Agency")
```

The system will have to solve the following sub-tasks:
* entity extraction &ndash; identifying mentions of person and organisation entities in text
* relation extraction &ndash; identifying instances of the &lsquo;is-leader-of&rsquo; relation

The data set for the lab consists of 62,010&nbsp;sentences from the [Groningen Meaning Bank](http://gmb.let.rug.nl) (release 2.2.0), an open corpus of English. To analyse the sentences you will use [spaCy](https://spacy.io/).

## Getting started

The first cell imports the Python module required for this lab.

In [1]:
import tm2

The next cell imports spaCy and loads its English language model.

In [2]:
import spacy
import re

nlp = spacy.load('en')

## Data and gold standard

The data is contained in the following file:

In [3]:
data_file = "/home/TDDE16/labs/l2/data/gmb.txt"

The `tm2` module defines a function `read_data` that returns an iterator over the lines in a file. You should use this function to read the data for this lab. Use the optional argument `n` to restrict the iteration to the first few lines of the file. Here is an example:

In [4]:
sent = [s for s in tm2.read_data(data_file, n=15000)]

In [5]:
sent[-1]

'After approving a new constitution and restoring multiparty politics in 1992, RAWLINGS won presidential elections in 1992 and 1996, but was constitutionally prevented from running for a third term in 2000.'

In addition to the raw data, we also provide you with a gold standard of entity pairs that your system should be able to extract. The following code loads these pairs from the file `gold.txt` and adds them to the set `gold`. Each pair is augmented with the identifier of the sentence (line number in the data file) which it was extracted from. Note that the sentence (line) numbering starts at index&nbsp;0.

In [6]:
gold_file = "/home/TDDE16/labs/l2/data/gold.txt"

gold = set()
with open(gold_file) as fp:
    for line in fp:
        columns = line.rstrip().split('\t')
        gold.add((int(columns[0]), columns[1], columns[2]))

In [7]:
gold

{(802, 'Ali Zardari', "Pakistan People 's Party"),
 (2297, 'Abdul Aziz al-Hakim', 'Supreme Council'),
 (4823, 'Slavkov', 'Bulgarian National Olympic Committee'),
 (7902, 'Mr. Hakim', 'Supreme Council'),
 (8206, 'J. Patrick Boyle', 'American Meat Institute'),
 (8633, 'Ali Rodriguez', 'Petroleos de Venezuela'),
 (9004, 'Foreign Minister Joschka Fischer', 'Green Party'),
 (11021, 'Khalaf', 'al-Qaida'),
 (11259, 'Joseph Domenech', "U.N. 's Food and Agricultural Organization"),
 (13043, 'David Petraeus', 'U.S. Central Command'),
 (15203, 'Joseph Kony', "Lord 's Resistance Army"),
 (15494, 'Khodorkovsky', 'Yukos'),
 (15906, 'President Chen Shui-bian', 'Democratic Progressive Party'),
 (18977, 'General Petraeus', 'U.S. Central Command'),
 (20496, 'Avigdor Lieberman', 'Yisrael Beitenu'),
 (20667, 'Mr. Fini', 'National Alliance'),
 (21914, 'Mr. Mwanawasa', 'Southern African Development Community'),
 (23016, 'Osama bin Laden', 'al-Qaida'),
 (28997, 'Ma', 'Nationalist Party'),
 (30171, 'al-Zarqaw

The following code prints the 10&nbsp;first pairs from the gold standard:

In [8]:
for i, person, org in sorted(gold)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

802	Ali Zardari	Pakistan People 's Party
2297	Abdul Aziz al-Hakim	Supreme Council
4823	Slavkov	Bulgarian National Olympic Committee
7902	Mr. Hakim	Supreme Council
8206	J. Patrick Boyle	American Meat Institute
8633	Ali Rodriguez	Petroleos de Venezuela
9004	Foreign Minister Joschka Fischer	Green Party
11021	Khalaf	al-Qaida
11259	Joseph Domenech	U.N. 's Food and Agricultural Organization
13043	David Petraeus	U.S. Central Command


You should take a moment to have a look at the data file and see these pairs in the context of the sentence they were extracted from.

## Entity extraction

To implement the entity extraction part of your system, you do not need to do much, as you can use the full natural language processing power built into spaCy. The following code extracts the entities from the first 5&nbsp;sentences of the data:

In [9]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=5))):
    for ent in doc.ents:
        print("{}\t{}\t{}\t{}".format(ent.text, ent.start, ent.end, ent.label_))

Turkey	13	14	GPE
45	16	17	CARDINAL
at least six	20	23	CARDINAL
Turkish	0	1	NORP
Monday	6	7	DATE
Bilge	11	12	ORG
about 600 kilometers	12	15	QUANTITY
Ankara	16	17	GPE
Mardin	12	13	ORG


Read the [section about named entities](https://spacy.io/usage/linguistic-features#section-named-entities) from spaCy&rsquo;s documentation to get some background on this.

## Problem 1: Extract relevant pairs

The first problem that you will have to solve is to identify pairs of entities that are in the &lsquo;is-leader-of&rsquo; relation, as in the example above. There are many ways to do this, but for this lab it suffices to implement the strategy outlined in the section on [Relation Extraction](http://www.nltk.org/book/ch07.html#relation-extraction) in the book by Bird, Klein, and Loper (2009):

* look for all triples of the form $(x, \alpha, y)$ where $x$ and $y$ denote named entities of type *person* and *organisation*, respectively, and $\alpha$ is the intervening text
* write a regular expression to match just those instances of $\alpha$ that express the &lsquo;is-leader-of&rsquo; relation

You can restrict your attention to adjacent pairs of entities &ndash; that is, cases where $x$ precedes $y$ and $\alpha$ does not contain other named entities.

Write a function `extract` that takes an analysed sentence (represented as a spaCy [`Doc`](https://spacy.io/api/doc) object) and yields pairs $(x, y)$ of strings representing entity mentions predicted to be in the &lsquo;is-leader-of&rsquo; relation.

In [10]:
def extract(doc):
    """Extract relevant relation instances from the specified document.
    
    Args:
        doc: The sentence as analysed by spaCy.
    Yields:
        Pairs of strings representing the extracted relation instances.
    """
    allowed_e1 = frozenset(['PERSON', 'GPE', 'ORG'])
    allowed_e2 = frozenset(['GPE', 'LOC', 'NORP', 'ORG'])
    
    N = range(len(doc.ents) - 1)
    reg = re.compile(r'(,)?((.*?((has been|as|is|new))?( the)? (head|leader|ruler|songwriter) (of|in|for)?)|.*?(who heads|heads|leads)).*?')
    matches = []
    for n in N:
        e1 = doc.ents[n]
        e2 = doc.ents[n+1]
        if (not e1.label_ in allowed_e1) or (not e2.label_ in allowed_e2):
            continue
        start = e1.end
        end = e2.start
        txt = doc[start:end]
        is_match = reg.match(str(txt))
        if is_match:
            matches.append((str(e1), str(e2)))
    
    return matches

The following cell shows how your function is supposed to be used. The code prints out the extracted pairs for the first 1,000&nbsp;sentences in the data. It additionally numbers each pair with the sentence identifier.

In [142]:
for e, p in tm2.extract(doc):
    print(e,p)

In [11]:
gmap = {e[0]: (e[1], e[2]) for e in gold}

In [163]:
bork_docs = []
extracted = set()
fp = 0
fn = 0
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=60000))):
    t1 = False
    t2 = False
    for person, org in extract(doc):
        print("EXTRACT:{}\t{}\t{}".format(i, person, org))
        #bork_docs.append(sent[i])
        extracted.add((i, person, org))
        t1 = True
    if i in gmap:
        t2 = True
        print("GOLD---:{}\t{}\t{}".format(i, gmap[i][0], gmap[i][1]))
        
    if not t1 and t2:
        fn = fn + 1
        print('---False negative----')
        print(doc)
        for ent in doc.ents:
            print("{}\t{}\t{}\t{}".format(ent.text, ent.start, ent.end, ent.label_))
        print('---------------------')
            
    if t1 and not t2:
        fp = fp + 1
        print('---False positive----')
        print(doc)
        print('---------------------')
#bork_docs

EXTRACT:512	Aung San Suu Kyi	the National League for Democracy
---False positive----
Aung San Suu Kyi, the head of the National League for Democracy, has spent 10 of the past 16 years in detention, mostly under house arrest.
---------------------
EXTRACT:615	Gerhard Schroeder	Social Democrats
---False positive----
The announcement followed Wednesday's third round of coalition talks between Christian Democratic leader Angela Merkel and Chancellor Gerhard Schroeder, who heads the ruling Social Democrats.
---------------------
EXTRACT:736	Viktor Yanukovych	Russian Party
---False positive----
Representatives of the country's pro-Western coalition say talks scheduled for Monday were canceled after Viktor Yanukovych, the leader of the opposition pro- Russian Party of Regions failed to attend.
---------------------
EXTRACT:802	Asif Ali Zardari	the Pakistan People's Party
GOLD---:802	Ali Zardari	Pakistan People 's Party
EXTRACT:2294	Iraq	Iraqi
---False positive----
Gunmen in Iraq have assassin

EXTRACT:13043	David Petraeus	U.S. Central Command
GOLD---:13043	David Petraeus	U.S. Central Command
EXTRACT:13596	Cuba	the Gulf of Mexico
---False positive----
The storm is predicted to cross Cuba, then intensify again as it heads north toward the Gulf of Mexico and the Florida Keys - a string of U.S. islands dividing the Atlantic Ocean from the Gulf of Mexico.
---------------------
EXTRACT:13836	Pokhrel	Nepal
---False positive----
Mr. Pokhrel was a popular religious leader in Nepal, who population is predominantly Hindu.
---------------------
EXTRACT:14069	Annan	Asia
---False positive----
Mr. Annan will head to south Asia later this week for a summit on relief efforts.
---------------------
EXTRACT:14855	Gerry Adams	IRA
---False positive----
President Bush is also expected to phone Gerry Adams, the leader of the IRA's political wing, Sinn Fein.
---------------------
EXTRACT:15038	India	Britain
---False positive----
The prize is awarded by the Gandhi Development Trust in honor of India

EXTRACT:22755	Arafat	Palestine Liberation Organization
---False positive----
Former Prime Minister Mahmoud Abbas, Mr. Arafat's successor as head of the powerful Palestine Liberation Organization, is expected to be the leading contender if his Fatah faction chooses him as its candidate.
---------------------
EXTRACT:22981	Gaza	Hamas
---False positive----
Several dozen people were wounded last month in factional clashes after Palestinian President Mahmoud Abbas rejected Hamas' appointment of a leading Gaza militant to head the Hamas-dominated security unit.
---------------------
EXTRACT:23016	Osama bin Laden	al-Qaida
GOLD---:23016	Osama bin Laden	al-Qaida
EXTRACT:23709	Richard Veerman	Angola
---False positive----
Richard Veerman is the group’s head of mission in Angola.
---------------------
EXTRACT:24833	Olusegan Obasanjo	AU
---False positive----
Among those attending the summit of the 53-member African Union are United Nations Secretary-General Kofi Annan and Nigerian President Olusega

EXTRACT:35475	Chavez	Asia
---False positive----
Later this week, Mr. Chavez heads to Asia for talks with regional leaders.
---------------------
EXTRACT:36057	Mullah Omar's	Taleban
---False positive----
The commander, said to be Mullah Omar's former deputy and now the head of Taleban operations, held a Kalashnikov assault rifle and partly covered his face with a black turban during the interview.
---------------------
EXTRACT:36074	Minas al-Yousifi	Iraqi
---False positive----
The Iraqi Vengeance Brigade had threatened to behead Minas al-Yousifi, the head of the Iraqi Christian Democratic party unless a large ransom was paid and U.S. troops withdrew from Iraq.
---------------------
EXTRACT:36114	Philippe Douste-Blazy	Colombia
---False positive----
French Foreign Minister Philippe Douste-Blazy has announced he will soon head to Colombia to try to win freedom for kidnapped former presidential candidate Ingrid Betancourt.
---------------------
GOLD---:36946	Vintsuk Vyachorka	Belarus Popula

EXTRACT:48091	Aftab Khan Sherpao	Taleban's Culture and Information Council
---False positive----
Pakistani Interior Minister Aftab Khan Sherpao says the head of the ousted Taleban's Culture and Information Council, Maulvi Muhammad Yasir, was arrested last week and is being interrogated.
---------------------
EXTRACT:48229	Hamid Karzai	EU
---False positive----
He is also scheduled to meet with Afghan President Hamid Karzai, several of the country's election candidates, as well as, the head of the EU election observation mission in Afghanistan.
---------------------
EXTRACT:48378	Abu Musab al-Zarqawi	the al Qaida
GOLD---:48378	Abu Musab al-Zarqawi	al Qaida
EXTRACT:48541	Raul Castro	Cuban
---False positive----
The media report Tuesday said Mr. Chavez also is to meet with acting Cuban President Raul Castro, who has been head of the Cuban government since Fidel Castro temporarily transferred power to him last July after undergoing intestinal surgery.
---------------------
EXTRACT:48799	Rice

EXTRACT:58831	Alejandro	Peru
---False positive----
A caretaker government oversaw new elections in the spring of 2001, which ushered in Alejandro TOLEDO Manrique as the new head of government - Peru's first democratically elected president of Native American ethnicity.
---------------------
EXTRACT:59053	France	French
---False positive----
He said France has configured its nuclear arsenal to respond to threats and noted that the number of nuclear warheads on French submarines has been reduced to allow for targeted strikes.
---------------------
EXTRACT:59063	Malcolm Rifkind	the Conservative Party
---False positive----
Former British Foreign Secretary Malcolm Rifkind has withdrawn from the race for leader of the Conservative Party and has thrown his support to former Finance Minister Kenneth Clarke.
---------------------
EXTRACT:59210	Saddam Hussein	Baghdad
---False positive----
Defense lawyers for former Iraqi President Saddam Hussein say an unidentified man attacked the ousted leader 

## Oddities

We ran the parser on the first ~15000 examples to find out how it performed and to tweak it. It turns out thatit consistently finds meaningful "is-leader-of" relations which are not in the gold set. It also filters out several items in the gold set, since the person is classified as an organisation for instance, which is not allowed by our parser. 

### Sentences parsed with reasonable results, which are not in the gold set
 - 736  Viktor Yanukovic Russian Party
  - "Representatives of the country's pro-Western coalition say talks scheduled for Monday were canceled after Viktor Yanukovych, the leader of the opposition pro- Russian Party of Regions failed to attend."
 - 2945 Samuel Jang Jae Catholic +<- something
  - "The official news agency, KCNA, on Tuesday quotes Samuel Jang Jae On, described as head of the country's Catholic association, as saying North Korean Catholics are deeply saddened by the pontiff's passing."
 - 3944	Ichiro Ozawa	Japan +<- something
  - "Mr. Hu made the comments Tuesday during a meeting with Ichiro Ozawa, the leader of Japan's main opposition party.
 - 5580	Ali Abdul-Hussein	Shi'ite + somethin
  - "In the southern city of Basra, meanwhile, authorities say gunmen shot and killed Ali Abdul-Hussein, leader of a Shi'ite mosque, as he stood outside his house late Thursday."
 - 7427	Sonia Gandhi	Congress + something
  - 'The minister for expatriate Indians, Jagdish Tytler, said he had submitted his resignation to Sonia Gandhi, president of the ruling Congress party, and asked her to forward it to Prime Minister Manmohan Singh.'
 - 8225	Massoud Barzani	Iraq
  - "President Bush meets at the White House Tuesday with Massoud Barzani, the president of Iraq's Kurdish region."
 - 9021	Hassan Halemi	Kabul University
  - 'Hassan Halemi, head of the pathology department at Kabul University where the autopsies were carried out, said hours of testing Saturday confirmed the identities of teachers Jun Fukusho and Shinobu Hasegawa.'
  - Should be "Pathology department at..."
 - 9710	Robert Mugabe	Zimbabwe
  - "Justice minister Patrick Chinamasa announced that Mr. Mugabe appointed a 5- member body under the chairmanship of High Court Judge, George Chiweshe, who was appointed by President Robert Mugabe, leader of Zimbabwe's ruling Zanu-PF party."
 - 10395	Younus Khalis	Taleban
  - 'A statement published in Pakistan, meanwhile, reports the death last week of Younus Khalis, leader of a pro-Taleban faction in Afghanistan who had been in hiding since 2003, when he declared a holy war against foreign forces in Afghanistan.'
 - 12291	al-Jaafari	Islamic
  - 'Mr. al-Jaafari, the head of the influential Islamic Dawa party, told a news conference the day represents a big step forward for Iraq.'
 
### Sentences which are not parsed because of bad entity types
 
4823 'Slavkov will lose his position as head of the Bulgarian National Olympic Committee.'
 - Slavkov is GPE, should be PERSON

11021	Khalaf	al-Qaida
 - 'According to the department, Khalaf is a senior leader of al-Qaida in Iraq\'s "facilitation network," which controls the flow of resources -- including weapons, money and militants -- from Syria into Iraq.'
 - Khalaf is a GPE
 - Same reason as 7902 above
 
In addition our parser does not recognize strings with more advanced structure, such as the following. 
 
 
7902	Mr. Hakim	Supreme Council
 - Too complicated at the moment of writing
 "Mr. Hakim heads the Shi'ite dominated Supreme Council for the Revolution in Iraq, which has the largest representation in parliament."
 [('Hakim', 4, 9, 'PERSON'), ("Shi'ite", 20, 27, 'NORP'), ('Supreme Council for the Revolution in Iraq', 38, 80, 'ORG')]

Once you feel confident that your `extract` function does what it is supposed to do, execute the following cell to extract the entities from the full data set. Note that this will take several minutes (remember that we are processing 62k sentences).

In [12]:
extracted = set()
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file))):
    for person, org in extract(doc):
        extracted.add((i, person, org))
    print('\rProcessed {} sentences ...'.format(i+1), end='', flush=True)
print(' done')
emap = {e[0]: (e[1], e[2]) for e in extracted}

Processed 62010 sentences ... done


After executing the above cell, all extracted id-string-string triples are in the set `extracted`. The code in the next cell will print the first 10&nbsp;triples in this set.

## Problem 2: Evaluate your system

You now have an extractor, but how good is it? Your task now is to write code that computes the precision, recall, and F1 measure of your extractor relative to the gold standard.

In [13]:
from functools import reduce

def evaluate(reference, predicted):
    """Print out the precision, recall, and F1 for the id-entity-entity
    triples in the set `predicted`, given the triples in the reference set.
    
    Args:
        reference: The reference set of triples.
        predicted: The set of predicted triples.
    Returns:
        Nothing, but prints out precision, recall, and F1.
    """
    
    b = "false negatives = data not in predicted but is in reference"
    fp = reference.difference(predicted)
    tp = predicted.intersection(reference)
    precision = len(tp)/len(predicted)
    recall    = len(tp)/len(tp.union(fp)) 
    F1        = 2*precision*recall/(precision + recall)
    
    #print(tm2.evaluate(reference, predicted))
    print('Precision: ', precision)
    print('Recall: ', recall)
    print('F1: ', F1)

The next cell shows how your function is intended to be used, as well as the suggested output format.

In [14]:
evaluate(gold, extracted)

Precision:  0.03111111111111111
Recall:  0.15217391304347827
F1:  0.051660516605166046


The results are indeed terrible. One reason is the lack of entity resolution together with the fact that the comparison is done using exact matches.

## Problem 3: Entity resolution

Looking at the results of your quantitative evaluation, you will realise that your extractor (probably) does a rather poor job in matching the gold standard. One reason for this is that the NLP preprocessing is not perfect (spaCy was not trained on the annotations in the Groningen Meaning Bank), and that the approach of using regular expressions for relation extraction is rather naive.

Another reason however is that the current version of your system does not include a component for *entity resolution*. To give an example, your system does not realise that the strings `David Petraeus` and `General David Petraeus` refer to the same entity.

While writing a &lsquo;real&rsquo; entity resolver is beyond the scope of this assignment, we ask you to &lsquo;fake&rsquo; such a resolver. More specifically, you should implement a function `normalise` that takes an entity mention (a string) as its input and rewrites it to the form used in the gold standard. While this is &lsquo;cheating&rsquo;, it allows you to assess the performance of a more realistic system, and helps to illustrate that information extraction can be very domain-specific.

The following cell contains skeleton code for the `normalise` function.

In [159]:
for k in gmap.keys():
    if k in emap:
        print(gmap[k], emap[k])
    else:
        print(str(k) + ' not in emap')

32262 not in emap
11021 not in emap
41998 not in emap
('Avigdor Lieberman', 'Yisrael Beitenu') ('Avigdor Lieberman', 'the Yisrael Beitenu party')
61337 not in emap
21914 not in emap
('General David Petraeus', 'U.S. Central Command') ('David Petraeus', 'the U.S. Central Command')
('Lal Krishna Advani', 'Bharatiya Janata Party') ('Lal Krishna Advani', 'the Bharatiya Janata Party')
18977 not in emap
('Ali Zardari', "Pakistan People 's Party") ('Asif Ali Zardari', "the Pakistan People's Party")
15494 not in emap
49705 not in emap
31546 not in emap
('Foreign Minister Joschka Fischer', 'Green Party') ('Joschka Fischer', 'the Green Party')
('Ali Akbar Salehi', 'Atomic Energy Organization') ('Ali Akbar Salehi', 'the Atomic Energy Organization')
51507 not in emap
49336 not in emap
('Ali Rodriguez', 'Petroleos de Venezuela') ('Ali Rodriguez', 'Petroleos de Venezuela')
57350 not in emap
20667 not in emap
('Saad al-Fagih', 'Movement for Islamic Reform') ('Saad al-Fagih', 'the Movement for Islamic 

In [20]:
def normalise(text):
    if text == "David Petraeus":
        return "General David Petraeus"
    elif text == "Asif Ali Zardari":
        return "Ali Zardari"
    elif text == "Joschka Fischer":
        return "Foreign Minister Joschka Fischer"
    elif text == "Chen Shui-bian":
        return "President Chen Shui-bian"
    elif text == "Rafsanjani":
        return "Mr. Rafsanjani"
    elif text == "Zarqawi":
        return "al-Zarqawi"
    elif text == "Ahmad Jannati":
        return "Ayatollah Ahmad Jannati"
    elif text == "Resistance Army":
        return "Lord 's Resistance Army"
    elif text == "Abbas":
        return "Mr. Abbas"
    elif text == 'the Supreme Council for the Islamic Revolution in Iraq':
        return "Supreme Council"
    elif text == "U.N.":
        return "U.N. 's Food and Agricultural Organization"
    
    
    return text.lstrip('the').strip()

The next cell shows how `normalise` is intended to be used. Each triple in the set `extracted` is transformed by feeding the two entity mentions into the `normalise` function. The normalised triples are then added to a new set `extracted_normalised`.

In [21]:
extracted_normalised = set()
for triple in extracted:
    extracted_normalised.add((triple[0], normalise(triple[1]), normalise(triple[2])))

To pass the assignment, you should add enough normalisation rules to `normalise` to achieve a recall of at least 50%.

In [22]:
evaluate(gold, extracted_normalised)

Precision:  0.13777777777777778
Recall:  0.6739130434782609
F1:  0.22878228782287824


## Problem 4: Error analysis

Your last task in this assignment is to do a qualitative error analysis of your information extraction system. You can do this either by writing code or by manual work (inspecting the data file), or mix the two strategies. You can also use the visualisation tools provided by spaCy. For example, the following code cell visualises the output of the named entity recogniser for the given input sentence:

In [None]:
from spacy import displacy

sentence = u'Slavkov will lose his position as head of the Bulgarian National Olympic Committee.'

displacy.render(nlp(sentence), style='ent', jupyter=True)

In any case, you should enter your pairs in the provided text boxes. Use the triple format shown above, where for each pair you also specify the sentence id (line number in the data file) from which the instance was extracted.

### Recall-related errors (false negatives)

By tuning the `normalise` function, you can deal with some of the recall-related mistakes that your system makes. Other recall-related errors cannot be fixed in this way. To illustrate this, find at least 5&nbsp;entity pairs in the gold standard that your system still does not identify correctly, and enter them into the text box below. For each example, provide a brief explanation of what goes wrong. Try to find examples that illustrate different types of errors.

7902: The sentence contains another entity between the person and the organisation, which our model do not understand.
15494: The sentence do not follow the assumed structude of "person->relation->organisation"
18977: General Patreaus is not identified as an entity by SpaCy
28997: Ma is not identified as an entity by SpaCy
41998: Ebadi is not identified as an entity by SpaCy

### Precision-related errors (false positives)

Next, provide at least 5 entity pairs that represent false positives of your system. Explain what goes wrong.

512: The sentence contains a simple "person, the head of organisation"-phrase and should be in the gold standard
615: The sentence contains a simple "person, who heads the organisation"-phrase and should be in the gold standard
736: The sentence contains a simple "person, the leader of organisation"-phrase and should be in the gold standard
2294: The sentence contains "Iraq ... leader...the northern Iraqi" which gets a false positive since the models accepts "ORG and GPE" entities in the "person slot" of the phrase. This is because SpaCy classifies people as ORG/GPE frequently and to meet the 50% recall requirement these are allowed, even though it is wrong. 
2815: The sentence contains "Somalia...head of...Mogadishu" which become a false positive for the same reason as in 2294.

### Incompleteness of the gold standard

You may have noticed that some of your system&rsquo;s false positives are actually &lsquo;correct&rsquo;. This can happen because, while each entity pair in the gold standard has been manually checked for correctness, no check has been made that the gold standard contains all relevant pairs. Find at least 5&nbsp;entity pairs in the data that are valid instances of the &lsquo;is-leader-of&rsquo; relation (according to your subjective judgement) but that are not contained in the gold standard.

Did you find any examples that you did not find when looking for false positives?

Yes.

This is the end of the assignment.