# L2: Information Extraction

In this lab you will implement and evaluate a simple system for information extraction. The task of the system is to read sentences and extract pairs of the form $(x, y)$, where $x$&nbsp;is a string denoting a person, $y$&nbsp;is a string denoting an organisation, and $x$ is the &lsquo;leader&rsquo; of&nbsp;$y$. Consider the following example sentence:

<blockquote>
Mr. Obama also selected Lisa Jackson to head the Environmental Protection Agency.
</blockquote>

From this sentence the system should extract the pair
```
("Lisa Jackson", "Environmental Protection Agency")
```

The system will have to solve the following sub-tasks:
* entity extraction &ndash; identifying mentions of person and organisation entities in text
* relation extraction &ndash; identifying instances of the &lsquo;is-leader-of&rsquo; relation

The data set for the lab consists of 62,010&nbsp;sentences from the [Groningen Meaning Bank](http://gmb.let.rug.nl) (release 2.2.0), an open corpus of English. To analyse the sentences you will use [spaCy](https://spacy.io/).

## Getting started

The first cell imports the Python module required for this lab.

In [1]:
import tm2

The next cell imports spaCy and loads its English language model.

In [2]:
import spacy

nlp = spacy.load('en')

## Data and gold standard

The data is contained in the following file:

In [3]:
data_file = "/home/TDDE16/labs/l2/data/gmb.txt"

The `tm2` module defines a function `read_data` that returns an iterator over the lines in a file. You should use this function to read the data for this lab. Use the optional argument `n` to restrict the iteration to the first few lines of the file. Here is an example:

In [4]:
for sentence in tm2.read_data(data_file, n=3):
    print(sentence)

Masked assailants with grenades and automatic weapons attacked a wedding party in southeastern Turkey, killing 45 people and wounding at least six others.
Turkish officials said the attack occurred Monday in the village of Bilge about 600 kilometers from Ankara.
The wounded were taken to the hospital in the nearby city of Mardin.


In [58]:
def get_i_element(i):
    gen = tm2.read_data(data_file)
    _ = [next(gen) for _ in range(i+1)]
    return _[i]

In [59]:
print(get_i_element(802))

Asif Ali Zardari, leader of the Pakistan People's Party, says he wants to set aside the dispute over the Kashmir region so the two countries can focus on other issues, including boosting trade.


In addition to the raw data, we also provide you with a gold standard of entity pairs that your system should be able to extract. The following code loads these pairs from the file `gold.txt` and adds them to the set `gold`. Each pair is augmented with the identifier of the sentence (line number in the data file) which it was extracted from. Note that the sentence (line) numbering starts at index&nbsp;0.

In [4]:
gold_file = "/home/TDDE16/labs/l2/data/gold.txt"

gold = set()
with open(gold_file) as fp:
    for line in fp:
        columns = line.rstrip().split('\t')
        gold.add((int(columns[0]), columns[1], columns[2]))

The following code prints the 10&nbsp;first pairs from the gold standard:

In [4]:
for i, person, org in sorted(gold)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

802	Ali Zardari	Pakistan People 's Party
2297	Abdul Aziz al-Hakim	Supreme Council
4823	Slavkov	Bulgarian National Olympic Committee
7902	Mr. Hakim	Supreme Council
8206	J. Patrick Boyle	American Meat Institute
8633	Ali Rodriguez	Petroleos de Venezuela
9004	Foreign Minister Joschka Fischer	Green Party
11021	Khalaf	al-Qaida
11259	Joseph Domenech	U.N. 's Food and Agricultural Organization
13043	David Petraeus	U.S. Central Command


You should take a moment to have a look at the data file and see these pairs in the context of the sentence they were extracted from.

## Entity extraction

To implement the entity extraction part of your system, you do not need to do much, as you can use the full natural language processing power built into spaCy. The following code extracts the entities from the first 5&nbsp;sentences of the data:

In [25]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=5))):
    for ent in doc.ents:
        print("{}\t{}\t{}\t{}".format(ent.text, ent.start, ent.end, ent.label_))

Turkey	13	14	GPE
45	16	17	CARDINAL
at least six	20	23	CARDINAL
Turkish	0	1	NORP
Monday	6	7	DATE
Bilge	11	12	ORG
about 600 kilometers	12	15	QUANTITY
Ankara	16	17	GPE
Mardin	12	13	ORG


Read the [section about named entities](https://spacy.io/usage/linguistic-features#section-named-entities) from spaCy&rsquo;s documentation to get some background on this.

## Problem 1: Extract relevant pairs

The first problem that you will have to solve is to identify pairs of entities that are in the &lsquo;is-leader-of&rsquo; relation, as in the example above. There are many ways to do this, but for this lab it suffices to implement the strategy outlined in the section on [Relation Extraction](http://www.nltk.org/book/ch07.html#relation-extraction) in the book by Bird, Klein, and Loper (2009):

* look for all triples of the form $(x, \alpha, y)$ where $x$ and $y$ denote named entities of type *person* and *organisation*, respectively, and $\alpha$ is the intervening text
* write a regular expression to match just those instances of $\alpha$ that express the &lsquo;is-leader-of&rsquo; relation

You can restrict your attention to adjacent pairs of entities &ndash; that is, cases where $x$ precedes $y$ and $\alpha$ does not contain other named entities.

Write a function `extract` that takes an analysed sentence (represented as a spaCy [`Doc`](https://spacy.io/api/doc) object) and yields pairs $(x, y)$ of strings representing entity mentions predicted to be in the &lsquo;is-leader-of&rsquo; relation.

In [167]:
def extract(doc):
    """Extract relevant relation instances from the specified document.
    
    Args:
        doc: The sentence as analysed by spaCy.
    Yields:
        Pairs of strings representing the extracted relation instances.
    """
    matchers = {"leader of","lead","leads","leading","head of","head","heads","headed","heading"}
    is_person = False
    for ent in doc.ents:
        #if person precedes org
        if ent.label_ == "PERSON":
            #save last position of word in doc
            begin = ent.end
            person = ent
            is_person = True
            continue
        if ent.label_ == "ORG" and is_person:
            end = ent.start
            text = doc[begin:end].text
            if any(m in text for m in matchers):
                yield person.text, ent.text
            is_person = False
        else:
            is_person = False

The following cell shows how your function is supposed to be used. The code prints out the extracted pairs for the first 1,000&nbsp;sentences in the data. It additionally numbers each pair with the sentence identifier.

In [169]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=1000))):
    for person, org in extract(doc):
        print("{}\t{}\t{}".format(i, person, org))

207	Rugova	European Union
351	Jendayi Frazer	Sudan Liberation Army
512	Aung San Suu Kyi	the National League for Democracy
736	Viktor Yanukovych	Russian Party
802	Asif Ali Zardari	the Pakistan People's Party


Once you feel confident that your `extract` function does what it is supposed to do, execute the following cell to extract the entities from the full data set. Note that this will take several minutes (remember that we are processing 62k sentences).

In [170]:
extracted = set()
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file))):
    for person, org in extract(doc):
        extracted.add((i, person, org))
    print('\rProcessed {} sentences ...'.format(i+1), end='', flush=True)
print(' done')

Processed 62010 sentences ... done


In [5]:
import pickle

In [173]:
pickle.dump(extracted,open("extracted.p","wb"))

In [6]:
extracted = pickle.load(open("extracted.p","rb"))

After executing the above cell, all extracted id-string-string triples are in the set `extracted`. The code in the next cell will print the first 10&nbsp;triples in this set.

In [7]:
for i, person, org in sorted(extracted)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

207	Rugova	European Union
351	Jendayi Frazer	Sudan Liberation Army
512	Aung San Suu Kyi	the National League for Democracy
736	Viktor Yanukovych	Russian Party
802	Asif Ali Zardari	the Pakistan People's Party
1349	Karen Hughes	State Department
1790	Koizumi	the United Nations
2297	Abdul Aziz al-Hakim	the Supreme Council for the Islamic Revolution in Iraq
3274	Jack Abramoff	Congress
3520	Peres	Amir Peretz


## Problem 2: Evaluate your system

You now have an extractor, but how good is it? Your task now is to write code that computes the precision, recall, and F1 measure of your extractor relative to the gold standard.

In [46]:
def evaluate(reference, predicted):
    """Print out the precision, recall, and F1 for the id-entity-entity
    triples in the set `predicted`, given the triples in the reference set.
    
    Args:
        reference: The reference set of triples.
        predicted: The set of predicted triples.
    Returns:
        Nothing, but prints out precision, recall, and F1.
    """
    TP = len(reference.intersection(predicted))
    FP = len(predicted) - TP
    FN = len(reference) - TP
    
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    f1 = 2*(recall * precision) / (recall + precision)
    print("precision: {:.2f}%".format(precision*100))
    print("recall: {:.2f}%".format(recall*100))
    print("f1-score: {:.2f}".format(f1))

The next cell shows how your function is intended to be used, as well as the suggested output format.

In [43]:
evaluate(gold, extracted)

precision: 0.05%
recall: 0.11%
f1-score: 0.06%


## Problem 3: Entity resolution

Looking at the results of your quantitative evaluation, you will realise that your extractor (probably) does a rather poor job in matching the gold standard. One reason for this is that the NLP preprocessing is not perfect (spaCy was not trained on the annotations in the Groningen Meaning Bank), and that the approach of using regular expressions for relation extraction is rather naive.

Another reason however is that the current version of your system does not include a component for *entity resolution*. To give an example, your system does not realise that the strings `David Petraeus` and `General David Petraeus` refer to the same entity.

While writing a &lsquo;real&rsquo; entity resolver is beyond the scope of this assignment, we ask you to &lsquo;fake&rsquo; such a resolver. More specifically, you should implement a function `normalise` that takes an entity mention (a string) as its input and rewrites it to the form used in the gold standard. While this is &lsquo;cheating&rsquo;, it allows you to assess the performance of a more realistic system, and helps to illustrate that information extraction can be very domain-specific.

The following cell contains skeleton code for the `normalise` function.

In [51]:
def normalise(text):
    if text == 'Asif Ali Zardari':
        return 'Ali Zardari'
    if text == 'Joschka Fischer':
        return 'Foreign Minister Joschka Fischer'   
    if text == 'Chen Shui-bian':
        return 'President Chen Shui-bian'
    if text == 'the Yisrael Beitenu party':
        return 'Yisrael Beitenu'
    if text == """the U.N.'""":
        return """U.N. 's Food and Agricultural Organization"""
    if text == 'the Supreme Council for the Islamic Revolution in Iraq':
        return 'Supreme Council'
    if text == "David Petraeus":
        return "General David Petraeus"
    if text.split(' ')[0] == 'the':
        return ' '.join(text.split(' ')[1:])
    if text == 'Mwanawasa':
        return 'Mr. Mwanawasa'
    if text == 'Mr. Fini':
        return 'Fini'
    if text == 'Rafsanjani':
        return 'Mr. Rafsanjani'
    if text == 'Ahmad Jannati':
        return 'Ayatollah Ahmad Jannati'
    if text == 'Zarqawi':
        return 'al-Zarqawi'
    if text == 'Resistance Army':
        return """Lord 's Resistance Army"""
    if text == 'Coleman':
        return 'Mr. Coleman'
    if text == 'Abbas':
        return 'Mr. Abbas'
    return text

The next cell shows how `normalise` is intended to be used. Each triple in the set `extracted` is transformed by feeding the two entity mentions into the `normalise` function. The normalised triples are then added to a new set `extracted_normalised`.

In [52]:
extracted_normalised = set()
for triple in extracted:
    extracted_normalised.add((triple[0], normalise(triple[1]), normalise(triple[2])))

To pass the assignment, you should add enough normalisation rules to `normalise` to achieve a recall of at least 50%.

In [53]:
evaluate(gold, extracted_normalised)

precision: 24.77%
recall: 58.70%
f1-score: 0.35


## Problem 4: Error analysis

Your last task in this assignment is to do a qualitative error analysis of your information extraction system. You can do this either by writing code or by manual work (inspecting the data file), or mix the two strategies. You can also use the visualisation tools provided by spaCy. For example, the following code cell visualises the output of the named entity recogniser for the given input sentence:

In [54]:
from spacy import displacy

sentence = u'Slavkov will lose his position as head of the Bulgarian National Olympic Committee.'

displacy.render(nlp(sentence), style='ent', jupyter=True)

In [66]:
for e in extracted_normalised:
    print('LINE: {}\nPERSON: {}\nORG: {}'.format(e[0], e[1], e[2]))
    displacy.render(nlp(get_i_element(e[0])), style='ent', jupyter=True)
    print('\n')

LINE: 44280
PERSON: Gandhi
ORG: National Advisory Council




LINE: 48378
PERSON: Abu Musab al-Zarqawi
ORG: al Qaida




LINE: 28824
PERSON: Kenichiro Sasae
ORG: Japanese Foreign Ministry's




LINE: 8206
PERSON: J. Patrick Boyle
ORG: American Meat Institute




LINE: 54098
PERSON: Ahmet Turk
ORG: Kurdish Democratic Society Party




LINE: 59366
PERSON: Netanyahu
ORG: Likud




LINE: 37076
PERSON: Bill Clinton
ORG: U.N.




LINE: 53885
PERSON: Ta Mok
ORG: Tuol Sleng




LINE: 28759
PERSON: Nord Eclair
ORG: Rally for Republicans"




LINE: 33646
PERSON: Mr. Coleman
ORG: Senate Government Affairs




LINE: 18520
PERSON: Goss
ORG: CIA




LINE: 351
PERSON: Jendayi Frazer
ORG: Sudan Liberation Army




LINE: 736
PERSON: Viktor Yanukovych
ORG: Russian Party




LINE: 15203
PERSON: Joseph Kony
ORG: Lord 's Resistance Army




LINE: 61023
PERSON: Dominique McAdams
ORG: U.N.




LINE: 40315
PERSON: John Solecki
ORG: U.N.




LINE: 42281
PERSON: Jacques Chirac
ORG: U.N. Security Council's




LINE: 10231
PERSON: Pervez Musharraf
ORG: Non- Aligned Movement




LINE: 15906
PERSON: President Chen Shui-bian
ORG: Democratic Progressive Party




LINE: 52988
PERSON: Bush
ORG: U.S. Justice Department




LINE: 19865
PERSON: Benedict
ORG: Catholic Church




LINE: 42098
PERSON: Mr. Abbas
ORG: Fatah




LINE: 43377
PERSON: Haradinaj
ORG: Albanian Kosovo Liberation Army




LINE: 30171
PERSON: al-Zarqawi
ORG: al-Qaida




LINE: 30026
PERSON: Charles Duelfer
ORG: Iraq Survey Group




LINE: 9364
PERSON: Kony
ORG: Lord 's Resistance Army




LINE: 48091
PERSON: Aftab Khan Sherpao
ORG: Taleban's Culture and Information Council




LINE: 26170
PERSON: Bush
ORG: Army




LINE: 20667
PERSON: Fini
ORG: National Alliance




LINE: 11487
PERSON: Atta Mohammed
ORG: Northern Alliance




LINE: 61152
PERSON: Saad al-Fagih
ORG: Movement for Islamic Reform




LINE: 51157
PERSON: Abu Musab al-Zarqawi
ORG: al-Qaeda




LINE: 23016
PERSON: Osama bin Laden
ORG: al-Qaida




LINE: 10490
PERSON: Stronach
ORG: Magna




LINE: 49242
PERSON: Ayatollah Ahmad Jannati
ORG: Guardian Council




LINE: 8633
PERSON: Ali Rodriguez
ORG: Petroleos de Venezuela




LINE: 20496
PERSON: Avigdor Lieberman
ORG: Yisrael Beitenu




LINE: 57199
PERSON: Detlev Mehlis
ORG: U.N.




LINE: 57127
PERSON: Peter Feith
ORG: EU




LINE: 24833
PERSON: Olusegan Obasanjo
ORG: AU




LINE: 51967
PERSON: Saad al-Fagih
ORG: Movement for Islamic Reform




LINE: 12080
PERSON: Haim Ramon
ORG: Kadima Party




LINE: 44908
PERSON: Kevin Costner
ORG: Kevin Costner Band




LINE: 40736
PERSON: Lal Krishna Advani
ORG: Bharatiya Janata Party




LINE: 1790
PERSON: Koizumi
ORG: United Nations




LINE: 56028
PERSON: Mahmud Abbas
ORG: Hamas




LINE: 57350
PERSON: Gene Sperling
ORG: National Economic Council




LINE: 1349
PERSON: Karen Hughes
ORG: State Department




LINE: 33608
PERSON: Slobodan Milosevic
ORG: U.N.




LINE: 20273
PERSON: Gerry Adams
ORG: IRA




LINE: 55036
PERSON: Francisco Flores
ORG: Organization of America States




LINE: 31546
PERSON: Mr. Abbas
ORG: Fatah




LINE: 6297
PERSON: Putin
ORG: NATO-Russia Council




LINE: 59063
PERSON: Malcolm Rifkind
ORG: Conservative Party




LINE: 9021
PERSON: Hassan Halemi
ORG: Kabul University




LINE: 43640
PERSON: Johan Verbeke
ORG: U.N.




LINE: 46977
PERSON: Rowhani
ORG: European Union




LINE: 42563
PERSON: Sharon
ORG: Likud




LINE: 36057
PERSON: Mullah Omar's
ORG: Taleban




LINE: 27180
PERSON: Lev Ponomaryov
ORG: Memorial Human Rights Center




LINE: 3520
PERSON: Peres
ORG: Amir Peretz




LINE: 26443
PERSON: Tom Delay
ORG: Congress




LINE: 22755
PERSON: Arafat
ORG: Palestine Liberation Organization




LINE: 49094
PERSON: Nick Marinellis
ORG: Mondayinthe New South Wales District Court




LINE: 11259
PERSON: Joseph Domenech
ORG: U.N.'s




LINE: 4567
PERSON: Bush
ORG: U.S. Justice Department




LINE: 61337
PERSON: Lisa Jackson
ORG: Environmental Protection Agency




LINE: 13043
PERSON: General David Petraeus
ORG: U.S. Central Command




LINE: 802
PERSON: Ali Zardari
ORG: Pakistan People's Party




LINE: 7917
PERSON: Vladimir Putin
ORG: Group




LINE: 50150
PERSON: Jacques Chirac
ORG: United Nations




LINE: 4753
PERSON: Junichiro Koizumi
ORG: APEC




LINE: 47798
PERSON: Li Rui
ORG: Communist Party's




LINE: 2297
PERSON: Abdul Aziz al-Hakim
ORG: Supreme Council




LINE: 61476
PERSON: Bashir
ORG: International Criminal Court




LINE: 22861
PERSON: Michel Barnier
ORG: European Union




LINE: 20815
PERSON: Mohammed
ORG: Wall Street Journal




LINE: 14130
PERSON: DeLay
ORG: House




LINE: 9495
PERSON: Tom DeLay
ORG: House of Representatives




LINE: 39722
PERSON: Francisco Galan
ORG: Uribe




LINE: 5046
PERSON: Daniel Pearl
ORG: Al-Qaida




LINE: 14855
PERSON: Gerry Adams
ORG: IRA




LINE: 30176
PERSON: al-Zarqawi
ORG: al-Qaida




LINE: 10395
PERSON: Younus Khalis
ORG: Taleban




LINE: 34889
PERSON: Prince Ali
ORG: West Asian Football Federation




LINE: 48229
PERSON: Hamid Karzai
ORG: EU




LINE: 21914
PERSON: Mr. Mwanawasa
ORG: Southern African Development Community




LINE: 43850
PERSON: Yoweri Museveni
ORG: Lord 's Resistance Army




LINE: 28833
PERSON: Hamid Karzai
ORG: NATO




LINE: 512
PERSON: Aung San Suu Kyi
ORG: National League for Democracy




LINE: 30777
PERSON: Gerhard Schroeder
ORG: Christian Democratic Union




LINE: 25815
PERSON: Salihu Abubakar
ORG: Nigerian Premier League




LINE: 12618
PERSON: Merkel
ORG: African Union




LINE: 61469
PERSON: Ban
ORG: NATO




LINE: 40372
PERSON: Mahmoud Ahmadinejad
ORG: Supreme National Security Council




LINE: 53168
PERSON: Nawaz Sharif
ORG: Pakistan Muslim League




LINE: 55319
PERSON: Syed Hamid
ORG: ASEAN




LINE: 32195
PERSON: Bolton
ORG: U.N.




LINE: 53075
PERSON: Mr. Rafsanjani
ORG: Expediency Council




LINE: 3274
PERSON: Jack Abramoff
ORG: Congress




LINE: 13637
PERSON: Zardari
ORG: Pakistan People's Party




LINE: 26323
PERSON: Mehdi Karroubi
ORG: Revolutionary Guards Mohsen Rezai




LINE: 37037
PERSON: Ali Akbar Salehi
ORG: Atomic Energy Organization




LINE: 60729
PERSON: General David Petraeus
ORG: U.S. Central Command




LINE: 40460
PERSON: Mohamed GHANNOUCHI
ORG: Chamber of Deputies




LINE: 28047
PERSON: Joseph Terrence Thomas
ORG: al-Qaida




LINE: 207
PERSON: Rugova
ORG: European Union




LINE: 9004
PERSON: Foreign Minister Joschka Fischer
ORG: Green Party




LINE: 19681
PERSON: Jerry Adams
ORG: IRA






In any case, you should enter your pairs in the provided text boxes. Use the triple format shown above, where for each pair you also specify the sentence id (line number in the data file) from which the instance was extracted.

### Recall-related errors (false negatives)

By tuning the `normalise` function, you can deal with some of the recall-related mistakes that your system makes. Other recall-related errors cannot be fixed in this way. To illustrate this, find at least 5&nbsp;entity pairs in the gold standard that your system still does not identify correctly, and enter them into the text box below. For each example, provide a brief explanation of what goes wrong. Try to find examples that illustrate different types of errors.

In [68]:
for e in gold.difference(gold.intersection(extracted_normalised)):
    print('LINE: {}\nPERSON: {}\nORG: {}'.format(e[0], e[1], e[2]))
    displacy.render(nlp(get_i_element(e[0])), style='ent', jupyter=True)
    print('\n')

LINE: 13043
PERSON: David Petraeus
ORG: U.S. Central Command




LINE: 15494
PERSON: Khodorkovsky
ORG: Yukos




LINE: 36946
PERSON: Vintsuk Vyachorka
ORG: Belarus Popular Front




LINE: 7902
PERSON: Mr. Hakim
ORG: Supreme Council




LINE: 4823
PERSON: Slavkov
ORG: Bulgarian National Olympic Committee




LINE: 11259
PERSON: Joseph Domenech
ORG: U.N. 's Food and Agricultural Organization




LINE: 20667
PERSON: Mr. Fini
ORG: National Alliance




LINE: 49705
PERSON: Ocalan
ORG: Kurdistan Workers Party




LINE: 51507
PERSON: Abdullah Ocalan
ORG: Kurdistan Workers Party




LINE: 57420
PERSON: Major General Udi Adam
ORG: Northern Command




LINE: 18977
PERSON: General Petraeus
ORG: U.S. Central Command




LINE: 802
PERSON: Ali Zardari
ORG: Pakistan People 's Party




LINE: 32262
PERSON: Morgan Tsvangirai
ORG: Movement for Democratic Change




LINE: 11021
PERSON: Khalaf
ORG: al-Qaida




LINE: 28997
PERSON: Ma
ORG: Nationalist Party




LINE: 44637
PERSON: Saad al-Fagih
ORG: Movement of Islamic Reform




LINE: 49336
PERSON: Manie de Clerq
ORG: Public Servants Association




LINE: 44784
PERSON: Mr. Rafsanjani
ORG: Expediency Council




LINE: 41998
PERSON: Ebadi
ORG: Human Rights Center






*Enter your explanations here*

### Precision-related errors (false positives)

Next, provide at least 5 entity pairs that represent false positives of your system. Explain what goes wrong.

In [69]:
for e in extracted_normalised.difference(gold.intersection(extracted_normalised)):
    print('LINE: {}\nPERSON: {}\nORG: {}'.format(e[0], e[1], e[2]))
    displacy.render(nlp(get_i_element(e[0])), style='ent', jupyter=True)
    print('\n')

LINE: 20273
PERSON: Gerry Adams
ORG: IRA




LINE: 55036
PERSON: Francisco Flores
ORG: Organization of America States




LINE: 30026
PERSON: Charles Duelfer
ORG: Iraq Survey Group




LINE: 22861
PERSON: Michel Barnier
ORG: European Union




LINE: 20815
PERSON: Mohammed
ORG: Wall Street Journal




LINE: 9364
PERSON: Kony
ORG: Lord 's Resistance Army




LINE: 14130
PERSON: DeLay
ORG: House




LINE: 48091
PERSON: Aftab Khan Sherpao
ORG: Taleban's Culture and Information Council




LINE: 9495
PERSON: Tom DeLay
ORG: House of Representatives




LINE: 39722
PERSON: Francisco Galan
ORG: Uribe




LINE: 5046
PERSON: Daniel Pearl
ORG: Al-Qaida




LINE: 6297
PERSON: Putin
ORG: NATO-Russia Council




LINE: 28824
PERSON: Kenichiro Sasae
ORG: Japanese Foreign Ministry's




LINE: 14855
PERSON: Gerry Adams
ORG: IRA




LINE: 26170
PERSON: Bush
ORG: Army




LINE: 30176
PERSON: al-Zarqawi
ORG: al-Qaida




LINE: 10395
PERSON: Younus Khalis
ORG: Taleban




LINE: 59063
PERSON: Malcolm Rifkind
ORG: Conservative Party




LINE: 20667
PERSON: Fini
ORG: National Alliance




LINE: 9021
PERSON: Hassan Halemi
ORG: Kabul University




LINE: 54098
PERSON: Ahmet Turk
ORG: Kurdish Democratic Society Party




LINE: 43640
PERSON: Johan Verbeke
ORG: U.N.




LINE: 48229
PERSON: Hamid Karzai
ORG: EU




LINE: 11487
PERSON: Atta Mohammed
ORG: Northern Alliance




LINE: 46977
PERSON: Rowhani
ORG: European Union




LINE: 43850
PERSON: Yoweri Museveni
ORG: Lord 's Resistance Army




LINE: 42563
PERSON: Sharon
ORG: Likud




LINE: 59366
PERSON: Netanyahu
ORG: Likud




LINE: 37076
PERSON: Bill Clinton
ORG: U.N.




LINE: 28833
PERSON: Hamid Karzai
ORG: NATO




LINE: 512
PERSON: Aung San Suu Kyi
ORG: National League for Democracy




LINE: 36057
PERSON: Mullah Omar's
ORG: Taleban




LINE: 53885
PERSON: Ta Mok
ORG: Tuol Sleng




LINE: 28759
PERSON: Nord Eclair
ORG: Rally for Republicans"




LINE: 30777
PERSON: Gerhard Schroeder
ORG: Christian Democratic Union




LINE: 27180
PERSON: Lev Ponomaryov
ORG: Memorial Human Rights Center




LINE: 3520
PERSON: Peres
ORG: Amir Peretz




LINE: 10490
PERSON: Stronach
ORG: Magna




LINE: 25815
PERSON: Salihu Abubakar
ORG: Nigerian Premier League




LINE: 18520
PERSON: Goss
ORG: CIA




LINE: 351
PERSON: Jendayi Frazer
ORG: Sudan Liberation Army




LINE: 26443
PERSON: Tom Delay
ORG: Congress




LINE: 57199
PERSON: Detlev Mehlis
ORG: U.N.




LINE: 57127
PERSON: Peter Feith
ORG: EU




LINE: 24833
PERSON: Olusegan Obasanjo
ORG: AU




LINE: 12618
PERSON: Merkel
ORG: African Union




LINE: 736
PERSON: Viktor Yanukovych
ORG: Russian Party




LINE: 61469
PERSON: Ban
ORG: NATO




LINE: 40372
PERSON: Mahmoud Ahmadinejad
ORG: Supreme National Security Council




LINE: 22755
PERSON: Arafat
ORG: Palestine Liberation Organization




LINE: 12080
PERSON: Haim Ramon
ORG: Kadima Party




LINE: 49094
PERSON: Nick Marinellis
ORG: Mondayinthe New South Wales District Court




LINE: 53168
PERSON: Nawaz Sharif
ORG: Pakistan Muslim League




LINE: 61023
PERSON: Dominique McAdams
ORG: U.N.




LINE: 40315
PERSON: John Solecki
ORG: U.N.




LINE: 42281
PERSON: Jacques Chirac
ORG: U.N. Security Council's




LINE: 10231
PERSON: Pervez Musharraf
ORG: Non- Aligned Movement




LINE: 11259
PERSON: Joseph Domenech
ORG: U.N.'s




LINE: 4567
PERSON: Bush
ORG: U.S. Justice Department




LINE: 13043
PERSON: General David Petraeus
ORG: U.S. Central Command




LINE: 802
PERSON: Ali Zardari
ORG: Pakistan People's Party




LINE: 55319
PERSON: Syed Hamid
ORG: ASEAN




LINE: 32195
PERSON: Bolton
ORG: U.N.




LINE: 3274
PERSON: Jack Abramoff
ORG: Congress




LINE: 52988
PERSON: Bush
ORG: U.S. Justice Department




LINE: 1790
PERSON: Koizumi
ORG: United Nations




LINE: 56028
PERSON: Mahmud Abbas
ORG: Hamas




LINE: 7917
PERSON: Vladimir Putin
ORG: Group




LINE: 50150
PERSON: Jacques Chirac
ORG: United Nations




LINE: 13637
PERSON: Zardari
ORG: Pakistan People's Party




LINE: 26323
PERSON: Mehdi Karroubi
ORG: Revolutionary Guards Mohsen Rezai




LINE: 4753
PERSON: Junichiro Koizumi
ORG: APEC




LINE: 47798
PERSON: Li Rui
ORG: Communist Party's




LINE: 19865
PERSON: Benedict
ORG: Catholic Church




LINE: 1349
PERSON: Karen Hughes
ORG: State Department




LINE: 40460
PERSON: Mohamed GHANNOUCHI
ORG: Chamber of Deputies




LINE: 33608
PERSON: Slobodan Milosevic
ORG: U.N.




LINE: 28047
PERSON: Joseph Terrence Thomas
ORG: al-Qaida




LINE: 207
PERSON: Rugova
ORG: European Union




LINE: 61476
PERSON: Bashir
ORG: International Criminal Court




LINE: 43377
PERSON: Haradinaj
ORG: Albanian Kosovo Liberation Army




LINE: 19681
PERSON: Jerry Adams
ORG: IRA






*Enter your explanations here*

### Incompleteness of the gold standard

You may have noticed that some of your system&rsquo;s false positives are actually &lsquo;correct&rsquo;. This can happen because, while each entity pair in the gold standard has been manually checked for correctness, no check has been made that the gold standard contains all relevant pairs. Find at least 5&nbsp;entity pairs in the data that are valid instances of the &lsquo;is-leader-of&rsquo; relation (according to your subjective judgement) but that are not contained in the gold standard.

Did you find any examples that you did not find when looking for false positives?

This is the end of the assignment.