# L2: Information Extraction

In this lab you will implement and evaluate a simple system for information extraction. The task of the system is to read sentences and extract entity pairs of the form *x*&ndash;*y* where *x*&nbsp;is a person, *y*&nbsp;is an organisation, and *x* is the &lsquo;leader&rsquo; of&nbsp;*y*. Consider the following example sentence:

<blockquote>
Mr. Obama also selected Lisa Jackson to head the Environmental Protection Agency.
</blockquote>

From this sentence the system should extract the pair
```
("Lisa Jackson", "Environmental Protection Agency")
```

The system will have to solve the following sub-tasks:
* entity extraction &ndash; identifying mentions of person entities in text
* relation extraction &ndash; identifying instances of the &lsquo;is-leader-of&rsquo; relation

The data set for the lab consists of 62,010&nbsp;sentences from the [Groningen Meaning Bank](http://gmb.let.rug.nl) (release 2.2.0), an open corpus of English. To analyse the sentences you will use [spaCy](https://spacy.io/).

## Getting started

The first cell imports the Python module required for this lab.

In [1]:
import tm2

The data is contained in the following file:

In [2]:
data_file = "/home/TDDE16/labs/l2/data/gmb.txt"

The `tm2` module defines a function `read_data` that returns an iterator over the lines in a file. You should use this function to read the data for this lab. Use the optional argument `n` to restrict the iteration to the first few lines of the file. Here is an example:

In [3]:
for sentence in tm2.read_data(data_file, n=3):
    print(sentence)

Masked assailants with grenades and automatic weapons attacked a wedding party in southeastern Turkey, killing 45 people and wounding at least six others.
Turkish officials said the attack occurred Monday in the village of Bilge about 600 kilometers from Ankara.
The wounded were taken to the hospital in the nearby city of Mardin.


The next cell imports spaCy and loads its English language model.

In [4]:
import spacy

nlp = spacy.load('en', disable=['textcat'])

## Entity extraction

To implement the entity extraction part of your system, you do not need to do much, as you can use the full natural language processing power built into spaCy. The following code extracts the entities from the first 5&nbsp;sentences of the data:

In [5]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=5))):
    for ent in doc.ents:
        print("{}\t{}\t{}\t{}".format(ent.text, ent.start, ent.end, ent.label_))

Turkey	13	14	GPE
45	16	17	CARDINAL
at least six	20	23	CARDINAL
Turkish	0	1	NORP
Monday	6	7	DATE
Bilge	11	12	ORG
about 600 kilometers	12	15	QUANTITY
Ankara	16	17	GPE
Mardin	12	13	ORG


Read the [section about named entities](https://spacy.io/usage/linguistic-features#section-named-entities) from spaCy&rsquo;s documentation to get some background on this. (Please note that we are using version&nbsp;1 of the spaCy library, which means that there may be slight differences in the usage. At the time of writing, the current version&nbsp;2 is not yet stable and fast enough for this lab.)

## Problem 1: Extract relevant pairs

The first problem that you will have to solve is to identify pairs of entities that are in the &lsquo;is-leader-of&rsquo; relation, as in the example above. There are many ways to do this, but for this lab it suffices to implement the strategy outlined in the section on [Relation Extraction](http://www.nltk.org/book/ch07.html#relation-extraction) in the book by Bird, Klein, and Loper (2009):

* look for all triples of the form $(X, \alpha, Y)$ where $X$ and $Y$ are named entities of type *person* and $\alpha$ is the intervening text
* write a regular expression to match just those instances of $\alpha$ that express the &lsquo;is-leader-of&rsquo; relation

You can restrict your attention to adjacent pairs of entities &ndash; that is, cases where $X$ precedes $Y$ and $\alpha$ does not contain other named entities.

Write a function `extract` that takes an analysed sentence (represented as a spaCy [`Doc`](https://spacy.io/api/doc) object) and yields pairs $(X, Y)$ of strings representing entity mentions predicted to be in the &lsquo;is-leader-of&rsquo; relation.

In [228]:
import re

def extract(doc):
    """Extract relevant relation instances from the specified document.
    
    Args:
        doc: The sentence as analysed by spaCy.
    Yields:
        Pairs of strings representing the extracted relation instances.
    """

    L = ['head of', 'leader', 'leading', 'heads']
    PAIR = []
    
    org = ''
    person = ''
    
    for i, ent in enumerate(doc.ents):
        if ent.label_ == "ORG" and person:
            org = re.sub(r'([()])', r'\\\1', ent.text)
        
        if ent.label_ == "PERSON":
            person = re.sub(r'([()])', r'\\\1', ent.text)
        
        if org and person:
            PAIR.append((person, org))
            person = ''
            org = ''

    if not PAIR:
        return
    
    results = []
    
    for person, org in PAIR:
        try:
            results.extend(re.findall(r"("+person+r").* ("+'|'.join(L)+r") .*("+org+r")", doc.text))
        except:
            print('ERROR', person, org)
    
    for person, relation, org in results:
        if not person or not org: continue
        
        yield (person, org)


The following cell shows how your function is supposed to be used. The code prints out the extracted pairs for the first 1,000&nbsp;sentences in the data. It additionally numbers each pair with the identifier of the sentence (line number in the data file) which it was extracted from. Note that the sentence (line) numbering starts at index&nbsp;0.

In [106]:
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file, n=1000))):
    for person, org in extract(doc):
        print("{}\t{}\t{}".format(i, person, org))

RESULTS [('Aung San Suu Kyi', 'head of', 'the National League for Democracy')]
DOC Aung San Suu Kyi, the head of the National League for Democracy, has spent 10 of the past 16 years in detention, mostly under house arrest.
512	Aung San Suu Kyi	the National League for Democracy
RESULTS [('Viktor Yanukovych', 'leader', 'Russian Party')]
DOC Representatives of the country's pro-Western coalition say talks scheduled for Monday were canceled after Viktor Yanukovych, the leader of the opposition pro- Russian Party of Regions failed to attend.
736	Viktor Yanukovych	Russian Party
RESULTS [('Asif Ali Zardari', 'leader', "the Pakistan People's Party")]
DOC Asif Ali Zardari, leader of the Pakistan People's Party, says he wants to set aside the dispute over the Kashmir region so the two countries can focus on other issues, including boosting trade.
802	Asif Ali Zardari	the Pakistan People's Party


Once you feel confident that your `extract` function does what it is supposed to do, execute the following cell to extract the entities from the full data set. Note that this will probably take a few minutes.

In [215]:
extracted = set()
for i, doc in enumerate(nlp.pipe(tm2.read_data(data_file))):
    for person, org in extract(doc):
        extracted.add((i, person, org))

After executing the above cell, all extracted id-string-string triples are in the set `extracted`. The code in the next cell will print the first 10&nbsp;triples in this set.

In [111]:
for i, person, org in sorted(extracted)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

512	Aung San Suu Kyi	the National League for Democracy
736	Viktor Yanukovych	Russian Party
802	Asif Ali Zardari	the Pakistan People's Party
1591	Fidel Castro	the Communist Party
2297	Abdul Aziz al-Hakim	the Supreme Council for the Islamic Revolution in Iraq
4567	Bush	the U.S. Justice Department
8206	J. Patrick Boyle	the American Meat Institute
9004	Joschka Fischer	the Green Party
9021	Hassan Halemi	Kabul University
9047	Karzai	Taleban


## Problem 2: Evaluate your system

You now have an extractor, but how good is it? To help you answer this question, we provide you with a &lsquo;gold standard&rsquo; of entity pairs that your system should be able to extract. The following code loads them (again augmented with the relevant sentence id) from the file `gold.txt` and adds them to the set `gold`:

In [112]:
gold_file = "/home/TDDE16/labs/l2/data/gold.txt"

gold = set()
with open(gold_file) as fp:
    for line in fp:
        columns = line.rstrip().split('\t')
        gold.add((int(columns[0]), columns[1], columns[2]))

The following code prints the 10&nbsp;first pairs from the gold standard:

In [113]:
for i, person, org in sorted(gold)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

802	Ali Zardari	Pakistan People 's Party
2297	Abdul Aziz al-Hakim	Supreme Council
4823	Slavkov	Bulgarian National Olympic Committee
7902	Mr. Hakim	Supreme Council
8206	J. Patrick Boyle	American Meat Institute
8633	Ali Rodriguez	Petroleos de Venezuela
9004	Foreign Minister Joschka Fischer	Green Party
11021	Khalaf	al-Qaida
11259	Joseph Domenech	U.N. 's Food and Agricultural Organization
13043	David Petraeus	U.S. Central Command


Your task now is to write code that computes the precision, recall, and F1 measure of your extractor relative to the gold standard.

In [226]:
def evaluate(reference, predicted):
    """Print out the precision, recall, and F1 for the id-entity-entity
    triples in the set `predicted`, given the triples in the reference set.
    
    Args:
        reference: The reference set of triples.
        predicted: The set of predicted triples.
    Returns:
        Nothing, but prints out precision, recall, and F1.
    """
    
    true_positive = 0
    false_positive = 0
    false_negative = 0
    
    for pred in predicted:
        if pred in reference:
            true_positive += 1
        else:
            false_positive += 1
            
    for ref in reference:
        if ref not in predicted:
            false_negative += 1

    precision = true_positive / (true_positive + false_positive)
    recall = true_positive / (true_positive + false_negative)
    f1 = (2 * precision * recall) / (precision + recall)
    
    def format(proc):
        return str(round(proc*100, 2)) + '%'
            
    print('P: ' + format(precision))
    print('R: ' + format(recall))
    print('F: ' + format(f1))


The next cell shows how your function is intended to be used, as well as the suggested output format.

In [227]:
evaluate(gold, extracted)

P: 5.83%
R: 13.04%
F: 8.05%


## Problem 3: Entity resolution

Looking at the results of your quantitative evaluation, you will realise that your extractor (probably) does a rather poor job in matching the gold standard. One reason for this is that the NLP preprocessing is not perfect (spaCy was not trained on the annotations in the Groningen Meaning Bank), and that the approach of using regular expressions for relation extraction is rather naive.

Another reason however is that the current version of your system does not include a component for *entity resolution*. To give an example, your system does not realise that the strings `David Petraeus` and `General David Petraeus` refer to the same entity.

While writing an entity resolver is beyond the scope of this assignment, we ask you to *simulate* such a resolver. More specifically, you should implement a function `normalise` that takes an entity mention (a string) as its input and rewrites it to the form used in the gold standard. While in some sense this is &lsquo;cheating&rsquo;, it allows you to assess the performance of a more realistic system.

The following cell contains skeleton code for the `normalise` function.

In [220]:
def normalise(text):
    if text == "Asif Ali Zardari":
        return "Ali Zardari"
    
    if text == "the Supreme Council for the Islamic Revolution in Iraq":
        return "Supreme Council"
    
    if text == "Joschka Fischer":
        return "Foreign Minister Joschka Fischer"
    
    if text == "Resistance Army":
        return "Lord 's Resistance Army"
    
    if text == "U.N.":
        return "U.N. 's Food and Agricultural Organization"
    
    if text == "Chen Shui-bian":
        return "President Chen Shui-bian"
    
    if text == 'the Yisrael Beitenu party':
        return 'Yisrael Beitenu'
    
    if text == 'Mwanawasa':
        return 'Mr. Mwanawasa'
    
    if text == 'Zarqawi':
        return 'al-Zarqawi'
    
    if text == 'Abbas':
        return 'Mr. Abbas'
    
    if text == 'the Movement of Islamic Reform in Arabia':
        return 'Movement of Islamic Reform'
    
    if text == 'Rafsanjani':
        return 'Mr. Rafsanjani'
    
    tokens = text.split(' ')
    if tokens[0] == "the":
        return ' '.join(tokens[1:])

    return text

The next cell shows how `normalise` is intended to be used. Each triple in the set `extracted` is transformed by feeding the two entity mentions into the `normalise` function. The normalised triples are then added to a new set `extracted_normalised`.

In [221]:
extracted_normalised = set()
for triple in extracted:
    extracted_normalised.add((triple[0], normalise(triple[1]), normalise(triple[2])))

To pass the assignment, you should add enough normalisation rules to `normalise` to achieve a recall of at least 50%.

In [229]:
evaluate(gold, extracted_normalised)

P: 25.24%
R: 56.52%
F: 34.9%


## Problem 4: Limitations of the gold standard

Each entity pair in the gold standard has been manually checked for correctness. However, there is no guarantee that the gold standard contains all relevant pairs &ndash; there are in fact many pairs that are missing from the gold standard. Your last task in this assignment is to find at least 5&nbsp;entity pairs in the data that are valid instances of the &lsquo;is-leader-of&rsquo; relation but are not contained in the gold standard.

You can solve this task either by writing code or by manual work (inspecting the data file), or mix the two strategies. In any case, you should enter your pairs in the textbox below. Use the triple format shown above where for each pair you also specify the sentence id (line number in the data file) from which the instance was extracted.

Finally we ask you to reflect on the limitations of the evaluation that you carried out in this lab and discuss the question: *How should systems for information extraction really be evaluated?*. Here are some starting points for your discussion.

* How could one create a better gold standard for this task?
* What do precision, recall, and F1 actually measure in this context?
* What measures would be more suitable to evaluate this task?
* What other ways of evaluating systems for information extraction can you think of?

Submit your discussion as a short text (ca. 250&nbsp;words). When presenting your arguments, link back to your own results and experience from this lab, and to concepts you have learned in the lectures or in other parts of the course.

## Discussion
Would be nice if the gold standard was more correct, i.e. contained more true positives. Also the gold standard should use the same name for one entity, for example David Petreus/General Petreus/General David Petreus. Or if it accepted multiple names.

It was good that the measurement required to fulfill was recall since it doesnt depend on false positives, only false negatives. It would perhaps be easy to fake it by just adding all PERSON-ORG pairs, but its in that case precision would be low instead. For this lab, using precision is an issue because the gold standard does not include all pairs.

In a real life application it would be better to use some extrinsic evaluation in order to find the real value of the work put into the entity extraction.

This is the end of the assignment.