# L5: Information extraction

Information extraction (IE) is the task of identifying named entities and semantic relations between these entities in text data. In this lab we will focus on two sub-tasks in IE, **named entity recognition** (identifying mentions of entities) and **entity linking** (matching these mentions to entities in a knowledge base).

**Reminder about our [Rules for hand-in assignments](https://www.ida.liu.se/~TDDE16/exam.en.shtml#handins) and the [Policy on cheating and plagiarism](https://www.ida.liu.se/~TDDE16/exam.en.shtml#cheating)**

We start by loading spaCy:

In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

The data that we will be using has been tokenized following the conventions of the [Penn Treebank](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html), and we need to prevent spaCy from using its own tokenizer on top of this. We therefore override spaCy&rsquo;s tokenizer with one that simply splits on space.

In [2]:
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        return Doc(self.vocab, words=text.split(' '))

nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

## Data set

The main data set for this lab is a collection of news wire articles in which mentions of named entities have been annotated with page names from the [English Wikipedia](https://en.wikipedia.org/wiki/). The next code cell loads the training and the development parts of the data into Pandas data frames.

In [3]:
import bz2
import csv
import pandas as pd

with bz2.open('ner-train.tsv.bz2', 'rt') as source:
    df_train = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

with bz2.open('ner-dev.tsv.bz2', 'rt') as source:
    df_dev = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

Each row in these two data frames corresponds to one mention of a named entity and has five columns:

1. a unique identifier for the sentence containing the entity mention
2. the pre-tokenized sentence, with tokens separated by spaces
3. the start position of the token span containing the entity mention
4. the end position of the token span (exclusive, as in Python list indexing)
5. the entity label; either a Wikipedia page name or the generic label `--NME--`

The following cell prints the first five samples from the training data:

In [4]:
df_train.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,0000-000,EU rejects German call to boycott British lamb .,0,1,--NME--
1,0000-000,EU rejects German call to boycott British lamb .,2,3,Germany
2,0000-000,EU rejects German call to boycott British lamb .,6,7,United_Kingdom
3,0000-001,Peter Blackburn,0,2,--NME--
4,0000-002,BRUSSELS 1996-08-22,0,1,Brussels


In this sample, we see that the first sentence is annotated with three entity mentions:

* the span 0–1 &lsquo;EU&rsquo; is annotated as a mention but only labelled with the generic `--NME--`
* the span 2–3 &lsquo;German&rsquo; is annotated with the page [Germany](http://en.wikipedia.org/wiki/Germany)
* the span 6–7 &lsquo;British&rsquo; is annotated with the page [United_Kingdom](http://en.wikipedia.org/wiki/United_Kingdom)

## Problem 1: Evaluation measures

To warm up, we ask you to write code to print the three measures that you will be using for evaluation:

In [5]:
def evaluation_report(gold, pred):
    """Print precision, recall, and F1 score.
    
    Args:
        gold: The set with the gold-standard values.
        pred: The set with the predicted values.
    
    Returns:
        Nothing, but prints the precision, recall, and F1 values computed
        based on the specified sets.
    """
    # TODO: Replace the next line with your own code
    TP = len(gold & pred) # intersection
    TN = 0
    FP = len(pred - gold)
    FN = len(gold - pred)

    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    F1_score = (2 * precision * recall) / (precision + recall)
    
    print("precission: ",precision, ", recall: ", recall, ", F1_score: ", F1_score )


To test your code, you can run the following cell:

In [6]:
evaluation_report(set(range(3)), set(range(5)))

precission:  0.6 , recall:  1.0 , F1_score:  0.7499999999999999


This should give you a precision of 60%, a recall of 100%, and an F1-value of 75%.

## Problem 2: Span recognition

One of the first tasks that an information extraction system has to solve is to locate and classify (mentions of) named entities, such as persons and organizations. Here we will tackle the simpler task of recognizing **spans** of tokens that contain an entity mention, without the actual entity label.

The English language model in spaCy features a full-fledged [named entity recognizer](https://spacy.io/usage/linguistic-features#named-entities) that identifies a variety of entities, and can be updated with new entity types by the user. Your task in this problem is to evaluate the performance of this component when predicting entity spans in the development data.

Start by implementing a generator function that yields the gold-standard spans in a given data frame.

**Hint:** The Pandas method [`itertuples()`](https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.itertuples.html) is useful when iterating over the rows in a DataFrame.

In [7]:
def gold_spans(df):
    """Yield the gold-standard mention spans in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        triples consisting of the sentence id, start position, and end
        position of each span.
    """
    # input a df, return each span's sentence id, start position, end position
    # TODO: Replace the next line with your own code
    # because we use row.sentence_id directly, the Args index here doesn't matter
    for row in df.itertuples():
        yield row.sentence_id, row.beg, row.end
    

To test your code, you can count the spans yielded by your function. When called on the development data, you should get a total of 5,917 unique triples. The first triple and the last triple should be

    ('0946-000', 2, 3)
    ('1161-010', 1, 3)  

In [8]:
spans_dev_gold = set(gold_spans(df_dev))
print(len(spans_dev_gold))


5917


----My Answer:

----set() function will deduplicate or remove the same data records and reorder the list.

----then, we use list() function to get the original order.

In [9]:
# the first triple
list(gold_spans(df_dev))[0]


('0946-000', 2, 3)

In [10]:
# the last triple
list(gold_spans(df_dev))[-1]


('1161-010', 1, 3)

Your next task is to write code that calls spaCy to predict the named entities in the development data, and to evaluate the accuracy of these predictions in terms of precision, recall, and F1. Print these scores using the function that you wrote for Problem&nbsp;1.

In [11]:
# TODO: Write code here to run and evaluate the spaCy named entity recognizer on the development data
# 1. use spaCy to predict named entities in df_dev
# 2. use evaluation_report() function get precission, recall, F1_score

def spaCy_predict(df):
    for row in df.itertuples():
        doc = nlp(row.sentence)
        for ent in doc.ents:
            yield row.sentence_id, ent.start, ent.end

# df_dev is the development data
# 1. use spaCy_predict() function to predict the df_dev
# 2. use evaluation_report() function to compare 'pred' and 'gold' sets

spaCy_predictions = set(spaCy_predict(df_dev))
evaluation_report(gold = spans_dev_gold, pred = spaCy_predictions)


precission:  0.5213954072029113 , recall:  0.702213959776914 , F1_score:  0.5984444764511019


----My Answer:

----Here, we get above scores, which are not very good, the Precission and F1_score are too low.

----just more than a half of the predicted spans set are correct based on gold standard.

## Problem 3: Error analysis

As you were able to see in Problem&nbsp;2, the span accuracy of the named entity recognizer is far from perfect. In particular, only slightly more than half of the predicted spans are correct according to the gold standard. Your next task is to analyse this result in more detail.

Here is a function that prints the false positives as well

 as the false negatives spans for a data frame, given a reference set of gold-standard spans and a candidate set of predicted spans.

In [12]:
from collections import defaultdict

def error_report(df, spans_gold, spans_pred):
    false_pos = defaultdict(list)
    for s, b, e in spans_pred - spans_gold:
        false_pos[s].append((b, e))
    false_neg = defaultdict(list)
    for s, b, e in spans_gold - spans_pred:
        false_neg[s].append((b, e))
    for row in df.drop_duplicates('sentence_id').itertuples():
        if row.sentence_id in false_pos or row.sentence_id in false_neg:
            print('Sentence:', row.sentence)
            for b, e in false_pos[row.sentence_id]:
                print('  FP:', ' '.join(row.sentence.split()[b:e]))
            for b, e in false_neg[row.sentence_id]:
                print('  FN:', ' '.join(row.sentence.split()[b:e]))

Use this function to inspect and analyse the errors that the automated prediction makes. Can you see any patterns? Base your analysis on the first 500 rows of the training data. Summarize your observations in a short text.

In [13]:
# TODO: Write code here to do your analysis

# use first 500 rows 
# set of gold-standard spans
# set of predicted spans
error_report(df_dev.iloc[0:500,], set(gold_spans(df_dev)), set(spaCy_predict(df_dev)))


Sentence: CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .
  FN: LEICESTERSHIRE
Sentence: LONDON 1996-08-30
  FP: 1996-08-30
Sentence: West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .
  FP: 38
  FP: Friday
  FP: 39
  FP: four
  FP: two days
Sentence: After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended their first innings by 94 runs before being bowled out for 296 with England discard Andy Caddick taking three for 83 .
  FP: 296
  FP: 83
  FP: the opening morning
  FP: 83
  FP: three
  FP: first
  FP: 94
Sentence: Trailing by 213 , Somerset got a solid start to their second innings before Simmons stepped in to bundle them out for 174 .
  FP: 174
  FP: 213
  FP: second
Sentence: Essex , however , look certain to regain their top spot after Nasser Hussain and Peter Such gave them a firm grip

*TODO: Write a short text that summarises the errors that you observed*

----My Answer:

----The main reason for the error of low precission is the influence of numbers, such as: date, time, percentage, ordinal number, number, etc.

----a FP(False Positives) is an error in binary classification in which a test result incorrectly indicates the presence of a condition, and in almost every sentence, there are FPs related to numbers or dates, so we think this is the main reason for the error.

Now, use the insights from your error analysis to improve the automated prediction that you implemented in Problem&nbsp;2. While the best way to do this would be to [update spaCy&rsquo;s NER model](https://spacy.io/usage/linguistic-features#updating) using domain-specific training data, for this lab it suffices to write code to post-process the output produced by spaCy. To filter out specific labels it is useful to know the named entity label scheme, which can be found in the [model's documentation](https://spacy.io/models/en#en_core_web_sm). You should be able to improve the F1 score from Problem&nbsp;2 by at last 15 percentage points.

----My Answer:

----from models documentation: https://spacy.io/models/en#en_core_web_sm

--> en_core_web_sm --> Label Scheme --> NER(named entity labels), we could find the: 

----original types:

['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']

----number related types:

['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'PERCENT', 'QUANTITY', 'TIME'  ]

----so, we remove these number related types.


In [14]:
# TODO: Write code here to run and evaluate the spaCy named entity recognizer on the development data
def improved_spaCy_predict(df):
    for row in df.itertuples():
        doc = nlp(row.sentence)
        for ent in doc.ents:
            # here we remove these numbers or date related named entity labels.
            if ent.label_ not in ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'PERCENT', 'QUANTITY', 'TIME']:
                yield row.sentence_id, ent.start, ent.end

# after remove these named entity labels
# we got improved_spaCy_predictions, and compare it with gold standard

improved_spaCy_predictions = set(improved_spaCy_predict(df_dev))
evaluation_report(gold = spans_dev_gold, pred = improved_spaCy_predictions)


precission:  0.847284605961617 , recall:  0.701368936961298 , F1_score:  0.7674526121128065


In [15]:
#  improved_spaCy_predictions' F1_score minus F1_score from Problem 2

0.7674526121128065 - 0.5984444764511019


0.16900813566170458

----My Answer:

----As we can see, after removing these numbers or date related named entirty labels, we improved approximately 16.9% F1_score.

Show that you achieve the performance goal by reporting the evaluation measures that you implemented in Problem&nbsp;1.

Before going on, we ask you to store the outputs of the improved named entity recognizer on the development data in a new data frame. This new frame should have the same layout as the original data frame for the development data that you loaded above, but should contain the *predicted* start and end positions for each token span, rather than the gold positions. As the `label` of each span, you can use the special value `--NME--`.

In [16]:
# layout of old development data
df_dev.head()


Unnamed: 0,sentence_id,sentence,beg,end,label
0,0946-000,CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTE...,2,3,Leicestershire_County_Cricket_Club
1,0946-001,LONDON 1996-08-30,0,1,London
2,0946-002,West Indian all-rounder Phil Simmons took four...,0,2,West_Indies_cricket_team
3,0946-002,West Indian all-rounder Phil Simmons took four...,3,5,Phil_Simmons
4,0946-002,West Indian all-rounder Phil Simmons took four...,12,13,Leicestershire_County_Cricket_Club


In [17]:
# we look at the improved_spaCy_predictions
improved_spaCy_predictions


{('1133-011', 9, 10),
 ('1043-003', 27, 28),
 ('0980-014', 8, 9),
 ('1143-003', 43, 44),
 ('1144-000', 0, 1),
 ('1152-016', 8, 9),
 ('1099-005', 22, 23),
 ('1031-013', 26, 27),
 ('1116-033', 17, 18),
 ('1133-004', 38, 39),
 ('1057-007', 25, 27),
 ('0959-007', 2, 4),
 ('1012-004', 17, 18),
 ('0961-005', 7, 8),
 ('1090-005', 10, 13),
 ('1097-002', 0, 2),
 ('1152-005', 34, 36),
 ('1070-009', 7, 9),
 ('1152-005', 5, 9),
 ('1094-051', 0, 1),
 ('1046-010', 21, 22),
 ('0980-013', 19, 22),
 ('1135-024', 19, 20),
 ('1149-011', 2, 8),
 ('1051-002', 15, 16),
 ('1051-003', 15, 16),
 ('0960-003', 10, 11),
 ('1014-009', 0, 2),
 ('1151-001', 0, 1),
 ('1006-000', 0, 1),
 ('1116-037', 4, 5),
 ('0961-007', 17, 19),
 ('1056-017', 27, 29),
 ('1076-020', 0, 1),
 ('1087-019', 9, 12),
 ('0966-157', 1, 3),
 ('1152-014', 9, 10),
 ('1011-017', 7, 8),
 ('0985-009', 13, 14),
 ('1072-012', 34, 37),
 ('1142-013', 5, 6),
 ('1022-002', 12, 13),
 ('1152-003', 3, 4),
 ('0948-011', 0, 1),
 ('0953-012', 0, 1),
 ('0971-00

----My Answer:

----If we look at the predicted spans(improved_spaCy_predictions) we got based on the new entity recognizer(improved_spaCy_predict):

----it has 'sentence_id', 'predicted start position', 'end positions'

----As the label of each span, we use the special value '--NME--'.

In [18]:
# TODO: Write code here to store the predicted spans in a new data frame

#1. Create an empty DataFrame, just remove the old records, but keep the column names and layout.
new_df_dev = df_dev.drop(index=df_dev.index)

#2. add new row or data records at the end of this empty DataFrame.
# based on layout we add 'sentence_id', 'sentence', 'predicted start position', 'end positions' and 'label'
for id, start, end in improved_spaCy_predictions:
    new_df_dev.loc[len(new_df_dev.index)] = [id, df_dev[df_dev['sentence_id'] == id].iloc[0]['sentence'], start, end, '--NME--']

new_df_dev.head()


Unnamed: 0,sentence_id,sentence,beg,end,label
0,1133-011,Wang was arrested in the east China city of Ha...,9,10,--NME--
1,1043-003,"Jeerasak Densakul , 24 , was arrested in 1995 ...",27,28,--NME--
2,0980-014,"Federal Reserve governor Lawrence Lindsey , sp...",8,9,--NME--
3,1143-003,Representatives of the five nations making up ...,43,44,--NME--
4,1144-000,KPD confirms Iraqi military aid-U.N. official .,0,1,--NME--


----

----My Answer:

----new_df_dev has the same layout as df_dev and 5 rows, all the labels are replaced by '--NME--'

----since the predicted spans length changed to, the new_df_dev has the different length

## Problem 4: Entity linking

Now that we have a method for predicting mention spans, we turn to the task of **entity linking**, which amounts to predicting the knowledge base entity that is referenced by a given mention. In our case, for each span we want to predict the Wikipedia page that this mention references.

Start by extending the generator function that you implemented in Problem&nbsp;2 to labelled spans.

In [19]:
def gold_mentions(df):
    """Yield the gold-standard mentions in a data frame.

    Args:
        df: A data frame.

    Yields:
        The gold-standard mention spans in the specified data frame as
        quadruples consisting of the sentence id, start position, end
        position and entity label of each span.
    """
    # TODO: Replace the next line with your own code
    # 'sentence_id', 'start position', 'end position', 'entity label'
    for row in df.itertuples():
        yield row.sentence_id, row.beg, row.end, row.label



A naive baseline for entity linking on our data set is to link each mention span to the Wikipedia page name that we get when we join the tokens in the span by underscores, as is standard in Wikipedia page names. Suppose, for example, that a span contains the two tokens

    Jimi Hendrix

The baseline Wikipedia page name for this span would be

    Jimi_Hendrix

Implement this naive baseline and evaluate its performance. Print the evaluation measures that you implemented in Problem&nbsp;1.

**Here and in the remainder of this lab, you should base your entity predictions on the predicted spans that you computed in Problem&nbsp;3.**

In [20]:
# TODO: Write code here to implement the baseline
def naive_baseline_predict(df):
    for row in df.itertuples():
        label = "_".join(row.sentence.split(' ')[row.beg : row.end])
        yield row.sentence_id, row.beg, row.end, label


In [21]:
# use gold_mentions() function on df_dev to get gold standard
gold_standard_mentions = set(gold_mentions(df_dev)) # old df dev

# predict on new_df_dev get from problem 3
naive_baseline_prediction = set(naive_baseline_predict(new_df_dev)) # new df dev

evaluation_report(gold = gold_standard_mentions, pred = naive_baseline_prediction)


precission:  0.31400571661902815 , recall:  0.25992901808348823 , F1_score:  0.2844197873324087


In [22]:
naive_baseline_prediction

{('1128-003', 37, 38, 'BRE'),
 ('0966-146', 1, 3, 'Mike_Conley'),
 ('1045-001', 0, 2, 'UNITED_NATIONS'),
 ('1074-003', 2, 3, 'Mauritania'),
 ('1081-002', 4, 5, '0-0'),
 ('1143-003', 10, 11, 'Israel'),
 ('1135-008', 22, 25, "Suu_Kyi_'s"),
 ('1144-000', 0, 1, 'KPD'),
 ('1016-011', 8, 11, 'the_Middle_East'),
 ('1033-010', 30, 31, 'OSCE'),
 ('1060-016', 0, 3, '12._Frank_Nobilo'),
 ('1097-007', 7, 10, "Ivan_Lendl_'s"),
 ('1155-010', 17, 18, 'Iranian'),
 ('1103-013', 0, 1, 'Belgium'),
 ('0985-011', 15, 17, 'Las_Vegas'),
 ('1011-012', 10, 11, 'Missouri'),
 ('0979-013', 1, 3, 'Harry_Milling'),
 ('1084-002', 2, 3, 'Latvia'),
 ('0966-026', 1, 3, 'Steve_Brown'),
 ('0966-042', 4, 5, 'Russia'),
 ('1014-004', 34, 36, 'Scott_Reed'),
 ('0966-028', 4, 5, 'U.S.'),
 ('1096-025', 23, 27, 'the_San_Diego_Padres'),
 ('1008-003', 10, 11, 'Midwest'),
 ('1036-002', 0, 1, 'LONGYEAR'),
 ('0961-009', 6, 7, 'Cunningham'),
 ('1116-010', 2, 3, 'Moscow'),
 ('1015-006', 2, 3, 'Titanic'),
 ('0963-024', 3, 5, 'Tom_Lehman

## Problem 5: Extending the training data using the knowledge base

State-of-the-art approaches to entity linking exploit information in knowledge bases. In our case, where Wikipedia is the knowledge base, one particularly useful type of information are links to other Wikipedia pages. In particular, we can interpret the anchor texts (the highlighted texts that you click on) as mentions of the entities (pages) that they link to. This allows us to harvest long lists of mention–entity pairings.

The following cell loads a data frame summarizing anchor texts and page references harvested from the first paragraphs of the English Wikipedia. The data frame also contains all entity mentions in the training data (but not the development or the test data).

In [23]:
with bz2.open('kb.tsv.bz2', 'rt') as source:
    df_kb = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)


To understand what information is availabel in this data, the following cell shows the entry for the anchor text `Sweden`.

In [24]:
df_kb.loc[df_kb.mention == 'Sweden']


Unnamed: 0,mention,entity,prob
17436,Sweden,Sweden,0.985768
17437,Sweden,Sweden_national_football_team,0.014173
17438,Sweden,Sweden_men's_national_ice_hockey_team,5.9e-05


As you can see, each row of the data frame contains a pair $(m, e)$ of a mention $m$ and an entity $e$, as well as the conditional probability $P(e|m)$ for mention $m$ referring to entity $e$. These probabilities were estimated based on the frequencies of mention–entity pairs in the knowledge base. The example shows that the anchor text &lsquo;Sweden&rsquo; is most often used to refer to the entity [Sweden](http://en.wikipedia.org/wiki/Sweden), but in a few cases also to refer to Sweden&rsquo;s national football and ice hockey teams. Note that references are sorted in decreasing order of probability, so that the most probable pairing come first.

Implement an entity linking method that resolves each mention to the most probable entity in the data frame. If the mention is not included in the data frame, you can predict the generic label `--NME--`. Print the precision, recall, and F1 of your method using the function that you implemented for Problem&nbsp;1.

In [25]:
new_df_dev.head()

Unnamed: 0,sentence_id,sentence,beg,end,label
0,1133-011,Wang was arrested in the east China city of Ha...,9,10,--NME--
1,1043-003,"Jeerasak Densakul , 24 , was arrested in 1995 ...",27,28,--NME--
2,0980-014,"Federal Reserve governor Lawrence Lindsey , sp...",8,9,--NME--
3,1143-003,Representatives of the five nations making up ...,43,44,--NME--
4,1144-000,KPD confirms Iraqi military aid-U.N. official .,0,1,--NME--


In [26]:
df_kb.head()

Unnamed: 0,mention,entity,prob
0,000 Guineas,2000_Guineas_Stakes,1.0
1,10 00,United_States_dollar,1.0
2,126 million,United_States_dollar,1.0
3,13th dynasty,Middle_Kingdom_of_Egypt,1.0
4,14th Dalai Lama,14th_Dalai_Lama,1.0


In [27]:
# # TODO: Write code here to implement the "most probable entity" method.
# # df: new_df_dev
# # since we conclude df_kb in function, we only need df Args
# Yao's code
# def entity_link(df):
#     return_list = []
#     for row in df.itertuples():
#         entity = df_kb.loc[df_kb.mention == nlp(row[2])[row[3]:row[4]].text][0:1]['entity'].values # remove _
#         if len(entity) == 0:
#             return_list += [(row[1],row[3],row[4],'--NME--')] # row[1] is sentence_id, row[3]: beg; row[4]: end
#         else:
#             return_list += [(row[1],row[3],row[4],entity[0])]
#     return set(return_list)


In [28]:
# TODO: Write code here to implement the "most probable entity" method.    
def entity_link(new_df_dev):
    return_list = []
    for row in new_df_dev.itertuples():
        df_label = '_'.join(row[2].split(' ')[row[3] : row[4]]) # add '_'
        mentions = df_kb.loc[df_kb['mention'] == df_label]
        if len(mentions) > 0:
            # the prob goes from highest to lowest, hte first one is what we need 
            label = mentions.iloc[0].entity
            return_list += [(row[1],row[3],row[4],label)]
        else:
            label = '--NME--'
            return_list += [(row[1],row[3],row[4],label)] # row[1] is sentence_id, row[3]: beg; row[4]: end

        # yield row.sentence_id, row.beg, row.end, label
    return set(return_list)


In [29]:
pred = set(entity_link(new_df_dev))
evaluation_report(gold_standard_mentions, pred = pred)


precission:  0.4889750918742344 , recall:  0.4047659286800744 , F1_score:  0.4429033749422099


## Problem 6: Context-sensitive disambiguation

Consider the entity mention &lsquo;Lincoln&rsquo;. The most probable entity for this mention turns out to be [Lincoln, Nebraska](http://en.wikipedia.org/Lincoln,_Nebraska); but in pages about American history, we would be better off to predict [Abraham Lincoln](http://en.wikipedia.org/Abraham_Lincoln). This suggests that we should try to disambiguate between different entity references based on the textual context on the page from which the mention was taken. Your task in this last problem is to implement this idea.

Set up a dictionary that contains, for each mention $m$ that can refer to more than one entity $e$, a separate Naive Bayes classifier that is trained to predict the correct entity $e$, given the textual context of the mention. As the prior probabilities of the classifier, choose the probabilities $P(e|m)$ that you used in Problem&nbsp;5. To let you estimate the context-specific probabilities, we have compiled a data set with mention contexts:

In [30]:
with bz2.open('contexts.tsv.bz2') as source:
    df_contexts = pd.read_csv(source, sep='\t', quoting=csv.QUOTE_NONE)

This data frame contains, for each ambiguous mention $m$ and each knowledge base entity $e$ to which this mention can refer, up to 100 randomly selected contexts in which $m$ is used to refer to $e$. For this data, a **context** is defined as the 5 tokens to the left and the 5 tokens to the right of the mention. Here are a few examples:

In [31]:
df_contexts.head()

Unnamed: 0,mention,entity,context
0,1970,UEFA_Champions_League,Cup twice the first in @ and the second in 1983
1,1970,FIFA_World_Cup,America 1975 and during the @ and 1978 World C...
2,1990 World Cup,1990_FIFA_World_Cup,Manolo represented Spain at the @
3,1990 World Cup,1990_FIFA_World_Cup,Hašek represented Czechoslovakia at the @ and ...
4,1990 World Cup,1990_FIFA_World_Cup,renovations in 1989 for the @ The present capa...


Note that, in each context, the position of the mention is indicated by the `@` symbol.

From this data frame, it is easy to select the data that you need to train the classifiers – the contexts and corresponding entities for all mentions. To illustrate this, the following cell shows how to select all contexts that belong to the mention &lsquo;Lincoln&rsquo;:

In [32]:
df_contexts.context[df_contexts.mention == 'Lincoln']

41465    Nebraska Concealed Handgun Permit In @ municip...
41466    Lazlo restaurants are located in @ and Omaha C...
41467    California Washington Overland Park Kansas @ N...
41468    City Missouri Omaha Nebraska and @ Nebraska It...
41469    by Sandhills Publishing Company in @ Nebraska USA
                               ...                        
41609                                      @ Leyton Orient
41610                    English division three Swansea @ 
41611    league membership narrowly edging out @ on goa...
41612                                          @ Cambridge
41613                                                   @ 
Name: context, Length: 149, dtype: object

Implement the context-sensitive disambiguation method and evaluate its performance. Here are some more hints that may help you along the way:

**Hint 1:** The prior probabilities for a Naive Bayes classifier can be specified using the `class_prior` option. You will have to provide the probabilities in the same order as the alphabetically sorted class (entity) names.

**Hint 2:** Not all mentions in the knowledge base are ambiguous, and therefore not all mentions have context data. If a mention has only one possible entity, pick that one. If a mention has no entity at all, predict the `--NME--` label.

In [33]:
# TODO: Write code here to implement the context-sensitive disambiguation method
import math
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

vectorizer1 = CountVectorizer()
vectorizer1.fit(np.append(df_contexts.context.values,df_contexts.mention.values))

dict_mention = dict()

for m in set(df_contexts.mention):
    contexts = df_contexts[df_contexts.mention == m]['context'].values
    entity = df_contexts[df_contexts.mention == m]['entity'].values
    #complete contexts by replacing the '@'
    contexts_replace = [x[0].replace('@', x[1]) for x in zip(contexts, entity)]
    #compute prior probabilities
    prior_prob_list = [df_kb.prob[(df_kb.mention == m)&(df_kb.entity == x)].values[0] for x in sorted(set(entity))]
    #vectorize context as input data
    X = vectorizer1.transform(contexts_replace)
    model = MultinomialNB(class_prior = prior_prob_list)
    dict_mention[m] = model.fit(X,entity)





In [34]:
# prediction
def predict_cs(df):
    return_list = []
    for row in df.itertuples():
        # extract mentions from sentences
        mention = nlp(row[2])[row[3]:row[4]].text
        if mention in dict_mention.keys():
            context = [nlp(row[2])[row[3]-5:row[4]-5].text]
            #predict using the model stored in dictionary
            entity_pre = dict_mention[mention].predict(vectorizer1.transform(context))[0]
            return_list += [(row[1],row[3],row[4],entity_pre)]
        
        else:
            # for the mentions which don't have contexts, use entity_link instead
            return_list += [x for x in entity_link(pd.DataFrame([row[1:]]))]

    return set(return_list)


In [35]:
evaluation_report(set(gold_mentions(df_dev)), predict_cs(new_df_dev))



precission:  0.5134748877092691 , recall:  0.42504647625485886 , F1_score:  0.46509477577438746


You should expect to see a small (around 1&nbsp;unit) increase in both precision, recall, and F1.

----My Answer:

----In problem 5:
precission:  0.4889750918742344 , recall:  0.4047659286800744 , F1_score:  0.4429033749422099

----In problem 6:
precission:  0.5134748877092691 , recall:  0.42504647625485886 , F1_score:  0.46509477577438746

----As we expected, these values have improved slightly, the function in Problem 6 works


## Reflection questions

The following reflection questions will help you prepare for the diagnostic test. Answer each of them in the form of a short text and put your answers in the cell below. You will get feedback on your answers from your lab assistant.

**RQ 5.1:** In Problem&nbsp;3, you did an error analysis on the task of recognizing text spans mentioning named entities. Summarize your results. Pick one type of error that you observed. How could you improve the model&rsquo;s performance on this type of error? What resources (such as domain knowledge, data, compute) would you need to implement this improvement?

**RQ 5.2:** Thinking back about Problem&nbsp;6, explain what the word *context* refers to in the task addressed there, and how context can help to disambiguate between different entities. Suggest other types of context that you could use for disambiguation.

**RQ 5.3:** One type of entity mentions that we did not cover explicitly in this lab are pronouns. As an example, consider the sentence pair *Ruth Bader Ginsburg was an American jurist*. *She served as an associate justice of the Supreme Court from 1993 until her death in 2020*. What facts would you want to extract from this sentence pair? How do pronouns make fact extraction hard?

*TODO: Enter your answers here*

----5.1

We improve accuracy by removing number-related types. For example, one of the FPs is 'FP: 1996-08-30', This is a type that represents the date. There are various formats of the date. We need to improve the accuracy of the predicted date through calculation and some discriminants.

----5.2 

context similarity.

for example: The meaning of "apple" in the sentences "my mobile phone is Apple" and "I like to eat apples" is different.

We can calculate similarity by context. For example, we know that "Apple: is a kind of fruit.", "Apple: a high-tech company in the United States, and its classic products include iPhone", we can express these two meanings with vectors; Next, for a sentence that needs to be recognized, such as "I want to eat an apple", the context of "apple" in this sentence is taken out and converted into a vector, and compared with the above two meanings, which one has the highest similarity, it can be considered what meaning the word represents.

Other types can be word class. We use the above apple's example, "eat" is a verb, it will be associated with apples(fruit). Adjectives such as: "systematic", "latest", they will be used to describe Apple mobile phones.


---5.3 

Ruth Bader Ginsburg served as associate justice of the Supreme Court from 1993 until her death in 2020.

Because we need to associate the pronoun with the preceding object or context, which increases the difficulty of extracting information.




**This was the last lab in the Text Mining course. Congratulations! 🥳**