# Notebook 8 - Knowledge Representation (KR)

CSI4106 Artificial Intelligence  \
Fall 2021 \
Version 1 (2020) prepared by Julian Templeton and Caroline Barrière.  Version 2 (2021) revised by Caroline Barrière.

***INTRODUCTION***:  

When reading text, understanding the type of entities within the text helps to infer additional information about the entity. For example, if a text mentions *Canada*, knowing that it is a GPE (geopolitical entity), already indicates to us that this entity has a supercify, a population, etc. Through the use of Named Entity Recognition (NER), we are able to determine whether an entity is a Person, Organization, Country, ... 

When exploring text online, we also occassionally see entities have clickable links to webpages with more information on the entity. This is a form of enhancement of the text to allow readers to easily access the information needed to understand each entity from the text and its content.  If we take the example of Canada again, if we transform it into [Canada](https://en.wikipedia.org/wiki/Canada), using entity linking we access more information.

In this notebook we will be revisiting the Covid-19 related news dataset from notebook 7 to explore how we can improve spaCy's NER and enhance the text from the news articles through the use of entity linking. This will be done in three parts: 

(1) we explore the results of spaCy's NER  \
(2) we use text coherence for post-processing spaCy's NER results \
(3) we perform text enhancement with entity linking.    

This notebook uses libraries that have been used in previous notebooks, including spaCy and pandas. 

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, sign the notebook (at the end of the notebook), rename it to *StudentNum-LastName-Notebook8.ipynb* and submit it.  

*The notebook will be marked on 30.  
Each **(TO DO)** has a number of points associated with it.*
***

In [241]:
# Before starting we will import every module that we will be using
import spacy
import pandas as pd

In [242]:
# The core spacy object can be used for tokenization, lemmatization, POS Tagging, NER, ...
# Note that this is specifically for the English language and requires the English package to be installed
# via pip to work as intended.

# sp = spacy.load('en')

# If the above causes an error after installing the package 
# then install the package as below
# !spacy download en_core_web_sm
sp = spacy.load('en_core_web_sm')

Similarly to the last notebook, the dataset is provided on Brightspace (Module 8) along with this notebook, but details regarding Covid-19 news dataset can be found [here](https://www.kaggle.com/ryanxjhan/cbc-news-coronavirus-articles-march-26?select=news.csv). The first thing that we will do, as usual, is load the file into a pandas dataframe.  

In [243]:
# Read the dataset, show top ten rows
df = pd.read_csv("news.csv")
df.head(10)

Unnamed: 0.1,Unnamed: 0,authors,title,publish_date,description,text,url
0,0,[],'More vital now:' Gay-straight alliances go vi...,2020-05-03 1:30,Lily Overacker and Laurell Pallot start each g...,Lily Overacker and Laurell Pallot start each g...,https://www.cbc.ca/news/canada/calgary/gay-str...
1,1,[],Scientists aim to 'see' invisible transmission...,2020-05-02 8:00,Some researchers aim to learn more about how t...,"This is an excerpt from Second Opinion, a week...",https://www.cbc.ca/news/technology/droplet-tra...
2,2,['The Canadian Press'],Coronavirus: What's happening in Canada and ar...,2020-05-02 11:28,Canada's chief public health officer struck an...,The latest: The lives behind the numbers: Wha...,https://www.cbc.ca/news/canada/coronavirus-cov...
3,3,[],"B.C. announces 26 new coronavirus cases, new c...",2020-05-02 18:45,B.C. provincial health officer Dr. Bonnie Henr...,B.C. provincial health officer Dr. Bonnie Henr...,https://www.cbc.ca/news/canada/british-columbi...
4,4,[],"B.C. announces 26 new coronavirus cases, new c...",2020-05-02 18:45,B.C. provincial health officer Dr. Bonnie Henr...,B.C. provincial health officer Dr. Bonnie Henr...,https://www.cbc.ca/news/canada/british-columbi...
5,5,"['Senior Writer', 'Chris Arsenault Is A Senior...",Brazil has the most confirmed COVID-19 cases i...,2020-05-02 8:00,"From describing coronavirus as a ""little flu,""...","With infection rates spiralling, some big city...",https://www.cbc.ca/news/world/brazil-has-the-m...
6,6,['Cbc News'],The latest on the coronavirus outbreak for May 1,2020-05-01 20:43,The latest on the coronavirus outbreak from CB...,Coronavirus Brief (CBC) Canada is officiall...,https://www.cbc.ca/news/the-latest-on-the-coro...
7,7,['Cbc News'],Coronavirus: What's happening in Canada and ar...,2020-05-01 11:51,Nova Scotia announced Friday it is immediately...,The latest: The lives behind the numbers: Wha...,https://www.cbc.ca/news/canada/coronavirus-cov...
8,8,"['Senior Writer', ""Adam Miller Is Senior Digit...",Did the WHO mishandle the global coronavirus p...,2020-04-30 8:00,The World Health Organization has come under f...,The World Health Organization has come under f...,https://www.cbc.ca/news/health/coronavirus-who...
9,9,['Thomson Reuters'],Armed people in Michigan's legislature protest...,2020-04-30 21:37,"Hundreds of protesters, some armed, gathered a...","Hundreds of protesters, some armed, gathered a...",https://www.cbc.ca/news/world/protesters-michi...


**PART 1 - SpaCy's NER**  
  
Let's start by looking at the NER that is performed by spaCy.  SpaCy's documentation does not tell us how exactly their NER is done (certainly their trade secret), but we can at least look at the results.

As we've talked about in previous notebooks, when evaluating a process or a tool, we can do quantitative or **qualitative evaluation** of results.  In this notebook, we work at a qualitative level, meaning that we are not measuring metrics such as precision/recall on a large amount of data, but rather printing results of a few examples and try to understand these results.





Below is the same sentence example as in the last notebook, for which we had looked at POS-tagging and other linguistic processes.  We now use this sentence to show how to access spaCy's NER type predictions for tokens in a text.

In [244]:
# Same example from notebook 7, recall that we loop through the iterator found in the .ents property of a parsed sentence
sentence_example = "Government guidelines in Canada recommend that people stay at least two metres away from others as part of physical distancing measures to curb the spread of COVID-19."
sentence_example_content = sp(sentence_example)
# Loop through all tokens that contain a NER type and print the token along with the corresponding NER type
for token in sentence_example_content.ents:
    print("\"" + token.text + "\" is a " + token.label_ )

"Canada" is a GPE


**(TO DO) Q1 - 5 marks** 

In the text of **second document** (index 1) of our corpus of documents, find out which words are *PER* (spaCy uses the *PERSON* type, rather than *PER*), *ORG* (Organization), and *GPE* (Geopolitical Entity). You must do the following for this question:    
a) (2 marks) Print each element in the text tagged as *PER*, *ORG*, and *GPE* along with its NER type from spaCy.     
b) (1 mark) Is the majority of outputs correct? Provide two examples of incorrect outputs from (a).  
c) (2 marks) Do any of the problems with the NER type predictions come from an earlier step in the NLP pipeline that is performed by spaCy? Describe the problem for two examples of your output from (a).   

In [245]:
# ANSWER Q1(a) - 2 marks
# Select the second document (index 1)
doc = df["text"][1]
doc_content = sp(doc)

# Print each PER, ORG, GPE along with its type
for token in doc_content.ents:
    if token.label_ == "PERSON":
        print("PER: \"" + token.text + "\", spaCy NRE type: " + token.label_)
    if token.label_ == "ORG" or token.label_ == "GPE":
        print(token.label_ +": \"" + token.text + "\", spaCy NRE type: " + token.label_ )

PER: "COVID-19", spaCy NRE type: PERSON
PER: "COVID-19", spaCy NRE type: PERSON
ORG: "the World Health Organization", spaCy NRE type: ORG
ORG: "Touches", spaCy NRE type: ORG
ORG: "WHO", spaCy NRE type: ORG
ORG: "the Public Health Agency of", spaCy NRE type: ORG
GPE: "Canada", spaCy NRE type: GPE
PER: "W.F. Wells", spaCy NRE type: PERSON
ORG: "the Harvard School of Public Health", spaCy NRE type: ORG
PER: "Wells", spaCy NRE type: PERSON
GPE: "Canada", spaCy NRE type: GPE
PER: "Lydia Bourouiba", spaCy NRE type: PERSON
ORG: "the Fluid Dynamics of Disease Transmission Laboratory", spaCy NRE type: ORG
ORG: "the Massachusetts Institute of Technology", spaCy NRE type: ORG
PER: "Bourouiba", spaCy NRE type: PERSON
PER: "Mark Loeb", spaCy NRE type: PERSON
PER: "Hamilton", spaCy NRE type: PERSON
ORG: "McMaster University ", spaCy NRE type: ORG
ORG: "RNA", spaCy NRE type: ORG
GPE: "Wuhan", spaCy NRE type: GPE
GPE: "China", spaCy NRE type: GPE
GPE: "Nebraska", spaCy NRE type: GPE
PER: "COVID-19", s

**ANSWER Q1 (b) - 1 mark**   

The majority of outputs are correct. 

Two examples of incorrect outputs are: 
1. "COVID-19", which is incorrectly labelled as a PERSON
2. "Touches", which is incorrectly labelled as a ORG

**ANSWER Q1 (c) - 2 marks**   

"N95" is labelled as ORG. This problem might also be from POS Tagging in the NLP pipeline.
"COVID-19" is labelled as PERSON. This problem might come from Part of Speech Tagging in the NLP pipeline

**PART 2 - Text Coherence and coreference chains**  
  
As you saw in Q1, the results of spaCy are quite good, but not perfect.  One main issue with NER (not just in spaCy but in many tools) is that the annotation is performed one entity at a time without consideration of the overall document.  

But when looking a the whole document, and knowing that text is usually coherent, we can do some post-processing to spaCy's NER module and correct some mistakes.  By text being coherent, we mean, for example, that if a person is referred to with a particular name, e.g. *McGeer*, chances are that each time we see *McGeer* in the document, it is the same person.  All the mentions of *McGeer* form a coreference chain all refering to a single entity. So it is unlikely that *McGeer* would be once a person and once an organization.  This is not always true, there are numerous counter-examples, but it is a common assumption.  This idea is even the topic of an older much-cited NLP article called "One sense per discourse" (Gale and al. 1992). 

With this idea of "one sense per discourse", we will explore two different strategies to use text coherence to post-process the output from the spaCy NER module.  

The first strategy (*explored in Q2/Q3*) is to find, among all NER types assigned, which is the most frequent one.  For example, the name *Bourouiba* was assigned 1 time ORG, and 2 times PERSON, so this information can be used to modify the ORG type and change it to PERSON.  

The second strategy (explored in Q4) is to try to find a longer surface form in the text.  Since that longer form should be less ambiguous, we can use it to disambiguate the shorter, more ambiguous forms.  For example, *Lydia Bourouiba* occurs in the text and is assigned PERSON.  We can use that information to assign further occurrences of the short form *Bourouiba* to also be PERSON.   

Of course, using these methods for text coherence will not work every time, and will unfortunately introduce some errors...  But let's try.  That's what empirical studies are about, we try ideas.

Let's take again the news article from Q1, but this time, let's show not only GPE, PER, ORG, but rather all the Named Entities found by spaCy.

In [246]:
# Select document 2
doc = df["text"][1]
# NER
doc_sp = sp(doc)
# Display all entities from the text along with their index in the .ents iterator and the
# corresponding NER type
for i, token in enumerate(doc_sp.ents):
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ )

0: "weekly" is a DATE
1: "Saturday" is a DATE
2: "morning" is a TIME
3: "COVID-19" is a PERSON
4: "two metres" is a QUANTITY
5: "COVID-19" is a PERSON
6: "the World Health Organization" is a ORG
7: "Touches" is a ORG
8: "WHO" is a ORG
9: "more than one metre" is a QUANTITY
10: "the Public Health Agency of" is a ORG
11: "Canada" is a GPE
12: "at least two metres" is a QUANTITY
13: "two" is a CARDINAL
14: "2 metres" is a QUANTITY
15: "the 19th century" is a DATE
16: "1934" is a DATE
17: "W.F. Wells" is a PERSON
18: "the Harvard School of Public Health" is a ORG
19: "two metres" is a QUANTITY
20: "Wells" is a PERSON
21: "56,000+" is a QUANTITY
22: "Canada" is a GPE
23: "Saturday" is a DATE
24: "Lydia Bourouiba" is a PERSON
25: "the Fluid Dynamics of Disease Transmission Laboratory" is a ORG
26: "the Massachusetts Institute of Technology" is a ORG
27: "Bourouiba" is a PERSON
28: "Canadian" is a NORP
29: "Mark Loeb" is a PERSON
30: "Hamilton" is a PERSON
31: "McMaster University " is a ORG


**(TO DO) Q2 - 3 marks**  
As you can see in the results, sometimes the same entity was assigned different entity types (e.g. *McGeer* is one time assigned entity type ORG, and one time entity type PERSON) since the NER algorithm looks sentence by sentence.  In the following function, the purpose will be to find all the possible entity types assigned to a single entity.

Complete the definition of the *find_entity_types* function below. This function accepts as input a specific spaCy entity defined by the *entity* parameter and a list of all spaCy entities defined by the *entities* parameter.     

The function must find all entities (from *entities*) having the same surface form as *entity*. For each match between the entities, add the NER type to the dictionary *type_counts* and track the number of times each NER type appears.     

The *type_counts* dictionary would contain for example *McGeer* with ORG = 1, and PERSON = 1, because the function found 2 mentions of *McGeer*, each with a different type.

In [247]:
# ANSWER Q2 
def find_entity_types(entity, entities):
    '''
    Given a specific entity and a list of entities, finds all entities from the list that match surface form of the specified
    entity, but that could be of a different type.
    
    Returns the different NER types that have been classified for an entity and the count per NER type
    as a dictionary with the keys as the NER type and the value as the count
    '''
    type_counts = { }
    
    for token in entities:
        if token.text == entity.text:
            if token.label_ in type_counts:
                type_counts[token.label_] += 1
            else:
                type_counts[token.label_] = 1
        
    return type_counts
    

In [248]:
# Test the above to find the result when checking for the types of the entity 'Bourouiba' 
# from the document loaded above
print("All possible NER types for \"" + doc_sp.ents[64].text + "\" are " + str(find_entity_types(doc_sp.ents[64], doc_sp.ents)))

All possible NER types for "Bourouiba" are {'PERSON': 2, 'ORG': 2}


**(TO DO) Q3 - 2 marks**  
In the previous method, *find_entity_types*, we found all the possible entity types for a single entity.  Now, we want to use these to find the most common type.  For example, in the case of *McGeer*, it's a tie.  But for *Bourouiba*, there is one ORG type, and 2 PERSON type, so the most common would be PERSON.

Complete the definition of the *most_common_type* function below. This function accepts as input a specific spaCy entity defined by the *entity* parameter and a list of all spaCy entities defined by the *entities* parameter.        

Note: You can handle ties as you please.  Also, make sure to use the function *find_entity_types* which you just wrote in Q2.

In [249]:
# ANSWER Q3 
def most_common_type(entity, entities):
    '''
    Given a specific entity and a list of entities, find the most similar entities and assign the
    NER type to entity based on the most common NER type assigned to entities of the same name (if there
    is a tie, you decide how to handle this).
    
    Returns the most common NER type based on similar entities
    '''
    # TODO
    NER_possibilities = find_entity_types(entity, entities)
    return max(NER_possibilities, key=NER_possibilities.get)

In [250]:
# Test the above to find the result when checking for the types of the entity 'Bourouiba' 
# from the document loaded above
print("The most common NER type to \"" + doc_sp.ents[64].text + "\" is " + most_common_type(doc_sp.ents[64], doc_sp.ents))

The most common NER type to "Bourouiba" is PERSON



Our first exploration (in Q2/Q3) was about frequency of occurrence.  We assumed the most common entity type could be the correct one.  Now, we'll explore the idea that the least ambiguous reference to an entity (the actual text) could be the correct one.  For example, *McGeer* is more ambiguous (shorter form) than *Allison McGeer* (longer form).  Often the longer form of reference to an entity is the least ambiguous.  But because it is long to write, we often use it sparingly in a text (perhaps only once) and then subsequent references to the same entity will use the shorter form.  For example, the text might mention *Allison McGeer* once, and then use the short form *McGeer* to refer to the same person many times in the document.

In the course videos, we talked about the coreference chains. Thus, a chain contains long and short mentions, all referring to the same entity.

The longer form is often referred to as the *normalized form*, and it is a form that we are likely to find in an external resource.  We'll see in part 3 of this notebook, when we do entity linking, that there is a Wikipedia entry for *Allison McGeer* that we could link to. We can consider the longer *Allison McGeer* form as the normalized form.

**(TO DO) Q4 (a) - 3 marks**  
 
You must write a function that will find the longest form that can match a mention.

Your function will have the same *entity* and *entities* parameters, but this time the function must assign to *entity* the NER type of another entity in the *entities* iterator, that of the longest form found.   

Specifically, you must look through *entities* to find a normalized form of *entity*. In this scenario, the longest entity that contains *entity* as a substring will be considered the normalized form and should be returned.  If no longer form is found, the entity itself *entity* should be returned.

Ex: *Lydia Bourouiba* is the normalized form of *Bourouiba*. Thus this entity should be returned.  But *McMaster University* is already the longest form, so if we search for a normalized form for that *entity*, the function should return *entity* itself.

In [251]:
# ANSWER Q4(a)
# Find the longest surface form within "entities" for which the surface form of "entity" is a substring
def assign_normalized_form(entity, entities):

    longestForm = entity
    for token in entities:
            if entity.text in token.text:
                longestForm = token
                break
    
    return longestForm

Let's test the above function, assuming the candidates are only found in the previous mentions, as often a long form is given first to (e.g. *Allison McGeer*) and subsequent forms are the short forms (e.g. *McGeer*).

In [252]:
# Testing using only the previous references as candidates
test = df["text"][1]
# Parse the text with spaCy
test_sp = sp(test)
for i, token in enumerate(test_sp.ents):
    ent = assign_normalized_form(test_sp.ents[i], test_sp.ents[0:i-1])
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

0: "weekly" is a DATE  weekly  DATE
1: "Saturday" is a DATE  Saturday  DATE
2: "morning" is a TIME  morning  TIME
3: "COVID-19" is a PERSON  COVID-19  PERSON
4: "two metres" is a QUANTITY  two metres  QUANTITY
5: "COVID-19" is a PERSON  COVID-19  PERSON
6: "the World Health Organization" is a ORG  the World Health Organization  ORG
7: "Touches" is a ORG  Touches  ORG
8: "WHO" is a ORG  WHO  ORG
9: "more than one metre" is a QUANTITY  more than one metre  QUANTITY
10: "the Public Health Agency of" is a ORG  the Public Health Agency of  ORG
11: "Canada" is a GPE  Canada  GPE
12: "at least two metres" is a QUANTITY  at least two metres  QUANTITY
13: "two" is a CARDINAL  two metres  QUANTITY
14: "2 metres" is a QUANTITY  2 metres  QUANTITY
15: "the 19th century" is a DATE  the 19th century  DATE
16: "1934" is a DATE  1934  DATE
17: "W.F. Wells" is a PERSON  W.F. Wells  PERSON
18: "the Harvard School of Public Health" is a ORG  the Harvard School of Public Health  ORG
19: "two metres" is a 

**(TO DO) Q4 (b) - 2 marks**  

Do other tests without the limitation of using only longer forms mentioned before an entity (see *test_sp.ents[0:i-1]* in the code above), try searching before and after.  Or try an interval (e.g. max N entities before or after).  Explain what you tested.  Any difference?  Provide at least 2 examples of changes that you notice.


In [253]:
# ANSWER Q4(b)
# Do a different test


for i, token in enumerate(test_sp.ents):
    ent = assign_normalized_form(test_sp.ents[i], test_sp.ents)
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

# Testing using only the subsequent references as candidates
for i, token in enumerate(test_sp.ents):
    ent = assign_normalized_form(test_sp.ents[i], test_sp.ents[i+1:i-1])
    print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

0: "weekly" is a DATE  weekly  DATE
1: "Saturday" is a DATE  Saturday  DATE
2: "morning" is a TIME  morning  TIME
3: "COVID-19" is a PERSON  COVID-19  PERSON
4: "two metres" is a QUANTITY  two metres  QUANTITY
5: "COVID-19" is a PERSON  COVID-19  PERSON
6: "the World Health Organization" is a ORG  the World Health Organization  ORG
7: "Touches" is a ORG  Touches  ORG
8: "WHO" is a ORG  WHO  ORG
9: "more than one metre" is a QUANTITY  more than one metre  QUANTITY
10: "the Public Health Agency of" is a ORG  the Public Health Agency of  ORG
11: "Canada" is a GPE  Canada  GPE
12: "at least two metres" is a QUANTITY  at least two metres  QUANTITY
13: "two" is a CARDINAL  two metres  QUANTITY
14: "2 metres" is a QUANTITY  2 metres  QUANTITY
15: "the 19th century" is a DATE  the 19th century  DATE
16: "1934" is a DATE  1934  DATE
17: "W.F. Wells" is a PERSON  W.F. Wells  PERSON
18: "the Harvard School of Public Health" is a ORG  the Harvard School of Public Health  ORG
19: "two metres" is a 

**ANSWER Q4(b)**

I ran two tests. The first test removes all limitations, and searches for the longer form before and after. 

The second tests searches subsequently (after).

Since it is assumed that candidates are only found in previous mentions. I wanted to see what would happen for searches with no limitations and searches for candidates in the subsequent mentions. The algorithm seems to work better if there are no limitates on before and after, and if previous mentions are searched.  Issues with only searching subsequent mentions were that the long form did not always replace the short form. For example, for subsequent searches "Bourouiba" was not always replaced by its long form "Lydia Bourouiba". The same with "Samira Mubraeka"

**(TO DO) Q5 - 5 marks**  
Use a different news article in the corpus, the 7th article, so index 6.  

(a) (2 marks) Run the two approaches (most frequent, longest form).  For each entity found in the text, print its original entity type (as found by spaCy, then the most common entity type, and then the normalized form with its entity type. \
(b) (3 marks) Analyze and discuss the results.  Do you think these text coherence approaches help or are they too simple?  Are there conflicting results (the two approaches give different results).  If yes, show examples that are different.  

In [254]:
# ANSWER Q5(a) 
# Select document index 6
doc6 = df["text"][6]
doc6_sp = sp(doc6)
for i, token in enumerate(doc6_sp.ents):
    mct = most_common_type(doc6_sp.ents[i], doc6_sp.ents)
    asm = assign_normalized_form(doc6_sp.ents[i], doc6_sp.ents)
    
    print("{}: \"{}\" \noriginal entity type: {} \nmost common entity type: {} \nNormalized Form: \"{}\" with entity type: {} "
          .format(str(i+1), token.text, token.label_, mct, asm.text, asm.label_ ))

    
#     ent = assign_normalized_form(test_sp.ents[i], test_sp.ents)
#     print(str(i) + ": \"" + token.text + "\" is a " + token.label_ + "  " + ent.text + "  " + ent.label_)

1: "Canada" 
original entity type: GPE 
most common entity type: GPE 
Normalized Form: "Canada" with entity type: GPE 
2: "C.D. Howe" 
original entity type: ORG 
most common entity type: ORG 
Normalized Form: "C.D. Howe" with entity type: ORG 
3: "Monday" 
original entity type: DATE 
most common entity type: DATE 
Normalized Form: "Monday" with entity type: DATE 
4: "Alberta" 
original entity type: GPE 
most common entity type: GPE 
Normalized Form: "Alberta" with entity type: GPE 
5: "first" 
original entity type: ORDINAL 
most common entity type: ORDINAL 
Normalized Form: "first" with entity type: ORDINAL 
6: "Saturday" 
original entity type: DATE 
most common entity type: DATE 
Normalized Form: "Saturday" with entity type: DATE 
7: "Air Canada" 
original entity type: ORG 
most common entity type: ORG 
Normalized Form: "Air Canada" with entity type: ORG 
8: "Christmas" 
original entity type: DATE 
most common entity type: DATE 
Normalized Form: "Christmas" with entity type: DATE 
9: 

**ANSWER Q5(b)**

Coherence approaches seemed too simple and did not help for this data set. The two approaches gave conflicting results.

**PART 3 - Entity Linking / Text enhancement**  

For the third part of this notebook, we will be exploring how we can enhance the text of documents. In this scenario, we will be enhancing the text by performing entity linking. This means that we will attempt the linking of the entities that are detected by spaCy's NER to an active webpage that a reader can click on to obtain more information regarding the entity. Wikipedia, is a very good resource to find out more information about an entity, and we will use this resource for entity linking.    

Before going straight into an example through code, below is an example of how a text with no entity linking compares to a text with entity linking:    

*No entity linking:* \
During the pandemic, U.S. cities such as Atlanta, Chicago and Denver have made several adjustments to their transit systems.      

*With entity linking:*  \    
During the pandemic, U.S. cities such as <a href="http://en.wikipedia.org/wiki/Atlanta">Atlanta</a>, <a href="http://en.wikipedia.org/wiki/Chicago">Chicago</a> and <a href="http://en.wikipedia.org/wiki/Denver">Denver</a> have made several adjustments to their transit systems.

Transforming a text automatically with clickable links requires several processing at the character string level. In this Notebook, we will be satisfied with finding the links without making the replacements directly in the text. This will allow us to explore the Wikipedia resource, and understand the difficulties relating to "entity linking" without wasting too much time in the complex manipulation of strings.

For example, with the document (index 6), we would like to be able to link the entities found by spaCy to the most likely wikipedia page giving access to additional information on that entity.

**This enriched list format shown is the type of output requested in question Q6 below.**  For coding simplicity, we will use this type of output instead of an article in which the text would replaced by links.

0: "Coronavirus Brief" is a ORG found at http://en.wikipedia.org/wiki/Coronavirus_Brief \
1: "CBC" is a ORG found at http://en.wikipedia.org/wiki/CBC \
2: "Canada" is a GPE found at http://en.wikipedia.org/wiki/Canada \
3: "C.D. Howe" is a PERSON found at http://en.wikipedia.org/wiki/C.D._Howe \
4: ... \



**(TO DO) Q6 - 5 marks**  
Write the code needed to search a wikipedia page for the entities found by spaCy (as shown above) in a particular document.  

*You can write the code as you like, but it must include the following elements:*

*   (a) A restriction on which type of entities you are linking.  For example, Wikipedia does not contain quantities (such as "two meters") so it would be inappropriate to include a link to a quantity.
*   (b) The use of the *normalized form* of the entity to perform the linking.  For example, *Allison McGeer* does have a Wikipedia page (https://en.wikipedia.org/wiki/Allison_McGeer) that you can link to, even when you are looking at the entity with label *McGeer*.  So make sure to use the function you developed in Q4.
*   (c) Attention:  the wikipedia page uses underscores. So for example, *McMaster University* should be transformed to https://en.wikipedia.org/wiki/McMaster_University (with an underscore between *McMaster* and *University*
*   (d) Include one element of post-processing on the longer form.  For example *the C.D. Howe Institute's* is tagged by spaCy, but Wikipedia will contain *C.D._Howe_Institute*.  You can remove small particules like *the* to augment the chance of linking.
*   (e) For a specific document, output the surface form, entity type and link to Wikipedia in a list as shown above

Be sure to put comments in your code to make it clear what corresponds to parts (a), (b), (c), (d) and (e).

There will be probably many links that you include that will link to wikipedia pages that do not exist.  That's ok, don't worry about that.  Wikipedia does not contain everything, and some normalized forms will not be there.  You will be asked to discuss this later in Q7.




In [306]:
# ANSWER - Q6
import re

doc6 = df["text"][6]
doc6_sp = sp(doc6)

# I placed my solution in a function so that it could be easily used by question 7
def generateEntityLinks(docx):
    for i, token in enumerate(docx.ents):

        # b) obtaining the normalized form of the entity
        normalized_token = assign_normalized_form(docx.ents[i], docx.ents)

#         For Q7, restrictions applied for various entity types
#         Uncomment and recomment to test, also uncomment a) 
#         if  normalized_token.label_ == "GPE":
#         if  normalized_token.label_ == "PERSON":
#         if  normalized_token.label_ == "ORG":
            
        # a) restrictions for QUANTITY. 
        if  normalized_token.label_ != "QUANTITY":

            # d) post-processing to remove some small particules to improve chances of linking to wikipedia. 
            #    Notes beside or above the code

            # convert the normalized form into a string that will be processed
            unprocessed_form = normalized_token.text 

            # remove all the articles from the beginning of a sentence to augment chance of a wikipedia link. 
            processed_form1 = re.sub(r'^(the |The |tHe |thE |tHE |THe |ThE |THE |a |A |an |An |AN )', '', unprocessed_form)

            # remove all punctuations that cause problems when creating a wikipedia link, but NOT periods(.) or hyphens(-)
            # also, remove punctuations and the string that comes after it (Example: Institute's -> Institute, and Peter Cziborra/Reuters -> Peter Cziborra), , but NOT periods(.) or hyphens(-)
            processed_form2 = re.sub(r'[\;:,\?\"\'\/!]\w*$', '', processed_form1)

            # remove all whitespace at the beginning of a string
            processed_final = re.sub(r'^\s', '', processed_form2)

            # c) transforming all white spaces into underscores and then concatenating it to create the wiki search link
            wiki_search_link = 'http://en.wikipedia.org/wiki/' + processed_final.replace(" ","_")

            # e) output of surface form, entity type, and link to wikipedia
            print("{}: {}, \"{}\" is a {} found at {}".format(i, token.text, normalized_token.text, normalized_token.label_, wiki_search_link))


generateEntityLinks(doc6_sp)
            

0: Canada, "Canada" is a GPE found at http://en.wikipedia.org/wiki/Canada
1: C.D. Howe, "C.D. Howe" is a ORG found at http://en.wikipedia.org/wiki/C.D._Howe
2: Monday, "Monday" is a DATE found at http://en.wikipedia.org/wiki/Monday
3: Alberta, "Alberta" is a GPE found at http://en.wikipedia.org/wiki/Alberta
4: first, "first" is a ORDINAL found at http://en.wikipedia.org/wiki/first
5: Saturday, "Saturday" is a DATE found at http://en.wikipedia.org/wiki/Saturday
6: Air Canada, "Air Canada" is a ORG found at http://en.wikipedia.org/wiki/Air_Canada
7: Christmas, "Christmas" is a DATE found at http://en.wikipedia.org/wiki/Christmas
8: Canadians, "Canadians" is a NORP found at http://en.wikipedia.org/wiki/Canadians
9: more than $1.2 million, "more than $1.2 million" is a MONEY found at http://en.wikipedia.org/wiki/more_than_$1.2_million
10: COVID-19, "COVID-19" is a PERSON found at http://en.wikipedia.org/wiki/COVID-19
11: Park Golf Course, "Park Golf Course" is a PERSON found at http://en.w

In [305]:
# I chose the 9th article, so index 8
doc8 = df["text"][8]
doc8_sp = sp(doc8)

generateEntityLinks(doc8_sp)

0: The World Health Organization, "The World Health Organization" is a ORG found at http://en.wikipedia.org/wiki/World_Health_Organization
1: WHO, "WHO" is a ORG found at http://en.wikipedia.org/wiki/WHO
3: Canada, "Canada" is a GPE found at http://en.wikipedia.org/wiki/Canada
4: Saturday, "Saturday" is a DATE found at http://en.wikipedia.org/wiki/Saturday
5: The United Nations, "The United Nations" is a ORG found at http://en.wikipedia.org/wiki/United_Nations
6: 1948, "1948" is a DATE found at http://en.wikipedia.org/wiki/1948
7: 194, "1948" is a DATE found at http://en.wikipedia.org/wiki/1948
8: the World Health Organization, "the World Health Organization" is a ORG found at http://en.wikipedia.org/wiki/World_Health_Organization
9: Steven Hoffman, "Steven Hoffman" is a PERSON found at http://en.wikipedia.org/wiki/Steven_Hoffman
10: the Global Strategy Lab, "the Global Strategy Lab" is a ORG found at http://en.wikipedia.org/wiki/Global_Strategy_Lab
11: York University, "York Universit

**(TO DO) Q7 - 5 marks**  
Perform a qualitative evaluation of the entity linking method you wrote in Q6.  For your qualitative evaluation, you must choose a document (any one you want from the corpus of covid-19 related news, but make sure to mention which one) and run your method on that document.  Answer the following questions : 

* a. Give 2 examples of entities where the longer form was found in Wikipedia.  Is the page found appropriate? Would the shorter form be found too? Would it link to the same page?  
* b.  Give 2 examples of entities where the wikipedia page did not exist.  Why is that?  Was the form searched on incorrect?  
* c.  Try restricting your search with different entity types.  Do you see DATE covered by Wikipedia?  What about PERSON or GPE?  Discuss the coverage of different entity types by giving some examples.


**For Question 7, I inserted a cell above for my test because I wasn't sure if I was supposed to re-use the cell from Q6. 
  However, this helped me to better compare the results.**

**ANSWER Q7**

For my qualitative evaluation, I choose article 9 (index 8).

a. Two examples of entities where the longer form was found on wikipedia:
   1. "Theresa Tam" (https://en.wikipedia.org/wiki/Theresa_Tam)
   2. "Amir Attaran" (https://en.wikipedia.org/wiki/Amir_Attaran)

"Theresa Tam" is a longer form that was found on Wikipedia, and the page found was appropriate. The shorter form was "Tam", which did not link to the same page and it was not actually found on wikipedia. But, it did link to a Wikipedia page with various entities that referred to the string "Tam", some of which were people, or story characters with the name or surname of "Tam" and some of which were acronyms like "TAM". Thus, the page found by the shorter form was not appropriate.

"Amir Attaran" is the longer form where an appropriate page was found on Wikipedia. However, the shorter form "Attaran" was not found to link to any page.


b. Two examples of entities where the wikipedia page did not exist:
   1. "Jan. 28" (https://en.wikipedia.org/wiki/Jan._28)
   2. "Lynette Ong" (http://en.wikipedia.org/wiki/Lynette_Ong)

A link generated by "Jan. 28" does not exist. I thought the form searched was incorrect, but another link generated in the exact same way for "Dec. 31" (https://en.wikipedia.org/wiki/Dec._31) worked. Perhaps, this is because December 31 is a much more popular date since it is New Years Eve.

"Lynette Ong" did not generate a link that exists for wikipedia. The form searched was correct, but I believe a page did not exist because the entity was a person who was not very well-known. 

I am including a note about "A Public Health Emergency of International Concern" because I feel that it demonstrates a situation where an incorrect form did not lead to an existing page. However, when I adjusted my algorithm to remove the article "a", it created an appropriate link.
https://en.wikipedia.org/wiki/Public_Health_Emergency_of_International_Concern

c. I restricted different types in the code in Q6, such as ORG, GPE, and PERSON.

- ORG had very good results if the string was processed appropriately to remove articles such as "a" or "the", and any other noise. An example would be "University of Toronto" produced a appropriate page, but "the University of Toronto" did not.

- PERSON had very good coverage with the normalized long form. An example would be "Theresa Tam" from section a). However, PERSON did have issues finding pages for people that were not well-known, such as "Lynette Ong" also from the previous example.

- GPE had the best results for the short form. I beleive this is because it did not usually normalize as often as the other surface forms. However, tests on other data sets may disprove this. Examples that include appropriate pages include: "Canada" and "U.S."

- DATE has inconsistent coverage by wikipedia. A DATE that was covered by wikipedia usually had a more specific surface form, such as "2005" or "Saturday". However, for a DATE labelled as "today", Wikipedia linked to a page listing more than one entity related to that surface form. Additionally, other surface forms labelled as DATE did not have a page that existed, such as "late January". Restricting results to not include DATE seemed to produce more viable links. 

- Both CARDINAL and ORDINAL seemed to have the worst results overall. Restricting results to not include CARDINAL and ORDINAL seemed to produce more viable links. Examples include "first" which linked to a page indicating multiple entities, and "20 to 30" (from index 6), which did not have a page.
