# NLE Assessed Coursework 2: Question 3

For this assessment, you are expected to complete and submit 4 notebook files.  There is 1 notebook file for each question (to speed up load times).  This is notebook 3 out of 4.

Marking guidelines are provided as a separate document.

In order to provide unique datasets for analysis by different students, you must enter your candidate number in the following cell.

In [1]:
candidateno=181345 #this MUST be updated to your candidate number so that you get a unique data sample


In [7]:
#preliminary imports
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/Documents/teaching/NLE2018/resources')
sys.path.append(r'/Users/lucaskoh/Documents/NLE/resources')
import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from sussex_nltk.corpus_readers import ReutersCorpusReader
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import math
import spacy
import operator
nlp=spacy.load('en')
from nltk.corpus import gutenberg

In [8]:
import operator

## Question 3: Named Entity Recognition and Linking (25 marks)

The code below will run the SpaCy system on the text from Persuasion by Jane Austen.  `mysample` contains a 50% sample which is unique to your candidate number.

In [9]:
#Do NOT change the code in this cell.

#preparing corpus

def clean_text(astring):
    #replace newlines with space
    newstring=re.sub("\n"," ",astring)
    #remove title and chapter headings
    newstring=re.sub("\[[^\]]*\]"," ",newstring)
    newstring=re.sub("VOLUME \S+"," ",newstring)
    newstring=re.sub("CHAPTER \S+"," ",newstring)
    newstring=re.sub("\s\s+"," ",newstring)
    #return re.sub("([^\.|^ ])  +",r"\1 .  ",newstring).lstrip().rstrip()
    return newstring.lstrip().rstrip()


def get_sample(sentslist,seed=candidateno):
    random.seed(seed)
    random.shuffle(sentslist)
    testsize=int(len(sentslist)/2)
    return sentslist[testsize:]
    
persuasion=clean_text(gutenberg.raw('austen-persuasion.txt'))
nlp_persuasion=list(nlp(persuasion).sents)

mysample=get_sample(nlp_persuasion)

In [10]:
type(mysample[0])

spacy.tokens.span.Span

a) **Write code** and **extract**:
* the 30 most common strings referring to PEOPLE in `mysample`.
* the 30 most common strings referring to PLACES in `mysample`.

\[6 marks\]

In [14]:
def make_tag_lists(sents):
    tokens=[]
    pos=[]
    ner=[]
    for sent in sents:
        for t in sent:
            tokens.append(t.text)
            pos.append(t.pos_)
            ner.append(t.ent_type_)
    return tokens,pos,ner

def extract_entities(tokenlist,taglist,tagtype):
    entities={}
    inentity=False
    for i,(token,tag) in enumerate(zip(tokenlist,taglist)):
        if tag==tagtype:
            if inentity:
                entity+=" "+token
            else:
                entity=token
                inentity=True
        elif inentity:
            entities[entity]=entities.get(entity,0)+1
            inentity=False
    return entities

tagList = make_tag_lists(mysample)

In [15]:
toks = tagList[0]
pos = tagList[1]
ner = tagList[2]

people=extract_entities(toks,ner,"PERSON")
top_people=sorted(people.items(),key=operator.itemgetter(1),reverse=True)[:30]


places=extract_entities(toks,ner,"GPE")
top_places=sorted(places.items(),key=operator.itemgetter(1),reverse=True)[:30]

print(top_people,'\n')
print(top_places)

[('Anne', 215), ('Elliot', 97), ('Wentworth', 80), ('Mary', 72), ('Walter', 62), ('Lady Russell', 55), ('Charles', 49), ('Elizabeth', 46), ('Mrs Clay', 29), ('Mrs Smith', 26), ('Mrs Musgrove', 25), ('Louisa', 25), ('Harville', 24), ('Henrietta', 20), ('Lyme', 16), ('Captain Benwick', 16), ('Mrs Croft', 16), ('Charles Hayter', 14), ("Lady Russell 's", 14), ('Shepherd', 14), ('Anne Elliot', 13), ('Miss Elliot', 12), ('Wallis', 12), ('Musgrove', 12), ('Lady Dalrymple', 11), ('Benwick', 11), ('Mrs Harville', 11), ('Frederick', 9), ('Croft', 8), ('Lady Elliot', 7)] 

[('Kellynch', 21), ('Camden Place', 13), ('London', 8), ('Bath', 7), ('Plymouth', 4), ('Taunton', 4), ('Westgate Buildings', 4), ('Uppercross', 4), ('Laconia', 3), ('Camden', 2), ('Belmont', 2), ('Wentworth', 2), ('Gowland', 2), ('Captain', 2), ('England', 2), ('secret,)--', 1), ('Domingo', 1), ('Somersetshire', 1), ('Queen Squares', 1), ('Harville', 1), ('the hedge - row', 1), ('Frederick', 1), ('Portsmouth', 1), ('Vague', 1),

b) Making reference to specific examples from the text in `mysample`, **discuss** the different types of errors made by the named entity recogniser. \[6 marks\]

## Named entity recogniser (NER)
Named entity recognition identifies a **named entity** correctly within a span of text which refer to **real world entities** and to classify them to these following types: `persons`, `organizations`, `locations`, `expressions of times`, `quantities`, `monetary values`, `percentages` etc.. Hence, the goal is to identify specific groups of words which share common semantic characteristics. 

## Challenges
However, it is challenging to classify a word corretly due to **variation** and **ambiguity**. 
### Variation
The **the named entity recogniser** might wrongly classify an entity with a different entity as it should due to a word's *variation*. For instance, a person might be addressed in multiple manners, within the `mysample` corpus, `Miss Elliot` or `Anne Elliot`. These `person` named entity types might not be able to related to each other when extracting them as the classifier might recognize these as two distinct entities even though they refer to the same person. 
Furthermore, to overcome this issue, we might utlize a knowledge source such as a database containing every character and its variation to train our named entity recogniser which ultimately could classify the named entities more accurately. 
### Ambiguity 
Another major issue when it comes to named entity recognition is ambiguity, entities might have the same entity types. for instance, within the `mysample` corpora, the character `Miss Musgroves` has been tagged as a `WORK_OF_ART`. However, if we observe the sample list, it is show clearly that `Miss Musgroves` has dominantly a `PERSON` entity type. Furthermore, the recogniser wrongly classified `Westgate Buildings` also as a `WORK_OF_ART` which are titles of books and songs whereas it should have been classified as a `LOC` entity type. Furthermore, an entity may be different types which increases the problem of ambiguity, this is in align with the case of ambiguity of segmentation, it is often challenging to decide the nature of an entity and what isn't and defining the neccessary boundaries. 

In [16]:
for x in range(0,len(ner)-1):
    if ner[x] != '':
        print(ner[x],toks[x])

PERSON Elliot
PERSON Elliot
PERSON Anne
PERSON Mary
PERSON Anne
PERSON Henrietta
PERSON Charles
PERSON Hayter
PERSON Wentworth
PERSON Henrietta
PERSON Charles
NORP Spicers
ORG Bishop
DATE year
CARDINAL two
PERSON Mrs
PERSON Musgrove
WORK_OF_ART Merely
WORK_OF_ART Gowland
TIME their
TIME own
TIME hours
PERSON Lady
PERSON Russell
PERSON Mrs
PERSON Clay
PERSON Elliot
WORK_OF_ART Marlborough
WORK_OF_ART Buildings
ORG Uppercross
ORG Lyme
PERSON Lady
PERSON Russell
PERSON 's
PERSON him--
ORG Lyme
PERSON Lyme
GPE Plymouth
PERSON Edward
PERSON Charles
PERSON Lady
PERSON Dalrymple
PERSON Charles
PERSON Mary
PERSON Charles
PERSON Charles
PERSON Captain
PERSON Benwick
ORDINAL first
ORDINAL second
ORG Nurse
ORG Rooke
TIME half
TIME an
TIME hour
CARDINAL one
CARDINAL one
PERSON Elizabeth
DATE daily
TIME the
TIME evening
PERSON Walter
PERSON Shepherd
PERSON Croft
GPE Taunton
PERSON Louisa
DATE the
DATE Christmas
DATE holidays
PERSON Walter
PERSON Charles
PERSON Hayter
PERSON Mrs
PERSON Clay
DATE One

TIME next
TIME morning
CARDINAL one
ORG Louisa
NORP Musgroves
PERSON Dick
PERSON Dick
DATE six
DATE months
CARDINAL only
CARDINAL two
PERSON Anne
PERSON Charles
PERSON Wentworth
PERSON Wentworth
PERSON Wentworth
PERSON Wentworth
PRODUCT Cobb
NORP Irish
PERSON Charles
PERSON Anne
DATE yesterday
ORG Harvilles
FAC the
FAC Octagon
FAC Room
PERSON Walter
PERSON Elizabeth
PERSON Mrs
PERSON Clay
TIME one
TIME morning
PERSON Laura
PERSON Place
PERSON Lady
PERSON Dalrymple
TIME the
TIME same
TIME evening
PERSON Anne
GPE Westgate
GPE Buildings
PERSON Mary
PERSON Anne
ORDINAL first
PERSON Basil
PERSON Anne
PERSON Elliot
PERSON Kellynch
PERSON Lady
PERSON Russell
PERSON Elliot
PERSON himself!--she
PERSON Anne
PERSON Elliot
CARDINAL one
PERSON Mrs
PERSON Clay
CARDINAL two
PERSON Elliot
PERSON Walter
PERSON Mary
PERSON Walter
ORG Uppercross
ORG Lyme
PERSON Lady
PERSON Russell
PERSON Mrs
PERSON Smith
PERSON Elliot
PERSON Mary
PERSON Walter
PERSON Elliot
PERSON Elliot
PERSON Charles
TIME a
TIME few
TI

PERSON Lady
PERSON Russell
PERSON Walter
PERSON Lady
PERSON Dalrymple
TIME the
TIME evening
TIME the
TIME last
TIME two
TIME hours
PERSON Anne
PERSON Elliot
PERSON Harville
PERSON Anne
ORG Musgroves
DATE twentieth
DATE year
ORG Uppercross
DATE two
DATE years
DATE before
CARDINAL two
PERSON Mrs
PERSON Croft
PERSON Anne
PERSON Mary
DATE seven
DATE years
PERSON Elizabeth
PERSON Kellynch
PERSON Hall
PERSON Walter
PERSON Elliot
WORK_OF_ART Westgate
WORK_OF_ART Buildings
PERSON Anne
PERSON Elliot
GPE Westgate
GPE Buildings
PERSON Anne
PERSON Charles
PERSON Harville
DATE Six
DATE years
TIME all
TIME hours
PERSON Anne
NORP Bath
ORDINAL first
DATE three
DATE years
ORDINAL secondly
DATE the
DATE only
DATE winter
PERSON Mary
NORP Bath
PERSON Lady
PERSON Russell
PERSON Musgrove
PERSON Lady
PERSON Russell
PRODUCT the
PRODUCT Concert
PRODUCT Room
PERSON Anne
PERSON Walter
PERSON Charles
PERSON Mary
PERSON Mary
PERSON Anne
DATE day
DATE autumn
ORG Uppercross
ORG Cottage
LANGUAGE Bath
PERSON Anne
DATE

c) **Design** and **implement** a system to track the locations of characters throughout the story.  For a given PERSON named entity, your system should return a list of time-ordered LOCATIONS for that character.  Test your system using the complete text of "Persuasion" (**not** `mysample`) for at least 3 major characters.   \[13 marks\]

In [17]:
tag_list = []
token_list = []
pos_list = []
ner_list = []
final_list = []

tag_list = [make_tag_lists(nlp_persuasion)]
token_list = [tag_list[0][0]]
pos_list = [tag_list[0][1]]
ner_list = [tag_list[0][2]]

for i in range (0,len(token_list[0])):
    final_list += [(token_list[0][i],pos_list[0][i],ner_list[0][i])]

In [18]:
def location_tracker(target_person,text_corpus,windowsize):
    count = 0
    target_list = []
    place_list = []
    for sentence in text_corpus:
        while count < (len(text_corpus) - (windowsize+1)):
            target_range_words = []
            if str(sentence[count]) == target_person:
                for i in range (0,min(windowsize,count)):
                    target_range_words += [str(sentence[count-i])]
                for i in range (0,min(windowsize,len(sentence)-count)):
                    target_range_words += [str(sentence[count+i])]
            if len(target_range_words) > 0:
                target_list += [target_range_words]
            count = count + 1 
    for list_item in target_list:
        for item in list_item:
            if item in places.keys():
                place_list += [item]
    return place_list

In [19]:
location_tracker("Anne",nlp_persuasion,2000)

['Uppercross',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Crofts',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Crofts',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Kellynch',
 'Crofts',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Frederick',
 'Wentworth',
 'Kellynch',
 'Crofts',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch']

In [20]:
location_tracker("Mary",nlp_persuasion,2000)

['Kellynch',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Crofts',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch',
 'Crofts',
 'Frederick',
 'Wentworth',
 'Kellynch',
 'Crofts',
 'Crofts',
 'Kellynch',
 'Crofts',
 'Kellynch',
 'Uppercross',
 'Kellynch']

In [21]:
location_tracker("Wentworth",nlp_persuasion,300)

['Wentworth',
 'Kellynch',
 'Wentworth',
 'Captain',
 'Crofts',
 'Crofts',
 'Wentworth',
 'England',
 'Wentworth',
 'Captain',
 'Crofts',
 'Crofts',
 'Wentworth',
 'Frederick',
 'Captain',
 'Uppercross',
 'Wentworth',
 'Laconia',
 'Laconia',
 'Wentworth',
 'Frederick',
 'Captain',
 'Uppercross',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Laconia',
 'Laconia',
 'Wentworth',
 'Frederick',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Uppercross',
 'Crofts',
 'Kellynch',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Uppercross',
 'Crofts',
 'Kellynch',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain',
 'Wentworth',
 'Captain']

In [22]:
def location_tracker(target_person,text_corpus,windowsize):
    count = 0
    target_list = []
    place_list = []
    for sentence in text_corpus:
        while count < (len(text_corpus) - (windowsize+1)):
            target_range_words = []
            if str(sentence[count]) == target_person:
                for i in range (0,min(windowsize,count)):
                    target_range_words += [str(sentence[count-i])]
                for i in range (0,min(windowsize,len(sentence)-count)):
                    target_range_words += [str(sentence[count+i])]
            if len(target_range_words) > 0:
                target_list += [target_range_words]
            count = count + 1 
    for list_item in target_list:
        for item in list_item:
            if item in places.keys():
                place_list += [item]
    return place_list

## Design
The algorithm was designed based upon the idea of `sequence labelling`. A conventional seqence labeling task would assugn tags to define both the bondary and the type, considering the case of this algorithm, it will set the boundarie with a `window size`. Furthermore, to track the locations of characters throughout the story, the `location_tracker()` method will ,for the given target person, text corpus and window size, look for the specific person, once it has located the person within the corpora, it will scan $\pm$ the window size to identify the location entity type of the target person and return it in `place_list`. The window size parameter was implemented to increase the precision of tracking the location of the character: the assumption was that the location of a character would be within the proximity of the character within the context. 

## Implementation
The algorithm will first tokenize the **nlp_persuasion** text corpora and store the indidividual tokens, the pos tag and the ner type as a tuple in the **final_list**. Once that has been executed, the `location_tracker` method will take in a **target_person**, a **text_corpus** and **windowsize**. While the count is less than the length of the corpus subtracted by the windowsize (to make sure that if the method was searching through the first word of the list it would prevent an out of bounds exception error) the string given the input count is compared to the target person. If the target person is equivalent to the person within the text corpora, then the person will be added to a **target_list**. Futhermore, the algorithm will proceed by comparing every item in the **target_list** to the keys of the **place_list** which is a dictionary comprised of every location from the **nlp_persuasion** corpora. 
 
## Limitations 
Although the algorithm returns the time-ordered locations of the input person, it may lack total accuracy. For instance, there are many cases where a named-entity type is wrongfully classfied, `Elliot` being classified as an organization instead of a `PERSON`. To improve the entity sequencing method, we may extract information based on regex patterns, supervised learning methods when training the data or open information extraction methods. An addition could be also to implement IOB encoding which could potentially minimize false positives where a the recogniser would identify compund words as seperate entities. Furthermore, as we are given only a portion of the entire novel text, the entity recognition might be limited to the data it can acquire. As a whole, the location tracker is able to return the time-ordered location of any input character within the corpus given a window size. 

