# Exercise 11: Entity and Relation Extraction

## Task 1: Relation extraction from Wikipedia articles

Use Wikipedia to extract the relation `directedBy(Movie, Person)` by applying pattern based heuristics that utilize: *Part Of Speech Tagging*, *Named Entity Recognition* and *Regular Expressions*.

#### Required Library: SpaCy
- ```conda install -y spacy```
- ```python -m spacy download en```

In [5]:
pip install -U pip setuptools wheel

Collecting pip
  Downloading pip-21.3.1-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.2 MB/s eta 0:00:01
Collecting setuptools
  Downloading setuptools-60.5.0-py3-none-any.whl (958 kB)
[K     |████████████████████████████████| 958 kB 19.5 MB/s eta 0:00:01
Collecting wheel
  Downloading wheel-0.37.1-py2.py3-none-any.whl (35 kB)
Installing collected packages: wheel, setuptools, pip
  Attempting uninstall: wheel
    Found existing installation: wheel 0.37.0
    Uninstalling wheel-0.37.0:
      Successfully uninstalled wheel-0.37.0
  Attempting uninstall: setuptools
    Found existing installation: setuptools 58.0.4
    Uninstalling setuptools-58.0.4:
      Successfully uninstalled setuptools-58.0.4
  Attempting uninstall: pip
    Found existing installation: pip 21.2.4
    Uninstalling pip-21.2.4:
      Successfully uninstalled pip-21.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This be

In [6]:
pip install -U spacy

Note: you may need to restart the kernel to use updated packages.


In [9]:
import urllib.request, json, csv, re
import spacy
nlp = spacy.load("en_core_web_sm")

In [11]:
#read tsv with input movies
def read_tsv():
    movies=[]
    with open('movies.tsv','r') as file:
        tsv = csv.reader(file, delimiter='\t')
        next(tsv) #remove header
        movies = [{'movie':line[0], 'director':line[1]} for line in tsv]
    return movies

#parse wikipedia page
def parse_wikipedia(movie):
    txt = ''
    try:
        with urllib.request.urlopen('https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles='+movie) as url:
            data = json.loads(url.read().decode())
            txt = next (iter (data['query']['pages'].values()))['extract']
    except:
        pass
    return txt

#### 1) Parse the raw text of a Wikipedia movie page and extract named (PER) entities.

In [12]:
def find_PER_entities(txt):
    txt = nlp(txt)
    
    persons = []
    for e in txt.ents:
        if e.label_ == 'PERSON':
            persons.append(e.text)
    return persons

#### 2) Given the raw text of a Wikipedia movie page and the extracted PER entities, find the director.

In [30]:
#simple heuristic: find the next PER entity after the word 'directed'
def find_director(txt, persons):
     # replace special symbols with empty string and split at every whitespace
    txt = re.sub('[!?,.]', '', txt).split()
    for p1 in range(0, len(txt)):
        if(txt[p1] == 'directed'):
            for p2 in range(p1, len(txt)):
                for per in persons:
                    if per.startswith(txt[p2]):
                        return per
    # otherwise empty string that is used later
    return ''

In [16]:
movies = read_tsv()
movies[:10]

[{'movie': '13_Assassins_(2010_film)', 'director': 'Takashi Miike'},
 {'movie': '14_Blades', 'director': 'Daniel Lee'},
 {'movie': '22_Bullets', 'director': 'Richard Berry'},
 {'movie': 'The_A-Team_(film)', 'director': 'Joe Carnahan'},
 {'movie': 'Alien_vs_Ninja', 'director': 'Seiji Chiba'},
 {'movie': 'Bad_Blood_(2010_film)', 'director': 'Dennis Law'},
 {'movie': 'Bangkok_Knockout', 'director': 'Panna Rittikrai'},
 {'movie': 'Blades_of_Blood', 'director': 'Lee Joon-ik'},
 {'movie': 'The_Book_of_Eli', 'director': 'Allen Hughes'},
 {'movie': 'The_Bounty_Hunter_(2010_film)', 'director': 'Andy Tennant'}]

In [52]:
statements=[]
#persons = []
tp = 0
fp = 0
retrieved = 0
gt_directors = [m['director'] for m in movies if m.get('director')]
relevant = len(gt_directors)

for m in movies:
        # Wikipedia page for movie m
        txt = parse_wikipedia(m['movie'])
        # persons of movie m
        persons = find_PER_entities(txt)
        # finds director in persons by looking at first PER after "directed"
        director = find_director(txt, persons)
            
        if director != '':
            statements.append(m['movie'] + ' is directed by ' + director + '.')
            retrieved += 1
            if director == m['director']:
                tp +=1
            else:
                fp +=1

In [24]:
# Testing purposes to see persons for all movies
persons

[['Hepburn',
  'Takashi Miike',
  'Kōji Yakusho',
  'Takayuki Yamada',
  'Sōsuke Takaoka',
  'Hiroki Matsukata',
  'Gorō Inagaki',
  'Matsudaira Naritsugu',
  'Akashi',
  'Crows Zero 2'],
 ['Daniel Lee',
  'Donnie Yen',
  'Zhao Wei',
  'Sammo Hung',
  'Wu Chun',
  'Kate Tsui',
  'Qi Yuwu'],
 ['Richard Berry', 'Jacky Imbert', "L'Immortel"],
 ['Frank Lupo',
  'Stephen J. Cannell',
  'Joe Carnahan',
  'Carnahan',
  'Brian Bloom',
  'Skip Woods',
  'Liam Neeson',
  'Bradley Cooper',
  'Quinton Jackson',
  'Jessica Biel',
  'Patrick Wilson',
  'Yul Vazquez',
  'Tony Scott',
  'Ridley Scott',
  'Cooper'],
 ['Seiji Chiba'],
 ['Dennis Law', 'Simon Yam', 'Bernice Liu', 'Andy'],
 [],
 ['Gureumeul Beoseonan Dalcheoreom', 'Lee Joon-ik', "Park Heung-yong's"],
 ['Gary Whitta',
  'Denzel Washington',
  'Gary Oldman',
  'Mila Kunis',
  'Ray Stevenson',
  'Jennifer Beals'],
 ['Andy Tennant', 'Jennifer Aniston', 'Gerard Butler'],
 ['Butcher', 'Swordsman', 'Wuershan'],
 ['Neil Marshall', 'Michael Fassben

In [32]:
statements

['13_Assassins_(2010_film) is directed by Takashi Miike.',
 '14_Blades is directed by Daniel Lee.',
 '22_Bullets is directed by Richard Berry.',
 'Alien_vs_Ninja is directed by Seiji Chiba.',
 'Bad_Blood_(2010_film) is directed by Dennis Law.',
 'Blades_of_Blood is directed by Lee Joon-ik.',
 'The_Book_of_Eli is directed by Gary Whitta.',
 'The_Bounty_Hunter_(2010_film) is directed by Andy Tennant.',
 'The_Butcher,_the_Chef_and_the_Swordsman is directed by Wuershan.',
 'Centurion_(film) is directed by Neil Marshall.',
 'The_Crazies_(2010_film) is directed by Breck Eisner.',
 'Date_Night is directed by Shawn Levy.',
 'The_Expendables_(2010_film) is directed by Jason Statham.',
 'Faster_(2010_film) is directed by George Tillman Jr..',
 'Fire_of_Conscience is directed by Dante Lam.',
 'From_Paris_with_Love_(film) is directed by Pierre Morel.',
 'Gallants_(film) is directed by Derek Kwok.',
 'Gothic_%26_Lolita_Psycho is directed by Gosu Rori Shokeinin.',
 'Inception is directed by Christop

#### 3) Compute the precision and recall based on the given ground truth (column Director from tsv file) and show examples of statements that are extracted.

In [60]:
# compute precision and recall
fn = len(movies) - len(statements)

precision = tp / retrieved
# same as precision = tp / (tp + fp)

recall_IR = tp / relevant # relevant = len(movies)
recall_sample_sol = tp / (tp + fn) # fn = len(movies) - len(statements)


print(tp)

print ('Precision: {:.0%}'.format(precision))
print ('Recall Me: {:.0%}'.format(recall_IR))
print ('Recall: {:.0%}'.format(recall_sample_sol))
print('\n***Sample Statements***')
for s in statements[:5]:
    print (s)

192
Precision: 80%
Recall Me: 67%
Recall: 80%

***Sample Statements***
13_Assassins_(2010_film) is directed by Takashi Miike.
14_Blades is directed by Daniel Lee.
22_Bullets is directed by Richard Berry.
Alien_vs_Ninja is directed by Seiji Chiba.
Bad_Blood_(2010_film) is directed by Dennis Law.


## Task 2: Named Entity Recognition using Hidden Markov Model


Define a Hidden Markov Model (HMM) that recognizes Person (*PER*) entities.
Particularly, your model must be able to recognize pairs of the form (*firstname lastname*) as *PER* entities.
Using the given sentences as training and test set:

In [61]:
training_set=['The best blues singer was Bobby Bland while Ray Charles pioneered soul music .', \
              'Bobby Bland was just a singer whereas Ray Charles was a pianist , songwriter and singer .' \
              'None of them lived in Chicago .']

test_set=['Ray Charles was born in 1930 .', \
          'Bobby Bland was born the same year as Ray Charles .', \
          'Muddy Waters is the father of Chicago Blues .']

#### 1) Annotate your training set with the labels I (for PER entities) and O (for non PER entities).
	
    *Hint*: Represent the sentences as sequences of bigrams, and label each bigram.
	Only bigrams that contain pairs of the form (*firstname lastname*) are considered as *PER* entities.

In [62]:
#Bigram Representation
def getBigrams(sents):
    return [b[0]+' '+b[1] for l in sents for b in zip(l.split(' ')[:-1], l.split(' ')[1:])]

bigrams = getBigrams(training_set)

#Annotation
PER = ['Bobby Bland', 'Ray Charles']
annotations = []

for b in bigrams:
    if b in PER:
        annotations.append([b, 'I'])
    else:
        annotations.append([b, 'O'])
        
print('Annotation\n', annotations,'\n')

Annotation
 [['The best', 'O'], ['best blues', 'O'], ['blues singer', 'O'], ['singer was', 'O'], ['was Bobby', 'O'], ['Bobby Bland', 'I'], ['Bland while', 'O'], ['while Ray', 'O'], ['Ray Charles', 'I'], ['Charles pioneered', 'O'], ['pioneered soul', 'O'], ['soul music', 'O'], ['music .', 'O'], ['Bobby Bland', 'I'], ['Bland was', 'O'], ['was just', 'O'], ['just a', 'O'], ['a singer', 'O'], ['singer whereas', 'O'], ['whereas Ray', 'O'], ['Ray Charles', 'I'], ['Charles was', 'O'], ['was a', 'O'], ['a pianist', 'O'], ['pianist ,', 'O'], [', songwriter', 'O'], ['songwriter and', 'O'], ['and singer', 'O'], ['singer .None', 'O'], ['.None of', 'O'], ['of them', 'O'], ['them lived', 'O'], ['lived in', 'O'], ['in Chicago', 'O'], ['Chicago .', 'O']] 



In [75]:
ats = []
for b, annotation in annotations:
    ats.append(annotation)
print(ats)

print(len(ats))

for i, a in enumerate(annotations):
    print(i)
    print(a)


['O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'I', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
35
0
['The best', 'O']
1
['best blues', 'O']
2
['blues singer', 'O']
3
['singer was', 'O']
4
['was Bobby', 'O']
5
['Bobby Bland', 'I']
6
['Bland while', 'O']
7
['while Ray', 'O']
8
['Ray Charles', 'I']
9
['Charles pioneered', 'O']
10
['pioneered soul', 'O']
11
['soul music', 'O']
12
['music .', 'O']
13
['Bobby Bland', 'I']
14
['Bland was', 'O']
15
['was just', 'O']
16
['just a', 'O']
17
['a singer', 'O']
18
['singer whereas', 'O']
19
['whereas Ray', 'O']
20
['Ray Charles', 'I']
21
['Charles was', 'O']
22
['was a', 'O']
23
['a pianist', 'O']
24
['pianist ,', 'O']
25
[', songwriter', 'O']
26
['songwriter and', 'O']
27
['and singer', 'O']
28
['singer .None', 'O']
29
['.None of', 'O']
30
['of them', 'O']
31
['them lived', 'O']
32
['lived in', 'O']
33
['in Chicago', 'O']
34
['Chicago .', 'O']


#### 2) Compute the transition and emission probabilities for the HMM (use smoothing parameter $\lambda$=0.5).

    *Hint*: For the emission probabilities you can utilize the morphology of the words that constitute a bigram (e.g., you can count their uppercase first characters).

In [81]:
lambda_ = 0.5

#Transition Probabilities
transition_prob={}

I_count = 0
O_count = 0

for i, a in enumerate(annotations):
    if (i != 0):
        if a[1] == 'I':
            I_count += 1
        else:
            O_count += 1

#Prior
transition_prob['P(I|start)'] = I_count/ (I_count+O_count)
transition_prob['P(O|start)'] = 1 - transition_prob['P(I|start)']

O_after_O_count = 0
O_after_I_count = 0
I_after_O_count = 0
I_after_I_count = 0

for i, _ in enumerate(annotations):
    if (i != 0):
        if annotations[i-1][1]=='O' and annotations[i][1]=='O':
            O_after_O_count +=1
        
        elif annotations[i-1][1]=='O' and annotations[i][1]=='I':
            I_after_O_count +=1

        elif annotations[i-1][1]=='I' and annotations[i][1]=='O':
            O_after_I_count +=1

        elif annotations[i-1][1]=='I' and annotations[i][1]=='I':
            I_after_I_count +=1

transition_prob['P(O|O)'] = O_after_O_count / O_count
transition_prob['P(O|I)'] = O_after_I_count / I_count
transition_prob['P(I|O)'] = I_after_O_count / O_count
transition_prob['P(I|I)'] = I_after_I_count / I_count

print('Transition Probabilities\n',transition_prob, '\n')

#Emission Probabilities
emission_prob={}
        
default_emission = 1/len(bigrams) * (1 - lambda_)

two_upper_O_count = 0
two_upper_I_count = 0
one_upper_O_count = 0
one_upper_I_count = 0
zero_upper_O_count = 0
zero_upper_I_count = 0

for a in annotations:
    upper_count = sum(1 for c in a[0] if c.isupper())
    
    if upper_count == 2 and a[1]=='O':
        two_upper_O_count += 1
    
    elif upper_count == 1 and a[1]=='O':
        one_upper_O_count += 1

    elif upper_count == 0 and a[1]=='O':
        zero_upper_O_count += 1

    elif upper_count == 2 and a[1]=='I':
        two_upper_I_count += 1
    
    elif upper_count == 1 and a[1]=='I':
        one_upper_I_count += 1

    elif upper_count == 0 and a[1]=='I':
        zero_upper_I_count += 1
        
emission_prob['P(2_upper|O)'] = lambda_ * (two_upper_O_count / O_count) + default_emission 
emission_prob['P(2_upper|I)'] = lambda_ * (two_upper_I_count / I_count) + default_emission 
emission_prob['P(1_upper|O)'] = lambda_ * (one_upper_O_count / O_count) + default_emission 
emission_prob['P(1_upper|I)'] = lambda_ * (one_upper_I_count / I_count) + default_emission 
emission_prob['P(0_upper|O)'] = lambda_ * (zero_upper_O_count / O_count) + default_emission 
emission_prob['P(0_upper|I)'] = lambda_ * (zero_upper_I_count / I_count) + default_emission 

print('Emission Probabilities\n', emission_prob, '\n')

Transition Probabilities
 {'P(I|start)': 0.11764705882352941, 'P(O|start)': 0.8823529411764706, 'P(O|O)': 0.8666666666666667, 'P(O|I)': 1.0, 'P(I|O)': 0.13333333333333333, 'P(I|I)': 0.0} 

Emission Probabilities
 {'P(2_upper|O)': 0.014285714285714285, 'P(2_upper|I)': 0.5142857142857142, 'P(1_upper|O)': 0.2142857142857143, 'P(1_upper|I)': 0.014285714285714285, 'P(0_upper|O)': 0.33095238095238094, 'P(0_upper|I)': 0.014285714285714285} 



#### 3) Predict the labels of the test set and compute the precision and the recall of your model.

In [84]:
#Prediction
bigrams = getBigrams(test_set)
entities=[]
prev_state='start'
for b in bigrams:
    I_prob = transition_prob['P(I|'+prev_state+')'] * emission_prob['P('+str(sum(1 for c in b if c.isupper()))+'_upper|I)']
    O_prob = transition_prob['P(O|'+prev_state+')'] * emission_prob['P('+str(sum(1 for c in b if c.isupper()))+'_upper|O)']
    
    if I_prob > O_prob:
        entities.append(b)
        prev_state = 'I'
    else:
        prev_state = 'O'

print('Predicted Entities\n', entities, '\n')

Predicted Entities
 ['Ray Charles', 'Bobby Bland', 'Ray Charles', 'Muddy Waters', 'Chicago Blues'] 



In [83]:
bigrams

['Ray Charles',
 'Charles was',
 'was born',
 'born in',
 'in 1930',
 '1930 .',
 'Bobby Bland',
 'Bland was',
 'was born',
 'born the',
 'the same',
 'same year',
 'year as',
 'as Ray',
 'Ray Charles',
 'Charles .',
 'Muddy Waters',
 'Waters is',
 'is the',
 'the father',
 'father of',
 'of Chicago',
 'Chicago Blues',
 'Blues .']

Precision is **80%** while recall is **100%**. 

#### 4) Comment on how you can further improve this model.

We could increase precision by computing also the probabilities for unigrams and averaging them in the prediction step.