# 8. Named Entity Recognition

Today we will have a look at the last type of annotations provided by the basic spaCy pipeline: *Named Entity Recognition*. Named entities are typically proper names (e.g. names of persons, places or organizations). Sometimes also other types of expressions are included in this category, e.g. temporal expressions ("last January") or quantities ("100 000 euros"). Especially for news and generally text that reports on events, the most essential information is usually encoded in the named entities.

We start as always by reading the data. We continue with the YLE English dataset from the last class:

In [1]:
from collections import defaultdict
import csv
from operator import itemgetter
import spacy

In [2]:
def read_articles(filename):
    result = []
    with open(filename) as fp:
        reader = csv.DictReader(fp)
        for row in reader:
            result.append(row['text'])
    return result

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
texts = read_articles('yle_en.csv')
docs = [nlp(text) for text in texts]

Previously we were able to recognize proper names already by their part-of-speech tag `PROPN`. In addition to that, the named entity annotation distinguishes between different categories of names (e.g. `ORG` - names of organizations).

On the token level, this information is contained in the fields `ent_type_` (the type of the entity) and `ent_iob_`. The latter is either `B`, `I` or `O` and is a way to encode annotations spanning over multiple tokens. Those labels have the following meaning:
* `B` -- "base" -- a named entity starts here,
* `I` -- "inside" -- previously started named entity continues here,
* `O` -- "outside" -- there is no named entity here.

In [5]:
for tok in docs[0][:10]:
    print(tok.i, tok.norm_, tok.lemma_, tok.pos_, tok.ent_iob_, tok.ent_type_, sep='\t')

0	nokia	Nokia	PROPN	B	ORG
1	has	have	AUX	O	
2	announced	announce	VERB	O	
3	plans	plan	NOUN	O	
4	to	to	PART	O	
5	cut	cut	VERB	O	
6	up	up	ADP	B	CARDINAL
7	to	to	PART	I	CARDINAL
8	10,000	10,000	NUM	I	CARDINAL
9	jobs	job	NOUN	O	


We can also get a list of all named entites as spans from the document object, using its field `ents`:

In [6]:
for ent in docs[0].ents:
    print(ent, ent.label_, sep='\t')

Nokia	ORG
up to 10,000	CARDINAL
the next two years	DATE
Espoo	ORG
around 600 million	CARDINAL
the end of 2023	DATE
5	CARDINAL
Finland	GPE
5	CARDINAL
Nokia	ORG
around 90,000	CARDINAL
Pekka Lundmark	PERSON
6 percent	PERCENT
Nokia	ORG
last year	DATE
918 million	CARDINAL
2020	DATE
485 million	CARDINAL
2019	DATE


The module `spacy.displacy` provides a neat visualization of text with recognized named entities:

In [7]:
spacy.displacy.render(docs[0], style='ent')

## Ex 1

Create a frequency list of all recognized named entities in the whole corpus.

In [8]:
from collections import Counter

In [16]:
ent_freqs = Counter() # similar to defaultdict(lambda: 0)
for d in docs:
    ent_freqs.update(str(ent) for ent in d.ents)
ent_freqs_lst = sorted(ent_freqs.items(), reverse=True, key=itemgetter(1))

In [18]:
ent_freqs = defaultdict(lambda: 0)
for doc in docs:
    for ent in doc.ents:
        ent_freqs[str(ent)] += 1
ent_freqs_lst = sorted(ent_freqs.items(), reverse=True, key=itemgetter(1))

In [19]:
ent_freqs_lst[:20]

[('Finland', 424),
 ('Finnish', 150),
 ('Helsinki', 143),
 ('Parliament', 82),
 ('first', 68),
 ('two', 61),
 ('last year', 59),
 ('one', 55),
 ('Tuesday', 51),
 ('three', 51),
 ('Wednesday', 46),
 ('Thursday', 43),
 ('SDP', 40),
 ('Friday', 40),
 ('Turku', 39),
 ('Monday', 36),
 ('Saturday', 36),
 ('2019', 35),
 ('EU', 33),
 ('AstraZeneca', 32)]

# Ex 2

Find sentences in which a place name (like "Finland") is subject of a verb.

You can solve this either with or without a `DepencencyMatcher`.

In [21]:
# Without DependencyMatcher
for d in docs:
    for tok in d:
        if tok.dep_ == 'nsubj' and tok.ent_type_ == 'GPE':
            print(' '.join(str(t) for t in tok.subtree), tok.head, tok.sent)


Finland take 
The paper's numbers are based on an estimate from the Institute for Health and Welfare (THL) which assumes Finland will take delivery of 5.5 million doses of the vaccine by June this year.
Finland is According to current estimates--and pending Parliamentary approval of the arrangement--Finland is to receive a total of around 2.9 billion euros out of the EU's 750 billion euro stimulus package over the next few years.
Finland made According to the Minister of the Environment and Climate Change, Krista Mikkonen (Green), the green transition funding would be the largest single climate investment that Finland has ever made.

Finland receive In total, Finland can receive about 3 billion in grants.
Finland facing Like other EU states, Finland is facing an EU deadline at the end of April to present plans on how plans to use the funds.

 Finland identified 
Finland has already identified a shortage of specialists in the IT sector, healthcare and construction.
Finland be "If jobs 

Finland decided In Friday's debate PM Marin was also asked whether Finland had decided to focus incoming vaccine doses on the areas worst-hit by the pandemic, rather than vaccinating the whole country at an even pace.

National tabloid Ilta - Sanomat claims National tabloid Ilta-Sanomat claims that officials at the Ministry of Social Affairs and Health are drawing up plans that could prioritise the distribution of vaccines to the areas worst affected by the virus.
Ukraine scored Ukraine scored the first goal of the night 80 minutes in, but talismanic striker Teemu Pukki was able to equalise from the penalty spot nine minutes later after being brought down by Ukraine defender Vitaliy Mykolenko inside the area.
Finland co The Climate Coalition of Finance Ministers, which Finland co-founded and continues to lead, has its own promising track," the president writes.
Finland followed Finland has followed the twice-yearly 'Daylight Saving Time' custom of switching clocks since 1981, but the E

Jukkola says We get food for them during the quarantine period, when they stay in their own premises," says Jukkola.
Jukkola says Jukkola says that this year each farm must draw up a health and safety plan together with the chief infection doctor of the local hospital district.
Finland has According to estimates from hospital districts around the country, Finland has the capacity to administer up to 600,000 doses of the coronavirus vaccine per week.
Finland administer In May, Finland will administer around 300,000 jabs per week, Kontio said.
Finland expecting Kontio said Finland is expecting to receive about 20,000 doses of the Johnsson & Johnsson jabs per week by the end of this month.

Finland stop Kontio said that if the group recommends that AstraZeneca's vaccine be limited to people over the age of 65, it is likely that Finland would soon stop using it.

Finland target There has also been discussion about whether Finland should target vaccinations regionally, to areas hardest-hit 

In [28]:
# With DependencyMatcher
pattern = [
    {'RIGHT_ID': 'subject',
     'RIGHT_ATTRS': {'DEP': 'nsubj', 'ENT_TYPE': 'GPE'}}
]
matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
matcher.add('pattern', [pattern])
for d in docs:
    for m_id, tok_index in matcher(d):
        print(' '.join(str(t) for t in d[tok_index[0]].subtree), d[tok_index[0]].sent)
        #print(m_id, tok_index, d[tok_index[0]], d[tok_index[0]].sent)


Finland 
The paper's numbers are based on an estimate from the Institute for Health and Welfare (THL) which assumes Finland will take delivery of 5.5 million doses of the vaccine by June this year.
Finland According to current estimates--and pending Parliamentary approval of the arrangement--Finland is to receive a total of around 2.9 billion euros out of the EU's 750 billion euro stimulus package over the next few years.
Finland According to the Minister of the Environment and Climate Change, Krista Mikkonen (Green), the green transition funding would be the largest single climate investment that Finland has ever made.

Finland In total, Finland can receive about 3 billion in grants.
Finland Like other EU states, Finland is facing an EU deadline at the end of April to present plans on how plans to use the funds.

 Finland 
Finland has already identified a shortage of specialists in the IT sector, healthcare and construction.
Finland "If jobs are not created, Finland will not be able 

The City of Turku The City of Turku will be providing food and hygiene products to local university foreign exchange students ordered into quarantine.
Henriksson and the ministry The Finns Party also claimed that Henriksson and the ministry had neglected their duty to take measures to organise the election in a timely and safe manner.
Finland Some municipal parking enforcers have, however, been using them for years as Finland introduced the limited use of body cameras in early 2018.
Finland Finland is the land of milk.
Cognitive impairment 
 Jolma Cognitive impairment
Jolma said Finland does not have accurate figures on how many women abuse substances during pregnancy or how many children this behaviour affects.
Finland Cognitive impairment
Jolma said Finland does not have accurate figures on how many women abuse substances during pregnancy or how many children this behaviour affects.
Finland Finland announced 520 new cases on Sunday, with some 336 of those recorded in the region aroun

State - owned energy company Fortum , which runs Finland ’s sole plastics recycling facility in Riihimäki , State-owned energy company Fortum, which runs Finland’s sole plastics recycling facility in Riihimäki, only recycles around a third of all incoming plastic waste, according to Yle investigative programme MOT.
Fortum Over the past few years, Fortum has said that 75 percent of the household plastic waste it processes is recycled into new raw material--a statement the company still made on its website in January.
Fortum At the Riihimäki plant, household plastic passes through several processing stages, ending in plastic pellets that Fortum sells to manufacturers as raw material.

Finland Fortum previously said it lacked the capacity to process all of the sorted plastic arriving in Riihimäki, which is why Finland ships some household plastic waste to Sweden and Germany.
" Lapland , particularly in the west "Lapland, particularly in the west, can see up to 10 centimeters of snow," he 

## Ex 3

Using a `DependencyMatcher`, find all instances of a person related to an organization. We will look for the following structures in the dependency tree:
* a named entity of type `ORG`,
* a named entity of type `PERSON` contained in the **subtree** of the above, i.e. being its descendent, but not necessarily immediate descendent.
For every matches, print the document ID, positions of the matching tokens, the matched tokens (organization and person) and the sentence where they occur.

The results might look poor because of the many errors of both parsing and named entity recognition.

In [35]:
pattern = [
    # organization name
    {'RIGHT_ID': 'organization',
     'RIGHT_ATTRS': {'ENT_TYPE': 'ORG'}},
    # person name
    {'LEFT_ID': 'organization',
     'REL_OP': '>>',
     'RIGHT_ID': 'person',
     'RIGHT_ATTRS': {'ENT_TYPE': 'PERSON'}}
]
matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
matcher.add('pattern', [pattern])
for d in docs:
    for m_id, tok_idx in matcher(d):
            print(m_id, tok_idx, d[tok_idx[0]], d[tok_idx[1]])
            print(d[tok_idx[0]].sent)
            print()

15329811787164753587 [693, 695] Services Migri
The offices of the Finnish Immigration Services, Migri.

15329811787164753587 [140, 142] Vantaa Kaunianen
In Helsinki, Espoo, Vantaa and Kaunianen, incoming people with a native language other than Finnish, Swedish or Sámi accounted for 70 percent of population growth between 2015 and 2019.

15329811787164753587 [209, 217] Nokia Ylöjärvi
Föli operates in Turku, Kaarina, Raisio, Naantali, Lieto and Rusko, while Nysse runs in Tampere, Kangasala, Lempäälä, Nokia, Orivesi, Pirkkala, Vesilahti and Ylöjärvi.

15329811787164753587 [128, 137] Pirkanmaa Ostrobothnia
A similar trend was also observed at Finland's largest laboratory firm, Fimlab, which tests and analyses samples in the regions of Pirkanmaa, Central Finland, Kanta-Häme, Ostrobothnia and Päijät-Häme.

15329811787164753587 [131, 137] Finland Ostrobothnia
A similar trend was also observed at Finland's largest laboratory firm, Fimlab, which tests and analyses samples in the regions of Pir

15329811787164753587 [143, 153] AstraZeneca Pfizer
The City of Helsinki currently uses two different vaccines, AstraZeneca for those over the age of 65, and Pfizer, which is given to people between the ages of 16-64.

15329811787164753587 [60, 62] Merimasku Naantali
On Tamsaari farm in Merimasku, Naantali, where four workers from the first flight are already settling into their routines, another Covid test was issued a few days later, explains owner Stiina Lerkki.


15329811787164753587 [304, 306] Wallend Scott
Of the other respondents who committed serious drug offenses, Krister Wallend and Scott Hendry were sentenced to 10 years, ex-Cannonball leader Ari Ronkainen to 9 and a half years, Juho Kieloniemi to 9 years 4 months, Matias Palmroth to just over 9 years, and Oscar Fagerström and Jarkko Hietikko to 8 years in prison.

15329811787164753587 [304, 307] Wallend Hendry
Of the other respondents who committed serious drug offenses, Krister Wallend and Scott Hendry were sentenced to 10 

8 693 695
Services Migri
The offices of the Finnish Immigration Services, Migri.

12 140 142
Vantaa Kaunianen
In Helsinki, Espoo, Vantaa and Kaunianen, incoming people with a native language other than Finnish, Swedish or Sámi accounted for 70 percent of population growth between 2015 and 2019.

17 209 217
Nokia Ylöjärvi
Föli operates in Turku, Kaarina, Raisio, Naantali, Lieto and Rusko, while Nysse runs in Tampere, Kangasala, Lempäälä, Nokia, Orivesi, Pirkkala, Vesilahti and Ylöjärvi.

19 128 137
Pirkanmaa Ostrobothnia
A similar trend was also observed at Finland's largest laboratory firm, Fimlab, which tests and analyses samples in the regions of Pirkanmaa, Central Finland, Kanta-Häme, Ostrobothnia and Päijät-Häme.

19 131 137
Finland Ostrobothnia
A similar trend was also observed at Finland's largest laboratory firm, Fimlab, which tests and analyses samples in the regions of Pirkanmaa, Central Finland, Kanta-Häme, Ostrobothnia and Päijät-Häme.

24 13 15
Institution Kela
Finland's ne

# Optional homework

## Ex 4

Write the function `ent_rel(nlp, docs)` that finds occurrences of verbs for which both the subject (`nsubj`) and the direct object (`dobj`) is a named entity of one of the following types: `PERSON, GPE, LOC, ORG, NORP`. For each such occurrence, display as in the example:
* index of the document
* the verb
* the subject and object in form `word:dependency_type:entity_type`
* the entire sentence

Use `DependencyMatcher`! In order to test whether a token belongs to *one of* the given named entity categories, use the construct `IN`. You will find examples here: https://spacy.io/api/matcher

This can be a first step towards extracting events or relations between entities.

In [47]:
def ent_rel(nlp, docs):
    pattern = [
        # verb
        {'RIGHT_ID': 'verb',
         'RIGHT_ATTRS': {'POS': 'VERB'}},
        # subject
        {'LEFT_ID': 'verb',
         'REL_OP': '>',
         'RIGHT_ID': 'subject',
         'RIGHT_ATTRS': {'DEP': 'nsubj', 'ENT_TYPE': {'IN': ['PERSON', 'GPE', 'LOC', 'ORG', 'NORP']}},
        # object
        {'LEFT_ID': 'verb',
         'REL_OP': '>',
         'RIGHT_ID': 'object',
         'RIGHT_ATTRS': {'DEP': 'nsubj', 'ENT_TYPE': {'IN': ['PERSON', 'GPE', 'LOC', 'ORG', 'NORP']}}}
    ]
    matcher = spacy.matcher.DependencyMatcher(nlp.vocab)
    matcher.add('verb_with_entities', [pattern])
    for d in docs:
        for m_id, tok_idx in matcher(d):

In [51]:
for i, doc_ent_rel in enumerate(ent_rel(nlp, docs)):
    for toks in doc_ent_rel:
        print(i, toks[0],
              str(toks[1])+':'+toks[1].dep_+':'+toks[1].ent_type_,
              str(toks[2])+':'+toks[2].dep_+':'+toks[2].ent_type_)
        print(toks[0].sent)
        print()

1 told Kalo:nsubj:PERSON HS:dobj:ORG
"When information about [the transition to] distance learning came, we had to arrange everything at short notice," Vantaa's Director of Basic Education Ilkka Kalo told HS.

4 urged Commission:nsubj:ORG Finland:dobj:GPE
The EU Commission has urged Finland and many other countries to focus on fewer but more effective projects.

13 won Saariaho:nsubj:PERSON Award:dobj:ORG
Finnish composer Kaija Saariaho has won the Venice Biennale's Golden Lion Award for Lifetime Achievement.

15 told Broas:nsubj:PERSON Yle:dobj:ORG
"A slightly larger amount of the Moderna vaccine came to Lapland than to the rest of the country, because distances are long here," Broas told Yle.

28 asked APN:nsubj:ORG Andersson:dobj:ORG
Finland often claims it wants to attract people from abroad, so APN asked Andersson if the education system was prepared to welcome international families.

31 told Wit:nsubj:PERSON Yle:dobj:PERSON
"I am absolutely innocent, and this [case] is revenge b