# POS Tagging, Syntactic Dependency Parsing and NER

In this project, we'll demonstrate how we can perform **part-of-speech tagging (POS), syntactic dependency parsing and named entity recognition (NER) with spaCy**.

For this, we will use two different books, both freely available on the [Project Gutenberg website](https://www.gutenberg.org):

* "Flatland: A Romance of Many Dimensions", by Edwin Abbott Abbott
<br><br>
* Charles Darwin's seminal book "On the Origin of Species by Means of Natural Selection"

### Part-of-speech tagging

#### 1. Perform initial imports

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

from spacy import displacy

import os

#### 2. Load data

In [None]:
with open('./text/flatland.txt', encoding='utf8') as f:
    doc = nlp(f.read())

#### 3. Explore data

In spaCy, the `Doc` class is a container for accessing linguistic annotations. Let's explore the `doc` object we've just created via the `nlp` object.

In [None]:
# sentences of our doc

sentences = list(doc.sents)

In [None]:
# sentence example

sentences[11]

In [None]:
# total number of sentences

len(sentences)

In [None]:
# span of our sentence

sentences[11][0:5]

#### 4. Extract POS tags

Let's see how we can access the **text, coarse-grained POS tags, fine-grained POS tags and the description of these fine-grained POS tags**.

In [None]:
# text and POS tag

for token in sentences[11][0:5]:
    print(token.text, token.pos_)

In [None]:
# text, POS tag, fine-grained POS tag and description for whole sentence

for token in sentences[11]:
    print(f'{token.text:{15}} {token.pos_:{8}} {token.tag_:{8}} {spacy.explain(token.tag_)}')

We can also easily count the total number of each POS tag in our document.

In [None]:
POS_counts = doc.count_by(spacy.attrs.POS)

POS_counts

Let's order them and replace their ID's by the corresponding strings.

In [None]:
for key ,value in sorted(POS_counts.items(), key=lambda item: item[1], reverse=True):
    print(f'{doc.vocab[key].text:{8}}: {value}')

Nouns and punctuation marks are the most common POS tags.

### Syntactic dependency parsing

#### 5.  Extract syntactic dependency labels

We can also have access to the **syntactic dependency labels** and their description.

In [None]:
# text and dependency labels

for token in sentences[11][0:5]:
    print(token.text, token.dep_)

In [None]:
# text, dependency labels and description for whole sentence

for token in sentences[11]:
    print(f'{token.text:{15}} {token.dep_:{10}} {spacy.explain(token.dep_)}')

A **syntactic dependency label** describes the type of syntactic relation between two words in a sentence. For each pair of words, one word is the **syntactic governor, also called the head**, and the other is the **dependent, also called the child**.

Each word in a sentence has exactly one head: a word can be a child to only one head, but a given word can act as a head in none, one or several pairs.

Let's see a simple example of this:

In [None]:
# text, dependency label and head

for token in sentences[11][0:5]:
    print(f'{token.text:{10}} {token.dep_:{10}} {token.head.text}')

The ROOT label marks the token whose head is itsell - in this case the verb 'call'. This verb is also the head for the words 'I' and 'Flatland'.

We can also visualize the relationship between our words in a sentence.

#### 6. Visualize syntactic dependency parse

In [None]:
# considering only our first 5 tokens

displacy.render(list(doc.sents)[11][:5], style='dep', options={'compact': True, 'bg': '#09a3d5', 'color': 'white'})

Another thing we can do is to export this visualization to a file for later use.

In [None]:
# save visualization as an html page

# page = True to render as a full HTML page
# jupyter = False to override jupyter detection
html_page = displacy.render(list(doc.sents)[11][:5], style='dep', page=True, jupyter=False, options={'compact': True, 'bg': '#09a3d5', 'color': 'white'})

# output directory name
output_dir = 'dependency_vis'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
with open(output_dir+'/flatland.html', 'w', encoding='utf8') as f:
    f.write(html_page)

### Named Entity Recoginition

#### 7. Load data

In [16]:
with open('./text/origin_of_species.txt', encoding='utf8') as f:
    doc = nlp(f.read())

#### 8. Explore data

In [17]:
sentences = list(doc.sents)

In [18]:
sentences[179]

In considering the Origin of Species, it is quite conceivable that a
naturalist, reflecting on the mutual affinities of organic beings, on their
embryological relations, their geographical distribution, geological
succession, and other such facts, might come to the conclusion that each
species had not been independently created, but had descended, like
varieties, from other species.

#### 9. Recognize named entities

In [19]:
# function to display entity info

def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+': '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [20]:
show_ents(sentences[179])

the Origin of Species: WORK_OF_ART - Titles of books, songs, etc.


In [21]:
sentences[180]

Nevertheless, such a conclusion, even if
well founded, would be unsatisfactory, until it could be shown how the
innumerable species inhabiting this world have been modified, so as to
acquire that perfection of structure and coadaptation which most justly
excites our admiration.

In [22]:
show_ents(sentences[180])

No named entities found.


#### 10. Identify sentences with 'n' or more named entities

In [23]:
def sents_n_ents(doc, n=1):
    list_sentences = [sentence for sentence in doc.sents if len(sentence.ents) >= n]
    return list_sentences

In [24]:
# sentences with 4 or more named entities

ner_sentences = sents_n_ents(doc, 4)

In [25]:
len(ner_sentences)

264

We have 264 sentences with 4 or more named entities.

#### 11. Recognize named entities in one of these sentences

In [26]:
ner_sentences[14]

" Pigeons were much valued by Akber Khan
in India, about the year 1600; never less than 20,000 pigeons were taken
with the court.

In [27]:
show_ents(ner_sentences[14])

Akber Khan: PERSON - People, including fictional
India: GPE - Countries, cities, states
about the year 1600: DATE - Absolute or relative dates or periods
less than 20,000: CARDINAL - Numerals that do not fall under another type


#### 12. Visualize named entities

In [28]:
displacy.render(ner_sentences[14], style='ent')

#### 13. Add a named entity

In [29]:
ner_sentences[24]

if that
between America and Europe is ample, will that between the Continent and
the Azores, or Madeira, or the Canaries, or Ireland, be sufficient?

In [30]:
show_ents(ner_sentences[24])

America: GPE - Countries, cities, states
Europe: LOC - Non-GPE locations, mountain ranges, bodies of water
Continent: LOC - Non-GPE locations, mountain ranges, bodies of water
Canaries: LOC - Non-GPE locations, mountain ranges, bodies of water
Ireland: GPE - Countries, cities, states


As we can see, [Azores](https://en.wikipedia.org/wiki/Azores) and [Madeira](https://en.wikipedia.org/wiki/Madeira) are not recognized as a named entity - LOC - by spaCy. Let's change that!

In order to do this, we need to know the position - the indices - of our tokens of interest in our `doc` object.

In [31]:
token_i_madeira = [token.i for token in doc if token.text == 'Madeira']

In [32]:
token_i_azores = [token.i for token in doc if token.text == 'Azores']

Since we know that for our sentence of interest the token "Madeira" is the third token after the token "Azores", we can use that information to identify our indices.

In [33]:
possible_i=[]

for token_i in token_i_azores:
    if token_i + 3 in token_i_madeira:
        possible_i.append(token_i)

In [34]:
possible_i

[20728]

Luckily, there's only one possibility. Let's see if it is correct.

In [35]:
doc[20728:20743]

Azores, or Madeira, or the Canaries, or Ireland, be sufficient?

It seems to be our sentence! Let's confirm our indices and add a named entity to these tokens.

In [36]:
doc[20728]

Azores

In [37]:
doc[20731]

Madeira

In [38]:
from spacy.tokens import Span

In [39]:
azores_ent = Span(doc, 20728, 20729, label='LOC')
madeira_ent = Span(doc, 20731, 20732, label='LOC')

doc.ents = list(doc.ents) + [azores_ent, madeira_ent]

Let's see if this worked.

In [40]:
show_ents(ner_sentences[24])

America: GPE - Countries, cities, states
Europe: LOC - Non-GPE locations, mountain ranges, bodies of water
Continent: LOC - Non-GPE locations, mountain ranges, bodies of water
Azores: LOC - Non-GPE locations, mountain ranges, bodies of water
Madeira: LOC - Non-GPE locations, mountain ranges, bodies of water
Canaries: LOC - Non-GPE locations, mountain ranges, bodies of water
Ireland: GPE - Countries, cities, states


In [41]:
displacy.render(ner_sentences[24], style='ent')

Everything is working as expected.

However, this means that Azores and Madeira will be identified as entities in this sentence, but not in other occurrences throughout our document. In order to change this, we'll use spaCy's **PhraseMatcher**.

#### 12. Add a named entity to all occurences of a token

In [42]:
from spacy.matcher import PhraseMatcher

In [43]:
# initialize the matcher

matcher = PhraseMatcher(nlp.vocab)

In [44]:
# create match patterns and add them to the matcher

pattern1 = nlp("Azores")
pattern2 = nlp("Madeira")

matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

In [45]:
# apply the matcher to our doc

for match_id, start, end in matcher(doc):
    print(doc.vocab.strings[match_id]+': ', str(start)+' --> '+str(end), ' ('+doc[start:end].text+')')

PATTERN2:  20570 --> 20571  (Madeira)
PATTERN1:  20728 --> 20729  (Azores)
PATTERN2:  20731 --> 20732  (Madeira)
PATTERN2:  21980 --> 21981  (Madeira)
PATTERN2:  42782 --> 42783  (Madeira)
PATTERN2:  53622 --> 53623  (Madeira)
PATTERN2:  53690 --> 53691  (Madeira)
PATTERN2:  53731 --> 53732  (Madeira)
PATTERN2:  53792 --> 53793  (Madeira)
PATTERN2:  53895 --> 53896  (Madeira)
PATTERN2:  54434 --> 54435  (Madeira)
PATTERN1:  55446 --> 55447  (Azores)
PATTERN2:  120501 --> 120502  (Madeira)
PATTERN2:  129997 --> 129998  (Madeira)
PATTERN1:  138931 --> 138932  (Azores)
PATTERN2:  149084 --> 149085  (Madeira)
PATTERN2:  149475 --> 149476  (Madeira)
PATTERN2:  149517 --> 149518  (Madeira)
PATTERN2:  149608 --> 149609  (Madeira)
PATTERN2:  149716 --> 149717  (Madeira)
PATTERN2:  150302 --> 150303  (Madeira)
PATTERN1:  150305 --> 150306  (Azores)
PATTERN2:  153942 --> 153943  (Madeira)
PATTERN2:  153984 --> 153985  (Madeira)
PATTERN1:  187445 --> 187446  (Azores)
PATTERN2:  187645 --> 187646 

We have now identified all the occurences of our tokens "Azores" and "Madeira". Let's add a named entity to all of them.

In [None]:
# new entities excluding the ones we already added to avoid conflicts

new_ents = [Span(doc, start, end, label='LOC') for _, start, end in matcher(doc) 
            if (start != 20728) and (start != 20731)]

doc.ents = list(doc.ents) + new_ents

Apparently, several occurrences of the token "Madeira" were identified with the named entity "ORG". Since we cannot have overlapping entities, let's try to do this only for the "Azores" token.

In [46]:
# initialize the matcher

matcher = PhraseMatcher(nlp.vocab)

In [47]:
# create match patterns and add them to the matcher

pattern = nlp("Azores")

matcher.add('PATTERN', None, pattern)

In [48]:
# apply the matcher to our doc

for match_id, start, end in matcher(doc):
    print(doc.vocab.strings[match_id]+': ', str(start)+' --> '+str(end), ' ('+doc[start:end].text+')')

PATTERN:  20728 --> 20729  (Azores)
PATTERN:  55446 --> 55447  (Azores)
PATTERN:  138931 --> 138932  (Azores)
PATTERN:  150305 --> 150306  (Azores)
PATTERN:  187445 --> 187446  (Azores)
PATTERN:  187916 --> 187917  (Azores)
PATTERN:  190384 --> 190385  (Azores)
PATTERN:  195084 --> 195085  (Azores)


In [49]:
# new entities excluding the one we already added to avoid conflicts
# there's also other conflicting doc.ents with start=187916 and 190384

new_ents = [Span(doc, start, end, label='LOC') for _, start, end in matcher(doc) 
            if (start != 20728) and (start !=187916) and (start !=190384)]

doc.ents = list(doc.ents) + new_ents

Let's check if this worked.

In [50]:
# for pattern with start=55446

doc[55411:55450]

Mr. Thwaites informs me that he
has observed similar facts in Ceylon, and analogous observations have been
made by Mr. H. C. Watson on European species of plants brought from the
Azores to England.

In [51]:
show_ents(doc[55411:55450])

Thwaites: PERSON - People, including fictional
Ceylon: GPE - Countries, cities, states
H. C. Watson: PERSON - People, including fictional
European: NORP - Nationalities or religious or political groups
Azores: LOC - Non-GPE locations, mountain ranges, bodies of water
England: GPE - Countries, cities, states


In [52]:
displacy.render(doc[55411:55450], style='ent')

As expected, the "Azores" token is now identified as a LOC - Non-GPE locations, mountain ranges, bodies of water.

As an alternative, we could have also customized the text-processing pipeline by **updating the NER pipeline component**. In order to do this, we would need to prepare some training data with annotations that the model could learn from.