In [1]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

# **Text information extraction with Spacy**

In today's data-driven era, harnessing the power of natural language processing (NLP) is essential for extracting meaningful insights from unstructured text data. To embark on this exciting journey, we will be diving deep into the realm of NLP with a particular focus on the versatile programming library, Spacy.

The primary objectives of this notebook are twofold: firstly, to briefly go throught the capabilities of Spacy, and secondly, to delve into the fascinating world of information extraction using Spacy.

In [2]:
# Importing libraries
import spacy
import collections
from collections import Counter
nlp = spacy.load('en_core_web_sm')

We are now grab few paragraphs to use in the extraction. Below is a text from https://en.wikipedia.org/wiki/New_Zealand




In [3]:
text = ''' New Zealand (Māori: Aotearoa) is an island country in the southwestern Pacific Ocean. It consists of two main landmasses—the North Island (Te Ika-a-Māui) and the South Island (Te Waipounamu)—and over 700 smaller islands. It is the sixth-largest island country by area and lies east of Australia across the Tasman Sea and south of the islands of New Caledonia, Fiji, and Tonga. The country's varied topography and sharp mountain peaks, including the Southern Alps, owe much to tectonic uplift and volcanic eruptions. New Zealand's capital city is Wellington, and its most populous city is Auckland.

The islands of New Zealand were the last large habitable land to be settled by humans. Between about 1280 and 1350, Polynesians began to settle in the islands and then developed a distinctive Māori culture. In 1642, the Dutch explorer Abel Tasman became the first European to sight and record New Zealand. In 1840, representatives of the United Kingdom and Māori chiefs signed the Treaty of Waitangi, which in its English version declared British sovereignty over the islands. In 1841, New Zealand became a colony within the British Empire. Subsequently, a series of conflicts between the colonial government and Māori tribes resulted in the alienation and confiscation of large amounts of Māori land. New Zealand became a dominion in 1907; it gained full statutory independence in 1947, retaining the monarch as head of state. Today, the majority of New Zealand's population of 5.1 million is of European descent; the indigenous Māori are the largest minority, followed by Asians and Pacific Islanders. Reflecting this, New Zealand's culture is mainly derived from Māori and early British settlers, with recent broadening of culture arising from increased immigration. The official languages are English, Māori, and New Zealand Sign Language, with the local dialect of English being dominant.

A developed country, it was the first to introduce a minimum wage, and the first to give women the right to vote. It ranks highly in international measures of quality of life, human rights, and it has low levels of perceived corruption. It retains visible levels of inequality, having structural disparities between its Māori and European populations. New Zealand underwent major economic changes during the 1980s, which transformed it from a protectionist to a liberalised free-trade economy. The service sector dominates the national economy, followed by the industrial sector, and agriculture; international tourism is also a significant source of revenue.

Nationally, legislative authority is vested in an elected, unicameral Parliament, while executive political power is exercised by the Cabinet, led by the prime minister, currently Chris Hipkins. Charles III is the country's king and is represented by the governor-general. In addition, New Zealand is organised into 11 regional councils and 67 territorial authorities for local government purposes. The Realm of New Zealand also includes Tokelau (a dependent territory); the Cook Islands and Niue (self-governing states in free association with New Zealand); and the Ross Dependency, which is New Zealand's territorial claim in Antarctica.

New Zealand is a member of the United Nations, Commonwealth of Nations, ANZUS, UKUSA, OECD, ASEAN Plus Six, Asia-Pacific Economic Cooperation, the Pacific Community and the Pacific Islands Forum. '''

First, we tokenise and annotate the text above using Spacy, and then  display the resulting tokens.

In [4]:
# tokenise and annotate text
doc = nlp(text)

# Print the individual tokens
for token in doc:
    print(token)

 
New
Zealand
(
Māori
:
Aotearoa
)
is
an
island
country
in
the
southwestern
Pacific
Ocean
.
It
consists
of
two
main
landmasses
—
the
North
Island
(
Te
Ika
-
a
-
Māui
)
and
the
South
Island
(
Te
Waipounamu)—and
over
700
smaller
islands
.
It
is
the
sixth
-
largest
island
country
by
area
and
lies
east
of
Australia
across
the
Tasman
Sea
and
south
of
the
islands
of
New
Caledonia
,
Fiji
,
and
Tonga
.
The
country
's
varied
topography
and
sharp
mountain
peaks
,
including
the
Southern
Alps
,
owe
much
to
tectonic
uplift
and
volcanic
eruptions
.
New
Zealand
's
capital
city
is
Wellington
,
and
its
most
populous
city
is
Auckland
.



The
islands
of
New
Zealand
were
the
last
large
habitable
land
to
be
settled
by
humans
.
Between
about
1280
and
1350
,
Polynesians
began
to
settle
in
the
islands
and
then
developed
a
distinctive
Māori
culture
.
In
1642
,
the
Dutch
explorer
Abel
Tasman
became
the
first
European
to
sight
and
record
New
Zealand
.
In
1840
,
representatives
of
the
United
Kingdom
and
Māori
ch

Spacy has the capability to segment our text into sentences. In the cell below, we will print each sentence one by one.

In [5]:
for sent in doc.sents:
    print(sent.text)

 New Zealand (Māori: Aotearoa) is an island country in the southwestern Pacific Ocean.
It consists of two main landmasses—the North Island (Te Ika-a-Māui) and the South Island (Te Waipounamu)—and over 700 smaller islands.
It is the sixth-largest island country by area and lies east of Australia across the Tasman Sea and south of the islands of New Caledonia, Fiji, and Tonga.
The country's varied topography and sharp mountain peaks, including the Southern Alps, owe much to tectonic uplift and volcanic eruptions.
New Zealand's capital city is Wellington, and its most populous city is Auckland.


The islands of New Zealand were the last large habitable land to be settled by humans.
Between about 1280 and 1350, Polynesians began to settle in the islands and then developed a distinctive Māori culture.
In 1642, the Dutch explorer Abel Tasman became the first European to sight and record New Zealand.
In 1840, representatives of the United Kingdom and Māori chiefs signed the Treaty of Waitangi

## Part of Speech tagging with Spacy

Spacy's default part-of-speech tagging employs a straightforward set of labels, whereas Penn Treebank tags offer a more intricate label set. (Penn Treebank tags can be accessed using the `tag_` attribute, while Spacy's POS tags are available through `pos_`).

To facilitate understanding, you can utilise the [spacy.explain](https://spacy.io/api/top-level#spacy.explain) function, which provides user-friendly descriptions for specific POS tags, dependency labels, or entity types. This tool enhances your comprehension of the linguistic annotations used in your text analysis tasks.

In [6]:
# Part of Speech tagging with default set of labels
tags = [token.pos_ for token in doc]

tag_freq = Counter(tags)
for tag in sorted(tag_freq, key=tag_freq.get, reverse=True):
    print(tag, tag_freq[tag], spacy.explain(tag), sep='\t')

PROPN	111	proper noun
NOUN	100	noun
PUNCT	89	punctuation
ADP	64	adposition
ADJ	59	adjective
DET	55	determiner
VERB	43	verb
CCONJ	26	coordinating conjunction
AUX	20	auxiliary
PRON	15	pronoun
NUM	14	numeral
ADV	12	adverb
PART	12	particle
SPACE	5	space
X	1	other
SCONJ	1	subordinating conjunction


In [7]:
# Part of Speech tagging with Penn Treebank tags
tags = [token.tag_ for token in doc]

tag_freq = Counter(tags)
for tag in sorted(tag_freq, key=tag_freq.get, reverse=True):
    print(tag, tag_freq[tag], spacy.explain(tag), sep='\t')

NNP	107	noun, proper singular
NN	71	noun, singular or mass
IN	65	conjunction, subordinating or preposition
DT	56	determiner
JJ	56	adjective (English), other noun-modifier (Chinese)
,	41	punctuation mark, comma
NNS	29	noun, plural
CC	26	conjunction, coordinating
.	25	punctuation mark, sentence closer
VBZ	20	verb, 3rd person singular present
CD	14	cardinal number
VBD	13	verb, past tense
VBN	13	verb, past participle
RB	11	adverb
PRP	8	pronoun, personal
:	7	punctuation mark, colon or ellipsis
HYPH	7	punctuation mark, hyphen
VBG	7	verb, gerund or present participle
VB	7	verb, base form
POS	6	possessive ending
TO	6	infinitival "to"
_SP	5	whitespace
-LRB-	5	left round bracket
-RRB-	4	right round bracket
NNPS	4	noun, proper plural
VBP	3	verb, non-3rd person singular present
PRP$	3	pronoun, possessive
WDT	3	wh-determiner
JJS	2	adjective, superlative
XX	1	unknown
JJR	1	adjective, comparative
RBS	1	adverb, superlative


Here's an illustration of token filtering based on their part of speech. We will specifically print out proper nouns from the text.

In [8]:
filtered_tokens = []

for token in doc:
    if token.pos_ == 'PROPN':
        filtered_tokens.append(token)

print(filtered_tokens)

[New, Zealand, Māori, Aotearoa, Pacific, Ocean, North, Island, Te, Ika, a, Māui, South, Island, Te, Australia, Tasman, Sea, New, Caledonia, Fiji, Tonga, Southern, Alps, New, Zealand, Wellington, Auckland, New, Zealand, Polynesians, Māori, Abel, Tasman, European, New, Zealand, United, Kingdom, Māori, Treaty, Waitangi, New, Zealand, British, Empire, Māori, Māori, New, Zealand, New, Zealand, Māori, Asians, Pacific, Islanders, New, Zealand, Māori, English, Māori, New, Zealand, Sign, Language, English, Māori, New, Zealand, Parliament, Cabinet, Chris, Hipkins, Charles, III, New, Zealand, Realm, New, Zealand, Tokelau, Cook, Islands, Niue, New, Zealand, Ross, Dependency, New, Zealand, Antarctica, New, Zealand, United, Nations, Commonwealth, Nations, ANZUS, UKUSA, OECD, ASEAN, Six, Asia, Pacific, Economic, Cooperation, Pacific, Community, Pacific, Islands, Forum]


In this context, it's evident that some Māori words are not accurately recognized because Spacy does not currently support the Māori language, and the package that we load currently is 'en_core_web_sm'.

We can output a frequency list for proper nouns above

In [9]:
# select prop noun tokens only
filtered_tokens = [token.text for token in doc if token.pos_ == "PROPN"]

token_freq = Counter(filtered_tokens)
for token in sorted(token_freq, key=token_freq.get, reverse=True):
    print(token, token_freq[token])

New 16
Zealand 15
Māori 9
Pacific 5
Island 2
Te 2
Tasman 2
United 2
English 2
Islands 2
Nations 2
Aotearoa 1
Ocean 1
North 1
Ika 1
a 1
Māui 1
South 1
Australia 1
Sea 1
Caledonia 1
Fiji 1
Tonga 1
Southern 1
Alps 1
Wellington 1
Auckland 1
Polynesians 1
Abel 1
European 1
Kingdom 1
Treaty 1
Waitangi 1
British 1
Empire 1
Asians 1
Islanders 1
Sign 1
Language 1
Parliament 1
Cabinet 1
Chris 1
Hipkins 1
Charles 1
III 1
Realm 1
Tokelau 1
Cook 1
Niue 1
Ross 1
Dependency 1
Antarctica 1
Commonwealth 1
ANZUS 1
UKUSA 1
OECD 1
ASEAN 1
Six 1
Asia 1
Economic 1
Cooperation 1
Community 1
Forum 1


Next we have a look a VERB tag

In [10]:
filtered_tokens = []

for token in doc:
    if token.pos_ == 'VERB':
        filtered_tokens.append(token)

print(filtered_tokens)

[consists, including, owe, tectonic, settled, began, settle, developed, became, record, signed, declared, became, resulted, became, gained, retaining, followed, Reflecting, derived, arising, increased, introduce, give, vote, ranks, has, perceived, retains, having, underwent, transformed, liberalised, dominates, followed, vested, elected, exercised, led, represented, organised, includes, governing]


Alternatively, we can employ the comprehensive Penn Treebank tag set, located under the "English" category, to achieve more precise distinctions, such as discerning between verb tenses (e.g., selecting verbs in the past tense, VBD).

In [11]:
filtered_tokens = []

for token in doc:
    if token.tag_ == 'VBD': # similar to above, except instead of pos_ we use tag_ to access the different tag set
        filtered_tokens.append(token)

print(filtered_tokens)

[were, began, developed, became, signed, declared, became, resulted, became, gained, was, underwent, transformed]


This model can be utilised to eliminate or standardise specific types of tokens as needed. For instance, in this case, we are standardizing individual numbers to a single token labeled as "NUMBER." This approach can prove valuable when exploring collocation patterns associated with numbers in a broader context.

In [12]:
no_numbers = []

for token in doc:
    if token.pos_ == 'NUM':
        no_numbers.append('NUMBER') # if we wanted to remove numbers completely change this line to: continue
    else:
        no_numbers.append(token)

print(no_numbers)

[ , New, Zealand, (, Māori, :, Aotearoa, ), is, an, island, country, in, the, southwestern, Pacific, Ocean, ., It, consists, of, 'NUMBER', main, landmasses, —, the, North, Island, (, Te, Ika, -, a, -, Māui, ), and, the, South, Island, (, Te, Waipounamu)—and, over, 'NUMBER', smaller, islands, ., It, is, the, sixth, -, largest, island, country, by, area, and, lies, east, of, Australia, across, the, Tasman, Sea, and, south, of, the, islands, of, New, Caledonia, ,, Fiji, ,, and, Tonga, ., The, country, 's, varied, topography, and, sharp, mountain, peaks, ,, including, the, Southern, Alps, ,, owe, much, to, tectonic, uplift, and, volcanic, eruptions, ., New, Zealand, 's, capital, city, is, Wellington, ,, and, its, most, populous, city, is, Auckland, ., 

, The, islands, of, New, Zealand, were, the, last, large, habitable, land, to, be, settled, by, humans, ., Between, about, 'NUMBER', and, 'NUMBER', ,, Polynesians, began, to, settle, in, the, islands, and, then, developed, a, distinctive, M

## Filtering by character types of tokens

You have the option to filter tokens based on their types. In this instance, we are excluding tokens that contain non-alphabetic characters like numbers or '$'. Refer to the list of token attributes [here](https://spacy.io/api/token#attributes). To filter differently, you can replace 'is_alpha' with another boolean (bool) type, such as 'is_digit,' to tailor your filtering criteria.

In [13]:
char_filtered = []

for token in doc:
    if token.is_alpha is False:
        continue
    else:
        char_filtered.append(token)

print(char_filtered)
# no dates or punctuation

[New, Zealand, Māori, Aotearoa, is, an, island, country, in, the, southwestern, Pacific, Ocean, It, consists, of, two, main, landmasses, the, North, Island, Te, Ika, a, Māui, and, the, South, Island, Te, over, smaller, islands, It, is, the, sixth, largest, island, country, by, area, and, lies, east, of, Australia, across, the, Tasman, Sea, and, south, of, the, islands, of, New, Caledonia, Fiji, and, Tonga, The, country, varied, topography, and, sharp, mountain, peaks, including, the, Southern, Alps, owe, much, to, tectonic, uplift, and, volcanic, eruptions, New, Zealand, capital, city, is, Wellington, and, its, most, populous, city, is, Auckland, The, islands, of, New, Zealand, were, the, last, large, habitable, land, to, be, settled, by, humans, Between, about, and, Polynesians, began, to, settle, in, the, islands, and, then, developed, a, distinctive, Māori, culture, In, the, Dutch, explorer, Abel, Tasman, became, the, first, European, to, sight, and, record, New, Zealand, In, repres

## Noun phrases / chunks

Recognising noun chunks provides a fundamental method for identifying entities within our text.

In [14]:
for chunk in doc.noun_chunks:
    print(chunk.text)

 New Zealand
(Māori
Aotearoa
an island country
the southwestern Pacific Ocean
It
two main landmasses
the North Island
a
the South Island
Te Waipounamu)—and over 700 smaller islands
It
the sixth-largest island country
area
lies
Australia
the Tasman Sea
south
the islands
New Caledonia
Fiji
Tonga
The country's varied topography
sharp mountain peaks
the Southern Alps
uplift and volcanic eruptions
New Zealand's capital city
Wellington
its most populous city
Auckland
The islands
New Zealand
the last large habitable land
humans
Polynesians
the islands
a distinctive Māori culture
the Dutch explorer Abel Tasman
the first European
sight
New Zealand
representatives
the United Kingdom and Māori chiefs
the Treaty
Waitangi
which
its English version
British sovereignty
the islands
New Zealand
a colony
the British Empire
a series
conflicts
the colonial government
Māori tribes
the alienation
confiscation
large amounts
Māori land
New Zealand
a dominion
it
full statutory independence
the monarch
head
sta

## Named Entity Recognition

Spacy is equipped to identify named entities in text. The labels for these entities in English-language models are detailed in the documentation available [here](https://spacy.io/models/en#en_core_web_sm). To explore the full spectrum of labels, simply expand the "Label Scheme" section on the page.

For a more detailed description of a specific label, you can employ the `spacy.explain` function. You can experiment with various labels by running the following cell to access their respective descriptions.

In [15]:
print(spacy.explain('NORP'))

Nationalities or religious or political groups


Here, we present the named entities found in our sample text along with their respective frequencies. Once more, we employ the `spacy.explain` function to provide user-friendly descriptions for these entities.

In [16]:
entities = [ent.label_ for ent in doc.ents]

entities_freq = Counter(entities)
for entity in sorted(entities_freq, key=entities_freq.get, reverse=True):
    print(entity, entities_freq[entity], spacy.explain(entity), sep='\t')

GPE	25	Countries, cities, states
ORG	13	Companies, agencies, institutions, etc.
PERSON	8	People, including fictional
DATE	8	Absolute or relative dates or periods
NORP	8	Nationalities or religious or political groups
PRODUCT	6	Objects, vehicles, foods, etc. (not services)
LOC	5	Non-GPE locations, mountain ranges, bodies of water
CARDINAL	5	Numerals that do not fall under another type
ORDINAL	4	"first", "second", etc.
LANGUAGE	3	Any named language
WORK_OF_ART	1	Titles of books, songs, etc.


Here are all the entities listed out with their position within the text.

In [17]:
for ent in doc.ents:
    print(ent.label_, ent.start_char, ent.end_char, ent.text,sep='\t')

GPE	1	12	New Zealand
PERSON	14	19	Māori
LOC	72	85	Pacific Ocean
CARDINAL	102	105	two
LOC	122	138	the North Island
LOC	159	175	the South Island
CARDINAL	201	204	700
ORDINAL	232	237	sixth
GPE	286	295	Australia
LOC	303	317	the Tasman Sea
GPE	346	359	New Caledonia
GPE	361	365	Fiji
GPE	371	376	Tonga
GPE	517	530	New Zealand's
GPE	531	543	capital city
GPE	547	557	Wellington
GPE	589	597	Auckland
GPE	615	626	New Zealand
DATE	687	714	Between about 1280 and 1350
ORG	716	727	Polynesians
PRODUCT	792	797	Māori
DATE	810	814	1642
NORP	820	825	Dutch
PERSON	835	846	Abel Tasman
ORDINAL	858	863	first
NORP	864	872	European
GPE	893	904	New Zealand
DATE	909	913	1840
GPE	934	952	the United Kingdom
PERSON	957	962	Māori
WORK_OF_ART	977	999	the Treaty of Waitangi
LANGUAGE	1014	1021	English
NORP	1039	1046	British
DATE	1080	1084	1841
GPE	1086	1097	New Zealand
GPE	1121	1139	the British Empire
PERSON	1213	1218	Māori
PRODUCT	1290	1295	Māori
GPE	1302	1313	New Zealand
DATE	1335	1339	1907
DATE	1382	1386	1947
DATE	1428	1

Spacy offers a visually appealing method for displaying named entities

In [18]:
spacy.displacy.render(doc, style='ent', jupyter=True)

## Dependency Parsing

Dependency parsing is a technique that analyses sentences by examining the relationships between words.

To explore the various annotation labels used in Spacy's dependency parsing, please visit the following link: [Spacy's Dependency Parsing Labels](https://spacy.io/models/en#en_core_web_sm) (expand the "label scheme" section and look for the PARSER labels).

Additionally, Spacy includes a handy dependency visualiser for enhanced understanding:

In [19]:
# make a list of sentences
sentences = list(doc.sents) # create a list of sentences
# the sentences list can be passed to the following line, but here just displaying the shortest sentence
spacy.displacy.render(sentences[1], style='dep',jupyter=True,options={'distance': 120})