# Ling 380 - Week 6

# Normalizing data. Named entity recognition
We have by now learned the basics of working with python and with python data types. We have also learned to process files. We are moving to doing more interesting things with textual data. In this lesson, we will learn about cleaning and normalizing data and how to identify named entities.

We will continue to use NLTK, but we will also install another powerful Natural Language Processing package, spaCy. If you haven't, go to the spacy_install.ipynb notebook and follow instructions there. Then come back here to import and use spaCy. 

In [1]:
# optional: pandas for storing information into a dataframe
import pandas as pd

# we need to import NLTK every time we want to use it
import nltk

# import the NLTK packages we know we need
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# these 2 packages may already be in your system, but just in case
nltk.download('punkt')
nltk.download('wordnet')

# import spaCy and the small English language model
import spacy
nlp = spacy.load("en_core_web_sm")

# this does prettier displays on spaCy
from spacy import displacy

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yifangyuan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yifangyuan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Normalizing data
Normalizing refers to a set of processes to make data uniform. This is generally useful to count things the correct way and to get to the essence of the words in a text. Think of counting the instances of the word "the" in a text. You'll want to make sure that "the", "The", and "the?" all look the same before you count them. Normalization includes:

* Converting all words to lowercase
* Removing or separating punctuation
* Stemming - removing endings and ending up with the _stem_ (endings -> end; went -> went)
* Lemmatizing - removing endings and ending up with the _root_ (endings -> end; went -> go)

Most NLP packages (NLTK, spaCy) have built-in methods to do this. But it's also good to know how to do it yourself, in case you want to control what the output looks like.

In [2]:
# an excerpt from The Peak, https://the-peak.ca/2025/01/sfu-study-calls-for-utility-scale-solar-power-systems-in-canada/
text1 = "In December 2024, Clean Energy Research Group (CERG) published a paper calling for Canada to build “mass utility-scale solar mega projects,” according to an SFU news release. Utility-scale solar “refers to large solar installations designed to feed power directly onto the electric grid.” An electric grid is an “intricate system” that provides electricity “all the way from its generation to the customers that use it for their daily needs.”"

In [3]:
# a made-up text
text2 = "This gotta be the wéïrdest bit of text that's ne'er gonna be The thing you'll encounter, but I have to give, gave, given, something!\r, even the weirdest. Just tryna throw everything into a made-up bit that isn't making any sense.<br> And here's a sentence with the irregular plural feet and one with the irregular plural geese."

### Tokenizing without lowercase
We will use NLTK to tokenize the texts. You can print the list of tokens, and also print the count of types and tokens. You'll see that 'And' and 'and' count as two different types. But they are really the same word. That's why we lowercase first or lowercase after tokenizing, but before counting. Compare this bit of code and the output to what happens if we lowercase.

In [4]:
tokens1 = nltk.word_tokenize(text1)
n_tokens1 = len(tokens1)
n_types1 = len(set(tokens1))

In [5]:
tokens1

['In',
 'December',
 '2024',
 ',',
 'Clean',
 'Energy',
 'Research',
 'Group',
 '(',
 'CERG',
 ')',
 'published',
 'a',
 'paper',
 'calling',
 'for',
 'Canada',
 'to',
 'build',
 '“',
 'mass',
 'utility-scale',
 'solar',
 'mega',
 'projects',
 ',',
 '”',
 'according',
 'to',
 'an',
 'SFU',
 'news',
 'release',
 '.',
 'Utility-scale',
 'solar',
 '“',
 'refers',
 'to',
 'large',
 'solar',
 'installations',
 'designed',
 'to',
 'feed',
 'power',
 'directly',
 'onto',
 'the',
 'electric',
 'grid.',
 '”',
 'An',
 'electric',
 'grid',
 'is',
 'an',
 '“',
 'intricate',
 'system',
 '”',
 'that',
 'provides',
 'electricity',
 '“',
 'all',
 'the',
 'way',
 'from',
 'its',
 'generation',
 'to',
 'the',
 'customers',
 'that',
 'use',
 'it',
 'for',
 'their',
 'daily',
 'needs',
 '.',
 '”']

In [6]:
for t in set(tokens1):
    print(t)

In
Canada
all
Energy
intricate
solar
its
generation
customers
Group
CERG
provides
power
from
electric
system
a
directly
for
2024
refers
“
is
calling
to
news
onto
according
their
Utility-scale
SFU
designed
use
.
that
utility-scale
(
,
published
”
An
grid
the
December
mega
build
)
mass
large
Research
an
daily
way
electricity
paper
release
grid.
feed
needs
projects
Clean
installations
it


In [7]:
print(n_tokens1)
print(n_types1)

83
63


In [8]:
# same, but for text2
tokens2 = nltk.word_tokenize(text2)
n_tokens2 = len(tokens2)
n_types2 = len(set(tokens2))

In [9]:
tokens2

['This',
 'got',
 'ta',
 'be',
 'the',
 'wéïrdest',
 'bit',
 'of',
 'text',
 'that',
 "'s",
 "ne'er",
 'gon',
 'na',
 'be',
 'The',
 'thing',
 'you',
 "'ll",
 'encounter',
 ',',
 'but',
 'I',
 'have',
 'to',
 'give',
 ',',
 'gave',
 ',',
 'given',
 ',',
 'something',
 '!',
 ',',
 'even',
 'the',
 'weirdest',
 '.',
 'Just',
 'tryna',
 'throw',
 'everything',
 'into',
 'a',
 'made-up',
 'bit',
 'that',
 'is',
 "n't",
 'making',
 'any',
 'sense.',
 '<',
 'br',
 '>',
 'And',
 'here',
 "'s",
 'a',
 'sentence',
 'with',
 'the',
 'irregular',
 'plural',
 'feet',
 'and',
 'one',
 'with',
 'the',
 'irregular',
 'plural',
 'geese',
 '.']

In [10]:
for t in set(tokens2):
    print(t)

into
thing
weirdest
made-up
irregular
plural
any
I
!
feet
something
you
tryna
's
<
here
geese
And
a
ta
got
is
'll
ne'er
to
even
na
that
.
>
,
throw
have
given
and
bit
making
sense.
encounter
the
give
one
Just
This
text
gon
gave
of
sentence
be
with
everything
but
n't
br
wéïrdest
The


In [11]:
print(n_tokens2)
print(n_types2)

73
57


### Tokenizing after lowercasing
Compare the numbers and the output now.

In [12]:
# just using the lower() method in a string
tokens1_lower = [w.lower() for w in tokens1]

In [13]:
tokens1_lower

['in',
 'december',
 '2024',
 ',',
 'clean',
 'energy',
 'research',
 'group',
 '(',
 'cerg',
 ')',
 'published',
 'a',
 'paper',
 'calling',
 'for',
 'canada',
 'to',
 'build',
 '“',
 'mass',
 'utility-scale',
 'solar',
 'mega',
 'projects',
 ',',
 '”',
 'according',
 'to',
 'an',
 'sfu',
 'news',
 'release',
 '.',
 'utility-scale',
 'solar',
 '“',
 'refers',
 'to',
 'large',
 'solar',
 'installations',
 'designed',
 'to',
 'feed',
 'power',
 'directly',
 'onto',
 'the',
 'electric',
 'grid.',
 '”',
 'an',
 'electric',
 'grid',
 'is',
 'an',
 '“',
 'intricate',
 'system',
 '”',
 'that',
 'provides',
 'electricity',
 '“',
 'all',
 'the',
 'way',
 'from',
 'its',
 'generation',
 'to',
 'the',
 'customers',
 'that',
 'use',
 'it',
 'for',
 'their',
 'daily',
 'needs',
 '.',
 '”']

In [14]:
n_types1_lower = len(set(tokens1_lower))

In [15]:
print(n_tokens1)
print(n_types1)
print(n_types1_lower)

83
63
61


In [16]:
# same for text 2
tokens2_lower = [w.lower() for w in tokens2]

In [17]:
tokens2_lower

['this',
 'got',
 'ta',
 'be',
 'the',
 'wéïrdest',
 'bit',
 'of',
 'text',
 'that',
 "'s",
 "ne'er",
 'gon',
 'na',
 'be',
 'the',
 'thing',
 'you',
 "'ll",
 'encounter',
 ',',
 'but',
 'i',
 'have',
 'to',
 'give',
 ',',
 'gave',
 ',',
 'given',
 ',',
 'something',
 '!',
 ',',
 'even',
 'the',
 'weirdest',
 '.',
 'just',
 'tryna',
 'throw',
 'everything',
 'into',
 'a',
 'made-up',
 'bit',
 'that',
 'is',
 "n't",
 'making',
 'any',
 'sense.',
 '<',
 'br',
 '>',
 'and',
 'here',
 "'s",
 'a',
 'sentence',
 'with',
 'the',
 'irregular',
 'plural',
 'feet',
 'and',
 'one',
 'with',
 'the',
 'irregular',
 'plural',
 'geese',
 '.']

In [18]:
n_types2_lower = len(set(tokens2_lower))

In [19]:
print(n_tokens2)
print(n_types2)
print(n_types2_lower)

73
57
55


## Stemming and lemmatizing with NLTK

### Stemming
Stemmers remove any endings that may be [inflectional suffixes](https://en.wikipedia.org/wiki/Inflection) in English. There are different versions of stemmers, even within NLTK. See the [overview of stemming in NLTK](https://www.nltk.org/howto/stem.html). Here, we'll use the [Porter Stemmer](https://www.nltk.org/_modules/nltk/stem/porter.html), developed by Martin Porter. 

Look at the output carefully and note where things don't seem to make sense. This is because Porter removes anything that may possibly be an ending, including the '-er' in 'December', because it is sometimes an inflectional ending in words like 'clever'.

### Lemmatization
Lemmatizers are a bit smarter about inflection and are able to identify roots, even when no suffixes are involved (gave -> give; feet -> foot). We'll use the [WordNet lemmatizer](https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet) in NLTK. 

In [20]:
# assign the stemmer to a variable, 'stemmer'
stemmer = PorterStemmer()

# go through the list of tokens (tokens1)
# lower the tokens in that list
# use list comprehension (with the square brackets)
 # so that the stemmer can iterate over the list
tokens1_st = [stemmer.stem(token.lower()) for token in tokens1]
tokens2_st = [stemmer.stem(token.lower()) for token in tokens2]

In [21]:
tokens2_st

['thi',
 'got',
 'ta',
 'be',
 'the',
 'wéïrdest',
 'bit',
 'of',
 'text',
 'that',
 "'s",
 "ne'er",
 'gon',
 'na',
 'be',
 'the',
 'thing',
 'you',
 "'ll",
 'encount',
 ',',
 'but',
 'i',
 'have',
 'to',
 'give',
 ',',
 'gave',
 ',',
 'given',
 ',',
 'someth',
 '!',
 ',',
 'even',
 'the',
 'weirdest',
 '.',
 'just',
 'tryna',
 'throw',
 'everyth',
 'into',
 'a',
 'made-up',
 'bit',
 'that',
 'is',
 "n't",
 'make',
 'ani',
 'sense.',
 '<',
 'br',
 '>',
 'and',
 'here',
 "'s",
 'a',
 'sentenc',
 'with',
 'the',
 'irregular',
 'plural',
 'feet',
 'and',
 'one',
 'with',
 'the',
 'irregular',
 'plural',
 'gees',
 '.']

In [22]:
tokens1

['In',
 'December',
 '2024',
 ',',
 'Clean',
 'Energy',
 'Research',
 'Group',
 '(',
 'CERG',
 ')',
 'published',
 'a',
 'paper',
 'calling',
 'for',
 'Canada',
 'to',
 'build',
 '“',
 'mass',
 'utility-scale',
 'solar',
 'mega',
 'projects',
 ',',
 '”',
 'according',
 'to',
 'an',
 'SFU',
 'news',
 'release',
 '.',
 'Utility-scale',
 'solar',
 '“',
 'refers',
 'to',
 'large',
 'solar',
 'installations',
 'designed',
 'to',
 'feed',
 'power',
 'directly',
 'onto',
 'the',
 'electric',
 'grid.',
 '”',
 'An',
 'electric',
 'grid',
 'is',
 'an',
 '“',
 'intricate',
 'system',
 '”',
 'that',
 'provides',
 'electricity',
 '“',
 'all',
 'the',
 'way',
 'from',
 'its',
 'generation',
 'to',
 'the',
 'customers',
 'that',
 'use',
 'it',
 'for',
 'their',
 'daily',
 'needs',
 '.',
 '”']

In [23]:
# assign the lemmatizer to a variable
lemmatizer = WordNetLemmatizer()

# go through the list of tokens and lemmatize
tokens1_lm = [lemmatizer.lemmatize(token.lower()) for token in tokens1]
tokens2_lm = [lemmatizer.lemmatize(token.lower()) for token in tokens2]

In [24]:
tokens1_lm

['in',
 'december',
 '2024',
 ',',
 'clean',
 'energy',
 'research',
 'group',
 '(',
 'cerg',
 ')',
 'published',
 'a',
 'paper',
 'calling',
 'for',
 'canada',
 'to',
 'build',
 '“',
 'mass',
 'utility-scale',
 'solar',
 'mega',
 'project',
 ',',
 '”',
 'according',
 'to',
 'an',
 'sfu',
 'news',
 'release',
 '.',
 'utility-scale',
 'solar',
 '“',
 'refers',
 'to',
 'large',
 'solar',
 'installation',
 'designed',
 'to',
 'feed',
 'power',
 'directly',
 'onto',
 'the',
 'electric',
 'grid.',
 '”',
 'an',
 'electric',
 'grid',
 'is',
 'an',
 '“',
 'intricate',
 'system',
 '”',
 'that',
 'provides',
 'electricity',
 '“',
 'all',
 'the',
 'way',
 'from',
 'it',
 'generation',
 'to',
 'the',
 'customer',
 'that',
 'use',
 'it',
 'for',
 'their',
 'daily',
 'need',
 '.',
 '”']

In [25]:
tokens2_lm

['this',
 'got',
 'ta',
 'be',
 'the',
 'wéïrdest',
 'bit',
 'of',
 'text',
 'that',
 "'s",
 "ne'er",
 'gon',
 'na',
 'be',
 'the',
 'thing',
 'you',
 "'ll",
 'encounter',
 ',',
 'but',
 'i',
 'have',
 'to',
 'give',
 ',',
 'gave',
 ',',
 'given',
 ',',
 'something',
 '!',
 ',',
 'even',
 'the',
 'weirdest',
 '.',
 'just',
 'tryna',
 'throw',
 'everything',
 'into',
 'a',
 'made-up',
 'bit',
 'that',
 'is',
 "n't",
 'making',
 'any',
 'sense.',
 '<',
 'br',
 '>',
 'and',
 'here',
 "'s",
 'a',
 'sentence',
 'with',
 'the',
 'irregular',
 'plural',
 'foot',
 'and',
 'one',
 'with',
 'the',
 'irregular',
 'plural',
 'goose',
 '.']

### Other options: remove non-ASCII characters, remove HTML tags
There are additional things you may want to do to clean and normalize text, including converting everything to ASCII (wéïrdest -> weirdest) and removing HTML tags ('`<br>`', -> '' ). You'll probably want to do that before tokenization, so that the angle brackets in HTML don't get tokenized as punctuation. 

The [emoji](https://pypi.org/project/emoji/) library can also convert UTF emoji into descriptions (&#x1F917; -> 'hugging face')

## Normalizing text with spaCy
Now that you have seen how to do this with NLTK, the good news is that spaCy will do pretty much everything you need to do to clean and normalize text. It will also give you morphological and syntactic information about it. Go to the `spacy_install.ipynb` notebook for more information on how spaCy works.

We call spaCy by using the `nlp` object and passing it the text that we want to process. Then we can query and print the information contained in the `doc` object that spaCy creates.

In [26]:
# process the text with spaCy
doc1 = nlp(text1)
doc2 = nlp(text2)

In [27]:
# print the 'text' attribute of each of the tokens
for token in doc1:
    print(token.text)

In
December
2024
,
Clean
Energy
Research
Group
(
CERG
)
published
a
paper
calling
for
Canada
to
build
“
mass
utility
-
scale
solar
mega
projects
,
”
according
to
an
SFU
news
release
.
Utility
-
scale
solar
“
refers
to
large
solar
installations
designed
to
feed
power
directly
onto
the
electric
grid
.
”
An
electric
grid
is
an
“
intricate
system
”
that
provides
electricity
“
all
the
way
from
its
generation
to
the
customers
that
use
it
for
their
daily
needs
.
”


In [28]:
# if you want to see this in a single line, you can join the strings in the list, with a space between each
print(" ".join([token.text for token in doc1]))
print("\r") # this is just so that the two texts are separated on the screen by an empty line
print(" ".join([token.text for token in doc2]))

In December 2024 , Clean Energy Research Group ( CERG ) published a paper calling for Canada to build “ mass utility - scale solar mega projects , ” according to an SFU news release . Utility - scale solar “ refers to large solar installations designed to feed power directly onto the electric grid . ” An electric grid is an “ intricate system ” that provides electricity “ all the way from its generation to the customers that use it for their daily needs . ”

This got ta be the wéïrdest bit of text that 's ne'er gon na be The thing you 'll encounter , but I have to give , gave , given , something !  , even the weirdest . Just tryna throw everything into a made - up bit that is n't making any sense.<br > And here 's a sentence with the irregular plural feet and one with the irregular plural geese .


In [29]:
# print the lemmas
print(" ".join([token.lemma_ for token in doc1]))
print("\r")
print(" ".join([token.lemma_ for token in doc2]))

in December 2024 , Clean Energy Research Group ( CERG ) publish a paper call for Canada to build " mass utility - scale solar mega project , " accord to an SFU news release . utility - scale solar " refer to large solar installation design to feed power directly onto the electric grid . " an electric grid be an " intricate system " that provide electricity " all the way from its generation to the customer that use it for their daily need . "

this get to be the wéïrd bit of text that be ne'er go to be the thing you will encounter , but I have to give , give , give , something !  , even the weird . just tryna throw everything into a make - up bit that be not make any sense.<br > and here be a sentence with the irregular plural foot and one with the irregular plural geese .


In [30]:
# print the part of speech of each word after the word
print(" ".join([f"{token.text}/{token.pos_}" for token in doc1]))
print("\r")
print(" ".join([f"{token.text}/{token.pos_}" for token in doc2]))

In/ADP December/PROPN 2024/NUM ,/PUNCT Clean/PROPN Energy/PROPN Research/PROPN Group/PROPN (/PUNCT CERG/PROPN )/PUNCT published/VERB a/DET paper/NOUN calling/VERB for/SCONJ Canada/PROPN to/PART build/VERB “/PUNCT mass/ADJ utility/NOUN -/PUNCT scale/NOUN solar/ADJ mega/ADJ projects/NOUN ,/PUNCT ”/PUNCT according/VERB to/ADP an/DET SFU/PROPN news/NOUN release/NOUN ./PUNCT Utility/NOUN -/PUNCT scale/NOUN solar/NOUN “/PUNCT refers/VERB to/ADP large/ADJ solar/ADJ installations/NOUN designed/VERB to/PART feed/VERB power/NOUN directly/ADV onto/ADP the/DET electric/ADJ grid/NOUN ./PUNCT ”/PUNCT An/DET electric/ADJ grid/NOUN is/AUX an/DET “/PUNCT intricate/ADJ system/NOUN ”/PUNCT that/PRON provides/VERB electricity/NOUN “/PUNCT all/DET the/DET way/NOUN from/ADP its/PRON generation/NOUN to/ADP the/DET customers/NOUN that/PRON use/VERB it/PRON for/ADP their/PRON daily/ADJ needs/NOUN ./PUNCT ”/PUNCT

This/PRON got/VERB ta/PART be/AUX the/DET wéïrdest/ADJ bit/NOUN of/ADP text/NOUN that/PRON 's/AUX

In [31]:
# the doc object in spaCy contains all kinds of information
# including rich morphology for each word
for token in doc1:
    print(token.text, "\t\t", token.lemma_, "\t\t", token.pos_, "\t\t", token.morph) # the \t helps show sort of columns

In 		 in 		 ADP 		 
December 		 December 		 PROPN 		 Number=Sing
2024 		 2024 		 NUM 		 NumType=Card
, 		 , 		 PUNCT 		 PunctType=Comm
Clean 		 Clean 		 PROPN 		 Number=Sing
Energy 		 Energy 		 PROPN 		 Number=Sing
Research 		 Research 		 PROPN 		 Number=Sing
Group 		 Group 		 PROPN 		 Number=Sing
( 		 ( 		 PUNCT 		 PunctSide=Ini|PunctType=Brck
CERG 		 CERG 		 PROPN 		 Number=Sing
) 		 ) 		 PUNCT 		 PunctSide=Fin|PunctType=Brck
published 		 publish 		 VERB 		 Tense=Past|VerbForm=Fin
a 		 a 		 DET 		 Definite=Ind|PronType=Art
paper 		 paper 		 NOUN 		 Number=Sing
calling 		 call 		 VERB 		 Aspect=Prog|Tense=Pres|VerbForm=Part
for 		 for 		 SCONJ 		 
Canada 		 Canada 		 PROPN 		 Number=Sing
to 		 to 		 PART 		 
build 		 build 		 VERB 		 VerbForm=Inf
“ 		 " 		 PUNCT 		 PunctSide=Ini|PunctType=Quot
mass 		 mass 		 ADJ 		 Degree=Pos
utility 		 utility 		 NOUN 		 Number=Sing
- 		 - 		 PUNCT 		 PunctType=Dash
scale 		 scale 		 NOUN 		 Number=Sing
solar 		 solar 		 ADJ 		 Degree=Pos
mega 		 me

In [32]:
# if you don't know what the abbreviations mean, 
# you can ask for an explanation
spacy.explain("ADP")

'adposition'

In [33]:
# doc also includes syntactic information about heads and dependents
# including rich morphology for each word
for token in doc1:
    print(token.text, "\t\t", token.pos_, "\t\t", token.dep_, "\t\t", token.head)

In 		 ADP 		 prep 		 published
December 		 PROPN 		 pobj 		 In
2024 		 NUM 		 nummod 		 December
, 		 PUNCT 		 punct 		 published
Clean 		 PROPN 		 compound 		 Energy
Energy 		 PROPN 		 compound 		 Group
Research 		 PROPN 		 compound 		 Group
Group 		 PROPN 		 nsubj 		 published
( 		 PUNCT 		 punct 		 Group
CERG 		 PROPN 		 appos 		 Group
) 		 PUNCT 		 punct 		 Group
published 		 VERB 		 ROOT 		 published
a 		 DET 		 det 		 paper
paper 		 NOUN 		 dobj 		 published
calling 		 VERB 		 acl 		 paper
for 		 SCONJ 		 mark 		 build
Canada 		 PROPN 		 nsubj 		 build
to 		 PART 		 aux 		 build
build 		 VERB 		 advcl 		 calling
“ 		 PUNCT 		 punct 		 projects
mass 		 ADJ 		 amod 		 projects
utility 		 NOUN 		 nmod 		 scale
- 		 PUNCT 		 punct 		 scale
scale 		 NOUN 		 nmod 		 projects
solar 		 ADJ 		 amod 		 mega
mega 		 ADJ 		 compound 		 projects
projects 		 NOUN 		 dobj 		 build
, 		 PUNCT 		 punct 		 published
” 		 PUNCT 		 punct 		 published
according 		 VERB 		 prep 		 published
to 		 ADP 

In [34]:
# btw, you can still count tokens and types with spaCy
tokens1 = [token.text for token in doc1]
types1 = set(tokens1)

# you can see those lists
print("tokens: ", tokens1)
print("\r")
print("types: ", types1)

# and you can print their length
print("\r")
print("number of tokens: ", len(tokens1))
print("number of types: ", len(types1))

tokens:  ['In', 'December', '2024', ',', 'Clean', 'Energy', 'Research', 'Group', '(', 'CERG', ')', 'published', 'a', 'paper', 'calling', 'for', 'Canada', 'to', 'build', '“', 'mass', 'utility', '-', 'scale', 'solar', 'mega', 'projects', ',', '”', 'according', 'to', 'an', 'SFU', 'news', 'release', '.', 'Utility', '-', 'scale', 'solar', '“', 'refers', 'to', 'large', 'solar', 'installations', 'designed', 'to', 'feed', 'power', 'directly', 'onto', 'the', 'electric', 'grid', '.', '”', 'An', 'electric', 'grid', 'is', 'an', '“', 'intricate', 'system', '”', 'that', 'provides', 'electricity', '“', 'all', 'the', 'way', 'from', 'its', 'generation', 'to', 'the', 'customers', 'that', 'use', 'it', 'for', 'their', 'daily', 'needs', '.', '”']

types:  {'In', 'Canada', '-', 'Energy', 'intricate', 'all', 'solar', 'its', 'generation', 'customers', 'scale', 'Group', 'CERG', 'provides', 'power', 'from', 'electric', 'system', 'a', 'directly', 'for', '2024', 'refers', '“', 'is', 'calling', 'to', 'news', 'ont

### Using pandas to show the output
You can also store the information into a pandas dataframe, which makes it much more readable and easy to save.

In [35]:
data1 = []

for token in doc1:
    data1.append([token.text, token.pos_, token.dep_, token.head])
    
df = pd.DataFrame(data1)
df.columns = ['Text', 'Tag', 'Dependency', 'Head']

df

Unnamed: 0,Text,Tag,Dependency,Head
0,In,ADP,prep,published
1,December,PROPN,pobj,In
2,2024,NUM,nummod,December
3,",",PUNCT,punct,published
4,Clean,PROPN,compound,Energy
...,...,...,...,...
83,their,PRON,poss,needs
84,daily,ADJ,amod,needs
85,needs,NOUN,pobj,for
86,.,PUNCT,punct,is


# Named Entity Recognition
Named Entity Recognition (NER) is the process of identifying and labelling _named entities_, that is, real world objects, locations, and identifiers such as dates, currency, or quantities. It is very useful if you want to know, for instance, who is mentioned in a text, which countries are involved, or which dates. It is what allows your email or messaging application to identify dates and suggest a calendar entry, as in the image below, from the iPhone Notes app. 


<a href="./img/ner_date.jpeg" target="_blank">
<img src="./img/ner_date.jpeg" width="100" height="200" style="border: 1px solid gray; padding: 5px;"> </a>

spaCy has a pretty powerful NER module. It can give you the named entities in a text, with the label for each word that is part of the entity. It also knows the boundaries of the entire entity. So it knows that "Clean", "Energy", "Research", and "Group" all have the ORG (for organization) label, but it also knows that the full entity is "Clean Energy Research Group". 

You can list all the entities in a text with the usual for loop. Using the [displacy module](https://spacy.io/usage/visualizers), you can also visualize the boundaries and the types in different colours. 

In [36]:
# print each token and its entity label, if it has one
for ent in doc1.ents:
    print(ent.text, ent.label_)

December 2024 DATE
Clean Energy Research Group ORG
CERG ORG
Canada GPE
SFU ORG
daily DATE


In [37]:
# same for doc2, but it doesn't contain entities
for ent in doc2.ents:
    print(ent.text, ent.label_)

geese NORP


In [38]:
# it's useful to count and store the named entities in a text

# create an empty list
named_ents1 = []

# go through the entities and append each to the list
for ent in doc1.ents:
    named_ents1.append((ent.text, ent.label_))
    
print(named_ents1)

[('December 2024', 'DATE'), ('Clean Energy Research Group', 'ORG'), ('CERG', 'ORG'), ('Canada', 'GPE'), ('SFU', 'ORG'), ('daily', 'DATE')]


In [39]:
# create a df for the entities, from the list above 
df_ents1 = pd.DataFrame(named_ents1)
# name the columns
df_ents1.columns = ['Entity', 'Label']
# print
df_ents1

Unnamed: 0,Entity,Label
0,December 2024,DATE
1,Clean Energy Research Group,ORG
2,CERG,ORG
3,Canada,GPE
4,SFU,ORG
5,daily,DATE


In [40]:
# visualize the entities
displacy.render(doc1, style="ent", jupyter=True)

# Summary

We have learned quite a bit! Text normalization includes:

* Lowercasing the words
* Tokenizing (identifying words, punctuation, anything else)
* Stemming
* Lemmatizing

And we have learned to do this with both NLTK and spaCy.

We have also learned about **Named Entity Recognition** and how to extract entities with spaCy.