POS - Parts of Speech

NER - Named Entity Recognition

## Coarse-grained Part-of-speech Tags
Every token is assigned a POS Tag from the following list:


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

In [1]:
import spacy

In [2]:
nlp=spacy.load("en_core_web_sm")

In [3]:
doc = nlp(u"The quick brown fox jumped over the lazy dog`s back.")

In [4]:
print(doc.text)

The quick brown fox jumped over the lazy dog`s back.


In [7]:
print(doc[4].pos_)

VERB


In [9]:
# Detailed tag
print(doc[4].tag_)

VBD


In [11]:
for token in doc:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}")

The        DET        DT         determiner
quick      ADJ        JJ         adjective (English), other noun-modifier (Chinese)
brown      ADJ        JJ         adjective (English), other noun-modifier (Chinese)
fox        NOUN       NN         noun, singular or mass
jumped     VERB       VBD        verb, past tense
over       ADP        IN         conjunction, subordinating or preposition
the        DET        DT         determiner
lazy       ADJ        JJ         adjective (English), other noun-modifier (Chinese)
dog`s      NOUN       NN         noun, singular or mass
back       ADV        RB         adverb
.          PUNCT      .          punctuation mark, sentence closer


In [12]:
doc = nlp(u"I read books on NLP.")

In [15]:
word = doc[1]
word

read

In [16]:
token = word
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}")

read       VERB       VBP        verb, non-3rd person singular present


In [17]:
doc = nlp(u"I read a book on NLP.")
word = doc[1]

token = word
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}")

read       VERB       VBD        verb, past tense


In [19]:
doc = nlp(u"The quick brown fox jumped over the lazy dog`s back.")

POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{90: 2, 84: 3, 92: 2, 100: 1, 85: 1, 86: 1, 97: 1}

In [22]:
doc.vocab[84].text

'ADJ'

In [23]:
doc[2].pos_

'ADJ'

In [24]:
for key,value in sorted(POS_counts.items()):
    print(f"{key}. {doc.vocab[key].text:{10}} {value}")

84. ADJ        3
85. ADP        1
86. ADV        1
90. DET        2
92. NOUN       2
97. PUNCT      1
100. VERB       1


In [25]:
TAG_counts  = doc.count_by(spacy.attrs.TAG)

for key,value in sorted(TAG_counts.items()):
    print(f"{key}. {doc.vocab[key].text:{10}} {value}")

164681854541413346. RB         1
1292078113972184607. IN         1
10554686591937588953. JJ         3
12646065887601541794. .          1
15267657372422890137. DT         2
15308085513773655218. NN         2
17109001835818727656. VBD        1


In [26]:
len(doc.vocab)

792

In [27]:
DEP_counts  = doc.count_by(spacy.attrs.DEP)

for key,value in sorted(DEP_counts.items()):
    print(f"{key}. {doc.vocab[key].text:{10}} {value}")

400. advmod     1
402. amod       3
415. det        2
429. nsubj      1
439. pobj       1
443. prep       1
445. punct      1
8206900633647566924. ROOT       1


Visualize POS

In [29]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [30]:
doc = nlp(u"The quick brown fox jumped over the lazy dog`s back.")

In [31]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

In [32]:
options = {"distance":110, "color":"yellow","compact":"True","bg":"#09a3d5","font":"Times"}

In [33]:
displacy.render(doc, style="dep", jupyter=True, options=options)

In [35]:
doc2 = nlp(u"This is a sentence. This is 2nd sentence. This is 3rd sentence")

spans = list(doc2.sents)

displacy.serve(spans, style="dep",options={"distanc":100})




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [12/Apr/2023 20:11:12] "GET / HTTP/1.1" 200 9546
127.0.0.1 - - [12/Apr/2023 20:11:13] "GET /favicon.ico HTTP/1.1" 200 9546


Shutting down server on port 5000.


Named Entity Recognition (NER)

## Entity annotations
`Doc.ents` are token spans with their own set of annotations.
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>



In [36]:
import spacy

In [37]:
nlp = spacy.load("en_core_web_sm")

In [38]:
def show_entities(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + " - " + ent.label_ + " - " + str(spacy.explain(ent.label_)))
    else:
        print("No entities found!")

In [39]:
doc = nlp(u"Hi how are you?")

In [40]:
show_entities(doc)

No entities found!


In [43]:
doc = nlp(u"May I go to Washington DC next May I see the washington Monument?")

In [44]:
show_entities(doc)

Washington DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the washington Monument - ORG - Companies, agencies, institutions, etc.


## NER Tags
Tags are accessible through the `.label_` property of an entity.
<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

In [45]:
doc = nlp(u"Can i please have $500 of microsoft stocks?")

In [46]:
show_entities(doc)

500 - MONEY - Monetary values, including unit
microsoft - ORG - Companies, agencies, institutions, etc.


In [47]:
doc = nlp(u"Tesla to build a U.K factory for $6 million")

In [48]:
show_entities(doc)

U.K - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [49]:
from spacy.tokens import Span

In [53]:
ORG = doc.vocab.strings[u"ORG"]
ORG

383

In [54]:
new_entity = Span(doc, 0,1, label=ORG)

In [55]:
doc.ents = list(doc.ents) + [new_entity]

In [56]:
show_entities(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


Add multiple Named Entities

In [57]:
doc = nlp(u"our company created a brand new vacuum cleaner."
          u"This new vacuum-cleaner is the best in show.")

In [58]:
show_entities(doc)

No entities found!


In [59]:
from spacy.matcher import PhraseMatcher

In [60]:
matcher = PhraseMatcher(nlp.vocab)

In [61]:
phrase_list = ["vacuum cleaner", "vacuum-cleaner"]

In [63]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [68]:
phrase_patterns

[vacuum cleaner, vacuum-cleaner]

In [65]:
matcher.add("newproduct", [*phrase_patterns])

In [67]:
found_matches = matcher(doc)
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [69]:
from spacy.tokens import Span

In [70]:
PROD = doc.vocab.strings[u"PRODUCT"]

In [71]:
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [72]:
new_entities = [Span(doc, match[1],match[2], label=PROD) for match in found_matches]

In [73]:
doc.ents = list(doc.ents) + new_entities

In [74]:
show_entities(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum-cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)


In [75]:
doc = nlp(u"Originally I paid $29.95 for this car toy, but now it is marked down by $10.")

In [76]:
[ent for ent in doc.ents if ent.label_ == "MONEY"]

[29.95, 10]

In [77]:
len([ent for ent in doc.ents if ent.label_ == "MONEY"])

2

Visualizing Named Entity Recognition

In [78]:
import spacy

nlp = spacy.load("en_core_web_sm")

from spacy import displacy

In [84]:
doc = nlp(u"Over the last quarter Apple sold nearly 20 thousand Ipods for a profit of $6 million."
          u"In contrast Sony only sold 8 thousand walkman music players.")

In [85]:
displacy.render(doc, style="ent", jupyter=True)

In [86]:
for sent in doc.sents:
    displacy.render(nlp(sent.text), style="ent", jupyter=True)

In [88]:
colors = {"ORG":"red"}
options = {"ents":["PRODUCT","ORG"], 'colors':colors}
displacy.render(doc, style="ent", jupyter=True, options=options)

In [89]:
colors = {"ORG":"radial-gradient(yellow, green)"}
options = {"ents":["PRODUCT","ORG"], 'colors':colors}
displacy.render(doc, style="ent", jupyter=True, options=options)

In [91]:
colors = {"ORG":"linear-gradient(90deg, orange, red)"}
options = {"ents":["PRODUCT","ORG"], 'colors':colors}
displacy.render(doc, style="ent", jupyter=True, options=options)

Sentence Segmentation

In [92]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [93]:
doc = nlp(u"This is the first sentence. This is another sentence. This is the last sentence.")

In [94]:
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [96]:
doc[0]

This

In [95]:
# doc.sents[0]

TypeError: 'generator' object is not subscriptable

In [110]:
list(doc.sents)[0]

"Management is doing the right things; Leadership is doing the right things."

In [111]:
type(list(doc.sents)[0])

spacy.tokens.span.Span

In [112]:
doc = nlp(u'"Management is doing the right things; Leadership is doing the right things." - Peter Drucker')

In [113]:
doc.text

'"Management is doing the right things; Leadership is doing the right things." - Peter Drucker'

In [114]:
for sent in doc.sents:
    print(sent)
    print("\n")

"Management is doing the right things; Leadership is doing the right things."


- Peter Drucker




In [115]:
from spacy.language import Language

In [116]:
# Add a SEGMENTATION rule
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ";":
            doc[token.i+1].is_sent_start = True
    return doc

In [117]:
nlp.add_pipe("set_custom_boundaries",
             before="parser")

nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [118]:
doc4 = nlp(u'"Management is doing the right things; Leadership is doing the right things." - Peter Drucker')

In [119]:
for sent in doc4.sents:
    print(sent)

"Management is doing the right things;
Leadership is doing the right things."
- Peter Drucker


In [160]:
# Change SEGMENTATION rules

In [161]:
nlp = spacy.load("en_core_web_sm")

In [162]:
mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

In [163]:
print(mystring)

This is a sentence. This is another.

This is a 
third sentence.


In [164]:
doc = nlp(mystring)

for sentence in doc.sents:
    print(sentence)

This is a sentence.
This is another.


This is a 
third sentence.


In [165]:
# from spacy.pipeline import SentenceSegmenter

In [166]:
@Language.component("split_on_newlines")
def split_on_newlines(doc):
    start = 0
    seen_newline = False

    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith("\n"):
            seen_newline = True
    yield doc[start:]

In [167]:
# sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)

In [168]:
nlp.add_pipe("split_on_newlines", before="parser")

<function __main__.split_on_newlines(doc)>

In [169]:
print(nlp.pipe_names)

['tok2vec', 'tagger', 'split_on_newlines', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [170]:
doc = nlp(mystring)

ValueError: [E005] Pipeline component 'split_on_newlines' returned <class 'generator'> instead of a Doc. If you're using a custom component, maybe you forgot to return the processed Doc?

In [171]:
for sentence in doc.sents:
    print(sentence)

This is a sentence.
This is another.


This is a 
third sentence.
