# Part of Speech Tagging and named entity....

- understand how to retrive parts-of-speech using Spacy
- named entity recog
- visualize pos and ner
- sentence segmentation

- most words are rare
- some same type has huge difference
- some different have same meaning

- splitting text into word-like units is difficult
- use linguistic knowledge to add useful information

- you need to know atleast noun vs verb vs ajective
- plural noun pasttense superlativeadjective

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp(u"the quick brown fox jumped over the lazy dog's back.")

In [4]:
print(doc.text)

the quick brown fox jumped over the lazy dog's back.


In [5]:
print(doc[4].text)

jumped


In [6]:
print(doc[4].pos_)

VERB


In [7]:
print(doc[4].tag_)  

VBD


## Fine-grained Part of- speech Tags

### Refer to the notebooks provided to see what it means

In [8]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{15}} {token.tag_:{15}} {spacy.explain(token.tag_)}')

the        DET             DT              determiner
quick      ADJ             JJ              adjective (English), other noun-modifier (Chinese)
brown      ADJ             JJ              adjective (English), other noun-modifier (Chinese)
fox        NOUN            NN              noun, singular or mass
jumped     VERB            VBD             verb, past tense
over       ADP             IN              conjunction, subordinating or preposition
the        DET             DT              determiner
lazy       ADJ             JJ              adjective (English), other noun-modifier (Chinese)
dog        NOUN            NN              noun, singular or mass
's         PART            POS             possessive ending
back       ADV             RB              adverb
.          PUNCT           .               punctuation mark, sentence closer


In [9]:
doc = nlp(u'I read books on NLP.')

In [10]:
word = doc[1]
word.text

'read'

In [11]:
token= word
print(f'{token.text:{8}} {token.pos_:{8}} {token.tag_:{8}} {spacy.explain(token.tag_):{15}}')

read     VERB     VBP      verb, non-3rd person singular present


In [12]:
doc = nlp(u'I read a book on NLP')

In [13]:
word = doc[1]
token = word
print(f'{token.text:{8}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}')

read     VERB       VBP        verb, non-3rd person singular present


In [14]:
doc = nlp(u'The quick brown fox  jumped over the lazy dog\'s back.')

In [15]:
# observe here it tells the tag, and tell how many similar tags are in the sent


POS_counts = doc.count_by(spacy.attrs.POS)

In [16]:
POS_counts

{90: 2, 84: 3, 92: 2, 103: 1, 100: 1, 85: 1, 94: 1, 86: 1, 97: 1}

In [17]:
doc[0].pos

90

In [18]:
doc[0]

The

In [254]:
POS_counts = doc.count_by(spacy.attrs.POS)

for k, val in sorted(POS_counts.items()):
    print(f'{k}.  {doc.vocab[k].text:{5}} ---> {val}')
    

84.  ADJ   ---> 54
85.  ADP   ---> 122
86.  ADV   ---> 67
87.  AUX   ---> 49
89.  CCONJ ---> 61
90.  DET   ---> 90
92.  NOUN  ---> 166
93.  NUM   ---> 8
94.  PART  ---> 29
95.  PRON  ---> 109
96.  PROPN ---> 76
97.  PUNCT ---> 173
98.  SCONJ ---> 20
100.  VERB  ---> 135
103.  SPACE ---> 99


In [20]:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k, v in sorted(TAG_counts.items()):
    print(f'{k}.  {doc.vocab[k].text:{5}}  {v}')

74.  POS    1
164681854541413346.  RB     1
1292078113972184607.  IN     1
6893682062797376370.  _SP    1
10554686591937588953.  JJ     3
12646065887601541794.  .      1
15267657372422890137.  DT     2
15308085513773655218.  NN     2
17109001835818727656.  VBD    1


In [21]:
# dep is used for syntantic dependency

DEP_counts = doc.count_by(spacy.attrs.DEP)   
for k, v in sorted(DEP_counts.items()):
    print(f'{k}.   {doc.vocab[k].text:{5}}   {v}')

402.   amod    3
414.   dep     1
415.   det     2
429.   nsubj   1
439.   pobj    1
440.   poss    1
443.   prep    1
445.   punct   1
8110129090154140942.   case    1
8206900633647566924.   ROOT    1


## Visualization Oart if Speech

In [22]:
import spacy

In [23]:
nlp = spacy.load('en_core_web_sm')

In [24]:
doc = nlp(u'The quick brown fox jumped over the lazy dog')

In [25]:
from spacy import displacy

In [26]:
displacy.render(doc, style = 'dep', jupyter = True, options ={'distance':110})

In [27]:
# other options that you can pass


In [28]:
options = {'distanc': 110, 'compact': 'True', 'color':'yellow', 'bg':'#09a3d5', 'font':'Times'}

In [29]:
displacy.render(doc, style= 'dep', jupyter=True, options= options)

In [30]:
# for dealing with large sentences it's better to deal with a list of spans

In [31]:
doc2 = nlp(u'This is a sentence. This is another sentence, possibly longer than the previous sentence')

In [32]:
spans = list(doc2.sents)

In [33]:
spans

[This is a sentence.,
 This is another sentence, possibly longer than the previous sentence]

In [36]:
displacy.render(spans, style='dep', options= {'distance':110})

## to display in a local host :
# displacy.serve(doc2, style='dep', options= {'distance':110})

## Named Entity-Recognition

- named-entity recognition 
 - locate and classify named entity (person name, organization, location, quantities, monetary val)
 - 
 

In [37]:
import spacy

In [38]:
nlp = spacy.load('en_core_web_sm')

In [42]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
    else:
        print('No entities found !')

In [43]:
doc = nlp(u'Hi how are you?')

In [44]:
show_ents(doc)

No entities found !


In [45]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monumment?')
show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monumment - ORG - Companies, agencies, institutions, etc.


In [46]:
 doc = nlp(u"Can I please have 500 dollars of Microsoft stock?")

In [47]:
show_ents(doc)

500 dollars - MONEY - Monetary values, including unit
Microsoft - ORG - Companies, agencies, institutions, etc.


#### Adding custom named entity

In [48]:
doc = nlp(u"Tesla to build a U.K. factory for $6 million")

In [49]:
show_ents(doc)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [50]:
from spacy.tokens import Span

In [51]:
ORG = doc.vocab.strings[u"ORG"]

In [52]:
ORG

383

In [53]:
new_ent = Span(doc, 0,1, label = ORG)

In [54]:
doc.ents = list(doc.ents) + [new_ent]

In [55]:
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [56]:
doc2 = nlp(u"Tesla has a huge office")

In [57]:
show_ents(doc2)

No entities found !


In [58]:
# think why is it showing no entities, maybe we just added it as an entity to only doc

### Adding named entities

- add several terms as entity
- suppose you in vaccum comapny , so vaccum-cleaner and vaccumcleaner  all are entities

In [70]:
doc = nlp(u'Our company created a brand new vaccum cleaner .'
         u'this new vaccum-cleaner is the best in show')

In [71]:
show_ents(doc)

No entities found !


In [72]:
from spacy.matcher import PhraseMatcher

In [73]:
matcher = PhraseMatcher(nlp.vocab)

In [74]:
phrase_list  = ['vaccum cleaner', 'vaccum-cleaner']

In [75]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [76]:
phrase_patterns

[vaccum cleaner, vaccum-cleaner]

In [77]:
matcher.add("newproduct", phrase_patterns)

In [78]:
found_match = matcher(doc)

In [79]:
found_match

[(2689272359382549672, 6, 8), (2689272359382549672, 10, 13)]

In [80]:
from spacy.tokens import Span

In [81]:
PROD = doc.vocab.strings[u'PRODUCT']

In [83]:
found_match

[(2689272359382549672, 6, 8), (2689272359382549672, 10, 13)]

In [84]:
new_ents = [Span(doc , match[1], match[2], label=PROD) for match in found_match]

In [85]:
doc.ents = list(doc.ents) + new_ents

In [86]:
show_ents(doc)

vaccum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vaccum-cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)


In [88]:
doc = nlp(u"Originally I paid $29.95 for this car toy, but now it's marked down by $10 ")

In [97]:
ents = [(ent) for ent in doc.ents]

In [98]:
ents

[29.95, 10]

In [101]:
# only show a specified entity

[ent for ent in doc.ents if ent.label_ == "MONEY"]

[29.95, 10]

### Named entity recognition

In [102]:
import spacy

In [104]:
nlp = spacy.load('en_core_web_sm')

In [105]:
from spacy import displacy

In [119]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. By contrast sony only sold 8 thousand cardinal walkman music players')

In [120]:
displacy.render(doc, style= 'ent', jupyter= True)

In [121]:
for sent in doc.sents:
    displacy.render(nlp(sent.text), style= 'ent')
    print('------------')


------------


------------


In [124]:
colors = {'ORG': 'linear-gradient(45deg, blue, green)'}
options = {'ents' : ['PRODUCT', 'ORG'], 'colors':colors}

In [125]:
displacy.render(doc, style='ent', jupyter= True, options= options)

## Sentence Segmentation

In [220]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [221]:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence. ')

In [222]:
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [223]:
doc[0]

This

In [224]:
# this is how to take the individual sentences from it

sentences =list(doc.sents)

In [225]:
sentences

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

In [226]:
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." -Peater Ducker')

In [227]:
doc.text

'"Management is doing the right things; leadership is doing the right things." -Peater Ducker'

In [228]:
for sent in doc.sents:
    print(sent)
    print('\n')

"Management is doing the right things; leadership is doing the right things."


-Peater Ducker




### Segmentation Rules :
- add a segmentation rule
- change segmentation rules

- 1) add a segmentation rule

In [229]:
def print_with_ind(doc):
    for token in doc:
        print(token, end = ' - ')
        print(token.i, end = ' | ')
    

In [230]:
print_with_ind(doc)

" - 0 | Management - 1 | is - 2 | doing - 3 | the - 4 | right - 5 | things - 6 | ; - 7 | leadership - 8 | is - 9 | doing - 10 | the - 11 | right - 12 | things - 13 | . - 14 | " - 15 | -Peater - 16 | Ducker - 17 | 

In [231]:
# now we are going to add our custom method to the pipeline
# originally the pipeline consist of :
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [232]:
# selects unitl the second last string

doc[:-1]

"Management is doing the right things; leadership is doing the right things." -Peater

In [233]:
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." -Peater Ducker')

In [234]:
for sent in doc.sents:
    print(sent)

"Management is doing the right things; leadership is doing the right things."
-Peater Ducker


In [235]:
from spacy.language import Language

@Language.component("semicolonsent")
def set_custom_boundries(doc):
    print('im ex')
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i +1].is_sent_start = True
    return doc

In [236]:
doc[:-1]

"Management is doing the right things; leadership is doing the right things." -Peater

In [237]:
# CHANGE THE SEGMENTATION RULES

In [238]:
nlp.add_pipe("semicolonsent", name = "customname" ,before = 'parser')

<function __main__.set_custom_boundries(doc)>

In [239]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'customname',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [242]:
# remember you have to again define your string here, only then would it work(i mean the doc)
doc = nlp(u'"Management is doing the right things; leadership is doing the right things." -Peater Ducker')
for sent in doc.sents:
    print(sent)
list(doc.sents)

im ex
"Management is doing the right things;
leadership is doing the right things."
-Peater Ducker


["Management is doing the right things;,
 leadership is doing the right things.",
 -Peater Ducker]

In [210]:
# CHANGE SEGMENTATION RULES

In [211]:
nlp = spacy.load('en_core_web_sm')

In [212]:
mystring = u"This is a sentence. This is another\n\nThis is a \nthird sentence."

In [213]:
print(mystring)

This is a sentence. This is another

This is a 
third sentence.


In [214]:
type(mystring)

str

In [215]:
doc = nlp(mystring)

In [216]:
for sent in doc.sents:
    print(sent)

This is a sentence.
This is another

This is a 
third sentence.


In [219]:
from spacy.pipeline import SentenceSegmenter

ImportError: cannot import name 'SentenceSegmenter' from 'spacy.pipeline' (C:\Users\LENOVO\anaconda3\lib\site-packages\spacy\pipeline\__init__.py)

In [243]:
@Language.component("newrule_sentsplit")
def split_on_newlines(doc):
    start =0
    seen_newline = False
    
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'):
            seen_newline = True
    
    yield doc[start:]
            

In [245]:
sbd = Sentenceconfig = {"newrule_sentsplit": None}
nlp.add_pipe("sentencizer", config=config)

ConfigValidationError: 

Config validation error

sentencizer -> newrule_sentsplit   extra fields not permitted

{'nlp': <spacy.lang.en.English object at 0x000001DB44AEC4F0>, 'name': 'sentencizer', 'newrule_sentsplit': None, 'overwrite': False, 'punct_chars': None, 'scorer': {'@scorers': 'spacy.senter_scorer.v1'}, '@factories': 'sentencizer'}

# NOT COMPLEATED DO IT AGAIN

# Assesment

In [246]:
# create doc obj from file petrrabit.txt

In [247]:
import spacy

In [249]:
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

In [250]:
myfile = open('resources/peterrabbit.txt', 'r')

In [251]:
# for every token in the third sentence print the token text, the POS tag, the fine-grained tag...

In [252]:
doc = nlp(myfile.read())

In [253]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}}  {spacy.explain(token.tag_):{10}}')

The        DET        DT          determiner
Tale       PROPN      NNP         noun, proper singular
of         ADP        IN          conjunction, subordinating or preposition
Peter      PROPN      NNP         noun, proper singular
Rabbit     PROPN      NNP         noun, proper singular
,          PUNCT      ,           punctuation mark, comma
by         ADP        IN          conjunction, subordinating or preposition
Beatrix    PROPN      NNP         noun, proper singular
Potter     PROPN      NNP         noun, proper singular
(          PUNCT      -LRB-       left round bracket
1902       NUM        CD          cardinal number
)          PUNCT      -RRB-       right round bracket
.          PUNCT      .           punctuation mark, sentence closer


         SPACE      _SP         whitespace
Once       ADV        RB          adverb    
upon       SCONJ      IN          conjunction, subordinating or preposition
a          DET        DT          determiner
time       NOUN       NN     

POS_counts = doc.count_by(spacy.attrs.POS)

for k, val in sorted(POS_counts.items()):
    print(f'{k}.  {doc.vocab[k].text:{5}} ---> {val}')
    

In [256]:
# provide the frequence list of POS tags from entire document:

POS_counts =  doc.count_by(spacy.attrs.POS)
for k, val in sorted(POS_counts.items()):
    print(f'{k}.  {doc.vocab[k].text:{8}}-----> {val}')


84.  ADJ     -----> 54
85.  ADP     -----> 122
86.  ADV     -----> 67
87.  AUX     -----> 49
89.  CCONJ   -----> 61
90.  DET     -----> 90
92.  NOUN    -----> 166
93.  NUM     -----> 8
94.  PART    -----> 29
95.  PRON    -----> 109
96.  PROPN   -----> 76
97.  PUNCT   -----> 173
98.  SCONJ   -----> 20
100.  VERB    -----> 135
103.  SPACE   -----> 99


In [272]:
# what perc are nouns
count =0
totcount = 0
POS_counts = doc.count_by(spacy.attrs.POS)

for k, val in sorted(POS_counts.items()):
    if doc.vocab[k].text =='NOUN':
        count = val
        totcount = totcount + val
    else:
        totcount = totcount + val


In [273]:
count

166

In [275]:
count/totcount*100

13.195548489666137

In [276]:
# display the dependency parse for the third sentence

In [277]:
displacy.render(doc, style='dep', jupyter=  True)

In [289]:
count =0
for entity in doc.ents:
    count+=1
    print(entity , '----->' , str(spacy.explain(entity.label_)))
    if count == 2:
        break

The Tale of Peter Rabbit -----> Titles of books, songs, etc.
Beatrix Potter -----> People, including fictional


In [296]:
sentence = [sent for sent in doc.sents]
print(len(sentence))
type(sentence[1])

55


spacy.tokens.span.Span

In [304]:
count =0
for sent in sentence:
    if len(sent.ents) > 0:
        # print(sent)
        count = count +1
print(count)

23


### Challege display :named entity visulization

In [306]:
displacy.render(sentence[0], style='ent')