### POS-

We will look at-
* coarse POS tags: noun,verb,adjectives.
* fine-grained tags : plural noun, past tense, superlative

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp("The quick brown fox jumped over the lazy dog's back.")

In [4]:
#use pos_ to get coarse tag.
for i in doc:
    print(i, i.pos_, spacy.explain(i.pos_))

The DET determiner
quick ADJ adjective
brown ADJ adjective
fox PROPN proper noun
jumped VERB verb
over ADP adposition
the DET determiner
lazy ADJ adjective
dog NOUN noun
's PART particle
back NOUN noun
. PUNCT punctuation


In [5]:
#use tag_ to get fine grained tags-
for i in doc:
    print(i, i.tag_, spacy.explain(i.tag_))

The DT determiner
quick JJ adjective
brown JJ adjective
fox NNP noun, proper singular
jumped VBD verb, past tense
over IN conjunction, subordinating or preposition
the DT determiner
lazy JJ adjective
dog NN noun, singular or mass
's POS possessive ending
back NN noun, singular or mass
. . punctuation mark, sentence closer


### Frequency of a word-

In [6]:
POS_counts = doc.count_by(spacy.attrs.POS)

In [7]:
#it gives the part of speech index and its count.
POS_counts

{90: 2, 84: 3, 96: 1, 100: 1, 85: 1, 92: 2, 94: 1, 97: 1}

In [8]:
doc.vocab[90].text

'DET'

### Visualise POS

In [9]:
from spacy import displacy

In [10]:
displacy.render(doc,style='dep', jupyter=True)

## Named entity Recognition-
Named-entity recognition(NER) seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organisation, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [11]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
    else:
        print('no entities')

In [12]:
doc = nlp("Hi how are you")
show_ents(doc)

no entities


In [13]:
doc = nlp("We will be going to Delhi to see India gate in september")
show_ents(doc)

Delhi - GPE - Countries, cities, states
India - GPE - Countries, cities, states
september - DATE - Absolute or relative dates or periods


### Entity anotations-

* 'ent.text' - original text
* 'ent.label' - the entity types hash value.
* 'ent.label_' - The entity type string description.
* 'ent.start' - The token span's 'start' index position in the doc.
* 'ent.end' - The token spans 'stop' index position in the doc.
* 'ent.start_char' - The entity text's start index position in the doc.
* 'ent.end_char' - The entity test's stop index positionin the doc.

### Adding NER-

In [14]:
doc = nlp("Tesla to build a U.K factory for 100 crores")
show_ents(doc)

Tesla - ORDINAL - "first", "second", etc.
U.K - ORG - Companies, agencies, institutions, etc.
100 - CARDINAL - Numerals that do not fall under another type


In [15]:
from spacy.tokens import Span

In [16]:
ORG = doc.vocab.strings['ORG']
ORG

383

In [17]:
new_ent = Span(doc,0,1,label=ORG)

In [18]:
doc.ents = list(doc.ents) + [new_ent]

ValueError: [E103] Trying to set conflicting doc.ents: '(0, 1, 'ORDINAL')' and '(0, 1, 'ORG')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

### Adding multiple NER

In [None]:
doc = nlp("Our company created a brand new vacuum cleaner.This new vacuum-cleaner is the best in show")

In [None]:
show_ents(doc)

In [None]:
from spacy.matcher import PhraseMatcher

In [None]:
matcher = PhraseMatcher(nlp.vocab)

In [None]:
phrase_list = ['vacuum cleaner','vacuum-cleaner']

In [None]:
phrase_patterns = [nlp(text) for text in phrase_list]

In [None]:
matcher.add('newproduct',None,*phrase_patterns)

In [None]:
found_matches = matcher(doc)

In [None]:
found_matches

In [None]:
from spacy.tokens import span

In [None]:
PROD = doc.vocab.strings[u"PRODUCT"]

In [None]:
new_ents = [Span(doc,match[1],match[2],label=PROD) for match in found_matches]

In [None]:
doc.ents = list(doc.ents) + new_ents

In [None]:
show_ents(doc)

### Visualising NER-

In [None]:
from spacy import displacy

In [None]:
doc = nlp("Over the last quarter Apple sold nearly 20k  Ipods.")

In [None]:
displacy.render(doc,style='ent',jupyter=True)

In [None]:
##you can get a particular entity
options = {'ents':['PRODUCT','ORG']}
displacy.render(doc,style='ent',jupyter=True,options = options)

### Sentence segmentation-

In [19]:
doc = nlp("This is first sentence.This is second sentence.This is last sentence.")

In [20]:
for i in doc.sents:
    print(i)

This is first sentence.
This is second sentence.
This is last sentence.


### Use your own rules to segment sentence-

In [21]:
# In this we want to split based on ;
doc = nlp("Mgmt is doint the right things ; leadership is doing the right things.")

In [22]:
# Add a segmentation rule - 
def set_custom_boundaries(doc):
    for token in doc[:-1]:
         if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

In [23]:
nlp.add_pipe(set_custom_boundaries,before='parser')
nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

In [27]:
doc = nlp("Mgmt is doint the right things ; Leadership is doing the right things.")

In [28]:
for i in doc.sents:
    print(i)

Mgmt is doint the right things ;
Leadership is doing the right things.


In [26]:
# change the segmentation rule-

In [29]:
from spacy.pipeline import SentenceSegmenter

In [30]:
def split_on_newlines(doc):
    start=0
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_new_line = False
        elif word.text.startswith('\n'):
            seen_newline = True
    yield doc[start:]

In [31]:
sbd = SentenceSegmenter(nlp.vocab,strategy=split_on_newlines)

In [32]:
nlp.add_pipe(sbd)

In [33]:
doc = nlp("Mgmt is doint the right things Leadership  \n is doing the right things.")

In [34]:
for i in doc.sents:
    print(i)

Mgmt is doint the right things Leadership  
 is doing the right things.
