# Parts of Speech (POS) Tagging

Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence, which means that, unlike phrase matching, which is performed at the sentence or multi-word level, parts of speech tagging is performed at the token level.
POS tags (noun, verb, adjective) and fine-grained tags (plural noun, past-tense verb, superlative adjective).

#### Import Libraries

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

#### Create Doc

In [4]:
sen = nlp(u"I like to play football. I hated it in my childhood though")

#### Review what our documents.

In [5]:
print(sen.text)

I like to play football. I hated it in my childhood though


#### Printing POS within the sentence.

In [6]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival to
play         VERB       VB       verb, base form
football     NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           PRON       PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       SCONJ      IN       conjunction, subordinating or preposition


### Finding the Number of POS Tags

The number of occurrences of each POS tag by calling the count_by on the spaCy document object. The method takes spacy.attrs.POS as a parameter value.

In [7]:
num_pos = sen.count_by(spacy.attrs.POS)
num_pos

{95: 4, 100: 3, 94: 1, 92: 2, 97: 1, 85: 1, 98: 1}

** For each number is represent an ID of the POS tags along with its frequencies

In [8]:
for k,v in sorted(num_pos.items()):
    print(f'{k}. {sen.vocab[k].text:{8}}: {v}')

85. ADP     : 1
92. NOUN    : 2
94. PART    : 1
95. PRON    : 4
97. PUNCT   : 1
98. SCONJ   : 1
100. VERB    : 3


# Named Entity Recognition

Named entity recognition refers to the identification of words in a sentence as an entity e.g. the name of a person, place, organization, etc.

In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')

sen = nlp(u'Manchester United is looking to sign Harry Kane for $90 million')

In [12]:
print(sen.ents)

(Manchester United, Harry Kane, $90 million)


#### For entity describe in our sentence.

In [13]:
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


#### Adding New Entities

In [17]:
sen = nlp(u'PT.ABC is setting up a new company in Indonesia')
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Indonesia - GPE - Countries, cities, states


#### We're gonna adding PT. ABC as a ORG, Companies, Agencies.

**Import the Span class from the spacy.tokens module. Next, we need to get the hash value of the ORG entity type from our document.

In [18]:
from spacy.tokens import Span

ORG = sen.vocab.strings[u'ORG']
new_entity = Span(sen, 0, 1, label=ORG)
sen.ents = list(sen.ents) + [new_entity]

#### **Printing our span.

In [21]:
print(new_entity)

PT.ABC


In [22]:
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

PT.ABC - ORG - Companies, agencies, institutions, etc.
Indonesia - GPE - Countries, cities, states


#### Counting Entities

In the case of POS tags, we could count the frequency of each POS tag in a document using a special method sen.count_by. However, for named entities, no such method exists. We can manually count the frequency of each entity type.

In [24]:
sen = nlp(u'Manchester United is looking to sign Harry Kane for $90 million. David demand 100 Million Dollars')
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit
David - PERSON - People, including fictional
100 Million Dollars - MONEY - Monetary values, including unit


** Print len of entity which equal to our MATCHED search.

In [25]:
len([ent for ent in sen.ents if ent.label_=='PERSON'])

2