### POS Tagging in NLP using Spacy

> A PoS tag provides a considerable amount of information about a word and its neighbours. It can be used in various tasks such as sentiment analysis, text to speech conversion, etc.

### What is POS tagging?

> Tagging means the classification of tokens into predefined classes. **Parts of speech (POS)** tagging is the process of marking each word in the given corpus with a suitable token i.e. part of speech based on the context. It is also known as grammatical tagging.

### Techniques for POS tagging

**There are mainly four types of POS taggers:**


> 1. **Rule-based taggers:** The rule-based taggers work on the basis of some pre-defined rules and the context of the information provided to them to assign a part of speech to a word.
> 2. **Stochastic/Probabilistic taggers:** This is the simplest approach for POS tagging. It uses probability, frequency and statistics. These taggers find the tag which was most frequently used for a given word in the text under consideration in the training data and assign that tag to the word in the test data. Sometimes, this may result in tagging which is grammatically incorrect.
> 3. **Memory-based taggers:** A collection of cases is kept in memory, each having a word, its context, and an appropriate tag. Based on the best match among the cases kept in memory, a new sentence is tagged.
> 4. **Transformation-based taggers:** It is a combination of rule-based and stochastic tagging. In this type, the rules are automatically generated from the data. Also, some pre-defined rules are considered as well. Both these factors are used to perform POS tagging in transformation-based POS taggers.

source: [askpython.com](https://www.askpython.com/python/examples/pos-tagging-in-nlp-using-spacy)

### Parts of Speech in English
![alt text](https://promova.com/content/Parts_of_Speech_in_English_8d30a9f1d5.png "POS table")
source: [promova.com](https://promova.com/english-grammar/parts-of-speech-in-english)

### Example Code

In [1]:
import spacy

### POS Tags

In [2]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon flew to mars yesterday. He carried pork adobo with him")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Elon  |  PROPN  |  proper noun
flew  |  VERB  |  verb
to  |  ADP  |  adposition
mars  |  NOUN  |  noun
yesterday  |  NOUN  |  noun
.  |  PUNCT  |  punctuation
He  |  PRON  |  pronoun
carried  |  VERB  |  verb
pork  |  NOUN  |  noun
adobo  |  NOUN  |  noun
with  |  ADP  |  adposition
him  |  PRON  |  pronoun


In [3]:
doc = nlp("Wow! Captain America: Winter Soldier made 90 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Wow  |  INTJ  |  interjection
!  |  PUNCT  |  punctuation
Captain  |  PROPN  |  proper noun
America  |  PROPN  |  proper noun
:  |  PUNCT  |  punctuation
Winter  |  PROPN  |  proper noun
Soldier  |  PROPN  |  proper noun
made  |  VERB  |  verb
90  |  NUM  |  numeral
million  |  NUM  |  numeral
$  |  SYM  |  symbol
on  |  ADP  |  adposition
the  |  DET  |  determiner
very  |  ADV  |  adverb
first  |  ADJ  |  adjective
day  |  NOUN  |  noun


### SpaCy Tags

In [5]:
doc = nlp("Wow! Captain America: Winter Soldier made 90 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_), " | ", token.tag_, " | ", spacy.explain(token.tag_))

Wow  |  INTJ  |  interjection  |  UH  |  interjection
!  |  PUNCT  |  punctuation  |  .  |  punctuation mark, sentence closer
Captain  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
America  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
:  |  PUNCT  |  punctuation  |  :  |  punctuation mark, colon or ellipsis
Winter  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
Soldier  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
made  |  VERB  |  verb  |  VBD  |  verb, past tense
90  |  NUM  |  numeral  |  CD  |  cardinal number
million  |  NUM  |  numeral  |  CD  |  cardinal number
$  |  SYM  |  symbol  |  $  |  symbol, currency
on  |  ADP  |  adposition  |  IN  |  conjunction, subordinating or preposition
the  |  DET  |  determiner  |  DT  |  determiner
very  |  ADV  |  adverb  |  RB  |  adverb
first  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
day  |  NOUN  |  noun  |  NN  |  noun, singular or mass


### Below is a demonstration of SpaCy: past vs present tense for quit

In [6]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quits | VBZ | verb, 3rd person singular present


In [7]:
doc = nlp("he quit the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quit | VBD | verb, past tense


### Removing all SPACE, PUNCT and X token from `earning_text`

In [8]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

for token in doc:
    print(token, "|", token.pos_, "|", spacy.explain(token.pos_))

Microsoft | PROPN | proper noun
Corp. | PROPN | proper noun
today | NOUN | noun
announced | VERB | verb
the | DET | determiner
following | VERB | verb
results | NOUN | noun
for | ADP | adposition
the | DET | determiner
quarter | NOUN | noun
ended | VERB | verb
December | PROPN | proper noun
31 | NUM | numeral
, | PUNCT | punctuation
2021 | NUM | numeral
, | PUNCT | punctuation
as | SCONJ | subordinating conjunction
compared | VERB | verb
to | ADP | adposition
the | DET | determiner
corresponding | ADJ | adjective
period | NOUN | noun
of | ADP | adposition
last | ADJ | adjective
fiscal | ADJ | adjective
year | NOUN | noun
: | PUNCT | punctuation


 | SPACE | space
· | PUNCT | punctuation
         | SPACE | space
Revenue | NOUN | noun
was | AUX | auxiliary
$ | SYM | symbol
51.7 | NUM | numeral
billion | NUM | numeral
and | CCONJ | coordinating conjunction
increased | VERB | verb
20 | NUM | numeral
% | NOUN | noun

 | SPACE | space
· | PUNCT | punctuation
         | SPACE | space
Operatin

Remove the unnecessary character such as `["SPACE","PUNC","X"]` using `token.pos_` in **SpaCy**

In [None]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)

In [9]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)

In [10]:
filtered_tokens[:10]

[Microsoft,
 Corp.,
 today,
 announced,
 the,
 following,
 results,
 for,
 the,
 quarter]

Count the number attributes using `count_by()`

In [12]:
count = doc.count_by(spacy.attrs.POS)
count

{96: 13,
 92: 46,
 100: 24,
 90: 9,
 85: 16,
 93: 16,
 97: 27,
 98: 1,
 84: 20,
 103: 10,
 87: 6,
 99: 5,
 89: 12,
 86: 3,
 94: 3,
 95: 2}

In [11]:
doc.vocab[96].text

'PROPN'

In [13]:
for k,v in count.items():
    print(doc.vocab[k].text, "|",v)

PROPN | 13
NOUN | 46
VERB | 24
DET | 9
ADP | 16
NUM | 16
PUNCT | 27
SCONJ | 1
ADJ | 20
SPACE | 10
AUX | 6
SYM | 5
CCONJ | 12
ADV | 3
PART | 3
PRON | 2
