# Chapter 6 - POS_Tagging

## Approaches to POS tagging
1. Rule-based tagging
2. HMM and the Viterbi algorithm
3. Statistical approaches
4. ML approaches

## Approach 4: Machine learning approaches
1. NLTK
2. TextBlob
3. SpaCy

## NLTK
- For detail explaination. Click here [POS tagging in NLTK](https://github.com/leonewtonz/NLP-Basic-Python/blob/master/Part_2-Words/Chapter_06_pos_tagging/POS_NLTK.ipynb)

In [3]:
import nltk
from nltk import word_tokenize
from nltk import pos_tag

The code below did a decent job. However, it missed the main verd jumped

In [6]:
sent = 'The quick brown fox jumped over the lazy dog' # Add comma quick, brown to see brown=JJ
tokens = word_tokenize(sent)
tags = pos_tag(tokens)
tags

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumped', 'VBD'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

#### Pratice:
* Use the following text
* Perform POS Tagging
* Make a dictionary POS -> count
* Print the dictionary from highest to lowest count

In [7]:
text = """On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in \
S. Place and walked slowly, as though in hesitation, towards K. bridge. \
He had successfully avoided meeting his landlady on the staircase. His \
garret was under the roof of a high, five-storied house and was more \
like a cupboard than a room. The landlady who provided him with garret, \
dinners, and attendance, lived on the floor below, and every time \
he went out he was obliged to pass her kitchen, the door of which \
invariably stood open. And each time he passed, the young man had a \
sick, frightened feeling, which made him scowl and feel ashamed. He was \
hopelessly in debt to his landlady, and was afraid of meeting her."""

text

'On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S. Place and walked slowly, as though in hesitation, towards K. bridge. He had successfully avoided meeting his landlady on the staircase. His garret was under the roof of a high, five-storied house and was more like a cupboard than a room. The landlady who provided him with garret, dinners, and attendance, lived on the floor below, and every time he went out he was obliged to pass her kitchen, the door of which invariably stood open. And each time he passed, the young man had a sick, frightened feeling, which made him scowl and feel ashamed. He was hopelessly in debt to his landlady, and was afraid of meeting her.'

**Perform POS Tagging**

Note: the data structure of tags is list of tuples

In [8]:
tags = pos_tag(word_tokenize(text))
tags

[('On', 'IN'),
 ('an', 'DT'),
 ('exceptionally', 'RB'),
 ('hot', 'JJ'),
 ('evening', 'VBG'),
 ('early', 'JJ'),
 ('in', 'IN'),
 ('July', 'NNP'),
 ('a', 'DT'),
 ('young', 'JJ'),
 ('man', 'NN'),
 ('came', 'VBD'),
 ('out', 'IN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('garret', 'NN'),
 ('in', 'IN'),
 ('which', 'WDT'),
 ('he', 'PRP'),
 ('lodged', 'VBD'),
 ('in', 'IN'),
 ('S.', 'NNP'),
 ('Place', 'NNP'),
 ('and', 'CC'),
 ('walked', 'VBD'),
 ('slowly', 'RB'),
 (',', ','),
 ('as', 'IN'),
 ('though', 'IN'),
 ('in', 'IN'),
 ('hesitation', 'NN'),
 (',', ','),
 ('towards', 'NNS'),
 ('K.', 'NNP'),
 ('bridge', 'NN'),
 ('.', '.'),
 ('He', 'PRP'),
 ('had', 'VBD'),
 ('successfully', 'RB'),
 ('avoided', 'VBN'),
 ('meeting', 'VBG'),
 ('his', 'PRP$'),
 ('landlady', 'NN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('staircase', 'NN'),
 ('.', '.'),
 ('His', 'PRP$'),
 ('garret', 'NN'),
 ('was', 'VBD'),
 ('under', 'IN'),
 ('the', 'DT'),
 ('roof', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('high', 'JJ'),
 (',', ','),
 ('five-stori

**Make a dict POS -> count**

In [9]:
pos_dict = {}
for token, pos in tags:
    if pos in pos_dict:
        pos_dict[pos] += 1
    else:
        pos_dict[pos] = 1
pos_dict # display the pos_dict

{'IN': 21,
 'DT': 15,
 'RB': 6,
 'JJ': 8,
 'VBG': 3,
 'NNP': 4,
 'NN': 25,
 'VBD': 17,
 'WDT': 3,
 'PRP': 8,
 'CC': 7,
 ',': 12,
 'NNS': 2,
 '.': 6,
 'VBN': 3,
 'PRP$': 5,
 'RBR': 1,
 'WP': 1,
 'TO': 2,
 'VB': 1}

**Print pos_dict from highest to lowest count**

In [10]:
for pos in sorted(pos_dict, key=pos_dict.get, reverse=True):
    print(pos, ':', pos_dict[pos])

NN : 25
IN : 21
VBD : 17
DT : 15
, : 12
JJ : 8
PRP : 8
CC : 7
RB : 6
. : 6
PRP$ : 5
NNP : 4
VBG : 3
WDT : 3
VBN : 3
NNS : 2
TO : 2
RBR : 1
WP : 1
VB : 1


## spaCy
- For detail explaination. Click here [spaCy](https://github.com/leonewtonz/NLP-Basic-Python/blob/master/Part_2-Words/Chapter_06_pos_tagging/SpaCy.ipynb)

The spaCy Python library is designed for 'industrial-strength' NLP. Read installation instructions here. You should be able to install with pip or pip3:

    pip3 install -U spacy
spaCy can also be installed with conda, or compiled from source. If you have a GPU, read the instructions for linking spacy with your cuda library.

After spaCy is installed, you should download at least one pretrained model. There are three models, small, medium and large that can be downloaded as follows:

    $python3 -m spacy download en_core_web_sm
Simply change the 'sm' at the end to 'md' or 'bg' for the medium or large model.

In cmd:
1. spaCy installed
2. pretrained model install

In [13]:
import spacy

# load model
nlp = spacy.load('en_core_web_sm')

In [27]:
# Sample text
text = "Since turning cautious Friday morning, the DJIA\
 has dropped approximately 1,700 from peak to trough."
text

'Since turning cautious Friday morning, the DJIA has dropped approximately 1,700 from peak to trough.'

In [28]:
# Create spacy object
doc = nlp(text)
doc

Since turning cautious Friday morning, the DJIA has dropped approximately 1,700 from peak to trough.

**Display doc in details**

In [29]:
for token in doc:
    print(token, token.lemma_, token.pos_, token.tag_, token.is_alpha, token.is_stop)

# other attributes: token_dep_, token_shape

Since since SCONJ IN True True
turning turn VERB VBG True False
cautious cautious ADJ JJ True False
Friday Friday PROPN NNP True False
morning morning NOUN NN True False
, , PUNCT , False False
the the DET DT True True
DJIA DJIA PROPN NNP True False
has have AUX VBZ True True
dropped drop VERB VBN True False
approximately approximately ADV RB True False
1,700 1,700 NUM CD False False
from from ADP IN True True
peak peak NOUN NN True False
to to ADP IN True True
trough trough NOUN NN True False
. . PUNCT . False False


### Some methods and features of spaCy

In [30]:
# get noun phrase
[chunk.text for chunk in doc.noun_chunks]

['cautious Friday morning', 'the DJIA', 'peak', 'trough']

In [24]:
# get verb in lemmas form
[token.lemma_ for token in doc if token.pos_ == 'VERB']

['turn', 'drop']

In [33]:
# NER Named Entity Recognition
for ent in doc.ents:
    print(ent.text, ent.label_)

Friday DATE
morning TIME
approximately 1,700 CARDINAL


In [34]:
# Visualize a dependency parse
# How spaCy actual work

from spacy import displacy

In [35]:
doc = nlp('The quick brown fox jumped over the lazy river.')

In [36]:
displacy.render(doc, style='dep')

## TextBlob
- For detail explaination. Click here [TextBlob](https://github.com/leonewtonz/NLP-Basic-Python/blob/master/Part_2-Words/Chapter_06_pos_tagging/TextBlob%20Ten.ipynb)

In order to use TextBlob methods, import textblob, then convert text to a TextBlob object. Then, you are ready to roll.

Installing/Upgrading From the PyPI. Type the command below in cmd

    pip install -U textblob
    python -m textblob.download_corpora

In [39]:
from textblob import TextBlob

In [43]:
# Sample text
text = """TextBlob is a Python (2 and 3) library for processing\
 textual data. It provides a simple API for diving into common\
 natural language processing (NLP) tasks such as part-of-speech\
 tagging, noun phrase extraction, sentiment analysis, classification, translation, and more."""
text

'TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.'

In [45]:
# Convert text to TextBlob object
blob = TextBlob(text)
blob

TextBlob("TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.")

### Ten basic features of TextBlob

**1. POS Tagging**

In [46]:
# All tag already there in TextBlob object
blob.tags[:10]

[('TextBlob', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('Python', 'NNP'),
 ('2', 'CD'),
 ('and', 'CC'),
 ('3', 'CD'),
 ('library', 'NN'),
 ('for', 'IN'),
 ('processing', 'VBG')]

**2. Tokenize**

In [47]:
blob.sentences

[Sentence("TextBlob is a Python (2 and 3) library for processing textual data."),
 Sentence("It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.")]

In [48]:
for sent in blob.sentences:
    print(sent)

TextBlob is a Python (2 and 3) library for processing textual data.
It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.


In [49]:
blob.words

WordList(['TextBlob', 'is', 'a', 'Python', '2', 'and', '3', 'library', 'for', 'processing', 'textual', 'data', 'It', 'provides', 'a', 'simple', 'API', 'for', 'diving', 'into', 'common', 'natural', 'language', 'processing', 'NLP', 'tasks', 'such', 'as', 'part-of-speech', 'tagging', 'noun', 'phrase', 'extraction', 'sentiment', 'analysis', 'classification', 'translation', 'and', 'more'])

In [50]:
# print out all words start with t

t_word = [w for w in blob.words if w.lower().startswith('t')]
t_word

['TextBlob', 'textual', 'tasks', 'tagging', 'translation']

**3. Lemmatize**

In [51]:
from textblob import Word

In [54]:
# Create Word object
w = Word('alumni')
print(w, 'lemmatized:', w.lemmatize())

alumni lemmatized: alumnus


In [55]:
# We can also lemmatize in different form/POS
w = Word('had')
print(w, 'lemmatized:', w.lemmatize())
print(w, 'lemmatized verb:', w.lemmatize('v'))

had lemmatized: had
had lemmatized verb: have


**4. WordNet Integration**

Not sure what is it mean. Some sample code in the link.

**5. Noun Phrase Extraction**

In [57]:
blob.noun_phrases

WordList(['textblob', 'python', 'processing textual data', 'api', 'common natural language processing', 'nlp', 'noun phrase extraction', 'sentiment analysis'])

**6. Ngrams**

Not sure what it is.

In [58]:
blob.ngrams(n=3)

[WordList(['TextBlob', 'is', 'a']),
 WordList(['is', 'a', 'Python']),
 WordList(['a', 'Python', '2']),
 WordList(['Python', '2', 'and']),
 WordList(['2', 'and', '3']),
 WordList(['and', '3', 'library']),
 WordList(['3', 'library', 'for']),
 WordList(['library', 'for', 'processing']),
 WordList(['for', 'processing', 'textual']),
 WordList(['processing', 'textual', 'data']),
 WordList(['textual', 'data', 'It']),
 WordList(['data', 'It', 'provides']),
 WordList(['It', 'provides', 'a']),
 WordList(['provides', 'a', 'simple']),
 WordList(['a', 'simple', 'API']),
 WordList(['simple', 'API', 'for']),
 WordList(['API', 'for', 'diving']),
 WordList(['for', 'diving', 'into']),
 WordList(['diving', 'into', 'common']),
 WordList(['into', 'common', 'natural']),
 WordList(['common', 'natural', 'language']),
 WordList(['natural', 'language', 'processing']),
 WordList(['language', 'processing', 'NLP']),
 WordList(['processing', 'NLP', 'tasks']),
 WordList(['NLP', 'tasks', 'such']),
 WordList(['tasks',

**7. Sentiment Analysis**

Polarity ranges from -1.0 to +1.0. Subjectivity ranges from 0.0 to 1.0 where lower numbers are more objective and higher numbers are more subjective.

In [59]:
blob.sentiment

Sentiment(polarity=0.06000000000000001, subjectivity=0.4514285714285714)

In [60]:
blob2 = TextBlob('I hate seafood. I love spicy food')
blob2.sentiment

Sentiment(polarity=-0.15000000000000002, subjectivity=0.75)

**8. Spelling Correction**

In [61]:
messy_text = ('The stake is purfect but service is horible')
blob2 = TextBlob(messy_text)
blob2.correct()

TextBlob("The stake is perfect but service is horrible")

**9. Language Detection**

Use Google Translate API and requires internet

In [64]:
blob = TextBlob('Hello World')
blob.detect_language()

HTTPError: HTTP Error 400: Bad Request

In [65]:
blob2 = TextBlob('Hola amor')
blob2.detect_language()

HTTPError: HTTP Error 400: Bad Request

**10. Open Source**

TextBlob is an open source. It mean you can dig into code and learn more.