---------------------------------
#### Part of Speech Tagging 
--------------------------------

Part-Of-Speech tagging (or POS tagging, for short) is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).

we will use both __NLTK__ and __spacy__ to understand POS tags. Both use the __Penn Treebank__ notation for POS tagging.

- This means labeling words in a sentence as nouns, adjectives, verbs...etc. Even more impressive, it also labels by tense

- POS tagging is the process of marking a word in a corpus to a corresponding part of a speech tag, _based on its context and definition_.

- POS tags are also known as __word classes__, __morphological classes__, or __lexical tags__. 

- POS Tagging simply means labeling words with their appropriate Part-Of-Speech.

    Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope 
    Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is   
    Adjective(ADJ)- big, happy, green, young, fun, crazy, three    
    Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow   
    Preposition (P)- at, on, in, from, with, near, between, about, under   
    Conjunction (CON)- and, or, but, because, so, yet, unless, since, if   
    Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this    
    Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!

POS tag list:

    CC	coordinating conjunction
    CD	cardinal digit
    DT	determiner
    EX	existential there (like: "there is" ... think of it like "there exists")
    FW	foreign word
    IN	preposition/subordinating conjunction
    JJ	adjective	'big'
    JJR	adjective, comparative	'bigger'
    JJS	adjective, superlative	'biggest'
    LS	list marker	1)
    MD	modal	could, will
    NN	noun, singular 'desk'
    NNS	noun plural	'desks'
    NNP	proper noun, singular	'Harrison'
    NNPS	proper noun, plural	'Americans'
    PDT	predeterminer	'all the kids'
    POS	possessive ending	parent\'s
    PRP	personal pronoun	I, he, she
    PRP	possessive pronoun	my, his, hers
    RB	adverb	very, silently,
    RBR	adverb, comparative	better
    RBS	adverb, superlative	best
    RP	particle	give up
    TO	to	go 'to' the store.
    UH	interjection	errrrrrrrm
    VB	verb, base form	take
    VBD	verb, past tense	took
    VBG	verb, gerund/present participle	taking
    VBN	verb, past participle	taken
    VBP	verb, sing. present, non-3d	take
    VBZ	verb, 3rd person sing. present	takes
    WDT	wh-determiner	which
    WP	wh-pronoun	who, what
    WP	possessive wh-pronoun	whose
    WRB	wh-abverb	where, when  
    
POS tagging is a supervised learning solution that uses features like the previous word, next word, is first letter capitalized etc. 

NLTK has a function to get pos tags and it works after tokenization process.

__Key points__

1. Identifying part of speech tags is much more complicated than simply mapping words to their part of speech tags. This is because POS tagging is not something that is generic. It is quite possible for a single word to have a different part of speech tag in different sentences based on different contexts. That is why it is impossible to have a generic mapping for POS tags.

2. manual POS tagging is not scalable in itself. That is why we rely on __machine-based__ POS tagging.

#### Penn Treebank Tags
The most popular `tag set` is Penn Treebank tagset. 

Most of the already trained taggers for English are trained on this tag set. Examples of such taggers are:

- NLTK default tagger
- Stanford CoreNLP tagger

In [4]:
from IPython.display import Image

#### Using NLTK

In [1]:
import nltk 
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize 

stop_words = set(stopwords.words('english')) 

In [2]:
print (pos_tag(word_tokenize("I'm learning NLP")))

[('I', 'PRP'), ("'m", 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP')]


In [3]:
pos_tag(word_tokenize("I'm learning machine learning"))

[('I', 'PRP'),
 ("'m", 'VBP'),
 ('learning', 'VBG'),
 ('machine', 'NN'),
 ('learning', 'NN')]

comparing the output with coreNLP...

In [5]:
Image(r'D:\MYLEARN\2-ANALYTICS-DataScience\icons-images\pos-00.JPG', width=400)

<IPython.core.display.Image object>

In [4]:
pos_tag(['feet', 'foot'])

[('feet', 'NNS'), ('foot', 'NN')]

In [5]:
pos_tag(word_tokenize('i foot the bill'))

[('i', 'JJ'), ('foot', 'VBP'), ('the', 'DT'), ('bill', 'NN')]

In [6]:
pos_tag(word_tokenize('Hi height is 1 foot'))

[('Hi', 'NNP'), ('height', 'NN'), ('is', 'VBZ'), ('1', 'CD'), ('foot', 'NN')]

In [7]:
sentence = "The striped bats are hanging on their feet for best"
nltk.pos_tag(nltk.word_tokenize(sentence))

[('The', 'DT'),
 ('striped', 'JJ'),
 ('bats', 'NNS'),
 ('are', 'VBP'),
 ('hanging', 'VBG'),
 ('on', 'IN'),
 ('their', 'PRP$'),
 ('feet', 'NNS'),
 ('for', 'IN'),
 ('best', 'JJS')]

In [10]:
sentence = "I work for popcorn-ai, My boss is Veera."
token = nltk.word_tokenize(sentence)
token

['I', 'work', 'for', 'popcorn-ai', ',', 'My', 'boss', 'is', 'Veera', '.']

In [11]:
print(nltk.pos_tag(token))

[('I', 'PRP'), ('work', 'VBP'), ('for', 'IN'), ('popcorn-ai', 'NN'), (',', ','), ('My', 'NNP'), ('boss', 'NN'), ('is', 'VBZ'), ('Veera', 'NNP'), ('.', '.')]


In [12]:
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? \
               The weather is great, and Python is awesome !\
               The sky is pinkish-blue. \
               You shouldn\'t eat cardboard."

In [13]:
# Dummy text 
txt =  "Sukanya, Rajib and Naba are my good friends. \
        Sukanya is getting married next year. \
        Marriage is a big step in ones life. \
        It is both exciting and frightening. \
        But friendship is a sacred bond between people. \
        It is a special kind of love between us. \
        Many of you must have tried searching for a friend. \
        but never found the right one."

In [14]:
# PunktSentenceTokenizer from the nltk.tokenize.punkt module 
tokenized = sent_tokenize(txt) 
tokenized
print(len(tokenized))

8


In [15]:
for i in tokenized: 
      
    # Word tokenizers is used to find the words  
    # and punctuation in a string 
    wordsList = nltk.word_tokenize(i) 
  
    # removing stop words from wordList 
    # wordsList = [w for w in wordsList if not w in stop_words]  
  
    #  Using a Tagger. Which is part-of-speech tagger or POS-tagger.  
    tagged = nltk.pos_tag(wordsList) 
  
    print(tagged) 

[('Sukanya', 'NNP'), (',', ','), ('Rajib', 'NNP'), ('and', 'CC'), ('Naba', 'NNP'), ('are', 'VBP'), ('my', 'PRP$'), ('good', 'JJ'), ('friends', 'NNS'), ('.', '.')]
[('Sukanya', 'NNP'), ('is', 'VBZ'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN'), ('.', '.')]
[('Marriage', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('big', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('ones', 'NNS'), ('life', 'NN'), ('.', '.')]
[('It', 'PRP'), ('is', 'VBZ'), ('both', 'DT'), ('exciting', 'VBG'), ('and', 'CC'), ('frightening', 'NN'), ('.', '.')]
[('But', 'CC'), ('friendship', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('sacred', 'JJ'), ('bond', 'NN'), ('between', 'IN'), ('people', 'NNS'), ('.', '.')]
[('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('special', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('love', 'NN'), ('between', 'IN'), ('us', 'PRP'), ('.', '.')]
[('Many', 'JJ'), ('of', 'IN'), ('you', 'PRP'), ('must', 'MD'), ('have', 'VB'), ('tried', 'VBN'), ('searching', 'VBG'), ('for', 'IN'), ('a', 'DT'), ('friend

Ex

In [20]:
text = word_tokenize("And now for something completely different")

In [21]:
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

Ex

In [22]:
text = word_tokenize("They refuse to permit us to obtain the refuse permit")

In [24]:
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

Notice that `refuse` and `permit` both appear as a present tense verb (VBP) and a noun (NN). 

Lexical categories like "noun" and part-of-speech tags like NN seem to have their uses, but the details will be obscure to many readers. You might wonder what justification there is for introducing this extra level of information. Many of these categories arise from superficial analysis the distribution of words in text. 

Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). 

The text.similar() method takes a word w, finds all contexts w1, w2, then finds all words w' that appear in the same context, i.e. w1, w2.

In [25]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

In [26]:
text.similar('woman')

man time day year car moment world house family child country boy
state job place way war girl work word


In [27]:
text.similar('bought')

made said done put had seen found given left heard was been brought
set got that took in told felt


In [28]:
text.similar('over')

in on to of and for with from at by that into as up out down through
is all about


In [19]:
text.similar('the')

a his this their its her an that our any all one these my in your no
some other and


Observe that 
- searching for $woman$ finds $nouns$; 
- searching for $bought$ mostly finds $verbs$; 
- searching for $over$ generally finds $prepositions$; 
- searching for $the$ finds several $determiners$. 

A tagger can correctly identify the tags on these words in the context of a sentence,

Basically, the goal of a POS tagger is to assign linguistic (mostly grammatical) information to sub-sentential units. Such units are called tokens and, most of the time, correspond to words and symbols (e.g. punctuation).

#### Using spacy

- NLTK is a string processing library. It takes strings as input and returns strings or lists of strings as output. Whereas, spaCy uses object-oriented approach. When we parse a text, spaCy returns document object whose words and sentences are objects themselves.

- spaCy has support for word vectors whereas NLTK does not.

- As spaCy uses the latest and best algorithms, its performance is usually good as compared to NLTK. 

- As we can see below, in word tokenization and POS-tagging spaCy performs better, but in sentence tokenization, NLTK outperforms spaCy. Its poor performance in sentence tokenization is a result of differing approaches: NLTK attempts to split the text into sentences. In contrast, spaCy constructs a syntactic tree for each sentence, a more robust method that yields much more information about the text.

In [35]:
Image(r'D:\MYLEARN\2-ANALYTICS-DataScience\icons-images\pos-01.JPG', width=400)

<IPython.core.display.Image object>

In [20]:
# pip install spacy
# python -m spacy download en
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

In [17]:
import spacy

In [18]:
# we import the core spaCy English model
nlp = spacy.load("en_core_web_md")

In [33]:
# create a spaCy document that we will be using to perform 
# parts of speech tagging
doc = nlp("Apple is looking at buying U.K. startup for $1 billion, One billion ...")

The spaCy document object has several attributes that can be used to perform a variety of tasks. For instance, to print the text of the document, the text attribute is used. 

Similarly, the pos_ attribute returns the coarse-grained POS tag. 

To obtain fine-grained POS tags, we could use the tag_ attribute. And finally, to get the explanation of a tag, we can use the spacy.explain() method and pass it the tag name.

In [34]:
doc

Apple is looking at buying U.K. startup for $1 billion, One billion ...

In [35]:
for token in doc:
    print(token.text, '\t',
          token.lemma_, '\t',
          token.pos_, '\t',
          token.tag_, '\t',
          token.dep_, '\t',
          token.shape_, '\t',
          token.is_alpha, '\t',
          token.is_stop)

Apple 	 Apple 	 PROPN 	 NNP 	 nsubj 	 Xxxxx 	 True 	 False
is 	 be 	 AUX 	 VBZ 	 aux 	 xx 	 True 	 True
looking 	 look 	 VERB 	 VBG 	 ROOT 	 xxxx 	 True 	 False
at 	 at 	 ADP 	 IN 	 prep 	 xx 	 True 	 True
buying 	 buy 	 VERB 	 VBG 	 pcomp 	 xxxx 	 True 	 False
U.K. 	 U.K. 	 PROPN 	 NNP 	 compound 	 X.X. 	 False 	 False
startup 	 startup 	 NOUN 	 NN 	 dobj 	 xxxx 	 True 	 False
for 	 for 	 ADP 	 IN 	 prep 	 xxx 	 True 	 True
$ 	 $ 	 SYM 	 $ 	 quantmod 	 $ 	 False 	 False
1 	 1 	 NUM 	 CD 	 compound 	 d 	 False 	 False
billion 	 billion 	 NUM 	 CD 	 pobj 	 xxxx 	 True 	 False
, 	 , 	 PUNCT 	 , 	 punct 	 , 	 False 	 False
One 	 one 	 NUM 	 CD 	 compound 	 Xxx 	 True 	 True
billion 	 billion 	 NUM 	 CD 	 npadvmod 	 xxxx 	 True 	 False
... 	 ... 	 PUNCT 	 : 	 punct 	 ... 	 False 	 False


    Text: The original word text.
    Lemma: The base form of the word.
    POS: The simple part-of-speech tag.
    Tag: The detailed part-of-speech tag.
    Dep: Syntactic dependency, i.e. the relation between tokens.
    Shape: The word shape – capitalization, punctuation, digits.
    is alpha: Is the token an alpha character?
    is stop: Is the token part of a stop list, i.e. the most common words of the language?

more example ..

In [36]:
sen = nlp(u"I like to play football. I hated it in my childhood though")

In [37]:
print(sen.text)

I like to play football. I hated it in my childhood though


Next, let's see pos_ attribute. We will print the POS tag of the word "hated", which is actually the seventh token in the sentence.

In [38]:
print(sen[7])
print(sen[7].pos_)

hated
VERB


Now let's print the fine-grained POS tag for the word "hated".

In [39]:
print(sen[7].tag_)

VBD


To see what VBD means, we can use spacy.explain() method as shown below:

In [27]:
print(spacy.explain(sen[7].tag_))

verb, past tense


Let's print the text, coarse-grained POS tags, fine-grained POS tags, and the explanation for the tags for all the words in the sentence.

In [40]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival "to"
play         VERB       VB       verb, base form
football     NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           PRON       PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       ADV        RB       adverb


Examples ..

In [41]:
sen = nlp(u'Can you google it? ')
word = sen[2]

In [32]:
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       VERB       VB       verb, base form


Here the word "google" is being used as a verb. Next, we print the POS tag for the word "google" along with the explanation of the tag.

From the output, you can see that the word "google" has been correctly identified as a verb.

#### more examples using Spacy

In [48]:
ex1 = nlp('he drinks a drink')

In [49]:
for word in ex1:
    print(word.text, word.pos, word.pos_)

he 95 PRON
drinks 100 VERB
a 90 DET
drink 92 NOUN


In [50]:
ex2 = nlp('i fish a fish')

In [51]:
for word in ex2:
    print(word.text, word.pos, word.pos_, word.tag_)

i 95 PRON PRP
fish 100 VERB VBP
a 90 DET DT
fish 92 NOUN NN


In [52]:
# Explain the POS abbv
spacy.explain('DT')

'determiner'

In [53]:
spacy.explain('PRON')

'pronoun'

In [54]:
spacy.explain('VBP')

'verb, non-3rd person singular present'

In [55]:
ex3 = nlp('All the faith he had had had had no effect on the outcome of his life')
for word in ex3:
    print(word.text, word.pos, word.pos_, word.tag_)

All 90 DET PDT
the 90 DET DT
faith 92 NOUN NN
he 95 PRON PRP
had 87 AUX VBD
had 100 VERB VBN
had 100 VERB VBN
had 100 VERB VBN
no 90 DET DT
effect 92 NOUN NN
on 85 ADP IN
the 90 DET DT
outcome 92 NOUN NN
of 85 ADP IN
his 90 DET PRP$
life 92 NOUN NN


#### using NLTK

In [44]:
# PunktSentenceTokenizer from the nltk.tokenize.punkt module 
tokenized = sent_tokenize(u'Can you google it? One billion ...') 
tokenized

['Can you google it?', 'One billion ...']

In [45]:
for i in tokenized: 
      
    # Word tokenizers is used to find the words  
    # and punctuation in a string 
    wordsList = nltk.word_tokenize(i) 
  
    # removing stop words from wordList 
    # wordsList = [w for w in wordsList if not w in stop_words]  
  
    #  Using a Tagger. Which is part-of-speech tagger or POS-tagger.  
    tagged = nltk.pos_tag(wordsList) 
  
    print(tagged) 

[('Can', 'MD'), ('you', 'PRP'), ('google', 'VB'), ('it', 'PRP'), ('?', '.')]
[('One', 'CD'), ('billion', 'CD'), ('...', ':')]


Let's now see another example:

In [58]:
sen = nlp(u'Can you search it on google?')
word = sen[5]

print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

google       PROPN      NNP      noun, proper singular


In [59]:
# using NLTK
# PunktSentenceTokenizer from the nltk.tokenize.punkt module 
tokenized = sent_tokenize(u'Can you search it on google?') 
tokenized

['Can you search it on google?']

In [60]:
for i in tokenized: 
      
    # Word tokenizers is used to find the words  
    # and punctuation in a string 
    wordsList = nltk.word_tokenize(i) 
  
    # removing stop words from wordList 
    # wordsList = [w for w in wordsList if not w in stop_words]  
  
    #  Using a Tagger. Which is part-of-speech tagger or POS-tagger.  
    tagged = nltk.pos_tag(wordsList) 
  
    print(tagged) 

[('Can', 'MD'), ('you', 'PRP'), ('search', 'VB'), ('it', 'PRP'), ('on', 'IN'), ('google', 'NN'), ('?', '.')]


In [61]:
nltk.pos_tag(['google']) 

[('google', 'NN')]

In [62]:
nltk.pos_tag(['present']) 

[('present', 'NN')]

#### Finding the Number of POS Tags

You can find the number of occurrences of each POS tag by calling the count_by on the spaCy document object. The method takes spacy.attrs.POS as a parameter value.

In [64]:
spacy.attrs.POS

74

In [63]:
sen = nlp(u"I like to play football. I hated it in my childhood though")

num_pos = sen.count_by(spacy.attrs.POS)
num_pos

{95: 3, 100: 3, 94: 1, 92: 2, 97: 1, 85: 1, 90: 1, 86: 1}

In [65]:
for k,v in sorted(num_pos.items()):
    print(f'{k}. {sen.vocab[k].text:{8}}: {v}')

85. ADP     : 1
86. ADV     : 1
90. DET     : 1
92. NOUN    : 2
94. PART    : 1
95. PRON    : 3
97. PUNCT   : 1
100. VERB    : 3


#### Visualizing Parts of Speech Tags        

The displacy module from the spacy library is used for this purpose. 

To visualize the POS tags inside the Jupyter notebook, we need to call the render method from the displacy module and pass it the spacy document, the style of the visualization, and set the jupyter attribute to True 

In [66]:
from spacy import displacy

sen = nlp(u"I like to play football. I hated it in my childhood though")
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

#### Uses of POS tagging

- Text to Speech (TTS) applications
- information retrieval/extraction
- used as an intermediate step for higher level NLP tasks such as parsing, semantics analysis, translation, and many more 
- Sentiment Analysis
- Homonym disambiguity
- Predictions
- building NERs (most named entities are Nouns)

#### The Different POS Tagging Techniques

There are different approaches for POS tagging. The following figure demonstrates different POS tagging models. 

![pos-models.PNG](attachment:pos-models.PNG)


### Supervised POS Tagging 
The supervised POS tagging models require a pre-tagged corpora which is used for training to learn information about the tagset, word-tag frequencies, rule sets etc. 

The performance of the models generally increase with the increase in size of this corpora.

### Unsupervised POS Tagging
Unlike the supervised models, the unsupervised POS tagging models do not require a pre-tagged corpora. 

Instead, they use advanced computational methods like the Baum-Welch algorithm to automatically induce tagsets, transformation rules etc. 

Based on the information, they either calculate the probabilistic information needed by the stochastic taggers or induce the contextual rules needed by rule-based systems or transformation based systems 



Both the supervised and unsupervised POS tagging models can be of the following types. 

1. Rule Based / Transformation Based
2. Stochastic 
    - N-gram based
    - Max likelihood
    - Neural
    - Conditional Random Fields
4. Hidden Markov Model 
5. Maximum Entropy Model 
6. Memory Based Learning



### Rule-Based Methods/ Transformation Based

- The rule based POS tagging models apply a set of hand written rules and use contextual information to assign POS tags to words.
- These rules are often known as __context frame rules__
- For example, 
    - we can have a rule that says, words ending with “ed” or “ing” must be assigned to a verb. 
    - a context frame rule might say something like: “If an ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as an Adjective.”
- On the other hand, the transformation based approaches use a pre-defined set of handcrafted rules as well as automatically induced rules that are generated during training.
- Some models also use information about capitalization and punctuation
- In general, the rule based tagging models usually require __supervised training__ i.e. pre-annotated corpora. 
- good amount of work has been done to automatically induce the __transformation rules__. 
    - One approach to automatic rule induction is to run an untagged text through a tagging model and get the initial output. 
    - A human then goes through the output of this first phase and corrects any erroneously tagged words by hand. 
    - This tagged text is then submitted to the tagger, which learns correction rules by comparing the two sets of data. 
    - Several iterations of this process are sometimes necessary before the tagging model can achieve considerable performance.

The __Brill’s tagger__ is a rule-based tagger that goes through the training data and finds out the set of tagging rules that best define the data and minimize POS tagging errors. 

The most important point to note here about Brill’s tagger is that the __rules are not hand-crafted__, but are instead found out using the corpus provided. 

The only feature engineering required is a set of rule templates that the model can use to come up with new features.

### Stochastic models
A stochastic approach includes _frequency, probability or statistics_. 

1. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. 

    e.g. The simplest stochastic taggers __disambiguate__ words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered __most frequently__ in the training set with the word is the one assigned to an ambiguous instance of that word. 

    The problem with this approach is that it can come up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language. 

2. An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is sometimes referred to as the __n-gram__ approach, referring to the fact that the best tag for a given word is determined by the probability that it occurs with the n previous tags. 

The next level of complexity -- into a stochastic tagger combines the previous two approaches, using both tag sequence probabilities and word frequency measurements. 

This is known as the __Hidden Markov Model (HMM)__.

There are different models that can be used for stochastic POS tagging, some of which are 

1. Conditional Random Fields 
2. Hidden Markov Model 

## what is a Markov Model?

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.

widely employed in economics, game theory, communication theory, genetics and finance. 

When it comes real-world problems, they are used to postulate solutions to study cruise control systems in motor vehicles, queues or lines of customers arriving at an airport, exchange rates of currencies, etc.

Hidden Markov models are especially known for their application in __reinforcement learning__ and __temporal pattern recognition__ such as 
- speech, 
- handwriting, 
- gesture recognition, 
- part-of-speech tagging,  
- musical score following, 
- partial discharges and bioinformatics.

## Markov Chain

A Markov chain is a random process with the Markov property. A random process or often called stochastic property is a mathematical object defined as a collection of random variables. A Markov chain has either discrete state space (set of possible values of the random variables) or discrete index set (often representing time) 

## Discrete Time Markov chain
A discrete time Markov chain is a sequence of random variables X1, X2, X3, ... with the Markov property, such that the probability of moving to the next state depends only on the present state and not on the previous states. 

Pr( $X_{n+1}$ = x | $X_1$ = $x_1$, $X_2$ = $x_2$, …, $X_n$ = $x_n$) = Pr( $X_{n+1}$ = x | $X_n$ = $x_n$)

As you can see, the probability of $X_{n+1}$ only depends on the probability of $X_n$ that precedes it.

The possible values of Xi form a countable set S called the state space of the chain. The state space can be anything: letters, numbers, basketball scores or weather conditions. While the time parameter is usually discrete, the state space of a discrete time Markov chain does not have any widely agreed upon restrictions, and rather refers to a process on an arbitrary state space. However, many applications of Markov chains employ finite or countably infinite state spaces, because they have a more straightforward statistical analysis.

----------------------------------


## Example

When Ipsita is sad, which isn't very usual: 
- she either goes for a run, 
- goobles down icecream 
- or takes a nap.

From historic data, if she spent sleeping a sad day away. The next day it is 60% likely she will go for a run, 20% she will stay in bed the next day and 20% chance she will pig out on icecream.

When she is sad and goes for a run, there is a 60% chances she'll go for a run the next day, 30% she gorges on icecream and only 10% chances she'll spend sleeping the next day.

Finally, when she indulges on icecream on a sad day, there is a mere 10% chance she continues to have icecream the next day as well, 70% she is likely to go for a run and 20% chance that she spends sleeping the next day.

![hmm2.PNG](attachment:hmm2.PNG)

The Markov Chain depicted in the state diagram has 3 possible states: sleep, run, icecream. So, the transition matrix will be 3 x 3 matrix. Notice, the arrows exiting a state always sums up to exactly 1, similarly the entries in each row in the transition matrix must add up to exactly 1 - representing probability distribution.

![hmm3.PNG](attachment:hmm3.PNG)

In the transition matrix, the cells do the same job that the arrows do in the state diagram.

We can now answer questions like: 

"Starting from the state: sleep, what is the probability that Ipsita will be running (state: run) at the end of a sad 2-day duration?"

Let's work this one out: In order to move from state: sleep to state: run, Ipsita must either 

1. stay on state: sleep the first move (or day), then move to state: run the next (second) move (0.2 ⋅ 0.6); 
OR
2. move to state: run the first day and then stay there the second (0.6 ⋅ 0.6) 

3. she could transition to state: icecream on the first move and then to state: run in the second (0.2 ⋅ 0.7). 

So the probability: ((0.2 ⋅ 0.6) + (0.6 ⋅ 0.6) + (0.2 ⋅ 0.7)) = 0.62. 

So, we can now say that there is a 62% chance that Ipsita will move to __state: run__ after two days of being sad, if she started out in the __state: sleep__.

-------------------------------------------

## Markov Chains in Python

In [10]:
import numpy as np
import random as rm

In [11]:
# The statespace
states = ["Sleep","Icecream","Run"]

In [12]:
# Possible sequences of events
transitionName = [["SS","SR","SI"],["RS","RR","RI"],["IS","IR","II"]]

In [13]:
# Probabilities matrix (transition matrix)
transitionMatrix = [[0.2,0.6,0.2],[0.1,0.6,0.3],[0.2,0.7,0.1]]

always make sure the probabilities sum up to 1.

In [14]:
if sum(transitionMatrix[0])+sum(transitionMatrix[1])+sum(transitionMatrix[1]) != 3:
    print("Somewhere, something went wrong. Transition matrix, perhaps?")
else: 
    print("All is OK... proceed!! ")

All is OK... proceed!! 


In [15]:
# A function that implements the Markov model to forecast the state/mood.
def activity_forecast(days):
    # Choose the starting state
    activityToday = "Sleep"
    print("Start state: " + activityToday)
    # Shall store the sequence of states taken. So, this only has the starting state for now.
    activityList = [activityToday]
    i = 0
    # To calculate the probability of the activityList
    prob = 1
    while i != days:
        if activityToday == "Sleep":
            change = np.random.choice(transitionName[0],replace=True,p=transitionMatrix[0])
            if change == "SS":
                prob = prob * 0.2
                activityList.append("Sleep")
                pass
            elif change == "SR":
                prob = prob * 0.6
                activityToday = "Run"
                activityList.append("Run")
            else:
                prob = prob * 0.2
                activityToday = "Icecream"
                activityList.append("Icecream")
        elif activityToday == "Run":
            change = np.random.choice(transitionName[1],replace=True,p=transitionMatrix[1])
            if change == "RR":
                prob = prob * 0.5
                activityList.append("Run")
                pass
            elif change == "RS":
                prob = prob * 0.2
                activityToday = "Sleep"
                activityList.append("Sleep")
            else:
                prob = prob * 0.3
                activityToday = "Icecream"
                activityList.append("Icecream")
        elif activityToday == "Icecream":
            change = np.random.choice(transitionName[2],replace=True,p=transitionMatrix[2])
            if change == "II":
                prob = prob * 0.1
                activityList.append("Icecream")
                pass
            elif change == "IS":
                prob = prob * 0.2
                activityToday = "Sleep"
                activityList.append("Sleep")
            else:
                prob = prob * 0.7
                activityToday = "Run"
                activityList.append("Run")
        i += 1  
    print("Possible states: " + str(activityList))
    print("End state after "+ str(days) + " days: " + activityToday)
    print("Probability of the possible sequence of states: " + str(prob))


In [16]:
# Function that forecasts the possible state for the next 2 days
activity_forecast(2)

Start state: Sleep
Possible states: ['Sleep', 'Icecream', 'Run']
End state after 2 days: Run
Probability of the possible sequence of states: 0.13999999999999999


## HMMs for Part of Speech Tagging

We know that to model any problem using a Hidden Markov Model we need 
- a set of observations and 
- a set of possible states. 

The states in an HMM are hidden.

In the part of speech tagging problem, the observations are the words themselves in the given sequence. As for the states, which are hidden, these would be the POS tags for the words. 

The transition probabilities would be somewhat like P(VP | NP) that is, what is the probability of the current word having a tag of Verb Phrase given that the previous tag was a Noun Phrase.



The term __hidden__ refers to the first order Markov process behind the observation. 

Observation refers to the data we know and can observe. 

Markov process is shown by the interaction between “Rainy” and “Sunny” in the below diagram and each of these are __HIDDEN STATES__.

__OBSERVATIONS__ are known data and refers to “Walk”, “Shop”, and “Clean” in the diagram. 

In machine learning sense, observation is our training data and the number of hidden states is our hyper parameter for our model. 

![hmm1.png](attachment:hmm1.png)

-------------------------------------------------------------------


T - Length of the observation sequence

N - number of states in the model

M - number of observation symbols

Q - {q0, q1, ..., $q_{N-1}$} = distinct states of the Markov process

V - {0, 1, ..., M-1} = set of possible observations

A - state transition probabilities

B - observation probability matrix

$\pi$ - initial state distribution

O - ($O_0$, $O_1$, $O_2$, ..., $O_{T-1}$) - observation sequence

---------------------------------------------------------------------------

T = don’t have any observation yet, 

N = 2, 

M = 3, 

Q = {“Rainy”, “Sunny”}, 

V = {“Walk”, “Shop”, “Clean”}

--------------------------------------------


__Lexical Based Methods__ — 
Assigns the POS tag the most frequently occurring with a word in the training corpus.


__Probabilistic Methods__ — This method assigns the POS tags based on the probability of a particular tag sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.

__Deep Learning Methods__ — Recurrent Neural Networks can also be used for POS tagging.