## <font color='darkblue'>Preface</font>
([article source](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31)) <font size='3ptx'>**In the world of Natural Language Processing (NLP), the most basic models are based on Bag of Words. But such models fail to capture the syntactic relations between words.**</font>

For example, suppose we build a sentiment analyser based on only Bag of Words. Such a model will not be able to capture the difference between “I like you”, where “like” is a verb with a positive sentiment, and “I am like you”, where “like” is a preposition with a neutral sentiment.

So this leaves us with a question — how do we improve on this Bag of Words technique?

[**Part of Speech**](https://en.wikipedia.org/wiki/Part_of_speech) (<font color='brown'>hereby referred to as POS</font>) Tags are useful for building parse trees, which are used in building [**NER**](https://en.wikipedia.org/wiki/Named-entity_recognition)s (<font color='brown'>most named entities are Nouns</font>) and extracting relations between words. POS Tagging is also essential for building lemmatizers which are used to reduce a word to its root form.

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition. This task is not straightforward, as a particular word may have a different part of speech based on the context in which the word is used.

For example: In the sentence “Give me your answer”, answer is a Noun, but in the sentence “Answer the question”, answer is a verb.

**To understand the meaning of any sentence or to extract relationships and build a knowledge graph, POS Tagging is a very important step.**

### <font color='darkgreen'>The Different POS Tagging Techniques</font>
There are different techniques for POS Tagging:
* **Lexical Based Methods** — Assigns the POS tag the most frequently occurring with a word in the training corpus.
* **Rule-Based Methods** — Assigns POS tags based on rules. For example, we can have a rule that says, words ending with “ed” or “ing” must be assigned to a verb. Rule-Based Techniques can be used along with Lexical Based approaches to allow POS Tagging of words that are not present in the training corpus but are there in the testing data.
* **Probabilistic Methods** — This method assigns the POS tags based on the probability of a particular tag sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.
* **Deep Learning Methods** — Recurrent Neural Networks can also be used for POS tagging.

<a id='crf'></a>
## <font color='darkblue'>Conditional Random Fields(CRF)</font>
* <font size='3ptx'>[**Dataset**](#dataset)</font>
* <font size='3ptx'>[**Creating the Feature Function**](#feature_func)</font>
* <font size='3ptx'>[**Fitting a CRF Model**](#crf_fitting)</font>
* <font size='3ptx'>[**Evaluating the CRF Model**](#crf_evaluation)</font>

**A CRF is a Discriminative Probabilistic Classifiers.** The difference between discriminative and generative models is that while discriminative models try to model conditional probability distribution, i.e., `P(y|x)`, generative models try to model a joint probability distribution, i.e., `P(x,y)`.

**Logistic Regression, SVM, CRF are Discriminative Classifiers. Naive Bayes, HMMs are Generative Classifiers**. CRF’s can also be used for sequence labelling tasks like Named Entity Recognisers and POS Taggers.

**In CRFs, the input is a set of features** (<font color='brown'>real numbers</font>) **derived from the input sequence using feature functions, the weights associated with the features** (<font color='brown'>that are learned</font>) **and the previous label and the task is to predict the current label.** The weights of different feature functions will be determined such that the likelihood of the labels in the training data will be maximised.

**In CRF, a set of feature functions are defined to extract features for each word in a sentence.** Some examples of feature functions are: is the first letter of the word capitalised, what the suffix and prefix of the word, what is the previous word, is it the first or the last word of the sentence, is it a number etc. These set of features are called State Features. In CRF, we also pass the label of the previous word and the label of the current word to learn the weights. CRF will try to determine the weights of different feature functions that will maximise the likelihood of the labels in the training data. **The feature function dependent on the label of the previous word is Transition Feature**.

Let’s now jump into how to use CRF for identifying POS Tags in Python. The code can be found [here](https://github.com/AiswaryaSrinivas/DataScienceWithPython/blob/master/CRF%20POS%20Tagging.ipynb). Firstly, let's import necessary packages:

In [3]:
#!pip install sklearn_crfsuite

In [4]:
import nltk, re, pprint
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import pprint, time
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn_crfsuite import scorers
from collections import Counter

<a id='dataset'></a>
### <font color='darkgreen'>Dataset</font> ([back](#crf))
**We will use the [NLTK](https://www.nltk.org/) Treebank dataset with the Universal Tagset**. The Universal tagset of NLTK comprises of 12 tag classes: Verb, Noun, Pronouns, Adjectives, Adverbs, Adpositions, Conjunctions, Determiners, Cardinal Numbers, Particles, Other/ Foreign words, Punctuations. This dataset has 3,914 tagged sentences and a vocabulary of 12,408 words.

In [8]:
import nltk

nltk.download('treebank')
nltk.download('universal_tagset')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\john\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\john\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.


True

In [9]:
tagged_sentence = nltk.corpus.treebank.tagged_sents(tagset='universal')
print("Number of Tagged Sentences ",len(tagged_sentence))
tagged_words=[tup for sent in tagged_sentence for tup in sent]
print("Total Number of Tagged words", len(tagged_words))
vocab=set([word for word,tag in tagged_words])
print("Vocabulary of the Corpus",len(vocab))
tags=set([tag for word,tag in tagged_words])
print("Number of Tags in the Corpus ",len(tags))

Number of Tagged Sentences  3914
Total Number of Tagged words 100676
Vocabulary of the Corpus 12408
Number of Tags in the Corpus  12


Next, we will split the data into Training and Test data in a 80:20 ratio — 3,131 sentences in the training set and 783 sentences in the test set.

In [10]:
train_set, test_set = train_test_split(tagged_sentence,test_size=0.2,random_state=1234)
print("Number of Sentences in Training Data ",len(train_set))
print("Number of Sentences in Testing Data ",len(test_set))

Number of Sentences in Training Data  3131
Number of Sentences in Testing Data  783


<a id='feature_func'></a>
### <font color='darkgreen'>Creating the Feature Function</font> ([back](#crf))
For identifying POS tags, we will create a function which returns a dictionary with the following features for each word in a sentence:
* Is the first letter of the word capitalised (Generally Proper Nouns have the first letter capitalised)?
* Is it the first word of the sentence?
* Is it the last word of the sentence
* Does the word contain both numbers and alphabets?
* Does it have a hyphen (generally, adjectives have hyphens - for example, words like fast-growing, slow-moving)
* Is the complete word capitalised?
* Is it a number?
* What are the first four suffixes and prefixes?(words ending with “ed” are generally verbs, words ending with “ous” like disastrous are adjectives)

The feature function is defined as below and the features for train and test data are extracted.

In [12]:
def features(sentence, index):
    ### sentence is of the form [w1,w2,w3,..], index is the position of the word in the sentence
    return {
        'is_first_capital':int(sentence[index][0].isupper()),
        'is_first_word': int(index==0),
        'is_last_word':int(index==len(sentence)-1),
        'is_complete_capital': int(sentence[index].upper()==sentence[index]),
        'prev_word':'' if index==0 else sentence[index-1],
        'next_word':'' if index==len(sentence)-1 else sentence[index+1],
        'is_numeric':int(sentence[index].isdigit()),
        'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence[index])))),
        'prefix_1':sentence[index][0],
        'prefix_2': sentence[index][:2],
        'prefix_3':sentence[index][:3],
        'prefix_4':sentence[index][:4],
        'suffix_1':sentence[index][-1],
        'suffix_2':sentence[index][-2:],
        'suffix_3':sentence[index][-3:],
        'suffix_4':sentence[index][-4:],
        'word_has_hyphen': 1 if '-' in sentence[index] else 0  
    }


def untag(sentence):
    return [word for word,tag in sentence]


def prepare_data(tagged_sentences):
    X, y=[], []
    for sentences in tagged_sentences:
        X.append([features(untag(sentences), index) for index in range(len(sentences))])
        y.append([tag for word, tag in sentences])
        
    return X,y


X_train, y_train=prepare_data(train_set)
X_test, y_test=prepare_data(test_set)

<a id='crf_fitting'></a>
### <font color='darkgreen'>Fitting a CRF Model</font> ([back](#crf))
The next step is to use the [**sklearn_crfsuite**](https://sklearn-crfsuite.readthedocs.io/en/latest/) to fit the CRF model. The model is optimised by Gradient Descent using the LBGS method with L1 and L2 regularisation. We will set the CRF to generate all possible label transitions, even those that do not occur in the training data.

In [14]:
crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.01, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

<a id='crf_evaluation'></a>
### <font color='darkgreen'>Evaluating the CRF Model</font>
We use [**F-score**](https://en.wikipedia.org/wiki/F-score) to evaluate the CRF Model. F-score conveys balance between Precision and Recall and is defined as:

In [16]:
y_pred=crf.predict(X_test)
print("F1 score on Test Data ")
print(metrics.flat_f1_score(y_test, y_pred,average='weighted', labels=crf.classes_))
print("F score on Training Data ")
y_pred_train = crf.predict(X_train)
metrics.flat_f1_score(
    y_train,
    y_pred_train,
    average='weighted',
    labels=crf.classes_
)

### Look at class wise score
print(metrics.flat_classification_report(
    y_test, y_pred, labels=crf.classes_, digits=3
))

F1 score on Test Data 
0.9738471726864286
F score on Training Data 
              precision    recall  f1-score   support

         ADP      0.979     0.985     0.982      1869
        NOUN      0.966     0.977     0.972      5606
        CONJ      0.994     0.994     0.994       480
        VERB      0.964     0.960     0.962      2722
         ADJ      0.911     0.874     0.892      1274
           .      1.000     1.000     1.000      2354
           X      1.000     0.997     0.998      1278
         NUM      0.991     0.993     0.992       671
         DET      0.994     0.995     0.994      1695
         ADV      0.927     0.909     0.918       585
        PRON      0.998     0.998     0.998       562
         PRT      0.979     0.982     0.980       614

    accuracy                          0.974     19710
   macro avg      0.975     0.972     0.974     19710
weighted avg      0.974     0.974     0.974     19710



From the class-wise score of the CRF (above), we observe that for <font color='darkred'>**predicting Adjectives, the precision** (0.911)**, recall** (0.874) **and F-score** (0.892) **are lower**</font> — indicating that more features related to adjectives must be added to the CRF feature function.

The next step is to look at the top 20 most likely Transition Features.

In [17]:
print("Number of Transition Features ")
len(crf.transition_features_)

Number of Transition Features 


144

In [18]:
Counter(crf.transition_features_).most_common(20)

[(('ADJ', 'NOUN'), 4.114996),
 (('NOUN', 'NOUN'), 2.935448),
 (('NOUN', 'VERB'), 2.891987),
 (('VERB', 'PRT'), 2.519179),
 (('X', 'VERB'), 2.271558),
 (('ADP', 'NOUN'), 2.265833),
 (('NOUN', 'PRT'), 2.172849),
 (('PRON', 'VERB'), 2.117186),
 (('NUM', 'NOUN'), 2.059221),
 (('DET', 'NOUN'), 2.053832),
 (('ADV', 'VERB'), 1.994419),
 (('ADV', 'ADJ'), 1.957063),
 (('NOUN', 'ADP'), 1.838684),
 (('VERB', 'NOUN'), 1.763319),
 (('ADJ', 'ADJ'), 1.660578),
 (('NOUN', 'CONJ'), 1.591359),
 (('PRT', 'NOUN'), 1.398473),
 (('NOUN', '.'), 1.381863),
 (('NOUN', 'ADV'), 1.380086),
 (('ADV', 'ADV'), 1.301282)]

As we can see, an Adjective is most likely to be followed by a Noun. A verb is most likely to be followed by a Particle (like TO), a Determinant like “The” is also more likely to be followed a noun.
Similarly, we can look at the most common state features.

In [19]:
print("Number of State Features ",len(crf.state_features_))

Number of State Features  32413


In [20]:
Counter(crf.state_features_).most_common(20)

[(('prev_word:will', 'VERB'), 6.751359),
 (('prev_word:would', 'VERB'), 5.940819),
 (('prefix_1:*', 'X'), 5.830558),
 (('suffix_4:rest', 'NOUN'), 5.644523),
 (('suffix_2:ly', 'ADV'), 5.260228),
 (('is_first_capital', 'NOUN'), 5.043121),
 (('prev_word:could', 'VERB'), 5.018842),
 (('suffix_3:ous', 'ADJ'), 4.870949),
 (('prev_word:to', 'VERB'), 4.849822),
 (('suffix_4:will', 'VERB'), 4.677684),
 (('next_word:appeal', 'ADJ'), 4.386434),
 (('prev_word:how', 'PRT'), 4.35094),
 (('suffix_4:pany', 'NOUN'), 4.329975),
 (('prefix_4:many', 'ADJ'), 4.205028),
 (('prev_word:lock', 'PRT'), 4.153643),
 (('word_has_hyphen', 'ADJ'), 4.151036),
 (('prev_word:tune', 'PRT'), 4.147576),
 (('next_word:Express', 'NOUN'), 4.137127),
 (('suffix_4:food', 'NOUN'), 4.116688),
 (('suffix_2:ed', 'VERB'), 4.070659)]

**If the previous word is “will” or “would”, it is most likely to be a Verb, or if a word ends in “ed”, it is definitely a verb.** As we discussed during defining features, if the word has a hyphen, as per CRF model the probability of being an Adjective is higher. Similarly if the first letter of a word is capitalised, it is more likely to be a NOUN. Natural language is such a complex yet beautiful thing!