<a href="https://colab.research.google.com/github/nthanhkhang/vietnamese-pos-tagging/blob/main/CRF_POS_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sklearn-crfsuite
import nltk, re, pprint
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import pprint, time
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn_crfsuite import scorers
from collections import Counter
nltk.download('treebank')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

**Introduction**

In this work, we will use CRF Classifier for POS Tagging. The dataset we will use in the PennTree Bank Corpus, with the universal Tag Set. This tag set has 12 unique POS Tags


In [None]:
tagged_sentence = nltk.corpus.treebank.tagged_sents(tagset='universal')

In [None]:
print("Number of Tagged Sentences ",len(tagged_sentence))
tagged_words=[tup for sent in tagged_sentence for tup in sent]
print("Total Number of Tagged words", len(tagged_words))
vocab=set([word for word,tag in tagged_words])
print("Vocabulary of the Corpus",len(vocab))
tags=set([tag for word,tag in tagged_words])
print("Number of Tags in the Corpus ",len(tags))

Number of Tagged Sentences  3914
Total Number of Tagged words 100676
Vocabulary of the Corpus 12408
Number of Tags in the Corpus  12


#### Splitting Data into train and test set - 80-20 split

In [None]:
train_set, test_set = train_test_split(tagged_sentence,test_size=0.2,random_state=1234)
print("Number of Sentences in Training Data ",len(train_set))
print("Number of Sentences in Testing Data ",len(test_set))

Number of Sentences in Training Data  3131
Number of Sentences in Testing Data  783


### Define the feature function. The following features can be used 
1. Is the first letter capitalised.
2. Is it the first word in the sentence?
3. Is it the last word?
4. What is the prefix of the word?
5. What is the suffix of the word?
6. Is the complete word captilised?
7. What is the previous word?
8. What is the next word?
9. Is it numeric?
10. Is it alphanumeric?
11. Is there an hyphen in the word?

In [None]:
def features(sentence,index):
    ### sentence is of the form [w1,w2,w3,..], index is the position of the word in the sentence
    return {
        'is_first_capital':int(sentence[index][0].isupper()),
        'is_first_word': int(index==0),
        'is_last_word':int(index==len(sentence)-1),
        'is_complete_capital': int(sentence[index].upper()==sentence[index]),
        'prev_word':'' if index==0 else sentence[index-1],
        'next_word':'' if index==len(sentence)-1 else sentence[index+1],
        'is_numeric':int(sentence[index].isdigit()),
        'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence[index])))),
        'prefix_1':sentence[index][0],
        'prefix_2': sentence[index][:2],
        'prefix_3':sentence[index][:3],
        'prefix_4':sentence[index][:4],
        'suffix_1':sentence[index][-1],
        'suffix_2':sentence[index][-2:],
        'suffix_3':sentence[index][-3:],
        'suffix_4':sentence[index][-4:],
        'word_has_hyphen': 1 if '-' in sentence[index] else 0
        
        
    }

#### Need to seperate labels and the sentences in both training and test data

In [None]:
def untag(sentence):
    return [word for word,tag in sentence]


def prepareData(tagged_sentences):
    X,y=[],[]
    for sentences in tagged_sentences:
        X.append([features(untag(sentences), index) for index in range(len(sentences))])
        y.append([tag for word,tag in sentences])
    return X,y

In [None]:
X_train,y_train=prepareData(train_set)
X_test,y_test=prepareData(test_set)


In [None]:
X_train[0]

[{'is_alphanumeric': 0,
  'is_complete_capital': 0,
  'is_first_capital': 1,
  'is_first_word': 1,
  'is_last_word': 0,
  'is_numeric': 0,
  'next_word': 'Wall',
  'prefix_1': 'O',
  'prefix_2': 'On',
  'prefix_3': 'On',
  'prefix_4': 'On',
  'prev_word': '',
  'suffix_1': 'n',
  'suffix_2': 'On',
  'suffix_3': 'On',
  'suffix_4': 'On',
  'word_has_hyphen': 0},
 {'is_alphanumeric': 0,
  'is_complete_capital': 0,
  'is_first_capital': 1,
  'is_first_word': 0,
  'is_last_word': 0,
  'is_numeric': 0,
  'next_word': 'Street',
  'prefix_1': 'W',
  'prefix_2': 'Wa',
  'prefix_3': 'Wal',
  'prefix_4': 'Wall',
  'prev_word': 'On',
  'suffix_1': 'l',
  'suffix_2': 'll',
  'suffix_3': 'all',
  'suffix_4': 'Wall',
  'word_has_hyphen': 0},
 {'is_alphanumeric': 0,
  'is_complete_capital': 0,
  'is_first_capital': 1,
  'is_first_word': 0,
  'is_last_word': 0,
  'is_numeric': 0,
  'next_word': 'men',
  'prefix_1': 'S',
  'prefix_2': 'St',
  'prefix_3': 'Str',
  'prefix_4': 'Stre',
  'prev_word': 'Wal

In [None]:
y_train[0]

['ADP',
 'NOUN',
 'NOUN',
 'NOUN',
 'CONJ',
 'NOUN',
 'VERB',
 'ADP',
 'ADJ',
 'NOUN',
 '.',
 'X',
 'VERB',
 'NUM',
 'DET',
 'ADV',
 'ADV',
 'PRON',
 'VERB',
 'ADP',
 'NOUN',
 'X',
 '.']

#### Let us fit a CRF model with the default Parameters

In [None]:
crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)



CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.01, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=False)

In [None]:
y_pred=crf.predict(X_test)

In [None]:
metrics.flat_f1_score(y_test, y_pred,average='weighted',labels=crf.classes_)

0.9738471726864286

In [None]:
y_pred_train=crf.predict(X_train)
metrics.flat_f1_score(y_train, y_pred_train,average='weighted',labels=crf.classes_)

0.9963402924209424

#### THE CRF Model had an F1 score of 0.97 on the test data and 0.996 on the train data. There is overfitting and we have to tune this model. 
But, before we tune the model, let us look at where the CRF failed and what are the important features used to identify different POS Tags

In [None]:
metrics.flat_accuracy_score(y_test,y_pred)

0.9739726027397261

In [None]:
metrics.flat_accuracy_score(y_train,y_pred_train)

0.9963441444556974

#### Let us look at class wise scores

In [None]:
print(metrics.flat_classification_report(
    y_test, y_pred, labels=crf.classes_, digits=3
))

              precision    recall  f1-score   support

         ADP      0.979     0.985     0.982      1869
        NOUN      0.966     0.977     0.972      5606
        CONJ      0.994     0.994     0.994       480
        VERB      0.964     0.960     0.962      2722
         ADJ      0.911     0.874     0.892      1274
           .      1.000     1.000     1.000      2354
           X      1.000     0.997     0.998      1278
         NUM      0.991     0.993     0.992       671
         DET      0.994     0.995     0.994      1695
         ADV      0.927     0.909     0.918       585
        PRON      0.998     0.998     0.998       562
         PRT      0.979     0.982     0.980       614

    accuracy                          0.974     19710
   macro avg      0.975     0.972     0.974     19710
weighted avg      0.974     0.974     0.974     19710



Adjectives have a low precision, recall and F1 score

### Let us look at Top Most likely Transition Features


In [None]:
print("Number of Transition Features ")
len(crf.transition_features_)

Number of Transition Features 


144

In [None]:
Counter(crf.transition_features_).most_common(20)

[(('ADJ', 'NOUN'), 4.114996),
 (('NOUN', 'NOUN'), 2.935448),
 (('NOUN', 'VERB'), 2.891987),
 (('VERB', 'PRT'), 2.519179),
 (('X', 'VERB'), 2.271558),
 (('ADP', 'NOUN'), 2.265833),
 (('NOUN', 'PRT'), 2.172849),
 (('PRON', 'VERB'), 2.117186),
 (('NUM', 'NOUN'), 2.059221),
 (('DET', 'NOUN'), 2.053832),
 (('ADV', 'VERB'), 1.994419),
 (('ADV', 'ADJ'), 1.957063),
 (('NOUN', 'ADP'), 1.838684),
 (('VERB', 'NOUN'), 1.763319),
 (('ADJ', 'ADJ'), 1.660578),
 (('NOUN', 'CONJ'), 1.591359),
 (('PRT', 'NOUN'), 1.398473),
 (('NOUN', '.'), 1.381863),
 (('NOUN', 'ADV'), 1.380086),
 (('ADV', 'ADV'), 1.301282)]

 If there is an adjective, it is more likely to be followed by a NOUN

In [None]:
Counter(crf.transition_features_).most_common()[-20:]

[(('X', 'NOUN'), -1.136906),
 (('CONJ', 'PRT'), -1.140622),
 (('ADJ', 'DET'), -1.146271),
 (('.', 'DET'), -1.255028),
 (('ADJ', 'PRON'), -1.266624),
 (('PRON', 'DET'), -1.330807),
 (('DET', '.'), -1.336752),
 (('CONJ', '.'), -1.368327),
 (('ADP', 'PRT'), -1.392629),
 (('X', 'NUM'), -1.484666),
 (('DET', 'DET'), -1.509759),
 (('PRT', 'PRT'), -1.522135),
 (('PRT', 'NUM'), -1.562026),
 (('DET', 'ADP'), -1.969625),
 (('X', 'PRT'), -2.096541),
 (('CONJ', 'X'), -2.157477),
 (('PRON', 'PRT'), -2.158365),
 (('ADP', 'X'), -3.107295),
 (('.', 'PRT'), -3.193167),
 (('DET', 'PRT'), -4.377446)]

Its unlikely that sentence is this corpus begins qith a DET or PRT. Unknown is not followed by a NOUN



### What are the most likely state features

In [None]:
print("Number of State Features ",len(crf.state_features_))

Number of State Features  32413


In [None]:
Counter(crf.state_features_).most_common(20)

[(('prev_word:will', 'VERB'), 6.751359),
 (('prev_word:would', 'VERB'), 5.940819),
 (('prefix_1:*', 'X'), 5.830558),
 (('suffix_4:rest', 'NOUN'), 5.644523),
 (('suffix_2:ly', 'ADV'), 5.260228),
 (('is_first_capital', 'NOUN'), 5.043121),
 (('prev_word:could', 'VERB'), 5.018842),
 (('suffix_3:ous', 'ADJ'), 4.870949),
 (('prev_word:to', 'VERB'), 4.849822),
 (('suffix_4:will', 'VERB'), 4.677684),
 (('next_word:appeal', 'ADJ'), 4.386434),
 (('prev_word:how', 'PRT'), 4.35094),
 (('suffix_4:pany', 'NOUN'), 4.329975),
 (('prefix_4:many', 'ADJ'), 4.205028),
 (('prev_word:lock', 'PRT'), 4.153643),
 (('word_has_hyphen', 'ADJ'), 4.151036),
 (('prev_word:tune', 'PRT'), 4.147576),
 (('next_word:Express', 'NOUN'), 4.137127),
 (('suffix_4:food', 'NOUN'), 4.116688),
 (('suffix_2:ed', 'VERB'), 4.070659)]

If the prev word is will or would or to then it is a verb and if the first letter of word in capitalised it is a nOUN. words ending with ed are verbs. 

In [None]:
Counter(crf.state_features_).most_common()[-20:]

[(('suffix_4:less', 'NOUN'), -2.430638),
 (('prev_word:*', 'DET'), -2.435687),
 (('prev_word:moderate', 'NOUN'), -2.517772),
 (('prev_word:paid', 'ADP'), -2.533975),
 (('suffix_4:ment', 'ADJ'), -2.572212),
 (('prev_word:was', 'NOUN'), -2.586244),
 (('prev_word:--', 'CONJ'), -2.58728),
 (('next_word:what', 'CONJ'), -2.621051),
 (('prev_word:--', 'DET'), -2.692732),
 (('prev_word:Media', 'VERB'), -2.6973),
 (('prefix_4:shor', 'NOUN'), -2.698477),
 (('prev_word:their', 'VERB'), -2.714216),
 (('next_word:currency', 'NOUN'), -2.732162),
 (('suffix_4:good', 'NOUN'), -2.809532),
 (('suffix_4:rter', 'ADJ'), -3.174431),
 (('prev_word:*U*', 'VERB'), -3.205405),
 (('next_word:of', 'PRT'), -3.22855),
 (('next_word:swap', 'ADJ'), -3.474744),
 (('prev_word:his', 'VERB'), -3.683731),
 (('word_has_hyphen', 'VERB'), -4.63526)]

if a word has hyphen, then it is least likely to be a verb, his is less likely to be followed by a verb. if a word ends with less, it is most likely not a noun.