<a href="https://colab.research.google.com/github/prasun000/Parts-Of-Speech-Tagging/blob/main/pos_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# POS Tagging as a Classification Problem

---
We will apply SVM algorithm to solve this multiclass classification problem (one - vs - rest type SVM classifiers ).

In [None]:
import nltk
nltk.download("treebank")
nltk.download("brown")
from nltk.corpus import  treebank

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [None]:
len(treebank.tagged_sents())

3914

We download the `treebank` corpus from `nltk` library as our data set. We wish to view it as a **multiclass classification** problem ( map words to pos labels).

In [None]:
def extract_features(sentence, index, opt):
  if (opt == 1) :
    return {
      'word':sentence[index],
      'suffix-1':sentence[index][-1],
      'suffix-2':sentence[index][-2:],
      'suffix-3':sentence[index][-3:],
      'prev_word':'' if index == 0 else sentence[index-1],
      'next_word':'' if index == (len(sentence)-1) else sentence[index+1]
  }
  elif (opt == 2) :
    return {
      'word':sentence[index],
      'is_first':index==0,
      'is_last':index ==len(sentence)-1,
      'is_capitalized':sentence[index][0].upper() == sentence[index][0],
      'is_all_caps': sentence[index].upper() == sentence[index],
      'is_all_lower': sentence[index].lower() == sentence[index],
      'is_alphanumeric': sentence[index].isalnum(),
      'prefix-1':sentence[index][0],
      'prefix-2':sentence[index][:2],
      'prefix-3':sentence[index][:3],
      'prefix-3':sentence[index][:4],
      'suffix-1':sentence[index][-1],
      'suffix-2':sentence[index][-2:],
      'suffix-3':sentence[index][-3:],
      'prev_word':'' if index == 0 else sentence[index-1],
      'next_word':'' if index == (len(sentence)-1) else sentence[index+1],
      'has_hyphen': '-' in sentence[index],
      'is_numeric': sentence[index].isdigit(),
      'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
  }

import pprint
pprint.pprint(extract_features(['Hello', 'read3er'], index = 0, opt = 2))

{'capitals_inside': False,
 'has_hyphen': False,
 'is_all_caps': False,
 'is_all_lower': False,
 'is_alphanumeric': True,
 'is_capitalized': True,
 'is_first': True,
 'is_last': False,
 'is_numeric': False,
 'next_word': 'read3er',
 'prefix-1': 'H',
 'prefix-2': 'He',
 'prefix-3': 'Hell',
 'prev_word': '',
 'suffix-1': 'o',
 'suffix-2': 'lo',
 'suffix-3': 'llo',
 'word': 'Hello'}


We do feature extraction from the words as follows :


`word` $\mapsto$ (`word`,`prev_word` , `next_word`, `suffix-1`, `suffix-2`, `suffix-3`)

Note that this feature extraction is done for `opt=1`

In [None]:
def transform_to_dataset(tagged_sentences, opt):
    X, y = [], []
    for sentence in tagged_sentences:
        recover_sentence = [word[0] for word in sentence]
        recover_tag      = [word[1] for word in sentence]
        for word_index in range(len(sentence)):
            X.append(extract_features(recover_sentence, word_index, opt = opt))
            y.append(recover_tag[word_index])
    return X, y

opt = 2
penn_train_size = int(0.8*len(treebank.tagged_sents())) # training size is 80% of tagged sents
penn_training   = treebank.tagged_sents()[:penn_train_size]
penn_testing    = treebank.tagged_sents()[penn_train_size:]
X_penn_train, y_penn_train = transform_to_dataset(penn_training, opt = opt)
X_penn_test , y_penn_test  = transform_to_dataset(penn_testing , opt = opt)

After splitting our data into training and testing , we store our data in a data-set format where each row of, say, `X_penn_train` is a list of dicionaries . Each dictionary contains the features for a particular word of a sentence. The lists in the dictionary encompass the words of one ( or maybe more ) consecutive sentence(s) appearing in the corpus.

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

First we choose to represent features using **one-hot-encoding** as we are dealing with categorical features. We use this to construct `X_test` , `y_test` where the $i$-th row of `X_test` has rows that represent the encoded vectors for the $i$-th word of the training corpus. `y_test` refers to the corresponding labels.

In [None]:
# fit a DicVectorizer that learns the one-hot encoding on the training data
v = DictVectorizer()
X_train = v.fit_transform(X_penn_train)

# Encode test data using fitted vectorizer
X_test  = v.transform(X_penn_test)

In [None]:
X_train

<80637x42386 sparse matrix of type '<class 'numpy.float64'>'
	with 1451466 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.linear_model import LogisticRegression
logistic_reg = LogisticRegression().fit(X_train,y_penn_train)
logistic_regPreds = logistic_reg.predict(X_test)
print(classification_report(y_penn_test,logistic_regPreds,zero_division= 1))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


              precision    recall  f1-score   support

           #       1.00      1.00      1.00         2
           $       1.00      1.00      1.00       242
          ''       1.00      1.00      1.00        78
           ,       1.00      1.00      1.00       930
       -LRB-       1.00      1.00      1.00        26
      -NONE-       1.00      1.00      1.00      1340
       -RRB-       1.00      1.00      1.00        26
           .       1.00      1.00      1.00       762
           :       1.00      1.00      1.00        77
          CC       1.00      1.00      1.00       429
          CD       1.00      1.00      1.00      1032
          DT       0.99      0.99      0.99      1611
          EX       0.88      1.00      0.93         7
          IN       0.98      0.98      0.98      1952
          JJ       0.86      0.88      0.87      1087
         JJR       0.82      0.82      0.82        76
         JJS       0.77      0.89      0.83        38
          MD       0.99    

In [None]:
from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB(binarize = None).fit(X_train,y_penn_train)
nbPreds = nb_clf.predict(X_test)
print(classification_report(y_penn_test,nbPreds,zero_division= 1))

              precision    recall  f1-score   support

           #       1.00      0.00      0.00         2
           $       1.00      0.00      0.00       242
          ''       1.00      0.99      0.99        78
           ,       1.00      1.00      1.00       930
       -LRB-       1.00      0.00      0.00        26
      -NONE-       0.78      1.00      0.87      1340
       -RRB-       1.00      0.00      0.00        26
           .       1.00      1.00      1.00       762
           :       1.00      0.00      0.00        77
          CC       1.00      0.99      1.00       429
          CD       0.98      0.90      0.94      1032
          DT       0.98      0.99      0.98      1611
          EX       1.00      0.00      0.00         7
          IN       0.87      0.98      0.92      1952
          JJ       0.69      0.69      0.69      1087
         JJR       1.00      0.00      0.00        76
         JJS       1.00      0.00      0.00        38
          MD       1.00    

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5).fit(X_train,y_penn_train)
neighPreds = neigh.predict(X_test)
print(classification_report(y_penn_test,neighPreds,zero_division= 1))

              precision    recall  f1-score   support

           #       1.00      1.00      1.00         2
           $       1.00      1.00      1.00       242
          ''       1.00      1.00      1.00        78
           ,       1.00      1.00      1.00       930
       -LRB-       1.00      1.00      1.00        26
      -NONE-       1.00      1.00      1.00      1340
       -RRB-       1.00      1.00      1.00        26
           .       1.00      1.00      1.00       762
           :       1.00      1.00      1.00        77
          CC       0.99      1.00      1.00       429
          CD       0.99      1.00      1.00      1032
          DT       0.98      0.99      0.99      1611
          EX       0.47      1.00      0.64         7
          IN       0.97      0.98      0.98      1952
          JJ       0.80      0.84      0.82      1087
         JJR       0.79      0.84      0.82        76
         JJS       0.81      0.89      0.85        38
          MD       0.96    

In [None]:
LinearSVCClassObj = LinearSVC().fit(X_train,y_penn_train) #fitting SVM one-vs-rest classifier
LinearSVCPreds    = LinearSVCClassObj.predict(X_test)     #predicting labels using fitted classifier
print(classification_report(y_penn_test,LinearSVCPreds,zero_division= 1))

              precision    recall  f1-score   support

           #       1.00      1.00      1.00         2
           $       1.00      1.00      1.00       242
          ''       1.00      1.00      1.00        78
           ,       1.00      1.00      1.00       930
       -LRB-       1.00      1.00      1.00        26
      -NONE-       1.00      1.00      1.00      1340
       -RRB-       1.00      1.00      1.00        26
           .       1.00      1.00      1.00       762
           :       1.00      1.00      1.00        77
          CC       1.00      1.00      1.00       429
          CD       1.00      1.00      1.00      1032
          DT       0.99      0.99      0.99      1611
          EX       0.88      1.00      0.93         7
          IN       0.98      0.98      0.98      1952
          JJ       0.87      0.90      0.88      1087
         JJR       0.82      0.86      0.84        76
         JJS       0.88      0.95      0.91        38
          MD       0.99    

We fit a simple SVM classifier and report the results.

We try out kernel SVM as well with the polynomial kernel and observe the following.

# POS Tagging as sequence to sequence labelling problem

--------------------------------------------
Now we will treat POS tagging as a seq2seq labelling problem, and apply hmm model.

In [None]:
from nltk.tag import HiddenMarkovModelTagger
# penn_training is the training data
hmm_tagger  = HiddenMarkovModelTagger.train(penn_training)
print(hmm_tagger)

<HiddenMarkovModelTagger 46 states and 11044 output symbols>


In [None]:
hmm_ = [hmm_tagger.tag(sent) for sent in treebank.sents()[penn_train_size:]]
# hmm_pred stores the predicted labels
hmm_pred = []
for tagged_sent in hmm_ :
  for word_index in range(len(tagged_sent)):
    hmm_pred.append(tagged_sent[word_index][1])

In [None]:
print(classification_report(y_penn_test,hmm_pred, zero_division = 1))
len(set(y_penn_train))

              precision    recall  f1-score   support

           #       1.00      1.00      1.00         2
           $       0.76      1.00      0.86       242
          ''       0.53      1.00      0.69        78
           ,       0.96      1.00      0.98       930
       -LRB-       1.00      0.88      0.94        26
      -NONE-       0.94      1.00      0.97      1340
       -RRB-       0.62      0.92      0.74        26
           .       0.90      1.00      0.95       762
           :       1.00      1.00      1.00        77
          CC       0.97      1.00      0.99       429
          CD       0.94      0.85      0.89      1032
          DT       0.90      0.99      0.95      1611
          EX       1.00      1.00      1.00         7
          IN       0.97      0.98      0.97      1952
          JJ       0.79      0.79      0.79      1087
         JJR       0.80      0.87      0.84        76
         JJS       1.00      0.84      0.91        38
          LS       0.00    

46