<a href="https://colab.research.google.com/github/lingduoduo/NLP/blob/master/NLP_Series_POS_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# So, it is time to learn to PoS Tag!

In this notebook, I'll guide you through the steps of training some models to be further utilized in our NLP Tool to do PoS Tagging. Here we won't apply any state of the art algorithm, but we won't be far either!

There are different techniques for POS Tagging:

* Lexical Based Methods — Assigns the POS tag the most frequently occurring with a word in the training corpus.
* Rule-Based Methods — Assigns POS tags based on rules. For example, we can have a rule that says, words ending with “ed” or “ing” must be assigned to a verb. Rule-Based Techniques can be used along with Lexical Based approaches to allow POS Tagging of words that are not present in the training corpus but are there in the testing data.
* Probabilistic Methods — This method assigns the POS tags based on the probability of a particular tag sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.
* Deep Learning Methods — Recurrent Neural Networks can also be used for POS tagging.


## Getting the data (Corpus)

Let us start by where we'll get our data (our **corpus**). There are many sources, but two are the most commonly used:
* **Penn Treebank** subset from nltk (you can buy the entire Treebank, if you want, but you'll have to invest some $700~).
* The **Universal Dependencies** Treebanks, available (as of February 2020) for 90 languages (in different quality and quantity levels).

These contain the hard work of many **annotators**, which went through selected sets of sentences and annotated each one by hand, forming a corpus to be used as **supervised** input for our **machine learning algorithms**.

The following two cells will show how to import the corpus from each of these two sources.

In [1]:
#This cell loads the Penn Treebank corpus from nltk into a list variable named penn_treebank.

#No need to install nltk in google colab since it is preloaded in the environments.
#!pip install nltk
import nltk

#Ensure that the treebank corpus is downloaded
nltk.download('treebank')

#Load the treebank corpus class
from nltk.corpus import treebank

#Now we iterate over all samples from the corpus (the fileids - that are equivalent to sentences)
#and retrieve the word and the pre-labeled PoS tag. This will be added as a list of tuples with
#a list of words and a list of their respective PoS tags (in the same order).
penn_treebank = []
for fileid in treebank.fileids():
  tokens = []
  tags = []
  for word, tag in treebank.tagged_words(fileid):
    tokens.append(word)
    tags.append(tag)
  penn_treebank.append((tokens, tags))

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/linghuang/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


In [2]:
#This cell loads the Universal Dependecies Treekbank corpus. It'll download all the packages, but we'll only use the GUM
#english package. We'll also install the conllu package, that was developed to parse data in the conLLu format, a
#format common of linguistic annotated files. We'll also have a list variable, but now named ud_treebank.

#Install conllu package, download the UD Treebanks corpus and unpack it.
!pip install conllu
!wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
!tar zxf ud-treebanks-v2.5.tgz

#The imports needed to open and parse (interpret) the conllu file. At the end we'll have a list of dicts.
from io import open
from conllu import parse_incr

#Open the file and load the sentences to a list.
data_file = open("ud-treebanks-v2.5/UD_English-GUM/en_gum-ud-train.conllu", "r", encoding="utf-8")
ud_files = []
for tokenlist in parse_incr(data_file):
    ud_files.append(tokenlist)

#Now we iterate over all samples from the corpus and retrieve the word and the pre-labeled PoS tag (upostag). This will
#be added as a list of tuples with a list of words and a list of their respective PoS tags (in the same order).
ud_treebank = []
for sentence in ud_files:
  tokens = []
  tags = []
  for token in sentence:
    tokens.append(token['form'])
    tags.append(token['upostag'])
  ud_treebank.append((tokens, tags))

Collecting conllu
  Downloading conllu-5.0.1-py3-none-any.whl.metadata (21 kB)
Downloading conllu-5.0.1-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-5.0.1
--2024-07-18 14:19:30--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 355216681 (339M) [application/x-gzip]
Saving to: ‘ud-treebanks-v2.5.tgz’


2024-07-18 14:21:25 (2.98 MB/s) - ‘ud-treebanks-v2.5.tgz’ saved [355216681/355216681]



**Word of Caution!**

Penn Treebank and UD Treebanks use *distinct tagsets*.

We won't be able to interchange them unless we make a converter - also, we'll only be able to do so from Penn->UD, because Penn Treebank has tags more detailed than UD, and we won't be able to retrieve these details from the tags without a third function and a lot of effort.

We'll only do that later, in our code.

Let us continue with the explanation of the Tagger.

### Extracting Features form Words

Next, we have to create a function that is able to extract features from our words. These features will be used to predict the PoS.

For that,  for each word, we'll pass the sentence and word index, and we'll provide a dict with the features.
* Is the first letter capitalised.
* Is it the first word in the sentence?
* Is it the last word?
* What is the prefix of the word?
* What is the suffix of the word?
* Is the complete word captilised?
* What is the previous word?
* What is the next word?
* Is it numeric?
* Is it alphanumeric?
* Is there an hyphen in the word?

To explain about the feature set (can be changed, if you want), it is composed by:
* Word: the word itself. Some words are always one PoS, others not.
* is_first, is_last: check if it is the first or last in the sentence.
* is_capitalized: first letter is caps? Maybe it is a proper noun...
* is_all_caps or is_all_lower: checks for acronyms (or common words).
* prefixes/suffixes: check word initialization/termination
* prev_word/next_word: checks the preceding and succeding word.
* has-hyphen: words with '-' may be adjectives.
* is_numeric: for numbers.
* capitals_inside: weird cases. Maybe nouns.

If you're wondering, yes, this encoding WILL need a lot of memory for training (if you're not using categorical variables).

And we'll have to replicate this in our main code.

In [3]:
#Regex module for checking alphanumeric values.
import re

def extract_features(sentence, index):
  return {
      'word':sentence[index],
      'is_first':index==0,
      'is_last':index ==len(sentence)-1,
      'is_capitalized':sentence[index][0].upper() == sentence[index][0],
      'is_all_caps': sentence[index].upper() == sentence[index],
      'is_all_lower': sentence[index].lower() == sentence[index],
      'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence[index])))),
      'prefix-1':sentence[index][0],
      'prefix-2':sentence[index][:2],
      'prefix-3':sentence[index][:3],
      'prefix-3':sentence[index][:4],
      'suffix-1':sentence[index][-1],
      'suffix-2':sentence[index][-2:],
      'suffix-3':sentence[index][-3:],
      'suffix-3':sentence[index][-4:],
      'prev_word':'' if index == 0 else sentence[index-1],
      'next_word':'' if index < len(sentence) else sentence[index+1],
      'has_hyphen': '-' in sentence[index],
      'is_numeric': sentence[index].isdigit(),
      'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
  }

We now prepare the dataset for use in Machine Learning algorithms.

There are two steps (three, if we're doing deep learning, but that's for later) to it:
* Defining a function to transform the corpus to a more datsetish format.
* Then, divide the encoded data into training and testing sets.

In [5]:
#Ater defining the extract_features, we define a simple function to transform our data in a more 'datasetish' format.
#This function returns the data as two lists, one of Dicts of features and the other with the labels.
def transform_to_dataset(tagged_sentences):
  X, y = [], []
  for sentence, tags in tagged_sentences:
    sent_word_features, sent_tags = [],[]
    for index in range(len(sentence)):
        sent_word_features.append(extract_features(sentence, index)),
        sent_tags.append(tags[index])
    X.append(sent_word_features)
    y.append(sent_tags)
  return X, y

#We divide the set BEFORE encoding. Why? To have full sentences in training/testing sets. When we encode, we do not encode
#a sentence, but its words instead.

#First, for the Penn treebank.
penn_train_size = int(0.8*len(penn_treebank))
penn_training = penn_treebank[:penn_train_size]
penn_testing = penn_treebank[penn_train_size:]
X_penn_train, y_penn_train = transform_to_dataset(penn_training)
X_penn_test, y_penn_test = transform_to_dataset(penn_testing)

#Then, for UD Treebank.
ud_train_size = int(0.8*len(ud_treebank))
ud_training = ud_treebank[:ud_train_size]
ud_testing = ud_treebank[ud_train_size:]
X_ud_train, y_ud_train = transform_to_dataset(ud_training)
X_ud_test, y_ud_test = transform_to_dataset(ud_testing)

#Third step, vectorize datasets. For that we use sklearn DictVectorizer
#WARNING

In [10]:
len(X_penn_train)

159

In [14]:
len(y_penn_train)

159

In [22]:
X_penn_train[0]

[{'word': 'Pierre',
  'is_first': True,
  'is_last': False,
  'is_capitalized': True,
  'is_all_caps': False,
  'is_all_lower': False,
  'is_alphanumeric': 0,
  'prefix-1': 'P',
  'prefix-2': 'Pi',
  'prefix-3': 'Pier',
  'suffix-1': 'e',
  'suffix-2': 're',
  'suffix-3': 'erre',
  'prev_word': '',
  'next_word': '',
  'has_hyphen': False,
  'is_numeric': False,
  'capitals_inside': False},
 {'word': 'Vinken',
  'is_first': False,
  'is_last': False,
  'is_capitalized': True,
  'is_all_caps': False,
  'is_all_lower': False,
  'is_alphanumeric': 0,
  'prefix-1': 'V',
  'prefix-2': 'Vi',
  'prefix-3': 'Vink',
  'suffix-1': 'n',
  'suffix-2': 'en',
  'suffix-3': 'nken',
  'prev_word': 'Pierre',
  'next_word': '',
  'has_hyphen': False,
  'is_numeric': False,
  'capitals_inside': False},
 {'word': ',',
  'is_first': False,
  'is_last': False,
  'is_capitalized': True,
  'is_all_caps': True,
  'is_all_lower': True,
  'is_alphanumeric': 0,
  'prefix-1': ',',
  'prefix-2': ',',
  'prefix-3': 

In [23]:
len(X_penn_test)

40

In [24]:
len(y_penn_test)

40

In [25]:
X_penn_test[0]

[{'word': 'Savin',
  'is_first': True,
  'is_last': False,
  'is_capitalized': True,
  'is_all_caps': False,
  'is_all_lower': False,
  'is_alphanumeric': 0,
  'prefix-1': 'S',
  'prefix-2': 'Sa',
  'prefix-3': 'Savi',
  'suffix-1': 'n',
  'suffix-2': 'in',
  'suffix-3': 'avin',
  'prev_word': '',
  'next_word': '',
  'has_hyphen': False,
  'is_numeric': False,
  'capitals_inside': False},
 {'word': 'Corp.',
  'is_first': False,
  'is_last': False,
  'is_capitalized': True,
  'is_all_caps': False,
  'is_all_lower': False,
  'is_alphanumeric': 0,
  'prefix-1': 'C',
  'prefix-2': 'Co',
  'prefix-3': 'Corp',
  'suffix-1': '.',
  'suffix-2': 'p.',
  'suffix-3': 'orp.',
  'prev_word': 'Savin',
  'next_word': '',
  'has_hyphen': False,
  'is_numeric': False,
  'capitals_inside': False},
 {'word': 'reported',
  'is_first': False,
  'is_last': False,
  'is_capitalized': False,
  'is_all_caps': False,
  'is_all_lower': True,
  'is_alphanumeric': 0,
  'prefix-1': 'r',
  'prefix-2': 're',
  'pref

# Training a Tagger

Now, we can train supervised machine learning algorithms to PoS Tagging.

We'll use the Conditional Random Fields (CRF) algorithm. Here's a brief explanation:

* **CRF**: A variation of Markov Random Field. Okay, that might not have helped. It is a discriminative model that, in a quick summary, evaluates the probabilities that a set of states are dependant or not between themselves based on a set of observations. In this case, it evaluates the probabilities that a word observed in a context (defined by the above mentioned features) belongs to a specific PoS. In training time, it takes what is the best state given the set of current observations and probabilities.

<div>
<img src="https://miro.medium.com/max/681/1*8hOWH7YF5INMF2OPhKjVxA.png" width="400"/>
</div>

Want more math? Read this: https://towardsdatascience.com/conditional-random-fields-explained-e5b8256da776

So, to achieve this, we'll use scikit learn (sklearn) and a sklearn compatible crf suite (skleran_crfsuit). If you don't know what is sklearn, [read this](https://scikit-learn.org/stable/getting_started.html).

In [26]:
#Ignoring some warnings for the sake of readability.
import warnings
warnings.filterwarnings('ignore')

#First, install sklearn_crfsuite, as it is not preloaded into Colab.
!pip install sklearn_crfsuite
from sklearn_crfsuite import CRF

#This loads the model. Specifics are:
#algorithm: methodology used to check if results are improving. Default is lbfgs (gradient descent).
#c1 and c2:  coefficients used for regularization.
#max_iterations: max number of iterations (DUH!)
#all_possible_transitions: since crf creates a "network", of probability transition states,
#this option allows it to map even "connections" not present in the data.
penn_crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
#The fit method is the default name used by Machine Learning algorithms to start training.
print("Started training on Penn Treebank corpus!")
penn_crf.fit(X_penn_train, y_penn_train)
print("Finished training on Penn Treebank corpus!")

#Same for UD
ud_crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
print("Started training on UD corpus!")
ud_crf.fit(X_ud_train, y_ud_train)
print("Finished training on UD corpus!")

Collecting sklearn_crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn_crfsuite)
  Downloading python_crfsuite-0.9.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn_crfsuite
Successfully installed python-crfsuite-0.9.10 sklearn_crfsuite-0.5.0
Started training on Penn Treebank corpus!
Finished training on Penn Treebank corpus!
Started training on UD corpus!
Finished training on UD corpus!


# Checking the Results

For that, we'll use a score method named balanced f-score. This score takes into account *precision* and *recall*.

* **precision**: Considering the universe of tagged words, how many were correctly tagged?
* **recall**: Considering the universe of correct tags, how many words were really correctly tagged?

The distinction is in the direction you look. Precision looks at all tagged words to find how many are ok; Recall looks at correct tags to find how many were able to be "guessed".

F-score is then calculated using these two. I won't go into the maths of it.  If you want,
* You can read the wikipedia article here: https://en.wikipedia.org/wiki/F1_score
* Or watch a neat simple video here: https://www.youtube.com/watch?v=j-EB6RqqjGI&ab_channel=CodeEmporium

Also, here's the wikipedia image to help you understand:
<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png"/>
</div>

We won't go into the computations either. Let the package do its thing (after all, we're interested in NLP now, not in statistics):

In [27]:
#We'll use the sklearn_crfsuit own metrics to compute f1 score.
from sklearn_crfsuite import metrics
from sklearn_crfsuite import scorers
print("## Penn ##")

#First calculate a prediction from test data, then we print the metrics for f-1 using the .flat_f1_score method.
y_penn_pred=penn_crf.predict(X_penn_test)
print("F1 score on Test Data")
print(metrics.flat_f1_score(y_penn_test, y_penn_pred,average='weighted',labels=penn_crf.classes_))
#For the sake of clarification, we do the same for train data.
y_penn_pred_train=penn_crf.predict(X_penn_train)
print("F1 score on Training Data ")
print(metrics.flat_f1_score(y_penn_train, y_penn_pred_train,average='weighted',labels=penn_crf.classes_))

# This presents class wise score. Helps see which classes (tags) are the ones with most problems.
print("Class wise score:")
print(metrics.flat_classification_report(
    y_penn_test, y_penn_pred, labels=penn_crf.classes_, digits=3
))

#Same for UD
print("## UD ##")

y_ud_pred=ud_crf.predict(X_ud_test)
print("F1 score on Test Data ")
print(metrics.flat_f1_score(y_ud_test, y_ud_pred,average='weighted',labels=ud_crf.classes_))
y_ud_pred_train=ud_crf.predict(X_ud_train)
print("F1 score on Training Data ")
print(metrics.flat_f1_score(y_ud_train, y_ud_pred_train,average='weighted',labels=ud_crf.classes_))

### Look at class wise score
print("Class wise score:")
print(metrics.flat_classification_report(
    y_ud_test, y_ud_pred, labels=ud_crf.classes_, digits=3
))


## Penn ##
F1 score on Test Data
0.9668646324625245
F1 score on Training Data 
0.9936643188628935
Class wise score:
              precision    recall  f1-score   support

         NNP      0.952     0.963     0.957      1213
           ,      1.000     1.000     1.000       592
          CD      1.000     0.999     0.999       683
         NNS      0.964     0.986     0.975       740
          JJ      0.879     0.912     0.895       731
          MD      0.993     1.000     0.996       135
          VB      0.980     0.946     0.963       313
          DT      0.992     0.993     0.992      1062
          NN      0.962     0.955     0.958      1899
          IN      0.981     0.980     0.981      1285
           .      1.000     1.000     1.000       509
         VBZ      0.958     0.936     0.947       219
         VBG      0.936     0.876     0.905       185
          CC      1.000     0.997     0.998       287
         VBD      0.965     0.945     0.955       492
         VBN      0

Not too shabby!

Remember that State of the Art results for Penn Treebank are at 97% f1.

Now, notice how UD is worse (90%)? Probably because there aren't many tags, so less variation and less classes for probability distribution.

---

But, wouldn't it be better if we could see it actually working?

That's what the following cell does. It also helps us understand what we'll have to implement in our main algorithm for it to work.

Feel free to play with the input phrase.



In [28]:
#First, we pass the sentence and "quickly tokenize it" - we've already done it in our code, so I'll just mock here with a split:
sent = "The tagger produced good results"
features = [extract_features(sent.split(), idx) for idx in range(len(sent.split()))]

#Then we tell the algorithm to make a prediction on a single input (sentence). I'll do once for Penn Treebank and once for UD.
penn_results = penn_crf.predict_single(features)
ud_results = ud_crf.predict_single(features)

#These line magics are just there to make it a neaty print, making a (word, POS) style print;
penn_tups = [(sent.split()[idx], penn_results[idx]) for idx in range(len(sent.split()))]
ud_tups = [(sent.split()[idx], ud_results[idx]) for idx in range(len(sent.split()))]

#The results come out here! Notice the difference in tags.
print(penn_tups)
print(ud_tups)

[('The', 'DT'), ('tagger', 'NN'), ('produced', 'VBN'), ('good', 'JJ'), ('results', 'NNS')]
[('The', 'DET'), ('tagger', 'NOUN'), ('produced', 'VERB'), ('good', 'ADJ'), ('results', 'NOUN')]


### Top Most likely Transition Features

In [None]:
print("Number of Transition Features ")
len(crf.transition_features_)

In [None]:
from collections import Counter

Counter(crf.transition_features_).most_common(20)

In [None]:
Counter(crf.transition_features_).most_common()[-20:]

### Top Most Likely State Features

In [None]:
print("Number of State Features ")
len(crf.state_features_)

In [None]:
Counter(crf.state_features_).most_common(20)

In [None]:
Counter(crf.state_features_).most_common()[-20:]

# Saving the Weights

We will want to load this to our NLPTools, right? So we have to save the weights. This means saving the classifier we trained to be able to classify our tokens.

To do it, we use Pickle, which is a Python package to save a readable binary file extension called "pickle". We'll later open this in our tool.



In [None]:
#import the pickle module
import pickle

#Simply dump! Use 'wb' in open to write bytes.

penn_filename = 'penn_treebank_crf_postagger.sav'
pickle.dump(penn_crf, open(penn_filename, 'wb'))

ud_filename = 'ud_crf_postagger.sav'
pickle.dump(ud_crf, open(ud_filename,'wb'))

To open the file, we just have to import the module and read the file using:

`model = pickle.load(open(filename, 'rb'))`

Great, we now have pickle files that can be loaded in our tool. Just download them using the lefthand file explorer and we're good to go!
See you back at the article!