<a href="https://colab.research.google.com/github/h-aldarmaki/NLPCourse/blob/main/POS_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part-of-Speech Tagging**

In this exercise, we will implement part-of-speech tagging using the following models:


*   Most Frequent Tag (baseline)
*   NLTK pos_tag  function
*   Hidden Markov Model

The objective of this exercise is to familiarize you with sequence labeling tasks, baselines, and model evaluation. 

## Data & Tagset

We will use the brown corpus. Download the datasets from Blackboard, then upload the files here. 

The brown tagset includes 87 tags. To simplify things, we will convert the tags to the universal tagset of 17 tags:

* ADJ: adjective
* ADP: adposition
* ADV: adverb
* AUX: auxiliary
* CCONJ: coordinating conjunction
* DET: determiner
* INTJ: interjection
* NOUN: noun
* NUM: numeral
* PART: particle
* PRON: pronoun
* PROPN: proper noun
* PUNCT: punctuation
* SCONJ: subordinating conjunction
* SYM: symbol
* VERB: verb
* X: other


## Tools

* We will use ``NLTK.tag.mapping`` for mapping the tags between different tagsets. 

* We will use ``NLTK.pos_tag``  for tagging and compare the performance with the baseline. 

* We will use ``nltk.HiddenMarkovModelTagger``  for training a hidden markov model. 












# Step 1 :

## Step 1.1: Read and process input files

After running the following block, you will have two files: ``brown.train.tagged.txt``  and ``brown.test.tagged.txt``

The following blocks reads both files and stores the sentences in lists: ``trainset_txt`` and ``testset_txt``

In [None]:
#Download files

!wget https://raw.githubusercontent.com/h-aldarmaki/NLPCourse/main/data/brown.train.tagged.txt
!wget https://raw.githubusercontent.com/h-aldarmaki/NLPCourse/main/data/brown.test.tagged.txt

In [None]:
#Let's first read the file: training set
filename ='brown.train.tagged.txt'
file = open(filename, 'rt')
trainset_txt = file.read()
file.close()

#split sentences by new line character
trainset_txt = trainset_txt.split('\n')

#check the output (printing the first 10 sentences)
print(trainset_txt[:10])

In [None]:
#Now let's read the test set file
filename ='brown.test.tagged.txt'
file = open(filename, 'rt')
testset_txt = file.read()
file.close()

#split sentences
testset_txt = testset_txt.split('\n')

print(testset_txt[:10])

## Step 1.2 

In the following block, we separate the words and the tags and store them as a list of tuples. The resulting sentences will be stored in ``trainset`` and ``testset``

In [None]:
#Now we will convert training sentences to tuples in the form (WORD, TAG)
#The tuples will be stored in the list trainset
import nltk
from  nltk.tag import mapping
nltk.download("universal_tagset")

trainset = []
for sent in trainset_txt:
   words = sent.split()
   sent_parts = []
   for word in words:
     parts = word.split("/")
     word = parts[0].lower()
     tag = parts[1].upper()
     tag = mapping.map_tag('en-brown', 'universal', tag) #map from Brown tagset to Universal tagset
     sent_parts.append((word, tag))
   if len(sent_parts) > 0:
     trainset.append(sent_parts)


#convert test sentences to tuples (WORD, TAG)
#The tuples will be stored in the list testset
testset = []
testset_text = [] # we will also store a text version of the sentences here (will be needed for the HMM model later)
for sent in testset_txt:
   words = sent.split()
   sent_parts = []
   sent_words = []
   for word in words:
     parts = word.split("/")
     if parts[0] != '':
       word = parts[0].lower()
       tag = parts[1].upper()
       tag = mapping.map_tag('en-brown', 'universal', tag)
       sent_parts.append((word, tag))
       sent_words.append(word)
   if len(sent_parts) > 0:
     testset.append(sent_parts)
     testset_text.append(sent_words)

#print one sentence from the list to double-check that everything is good
print(testset[0])
print(testset_text[0])

# Step 2: Baseline

We will now implement the Most-Frequent-Tag baseline. 

## Step 2.1 : 
In the following block, we first count all the tags to get the most frequent tag overall. Your can use ``nltk.FreqDist()`` or ``collections.Counter`` for that.

### **Question 1:**
Calculate the most frequent tag in the training set. 

In [None]:
#write code to find the most frequent tag in trainset
#Store the tag as most_frequent_tag


## Step 2.2

In the next section, we implement the most-frequent-tag baseline tagger. This tagger needs to count the tags for each word, so we go over the list of tuples, and count the tags for each word. The resulting counts will be stored in the dictionary ``word_tags``

In [None]:
#count the tags for each word. Results will be stored in word_tags

word_tags = {}

for sent in trainset:
  for item in sent:
    tag = item[1]
    if item[0] in word_tags: # if word is already in dictionary
      if tag in word_tags[item[0]]: # if the tag is already added for this word
        word_tags[item[0]][tag] = word_tags[item[0]][tag] + 1
      else:#if the tag has not been added yet, we need to add it now
        word_tags[item[0]][tag] = 1
    else:#if the word has not been added to dictionary yet
      word_tags[item[0]] = {}
      word_tags[item[0]][tag] = 1
  



## Step 2.3

Next, we use the counts we just calculated to tag the test set by finding the most frequent tag for each word. We will also calculate the accuracy of this tagger by comparing the most frequent tag with the true tag. 

In the process, we will creeate two lists: ``test_true_tags`` to store the true tags in the test set, and ``pred_tags`` to store the tags predicted by our model. We will use these lists in the following steps to calculate the accuracy and to produce the confusion matrix. 

In [None]:
#Based on the above counts, find the most frequent tag for each word in the test set
#For unseen words, use the most frequent tag overall. 

test_true_tags = []
pred_tags = []
for sent in testset:
  i=0
  while i < len(sent):
    item = sent[i]
    true_tag = item[1]
    test_true_tags.append(true_tag)
    word = item[0]
    if word in word_tags:
      possible_tags = word_tags[word]
      for k in sorted(possible_tags, key=possible_tags.get, reverse=True):
            pred_tag = k
            break;  
    else:#if the word is not in our list, we use the most frequent tag overall 
      pred_tag = most_freq_tag

    pred_tags.append(pred_tag)
    i = i + 1




### **Question 2:**
* (a) Calculate the accuracy of the tagger above. 
* (b) Produce a confusion matrix 

**Hint**
You may use functions from ``nltk.metrics`` package, such as ``nltk.metrics.ConfusionMatric``
https://www.nltk.org/api/nltk.metrics.confusionmatrix.html 

In [None]:
#Your code here


# Step 3: NLTK Tagger

In this section, we will use the built-n NLTK pos tagger ``nltk.pos_tag()`` We will calculate the accuracy of this model. 

## **Question 3**
The first block below shows you how to use the NLTK tagger to tag the first sentence from the testset. Given this, write a loop to tag all the sentences in the test set, and add the tags to ``pred_tags``. After that, calculate the accuracy of this tagger and produce the confusion matrix. 

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

test_sentence= testset_text[0]
print(test_sentence)

model_output = nltk.pos_tag(testset_text[0], tagset='universal')
print(model_output)

In [None]:

pred_tags = []

#pos_tag expects a list of words, so we will use testset_text (created in step 1)
for sent in testset_text:
  #your code here


# Step 4: Hidden Markov Model

In this section we will use ``nltk.HiddenMarkovModelTagger`` to train an HMM tagger using our trainset. We will then use the tagger to produce tags for the testset (using ``testset_text`` since the input should be a list of words)

Note that the tagger might take a while to tag all sentences in the test set. 

## **Question 4**
Calculate the accuracy of this tagger and produce the confusion matrix. 

In [None]:
import nltk

#training the tagger:
tagger = nltk.HiddenMarkovModelTagger.train(trainset)


In [None]:
#Tag the test set
results = tagger.tag_sents(testset_text)

In [None]:
#Calculate the accuracy and confusion matrix

#your code here