# Twitter POS tagging

Author: Muhammad Atif

Python version: 3.6

<b>Overview</b>: In this notebook, we will be adapting a POS tagger to Twitter data, starting from a tagger trained on Penn Treebank, using prior information on the Twitter tagset to obtain better performance. We will also analyse your results in a more fine-grained way.

### Part 1: Preprocessing

Our first task is to preprocess the data. We will use two datasets for training: 1) the Penn Treebank sample and 2) the Twitter samples data. In order to adapt the tagger to the Twitter data we need to built a *joint* vocabulary containing all the word types in PTB and the twitter_samples corpora.

The vocabulary and the tagset will be stored in Python dictionaries, mapping each word (or tag) to an index (integer).

Let's start with the PTB data. We will iterate over all sentences and words, and build the vocabulary and the tagset, ensuring to <b>lowercase</b> words before they are added to the dictionary. We will also generate the preprocessed corpus. It should be a list where each element is a tagged sentence, represented as another list of (word, tag) indices (which should correspond to the original words/tags).


In [1]:
import nltk
import numpy as np
import re
import urllib
from nltk.corpus import treebank, twitter_samples

corpus = treebank.tagged_sents()
vocab = {}
tagset = {}

preProcessedCorpusPTBTrain = []
for sent in corpus:
    num_sent = []
    for word, tag in sent:
        wi = vocab.setdefault(word.lower().strip(), len(vocab))
        ti = tagset.setdefault(tag, len(tagset))
        num_sent.append((wi, ti))
    preProcessedCorpusPTBTrain.append(num_sent)
    
print('\nFirst sentence in preprocessed PTB train corpus: \n', preProcessedCorpusPTBTrain[0])
print('\nIndex for the word electricity: \n', vocab['electricity'])
print('\nLength of the full tagset: \n', len(tagset))


First sentence in preprocessed PTB train corpus: 
 [(0, 0), (1, 0), (2, 1), (3, 2), (4, 3), (5, 4), (2, 1), (6, 5), (7, 6), (8, 7), (9, 8), (10, 9), (11, 7), (12, 4), (13, 8), (14, 0), (15, 2), (16, 10)]

Index for the word electricity: 
 1095

Length of the full tagset: 
 46


Now we will do the same with the twitter_samples dataset. From now on, we will refer this dataset as the **training** tweets. Since this data is not tagged, the preprocessed corpus will be a list where each element is another list containing indices only (instead of (word, tag) tuples). Besides generating the corpus, we will also **update** the vocabulary with the new words from this corpus.

There are two things to keep in mind when doing this process:

1) We will perform a bit more of preprocessing in this dataset, besides lowercasing. Specifically, we will replace special tokens with special symbols, as follows:
- Username mentions are tokens that start with '@': replace these tokens with 'USER_TOKEN'
- Hashtags are tokens that start with '#': replace these with 'HASHTAG_TOKEN'
- Retweets are represented as the token 'RT' (or 'rt' if you lowercase first): replace these with 'RETWEET_TOKEN'
- URLs are tokens that start with 'https://' or 'http://': replace these with 'URL_TOKEN'

2) **We will not create a new vocabulary**. Instead, we will update the vocabulary built from PTB with any new words present in this corpus. These should *include* the special tokens defined above but *not* the original un-preprocessed tokens.

In [2]:
preProcessedCorpusTwitterTrain = []
for tokens in twitter_samples.tokenized():
    num_sent = []
    for word in tokens:
        #pre-processing on words
        word = word.lower().strip()
        word = re.sub(r'^@.*', 'USER_TOKEN', word)
        word = re.sub(r'^#.*', 'HASHTAG_TOKEN', word)
        word = re.sub(r'^http(s)?://.*', 'URL_TOKEN', word)
        word = re.sub('^rt$', 'RETWEET_TOKEN', word)

        wi = vocab.setdefault(word, len(vocab))
        num_sent.append(wi)   
    preProcessedCorpusTwitterTrain.append(num_sent)

print('\nFirst sentence in preprocessed twitter_samples train corpus: \n', preProcessedCorpusTwitterTrain[0])
print('\nIndex for the word electricity: \n', vocab['electricity'])
print('\nIndex for HASHTAG_TOKEN: \n', vocab['HASHTAG_TOKEN'])


First sentence in preprocessed twitter_samples train corpus: 
 [11387, 182, 11388, 11389]

Index for the word electricity: 
 1095

Index for HASHTAG_TOKEN: 
 11409


Now we will preprocess the tagged twitter corpus used in W7 (Ritter et al.). This dataset will be referred from now on as **test** tweets. Before we do that, we will update the tagset.

This dataset has a few extra tags, besides the PTB ones. These were added to incorporate specific phenomena that happens on Twitter:
- "USR": username mentions
- "HT": hashtags
- "RT": retweets
- "URL": URL addresses

Notice that these special tags correspond to the special tokens we preprocessed before. These steps will be important in Part 3 later.

There a few additional tags which are not specific to Twitter but are not present in the PTB sample:
- "VPP"
- "TD"
- "O"

We will add these new seven tags to the tagset we built when reading the PTB corpus.

Another task is to add an extra type to the vocabulary: `<unk>`. This is in order to account for unknown or out-of-vocabulary words.

Finally, we will build two "inverted indices" for the vocabulary and the tagset. These should be lists, where the "i"-th element should contain the word (or tag) corresponding to the index "i" in the vocabulary (or tagset).

In [3]:
tagset.setdefault('USR', len(tagset))
tagset.setdefault('HT', len(tagset))
tagset.setdefault('RT', len(tagset))
tagset.setdefault('URL', len(tagset))
tagset.setdefault('VPP', len(tagset))
tagset.setdefault('TD', len(tagset))
tagset.setdefault('O', len(tagset))

vocab.setdefault('<unk>', len(vocab))

invVocab = [None] * len(vocab)
for word, index in vocab.items():
    invVocab[index] = word
invTagset = [None] * len(tagset)
for tag, index in tagset.items():
    invTagset[index] = tag
    
print('\nIndex for ''<unk>'': \n', vocab['<unk>'])
print('\nLength of resulting tagset: \n', len(invTagset))


Index for <unk>: 
 26069

Length of resulting tagset: 
 53


Now we can read the test tweets storing them in the same format as the PTB corpora (list of lists containing (word, tag) index tuples). We will do the same preprocessing steps that we did for the training tweets (lowercasing + replace special tokens). However, **we will not** update the vocabulary. Why? Because the test set should simulate a real-world scenario, where out-of-vocabulary words can appear. Instead, after preprocessing each word, we will check if that word is in the vocabulary. If yes, just replace it with its index, otherwise we will replace it with the index for the `<unk>` token.

When reading the POS tags for the test tweets we will do some additional preprocessing. There are three tags in this dataset which correspond to PTB tags but are represented with different names:
- "(". In PTB, this is represented as "-LRB-"
- ")". In PTB, this is represented as "-RRB-"
- "NONE". In PTB, this is represented as "-NONE-"

As we build the corpus for the test tweets, we will check if the tag for a word is one of the above. If yes, we will use the PTB equivalent instead. In practice, it is sufficient to ensure that we use the correct index for the corresponding tag, using our tagset dictionary. This concept is sometimes referred as *tag harmonisation*, where two different tagsets are mapped to each other.

In [10]:
try:
    urllib.request.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")
except: # Python 2
    urllib.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")
    
preProcessedCorpusTwitterTest = []
with open('pos.txt') as f:
    wordsTags = []
    for line in f:
        if line.strip() == '':
            preProcessedCorpusTwitterTest.append(wordsTags)
            wordsTags = []
        else:
            word, tag = line.strip().split()          
            #pre-processing on words
            word = word.lower().strip()
            word = re.sub(r'^@.*', 'USER_TOKEN', word)
            word = re.sub(r'^#.*', 'HASHTAG_TOKEN', word)
            word = re.sub(r'^http(s)?://.*', 'URL_TOKEN', word)
            word = re.sub('^rt$', 'RETWEET_TOKEN', word)
            #pre-processing on tags
            tag = tag.replace("(", "-LRB-")
            tag = tag.replace(")", "-RRB-")
            tag = tag.replace("NONE", "-NONE-")
         
            wi = vocab.get(word, vocab.get('<unk>'))
            ti = tagset.get(tag)
            wordsTags.append((wi, ti))
           
print('\nFirst sentence in preprocessed twitter test corpus: \n', preProcessedCorpusTwitterTest[0])   


First sentence in preprocessed twitter test corpus: 
 [(11392, 46), (61, 19), (114, 11), (8, 7), (3224, 8), (170, 9), (325, 33), (1325, 19), (2375, 22), (3205, 12), (182, 9), (799, 2), (1522, 3), (16, 10), (8490, 0), (1146, 0), (2495, 0), (14039, 43), (26069, 0), (16, 10), (4263, 17), (1760, 4), (9464, 8), (2259, 17), (888, 4), (741, 8), (16, 10)]


### Part 2: Running the PTB tagger on the test tweets

Our next task is to train a POS tagger on the PTB data and try it on the test tweets. 

Our first task is to encapsulate the HMM training code into a function. We will name our function `count`. This function will take these input parameters:
- A tagged corpus, in the format described above (list of lists containing (word, tag) index tuples).
- The vocabulary (a dict).
- The tagset (a dict).

Output return values will contain:
- The initial tag probabilities (a vector).
- The transition probabilities (a matrix).
- The emission probabilities (a matrix).

Notice that we pass vocabulary and tagset explicitly as parameters. This is to ensure our tagger can take into account the words in the training tweets and the extra tags. We will initialise the probabilities with an `eps` value, to ensure we end up with non-zero probabilities for unseen events.

In [5]:
def count(corpus, vocab, tagset):
    S = len(tagset)
    V = len(vocab)
    
    # initalise
    eps = 0.1
    pi = eps * np.ones(S)
    A = eps * np.ones((S, S))
    O = eps * np.ones((S, V))
    
    # count
    for sent in corpus:
        last_tag = None
        for word, tag in sent:
            O[tag, word] += 1
            if last_tag == None:
                pi[tag] += 1
            else:
                A[last_tag, tag] += 1
            last_tag = tag
            
    # normalise
    pi /= np.sum(pi)
    for s in range(S):
        O[s,:] /= np.sum(O[s,:])
        A[s,:] /= np.sum(A[s,:])
    
    return pi, A, O
    
[initialMatrix, transitionMatrix, emissionMatrix] = count(preProcessedCorpusPTBTrain, vocab, tagset)

Now we will write a function for Viterbi. The input parameters are:
- The parameters (probabilities) of your HMM (a tuple (initial, transition, emission)).
- The input words (a list with numbers).

The output is a list of (word, tag) indices, containing the original input word and the predicted tag.

We will run Viterbi on the test tweets and store the predictions in a list (might take a few seconds).

In [6]:
def viterbi(params, observations):
    pi, A, O = params
    M = len(observations)
    S = pi.shape[0]
    
    alpha = np.zeros((M, S))
    alpha[:,:] = float('-inf')
    backpointers = np.zeros((M, S), 'int')
    
    # base case
    alpha[0, :] = pi * O[:,observations[0]]
    
    # recursive case
    for t in range(1, M):
        for s2 in range(S):
            for s1 in range(S):
                score = alpha[t-1, s1] * A[s1, s2] * O[s2, observations[t]]
                if score > alpha[t, s2]:
                    alpha[t, s2] = score
                    backpointers[t, s2] = s1
    
    # now follow backpointers to resolve the state sequence
    ss = []
    ss.append(np.argmax(alpha[M-1,:]))
    for i in range(M-1, 0, -1):
        ss.append(backpointers[i, ss[-1]])
        
    return list(zip(observations, list(reversed(ss))))

predictions = []
for sent in preProcessedCorpusTwitterTest:
    encoded_sent = [wordTags[0] for wordTags in sent]
    pred = viterbi((initialMatrix, transitionMatrix, emissionMatrix), encoded_sent)
    predictions.append(pred)
    
print('\nFirst sentence of predicted list: \n', predictions[0])


First sentence of predicted list: 
 [(11392, 27), (61, 19), (114, 11), (8, 7), (3224, 8), (170, 9), (325, 33), (1325, 19), (2375, 22), (3205, 12), (182, 9), (799, 2), (1522, 3), (16, 10), (8490, 29), (1146, 8), (2495, 8), (14039, 10), (26069, 38), (16, 10), (4263, 29), (1760, 4), (9464, 8), (2259, 17), (888, 4), (741, 8), (16, 10)]


We will now evaluate the results. We will write a function that takes (word, tag) lists as inputs and outputs the tag sequence using the original tags in the tagset. Our inputs will be a sentence and the tag inverted index you built before.

We will run this function on the predictions we obtained above **and** the test tweets, storing them in two separate lists. Finally, we will flat our predictions into a single list and do the same for the test tweets and report accuracy.

In [7]:
def getOrigTags(sentences, tagInvIndex):
    tagSequence = []
    for wordTagsList in sentences:
        for wordTags in wordTagsList:
            tagSequence.append(tagInvIndex[wordTags[1]])
    return tagSequence

tagSequenceTest = getOrigTags(preProcessedCorpusTwitterTest, invTagset)
tagSequencePred = getOrigTags(predictions, invTagset)

from sklearn.metrics import accuracy_score as acc
print('\nAccuracy:\n', round(acc(tagSequenceTest, tagSequencePred) * 100, 1), '%')


Accuracy:
 63.7 %


### Part 3: Adapting the tagger using prior information

Now our task is to adapt the tagger using prior information. What do we mean by that? Remember from part 1 that the twitter tagset has some extra tags, related to special tokens such as mentions and hashtags. In other words, **we know beforehand** that these special tokens **should** have these tags. However, because these tags never appear in the PTB data, the tagger has no such information. We are going to add this in order to improve the tagger.

To recap, we know these things about the twitter data:
- username mentions should have the tag 'USR'
- hashtags should have the tag 'HT'
- retweet tokens should have the tag 'RT'
- URL tokens should have the tag 'URL'

Remember how we replace these tokens with unique special ones (such as 'USER_TOKEN')? Our task now is to adapt the emission probabilities for these tokens. We will modify the emission matrix: assign 1.0 probability for the emission P('USER_TOKEN'|'USR') and 0.0 for P(word|'USR') for all other words, doing the same for the other three special tags.

In order to do that, we will use the vocabulary and tagset dictionaries in order to obtain the indices for the corresponding words and tags. Then, we will use the indices to find the values in the emission matrix and modify them.

In [11]:
emissionMatrix[tagset['USR']] = 0.0
emissionMatrix[tagset['USR']][vocab['USER_TOKEN']] = 1.0

emissionMatrix[tagset['HT']] = 0.0
emissionMatrix[tagset['HT']][vocab['HASHTAG_TOKEN']] = 1.0

emissionMatrix[tagset['URL']] = 0.0
emissionMatrix[tagset['URL']][vocab['URL_TOKEN']] = 1.0

emissionMatrix[tagset['RT']] = 0.0
emissionMatrix[tagset['RT']][vocab['RETWEET_TOKEN']] = 1.0

print('\nEmission Matrix:\n', emissionMatrix)


Emission Matrix:
 [[9.15369893e-05 1.74752434e-04 8.32154448e-06 ... 8.32154448e-06
  8.32154448e-06 8.32154448e-06]
 [1.33457894e-05 1.33457894e-05 6.51955158e-01 ... 1.33457894e-05
  1.33457894e-05 1.33457894e-05]
 [1.62522347e-05 1.62522347e-05 1.62522347e-05 ... 1.62522347e-05
  1.62522347e-05 1.62522347e-05]
 ...
 [3.83582662e-05 3.83582662e-05 3.83582662e-05 ... 3.83582662e-05
  3.83582662e-05 3.83582662e-05]
 [3.83582662e-05 3.83582662e-05 3.83582662e-05 ... 3.83582662e-05
  3.83582662e-05 3.83582662e-05]
 [3.83582662e-05 3.83582662e-05 3.83582662e-05 ... 3.83582662e-05
  3.83582662e-05 3.83582662e-05]]


Now we will evaluate our new tagger on the test tweets again. We will report accuracy but also do a fine-grained error analysis. We will print the F-scores for **each tag**, reporting the tags that performed the best and the worse. 

In [12]:
predictions = []
for sent in preProcessedCorpusTwitterTest:
    encoded_sent = [wordTags[0] for wordTags in sent]
    pred = viterbi((initialMatrix, transitionMatrix, emissionMatrix), encoded_sent)
    predictions.append(pred)
    
tagSequenceTest = getOrigTags(preProcessedCorpusTwitterTest, invTagset)
tagSequencePred = getOrigTags(predictions, invTagset)

print('\nAccuracy:\n', round(acc(tagSequenceTest, tagSequencePred) * 100, 1), '%')    

from sklearn.metrics import classification_report
print('\nClasification report:\n', classification_report(tagSequenceTest, tagSequencePred))
print("\nBest tags having F1-Score >= 0.9:\n\"TO\", \"WRB\", \",\", \"CC\", \"URL\", \"USR\", \"HT\", \"RT\"") 
print("\nWorst tags having F1-Score <= 0.1:\n\"#\", \"$\", \"-LRB-\", \"-NONE-\", \"FW\", \"LS\", \"NNPS\", \"O\", \"PDT\", \"SYM\", \"TD\", \"UH\", \"VPP\", \"WP$\", \"``\", \"''\", \"-RRB-\"")


Accuracy:
 69.5 %

Clasification report:
              precision    recall  f1-score   support

          #       0.00      0.00      0.00         0
          $       0.00      0.00      0.00         0
         ''       0.03      0.20      0.06        91
          ,       0.85      1.00      0.92       303
      -LRB-       0.00      0.00      0.00        32
     -NONE-       0.00      0.00      0.00         2
      -RRB-       0.04      0.15      0.07        34
          .       0.72      0.83      0.77       875
          :       0.97      0.76      0.85       562
         CC       0.96      0.88      0.92       305
         CD       0.59      0.59      0.59       268
         DT       0.74      0.93      0.82       825
         EX       0.38      0.80      0.52        10
         FW       0.00      0.00      0.00         3
         HT       0.98      0.98      0.98       135
         IN       0.81      0.88      0.85      1091
         JJ       0.64      0.59      0.61       670
  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


Finally, based on the information we got above, we will do some analysis. Why did the tagger performed worse on the tags we mentioned above? How can we improve the tagger? We will inspect some instances manually to write our analysis.

<b>Training is done on a subset of Penn Treebank (PTB) corpus which is freely available with NLTK. As such, there are a lot of words which are not present in the PTB corpus we used - word vocabulary from PTB corpus consists of 11387 word types which increases to 26069 words when we add new words from twitter training dataset (which does not have a corresponding POS tag). Also, there are tags that either do not appear in PTB dataset at all e.g. "TD", "VPP", "O", or they appear with very low frequency e.g. "UH" which appears only 3 times in PTB. Further, all the worse performing tags have a very low support meaning that they occurr very infrequently in the test dataset. One exception, however, is the "UH" (interjection) tag which appears 493 times in test dataset, and the reason for it's poor performance is because twitter dataset is expected to have smileys and internet language slangs such as 'lol', 'hahaha', ':)', 'omg' etc. having "UH" tags which are not present in PTB tagged corpus (PTB only has three words 'OK', 'no', 'Oh' having "UH" tags).

Some ways to improve the tagger are (1) to include smileys and common internet language expression slangs, such as the ones mentioned above, with "UH" tag in our dictionaries and explicitly set the emission probabilities just like we did for special tokens like "URL_TOKEN", (2) to use a more comprehensive tagged corpus, and (3) instead of bigrams, use model based on trigrams i.e. compute probability of a tag given its last two tags.</b>