Text Mining Assignment 2
Tasks:
1. Download W-NUT_data.zip from the Brightspace assignment and unzip the directory. It
contains 3 IOB files: wnut17train.conll (train), emerging.dev.conll (dev),
emerging.test.annotated (test)
2. The IOB files do not contain POS tags yet. Add a function to your CRFsuite script that reads
the IOB files and adds POS tags (using an existing package for linguistic processing such as
Spacy or NLTK). The data needs to be stored in the same way as the benchmark data from
the tutorial (an array of triples (word,pos,biotag)).
3. Run a baseline run (train -> test) with the features directly copied from the tutorial.
4. Set up hyperparameter optimization using the dev set and evaluate the result on the test set.
5. Extend the features: add a larger context (-2 .. +2 or more) and engineer a few other features
that might be relevant for this task. Have a look at the train/dev data to get inspiration on
potentially relevant papers.
6. Experiment with the effect of different feature sets on the quality of the labelling.

In [73]:
#Imports
import nltk
nltk.download('averaged_perceptron_tagger')
import sklearn
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\cheye\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Function to read in the data per sentence. 

In [36]:
def parse_data(file):    
    sents = []
    with open(file, encoding='utf-8') as fp:
        new_sent = []
        for line in fp:
            if (line == '\n') or (line == '\t\n'):
                #new line so end of sentence, append new_sent to sents array and clear the new_sent
                sents.append(new_sent)
                new_sent = []
            else:
                #create tuple and add to sentence
                new_line = line.strip()
                new_sent.append(tuple(new_line.split('\t')))
    return sents

In [78]:
#parse all files
train_sents = parse_data('wnut17train.conll')
val_sents = parse_data('emerging.dev.conll')
test_sents =parse_data('emerging.test.annotated')

In [79]:
def add_POS_tag(word_tuple):
    #convert tuple to list
    l = list(word_tuple)
    
    #insert new value at index 1
    new_val = nltk.pos_tag(word_tuple)
    l.insert(1, new_val[0][1])
    
    #convert list again to tuple
    new_word_tuple = tuple(l)
    return new_word_tuple  

In [80]:
#add pos tag to each dataset, can take a few minutes
train_sents = [[add_POS_tag(word) for word in sentence] for sentence in train_sents]
val_sents = [[add_POS_tag(word) for word in sentence] for sentence in val_sents]
test_sents = [[add_POS_tag(word) for word in sentence] for sentence in test_sents]

3. Run a baseline run (train -> test) with the features directly copied from the tutorial.

In [83]:

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

labels = list(crf.classes_)
labels.remove('O')

y_pred = crf.predict(X_test)
print(metrics.flat_f1_score(y_test, y_pred,
                      average='weighted', labels=labels))

sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

0.15146209008241585
                 precision    recall  f1-score   support

  B-corporation      0.000     0.000     0.000        66
  I-corporation      0.000     0.000     0.000        22
B-creative-work      0.278     0.035     0.062       142
I-creative-work      0.333     0.041     0.073       218
        B-group      0.304     0.042     0.074       165
        I-group      0.316     0.086     0.135        70
     B-location      0.396     0.240     0.299       150
     I-location      0.360     0.096     0.151        94
       B-person      0.555     0.154     0.241       429
       I-person      0.517     0.229     0.317       131
      B-product      0.667     0.016     0.031       127
      I-product      0.385     0.040     0.072       126

      micro avg      0.428     0.101     0.163      1740
      macro avg      0.342     0.082     0.121      1740
   weighted avg      0.412     0.101     0.151      1740



Set up hyperparameter optimization using the dev set and evaluate the result on the test set.