# Training and Evaluating a CRF model on CONLL2003
In this lab we will build our own model for NER. In this case, we will use an implementation of a CRF in Python, in which we have to build our set of learning features. As in the previous lab we will use the CONLL 2003 dataset to train and evaluate our model.

## Set-up
In this section we will set up the notebook by mounting the drive, doing all the required imports, and downloading the pos-tagger model in NLTK. 

We are going to install and use `sklearn-crfsuite` a [Python wrapper](https://github.com/TeamHG-Memex/sklearn-crfsuite) for the original [crfsuite](https://www.chokkan.org/software/crfsuite/).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
## need to downgrade sklearn to make compatible with crf-suit
pip install scikit-learn==0.24

In [None]:
!pip install sklearn-crfsuite

In [None]:
import string
import numpy as np

from sklearn_crfsuite import CRF, metrics
from sklearn.metrics import make_scorer,confusion_matrix
from sklearn.metrics import f1_score,classification_report
from sklearn.pipeline import Pipeline

from pprint import pprint

from nltk import download
from nltk.tag import pos_tag
download('averaged_perceptron_tagger') # download 

## Loading the data
In this section we provide a function (`load_data_conll`) to load the data in CONLL format, in which we have each token per line along with multiple levels of annotations. The file contains a format of 4 whitespace separated colums(words, PoS, Chunk and NE tags). The function outputs a list where each item is composed of 2 lists: 1) a sentence as list of tokens, and NER tags a list of each token. For example:

`[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']]`

In [None]:
"""
Load the training/testing data. 
input: conll format data, with 4 whitespace separated colums - words, PoS, Chunk and NE tags.
output: A list where each item is 2 lists.  sentence as a list of tokens, NER tags as a list for each token.
"""
#functions for preparing the data in the *.txt files
def load_data_conll(file_path):
    myoutput,words,tags = [],[],[]
    fh = open(file_path)
    for line in fh:
        line = line.strip()
        if line.startswith("-DOCSTART"):
            #skip -DOCSTART- and the next line
            fh.readline()
        elif line == "":
            #Sentence ended.
            myoutput.append([words,tags])
            words,tags = [],[]
        else:   
            parts = line.split()
            #word, pos_tag, chunk_tag, ner_tag = line.split()
            if len(parts) == 4:
                words.append(parts[0])
                tags.append(parts[-1])
    fh.close()
    return myoutput

## Extracting features

The code below presents a function (`sent2feats`) that extracts the features of a given setences. That is, given a sentence (a list of tokens) and extracts the set of features related to each token (a list of dictionaries, where each dictionary represents a feature for that word). 

As you can see, the functions only extracts two features per token:

 - `wordsfeats['word'] = word`: the word form of the current token
 - `wordsfeats['tag'] = tag`: the pos tag of the current token


You can first evaluate the current feautes function to see how informative is to model the token with just the word form and its part of speech tag. 

Then as an exercise you can define and extracts a larger set of features. For example you can extract as feature set the following:

 - __Word features__: word, prev 2 words, next 2 words in the sentence.
 - __POS tag features__: current tag, previous and next 2 tags.
 - __Word shape features__: If current word is all caps, word shape, next 2 words shape
 - __Gazetteers__: The presence of current word in a gazetteers. Geonames can be a good resource for this task: https://www.geonames.org/ Using the whole dataset might be too tedious as the file is quite large, so for this lab might be enough to use the list of cities with a population bigger than 15000. 
   - Check the readme of geonames files: http://download.geonames.org/export/dump/readme.txt
   - All files can be downloaded from here: http://download.geonames.org/export/dump/

At least you need to implement the first two feature set to obtain competitive results, **but first continue with the current default features until the end of the notebook**.


In [None]:
"""
Get features for all words in the sentence
Features:
- word context: a window of 2 words on either side of the current word, and current word.
- POS context: a window of 2 POS tags on either side of the current word, and current tag. 
input: sentence as a list of tokens.
output: list of dictionaries. each dict represents features for that word.
"""
def sent2feats(sentence):
    feats = []
    sen_tags = pos_tag(sentence) #This format is specific to this POS tagger!
    for i in range(0,len(sentence)):
        word = sentence[i]
        wordfeats = {}
       #word features: word, prev 2 words, next 2 words in the sentence.
        wordfeats['word'] = word
        if i == 0:
            wordfeats["prevWord"] = wordfeats["prevSecondWord"] = "<S>"
        elif i==1:
            wordfeats["prevWord"] = sentence[0]
            wordfeats["prevSecondWord"] = "</S>"
        else:
            wordfeats["prevWord"] = sentence[i-1]
            wordfeats["prevSecondWord"] = sentence[i-2]
        #next two words as features
        if i == len(sentence)-2:
            wordfeats["nextWord"] = sentence[i+1]
            wordfeats["nextNextWord"] = "</S>"
        elif i==len(sentence)-1:
            wordfeats["nextWord"] = "</S>"
            wordfeats["nextNextWord"] = "</S>"
        else:
            wordfeats["nextWord"] = sentence[i+1]
            wordfeats["nextNextWord"] = sentence[i+2]
        
        #POS tag features: current tag, previous and next 2 tags.
        wordfeats['tag'] = sen_tags[i][1]
        if i == 0:
            wordfeats["prevTag"] = wordfeats["prevSecondTag"] = "<S>"
        elif i == 1:
            wordfeats["prevTag"] = sen_tags[0][1]
            wordfeats["prevSecondTag"] = "</S>"
        else:
            wordfeats["prevTag"] = sen_tags[i - 1][1]
            wordfeats["prevSecondTag"] = sen_tags[i - 2][1]
            # next two words as features
        if i == len(sentence) - 2:
            wordfeats["nextTag"] = sen_tags[i + 1][1]
            wordfeats["nextNextTag"] = "</S>"
        elif i == len(sentence) - 1:
            wordfeats["nextTag"] = "</S>"
            wordfeats["nextNextTag"] = "</S>"
        else:
            wordfeats["nextTag"] = sen_tags[i + 1][1]
            wordfeats["nextNextTag"] = sen_tags[i + 2][1]
        
        # WordShape: current wordShape
        wordfeats['wordShape'] = wordshape(word)
        #That is it! You can add whatever you want!
        feats.append(wordfeats)
    return feats

In [None]:
def wordshape(text):
    import re
    t1 = re.sub('[A-Z]', 'X',text)
    t2 = re.sub('[a-z]', 'x', t1)
    return re.sub('[0-9]', 'd', t2)

In [None]:
#Extract features from the conll data, after loading it.
def get_feats_conll(conll_data):
    feats = []
    labels = []
    for sentence in conll_data:
        feats.append(sent2feats(sentence[0]))
        labels.append(sentence[1])
    return feats, labels

In [None]:
# input example
example = [[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']]]

feats, labels = get_feats_conll(example)
print(example)
feats[0]

## Training the model

In [None]:
#Train a sequence model
def train_seq(X_train,Y_train,X_dev,Y_dev):
    crf = CRF(algorithm='lbfgs', c1=0.1, c2=10, max_iterations=50)#, all_possible_states=True)
    #Just to fit on training data
    crf.fit(X_train, Y_train)
    labels = list(crf.classes_)

    #testing:
    y_pred = crf.predict(X_dev)
    sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
    print(metrics.flat_f1_score(Y_dev, y_pred,average='weighted', labels=labels))
    print(metrics.flat_classification_report(Y_dev, y_pred, labels=sorted_labels, digits=3))
    get_confusion_matrix(Y_dev, y_pred,labels=sorted_labels)
    return crf

In [None]:
def print_cm(cm, labels):
    print("\n")
    """pretty print for confusion matrixes"""
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print("    " + empty_cell, end=" ")
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        sum = 0
        for j in range(len(labels)):
            cell = "%{0}.0f".format(columnwidth) % cm[i, j]
            sum =  sum + int(cell)
            print(cell, end=" ")
        print(sum) #Prints the total number of instances per cat at the end.

In [None]:
#python-crfsuite does not have a confusion matrix function, 
#so writing it using sklearn's confusion matrix and print_cm from github
def get_confusion_matrix(y_true,y_pred,labels):
    trues,preds = [], []
    for yseq_true, yseq_pred in zip(y_true, y_pred):
        trues.extend(yseq_true)
        preds.extend(yseq_pred)
    print_cm(confusion_matrix(trues,preds,labels=labels),labels)

Make sure your path is correct!

In [None]:
# main script
work_dir = "drive/MyDrive/Colab Notebooks/nlp-app-II/data"
conll_dir = work_dir + "/conll2003/en"
train_path = conll_dir + "/train.txt"
dev_path = conll_dir + "/valid.txt"
test_path = conll_dir + "/test.txt"

conll_train = load_data_conll(train_path)
conll_dev = load_data_conll(dev_path)

print("Training a Sequence classification model with CRF")
feats, labels = get_feats_conll(conll_train)
devfeats, devlabels = get_feats_conll(conll_dev)
crf_model = train_seq(feats, labels, devfeats, devlabels)
print("Done with sequence model")

## Exercise 1
Train first the CRF model with the feature sets provided by default (that is, current word and part-of-speech), and analyze the results. Why are we getting such a high overall results? Are they actually good results? What is happening there?

It is always a good idea to the use official scorer when possible.


In [None]:
# Note we are using dev partition as test. 
# Change to "text.txt" when you want to evaluate on test (do it once you are happy in dev)
test_path = conll_dir + "/valid.txt"

conll_test = load_data_conll(test_path)
testfeats, testlabels = get_feats_conll(conll_test)
words = [tokens[0] for tokens in conll_test]

preds = crf_model.predict(testfeats)

In [None]:
def dump_to_file(words, labels, preds):
    f = open('output.tsv', 'w', encoding='utf-8')
    for i in range(len(preds)):
        for w, l,p in zip(words[i],labels[i], preds[i]):
            f.write(w + " " + l + " " + p + "\n")
        f.write('\n')
    f.close()

dump_to_file(words, testlabels, preds)

In [None]:
!cp "drive/MyDrive/00-Irakaskuntza/HAP-LAP-masterra/NLP-Applications-2/Part1: Information-extraction/notebooks/conlleval.txt" .

In [None]:
!perl conlleval.txt < output.tsv

## Exercise 2

Look at the confusion matrix. Note that, rows are gold labels and colums are the predicted labels. Where is located the confusion? 

## Exercise 3
Build you own feature set. You can extract as feature set the following:

 - __Word features__: word, prev 2 words, next 2 words in the sentence.
 - __POS tag features__: current tag, previous and next 2 tags.
 - __Word shape features__: If current word is all caps, word shape, next 2 words shape. We provide a function to get the word shape. 
 - __Gazetteers__: The presence of current word in a gazetteers.

In order to onbtain some competitive results (and understand how to build a function to extract features for sequence labeling) you should implement the first two sets of features (word features, and pos tag features).