# COLX 563 Lab Assignment 1: Named Entity Recognition (Cheat sheet)

## Assignment Objectives

In this lab, you will be training two models to perform Named Entity Recognition (NER). The primary focus of this lab will be on feature generation. Components of this lab include:

1. Read in data and convert data to IOB-labeling
2. Define basic features and train a baseline classifier on your data
3. Train a Conditional Random Field classifier
4. Have your model compete in a class Kaggle competition

Note that parts of this lab are based on the [sklearn_crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/) docs.

## Getting Started

Run the code below to access relevant modules (you can add to this as needed)

In [41]:
# !pip install sklearn-crfsuite 

In [42]:
# import nltk
# nltk.download('gazetteers')

In [3]:
#provided code
import os
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import f1_score, classification_report
from bs4 import BeautifulSoup
from nltk.corpus import names,gazetteers 
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score, flat_classification_report
import re

In this lab, you will be working with the Ontonotes V 5.0 dataset, a commonly used dataset for the task. The data is available from the UBC library, but the subset you will be working with is available in the class Github repo in the [data directory](https://github.ubc.ca/mds-cl-2021-22/COLX_563_adv-semantics_students/tree/master/labs/Data/Lab1). You can pull this data from github and change the path below to access it.

In [4]:
#provided code
ontonotes_path = "/Users/MDS2021-2022/COLX_563_adv-sem_labs/Data/Lab1/"

Then run the code below to get a list of files for the training and dev sets

In [5]:
train_data = ['train/' + filename for filename in os.listdir(ontonotes_path + 'train')]
dev_data = ['dev/' + filename for filename in os.listdir(ontonotes_path + 'dev')]

print(f"Read {len(train_data)} training files")
print(f"Read {len(dev_data)} development files")

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Exercise 1: Initial Data Processing
rubric={accuracy:4,quality:1}

Before training, you need to convert your data from the .name files to standard IOB/BIO (**I**nside-**O**utside-**B**eginning) tags for NER. Each line of the data file contains a sentence with XML tags indicating the named entities. For example, if the sentence contains a *GPE* tag such as:


![BIO](BIO.png)

```
<ENAMEX TYPE="GPE"> Hong Kong </ENAMEX>

-> 

Hong, Kong
B-GPE, I-GPE
```

The tag for 'Hong' is *B-GPE* and 'Kong' is *I-GPE* (GPE stands for Geopolitical Entity).  Write a *sentence2iob* function that reads in a sentence from the dataset and converts it to a list of tokens with corresponding IOB-tags.

Note that there are a few ways to approach this, and for each of them you will encounter challenges. You will likely end up missing some cases in your first pass, and will need to look for the specific cases which are causing your asserts to fail. 

**Notes:** 

* Spaces have been inserted between words and punctuation and you can just use use `split`, no need to tokenize.
* ENAMEX tags sometimes contain attributes like `S_OFF="1"` and `E_OFF="4"`. You should ignore these attributes.
* There are a number of nested elements in the dataset. You should ignore these.
* When the sentence contains no tokens (i.e. it consists entirely of whitespace) or the sentence contains a start of document tag "<DOC ..." or end of document tag "<\/DOC>", you should return an empty token and tag list.

```
<DOC DOCNO="wb/a2e/00/a2e_0016@0016@a2e@wb@en@on">          <--- ignore this line
<ENAMEX TYPE="ORG">The National Syrian Party</ENAMEX> in <ENAMEX TYPE="GPE">Lebanon</ENAMEX> , and Very Dangerous Facts
<ENAMEX TYPE="PERSON">Al Shaheen2005</ENAMEX>
At a time when the <ENAMEX TYPE="NORP">Lebanese</ENAMEX> judiciary continues its investigation into ...
...
That 's not <ENAMEX TYPE="ORG">fun . At <ENAMEX TYPE="PERSON">all</ENAMEX></ENAMEX>     <--- nested;;; 
                                                                                             extract ORG, ignore PERSON
... pursued <ENAMEX TYPE="NORP" S_OFF="5">anti-Serbian</ENAMEX> policies .   <--- remove  ' S_OFF="[0-9]*"' (there will be ' E_OFF="[0-9]*"')
...
</DOC>                                                      <--- ignore this line
```

In [15]:
from re import search, sub
    
def sentence2iob(sentence):
    '''Input sentence is a string from the Ontonotes corpus, with xml tags indicating named entities
    output is a list of tokens and a list of NER IOB-tags corresponding to those tokens'''
    
    tokens=[]
    tags=[]
    
    # your code here

    # ignore <DOC and </DOC>
    # `sub`  S_OFF/E_OFF with ''  (see re.sub)
    # make "ENAMEX TYPE" as a single token ->  "ENAMEX_TYPE",
        # so you can have "<ENAMEX_TYPE="GPE">Moscow</ENAMEX>" or "<ENAMEX_TYPE="QUANTITY">2" as a single token

    return tokens, tags


In [16]:
check_sentence = '<ENAMEX TYPE="GPE">Moscow</ENAMEX> , overcast changing to moderate snow , <ENAMEX TYPE="QUANTITY">2 degrees below zero</ENAMEX> to <ENAMEX TYPE="QUANTITY">1 degree</ENAMEX> .'
curr_tokens, curr_tags = sentence2iob(check_sentence)

<ENAMEX_TYPE="GPE">Moscow</ENAMEX>
,
overcast
changing
to
moderate
snow
,
<ENAMEX_TYPE="QUANTITY">2
degrees
below
zero</ENAMEX>
to
<ENAMEX_TYPE="QUANTITY">1
degree</ENAMEX>
.


In [7]:
check_sentence = 'While <ENAMEX TYPE="PERSON">Galloway</ENAMEX> \'s <ENAMEX TYPE="ORG" S_OFF="4">pro-Wal-Mart</ENAMEX> film introduces us to grateful employees /-'
curr_tokens, curr_tags = sentence2iob(check_sentence)
assert curr_tags == ['O', 'B-PERSON', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

check_sentence = '<ENAMEX TYPE="GPE">Moscow</ENAMEX> , overcast changing to moderate snow , <ENAMEX TYPE="QUANTITY">2 degrees below zero</ENAMEX> to <ENAMEX TYPE="QUANTITY">1 degree</ENAMEX> .'
curr_tokens, curr_tags = sentence2iob(check_sentence)
assert curr_tags == ['B-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-QUANTITY', 'I-QUANTITY', 'I-QUANTITY', 'I-QUANTITY', 'O', 'B-QUANTITY', 'I-QUANTITY', 'O']

print("Success!")

Success!


Run the following code to build list containing tokenized sentences and tagged sentences for training and development. We will use these lists later.

In [None]:
train_sents = []
dev_sents = []

for filenames, sents in [(train_data, train_sents), (dev_data, dev_sents)]: 
    for filename in filenames:
        with open(ontonotes_path + filename, encoding="utf-8") as f:
            for sentence in f:
                curr_tokens, curr_tags = sentence2iob(sentence)
                assert "" not in curr_tokens # if you have empty strings, you've done something wrong
                sents.append((curr_tokens, curr_tags))

train_token_count = sum([len(tokens) for tokens, tags in train_sents])
assert train_token_count == 1096878
print("Success!")

In [8]:
# Cheat sheet
print(check_sentence) 
curr_tokens, curr_tags = sentence2iob(check_sentence)
print(curr_tokens)
print(curr_tags) 

<ENAMEX TYPE="GPE">Moscow</ENAMEX> , overcast changing to moderate snow , <ENAMEX TYPE="QUANTITY">2 degrees below zero</ENAMEX> to <ENAMEX TYPE="QUANTITY">1 degree</ENAMEX> .
['Moscow', ',', 'overcast', 'changing', 'to', 'moderate', 'snow', ',', '2', 'degrees', 'below', 'zero', 'to', '1', 'degree', '.']
['B-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-QUANTITY', 'I-QUANTITY', 'I-QUANTITY', 'I-QUANTITY', 'O', 'B-QUANTITY', 'I-QUANTITY', 'O']


## Exercise 2: Naive Bayes Classification

Now we are going to train a simple Naive bayes classifer to perform NER. 

### Exercise 2.1
rubric={accuracy:2}

The quality of the model depends on utilizing informative features for our task. Modify the *word2features* function to generate features for a specific word in the sentence that will be useful for Named Entity Recognition. You should include at least:

- A feature which looks at neighbouring words (Note that this should be one feature in your feature dict, but it will correspond to multiple features in your sparse matrix output from your vectorizer) 
- A feature which looks at word morphology, for example the last few letters of the word. 
- A feature which considers the "shape" of word (i.e. which letters are upper or lower case). You may want to consider the length or location of the word in the sentence to derive a high performing feature here (what's special about the first word in an English sentence?). 
- A gazetteer feature (use `names` and/or `gazetteers` from `nltk.corpus`; note that `gazetteers` has multiword expressions like *United States* which won't correspond to individual tokens)


You will use this same function in your CRF, and you may need to come back here later and improve your set of features to increase your performance in the Kaggle competition.

In [18]:
# my code here
import nltk

names_gazetteer = set(names.words())
location_gazetteer = set()
for location in gazetteers.words():
    words = location.split()
    for word in words:
        if word[0].isupper() and len(word) > 3:
            location_gazetteer.add(word)
# my code here

def word2features(sentence, idx):
    word_features = {}
    word_features['word_lowercase'] = sentence[idx].lower()

    # your code here
    # word_features['my_features'] = ...
    # word_features['prev_word'] = 
    # word_features['next_word'] = 
    # word_features['1-suffix'] = 
    # word_features['2-suffix'] = 
    # word_features['is_names_gazetteer'] = 
    # word_features['is_location_gazetteer'] = 

    return word_features
    
def sentence2features(sentence):
    return [word2features(sentence, idx) for idx in range(len(sentence))]

In [10]:
# Cheat sheet
print("Canada" in location_gazetteer)
print("canada" in location_gazetteer)
print("United States" in location_gazetteer)
print("United" in location_gazetteer)
print("States" in location_gazetteer)

True
False
False
True
True


### Exercise 2.2
rubric={accuracy:1}

Write a function `prepare_ner_feature_dicts` which takes `train_sents` or `dev_sents` which we prepared above and runs `sentence2features` on the tokenized sentences. You should return two lists. One containing the feature dictionaries for every sentence in the dataset and another one containing all tags. Note that these should should be plain lists of dictionaries and tags (not lists of lists). 

In [20]:
def prepare_ner_feature_dicts(sents):
    '''ner_files is a list of Ontonotes files with NER annotations. Returns feature dictionaries and 
    IOB tags for each token in the entire dataset'''
    all_dicts = []
    all_tags = []
    # your code here

    return all_dicts, all_tags

In [21]:
train_dicts, train_tags = prepare_ner_feature_dicts(train_sents)
dev_dicts, dev_tags = prepare_ner_feature_dicts(dev_sents)

assert(len(train_dicts)) == 1096878
print("Success!")

Success!


In [18]:
# Cheat sheet 
print("printing purpose:", train_toks[:9]) # NOT EXISTS...
print(train_dicts[:2])
print(train_tags[:9])

printing purpose: ['G.', 'William', 'Ryan', ',', 'president', 'of', 'Post-Newsweek', 'Stations', ',']
[{'word_lowercase': 'g.', 'all_caps': True, 'title_case_not_first_word': False, 'prev_word': '*None*', 'next_word': 'William', 'last_2_letters': 'G.', 'location_gazetteer': False, 'name_gazetteer': False, 'has_number': False}, {'word_lowercase': 'william', 'all_caps': False, 'title_case_not_first_word': True, 'prev_word': 'G.', 'next_word': 'Ryan', 'last_2_letters': 'Wi', 'location_gazetteer': False, 'name_gazetteer': True, 'has_number': False}]
['B-PERSON', 'I-PERSON', 'I-PERSON', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O']


### Exercise 2.3
rubric={accuracy:2,reasoning:1}

Now use your features to train a Multinomial Naive Bayes classifer on `train_dicts` and `train_tags`, with default settings. You will need to vectorize `train_tokens` first using `DictVectorizer` from sklearn. Evaluate the model on the `dev_dicts` comparing system generated tags to `dev_tags`. 

Using [sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), print out a macroaveraged f-score, a microaveraged f-score, and a classification report, so you can see how you're doing with each class. Note you will get very divergent scores for microaverage and macroaveraged f-score. Briefly explain why, with reference to the classification report.

In [22]:
# your code here


print("MicroF1:",f1_score(dev_tags, y_pred,average="micro"))
print("MacroF1:",f1_score(dev_tags, y_pred,average="macro"))

MicroF1: 0.9324081364058332
MacroF1: 0.34002264404986104


In [24]:
print(classification_report(dev_tags, y_pred))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


               precision    recall  f1-score   support

   B-CARDINAL       0.62      0.55      0.58      1216
       B-DATE       0.74      0.61      0.67      2230
      B-EVENT       1.00      0.02      0.03       130
        B-FAC       0.00      0.00      0.00       149
        B-GPE       0.73      0.94      0.82      2738
   B-LANGUAGE       0.00      0.00      0.00       114
        B-LAW       0.00      0.00      0.00        47
        B-LOC       1.00      0.01      0.03       231
      B-MONEY       0.71      0.52      0.60       712
       B-NORP       0.82      0.81      0.81       928
    B-ORDINAL       0.79      0.14      0.24       222
        B-ORG       0.71      0.55      0.62      3024
    B-PERCENT       0.75      0.78      0.76       574
     B-PERSON       0.74      0.90      0.81      2082
    B-PRODUCT       1.00      0.01      0.02       101
   B-QUANTITY       0.00      0.00      0.00       125
       B-TIME       1.00      0.01      0.03       203
B-WORK_OF

  _warn_prf(average, modifier, msg_start, len(result))


YOUR ANSWER HERE



### Exercise 2.4
rubric={accuracy:1}

One problem with using a regular (non-sequential) classifier for `IOB-based NER` is that it may create ill-formed named entities, i.e. `I-` tags with no corresponding `B-` or `I-` tags before it. Check how often this is happening in the dev set with your classifier (the answer is "a lot")

In [26]:
# Cheat sheet
for iterate `y_pred``:
    if compare the current tag and previous tag, then if not matched:
        print(y_pred[i-1], y_pred[i])


# the number can vary (depending on your features)
# There are 163,282 I-tags in total
# There are 6915 broken I-tags
# 4.24% of all I-tags are broken

3493 : the number can vary...


In [None]:
# your code here



## Exercise 3: Training a CRF

Next, you're going to train a CRF model using the [`sklearn_crfsuite`](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html) package.

#### Exercise 3.1
rubric={accuracy:1}

First, do the appropriate modification to `prepare_ner_feature_dicts` to put the data in the right format. The only difference is that you are now building lists of lists of feature dicts, with the "extra" lists corresponding to sentences. 

In [27]:
def prepare_ner_feature_dicts(sents):
    '''ner_files is a list of Ontonotes files with NER annotations. Returns feature dictionaries and 
    IOB tags for each token in the entire dataset'''
    all_dicts = []
    all_tags = []
    # your code here

    
    return all_dicts, all_tags

#### Exercise 3.2:
rubric={accuracy:1}

Now train and evalute your model in the same way as you did with the Naive Bayes model. Note that this will take a lot longer to train than the naive Bayes, you might want to set the `max_iterations` parameter low to start (but you will need to set it fairly high to get good results). If you want to see the progress of training, use verbose=True, it will help your sanity and give you a sense of how many iterations you need.

**Note:** 

1. [`sklearn_crfsuite`](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html) does not require a DictVectorizer, you can input features and tags directly into the model.
1. You may get an error `AttributeError: 'CRF' object has no attribute 'keep_tempfiles'` when training the CRF model because of an incompatible sklearn version. This is not fatal. Just use a `try ... except` clause which catches the error and pass it (note that this only works because the error is raised after the model parameters have been set):

```
    try:
        call_produces_an_error()
    except:
        pass
```

In [38]:
# your code here



loading training data to CRFsuite: 100%|██████████████████████████████████████████████████████████████████████████████| 57447/57447 [00:08<00:00, 6559.70it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 186119
Seconds required: 2.704

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=10.94 loss=1204633.74 active=186119 feature_norm=5.00
Iter 2   time=3.69  loss=1017711.04 active=186119 feature_norm=4.11
Iter 3   time=3.63  loss=991387.84 active=186119 feature_norm=4.03
Iter 4   time=3.83  loss=871534.63 active=186119 feature_norm=4.80
Iter 5   time=3.64  loss=729903.23 active=186119 feature_norm=7.10
Iter 6   time=3.64  loss=627632.90 active=186119 feature_norm=10.10
Iter 7   time=3.71  loss=518951.38 active=186119 feature_norm=14.17
Iter 8   time=7.21  loss=464748.65 active=186119 feature_norm=16.87
Iter 9   time=3.93  loss=419471.75 active=186119 feature_

In [39]:
print('Microaveraged fscore', flat_f1_score(dev_tags, dev_pred, average='micro'))
print('Macroaveraged fscore', flat_f1_score(dev_tags, dev_pred, average='macro'))

Microaveraged fscore 0.9637951092257253
Macroaveraged fscore 0.6816350011552526


In [None]:
from sklearn_crfsuite import metrics

labels = list(crf.classes_)
y_pred = crf.predict(dev_dicts)
print(metrics.flat_f1_score(dev_tags, y_pred, average='weighted', labels=labels))

In [30]:
# Cheat sheet
# for some reason `flat_classification_report()` works only with the previous version of sklearn...
from sklearn import metrics
from sklearn_crfsuite.utils import flatten

print(metrics.classification_report(flatten(dev_tags), flatten(y_pred)))

# import itertools
# print(classification_report(list(itertools.chain(*dev_tags)), list(itertools.chain(*dev_pred))))

In [23]:
# RECOMMEND
# !python3 -m pip install conlleval

from conlleval import evaluate # https://github.com/sighsmile/conlleval


evaluate(flatten(dev_tags), flatten(y_pred), verbose=True) 

# processed 170198 tokens with 14893 phrases; found: 14463 phrases; correct: 12367.
# accuracy:  85.43%; (non-O)
# accuracy:  96.77%; precision:  85.51%; recall:  83.04%; FB1:  84.26
#          CARDINAL: precision:  82.59%; recall:  83.88%; FB1:  83.23  1235
#              DATE: precision:  85.23%; recall:  85.11%; FB1:  85.17  2227
#             EVENT: precision:  57.33%; recall:  33.08%; FB1:  41.95  75
#               FAC: precision:  49.43%; recall:  28.86%; FB1:  36.44  87
#               GPE: precision:  91.45%; recall:  92.99%; FB1:  92.21  2784
#          LANGUAGE: precision:  87.69%; recall:  50.00%; FB1:  63.69  65
#               LAW: precision:  51.85%; recall:  29.79%; FB1:  37.84  27
#               LOC: precision:  69.15%; recall:  60.17%; FB1:  64.35  201
#             MONEY: precision:  92.40%; recall:  90.45%; FB1:  91.41  697
#              NORP: precision:  87.71%; recall:  91.49%; FB1:  89.56  968
#           ORDINAL: precision:  78.39%; recall:  83.33%; FB1:  80.79  236
#               ORG: precision:  82.47%; recall:  78.24%; FB1:  80.30  2869
#           PERCENT: precision:  90.37%; recall:  89.90%; FB1:  90.13  571
#            PERSON: precision:  88.45%; recall:  87.94%; FB1:  88.20  2070
#           PRODUCT: precision:  50.00%; recall:  27.72%; FB1:  35.67  56
#          QUANTITY: precision:  82.61%; recall:  60.80%; FB1:  70.05  92
#              TIME: precision:  63.45%; recall:  45.32%; FB1:  52.87  145
#       WORK_OF_ART: precision:  34.48%; recall:  29.85%; FB1:  32.00  58

#### Exercise 3.3:
rubric={accuracy:1,reasoning:1}

Look at the top and bottom 10 transitions in terms of weight (**Hint**: [`sklearn_crfsuite` tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-check-what-classifier-learned)). Do they make sense?  

In [34]:
# your code here

# put `crf.transition_features_`  in `Counter` and find `most_common`
# `crf.transition_features_` has (label_from, label_to), weight:
# 

Top 10: [(('O', 'B-CARDINAL'), 4.573875), (('I-MONEY', 'I-MONEY'), 4.620261), (('I-PERSON', 'I-PERSON'), 4.666901), (('B-PERCENT', 'I-PERCENT'), 4.716632), (('B-ORG', 'I-ORG'), 4.990618), (('I-ORG', 'I-ORG'), 5.134131), (('I-DATE', 'I-DATE'), 5.385755), (('B-PERSON', 'I-PERSON'), 5.629678), (('O', 'O'), 6.695223), (('B-DATE', 'I-DATE'), 6.768138)]
Bottom 10: [(('B-PERCENT', 'O'), -1.28613), (('B-PERSON', 'B-PERSON'), -1.177372), (('I-ORG', 'B-PERSON'), -1.174839), (('B-PERSON', 'B-ORG'), -0.984528), (('I-ORG', 'B-GPE'), -0.984249), (('I-ORG', 'B-ORG'), -0.967117), (('I-ORDINAL', 'O'), -0.963708), (('I-LANGUAGE', 'O'), -0.96031), (('B-GPE', 'B-GPE'), -0.956907), (('B-GPE', 'B-PERSON'), -0.937251)]


YOUR ANSWER HERE

