# Named Entity Recognition with linear Conditional Random Fields 

This assignment is comprised of two parts:

1. **Theory**: Solve two exercises about POS tagging and run Viterbi for a simple HMM model.
2. **Implementation**: Experiment with a name entity recognizer using sequence tagging with CRFs.

For the implementation part, you will be using the `python-crfsuite` that can be installed using Anaconda as follows: `conda install -c conda-forge python-crfsuite`

## Write Your Name Here: Shreyas Lokesha
##### UNC-id: 801210964
##### e-mail: slokesha@uncc.edu

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Please make sure to have entered your name above.
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of ll cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX and download a PDF version *crf-ner.pdf* showing the code and the output of all cells, and save it in the same folder that contains the notebook file *crf-ner.ipynb*.
6. The PDF file may not show the table with the Viterbi solution. Make sure that you submit the notebook file as well!
7. Submit **both** your PDF and notebook on Canvas. Also upload the **CRF output** on the testb blind test set in CoNLL format.

# Theory

## Theory: POS tagging exercises

1. Exercise 8.1 in Chapter 8 in J&M.
2. Exercise 8.2 in Chapter 8 in J&M, parts 1, 2, 3, and 4.

To solve these exercises, it will be useful to consult the POS tagging guidelines linked on the course website, as well as the POS tagging examples provided in the files under `data/wsj/23/*.pos`. <br>

(1)
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN <br>
Tagging Error Rectified: Atlanta/NNP (Proper Noun) <br>

2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS<br>
Tagging Error Rectified: Dinner/NN (Singular Noun) <br>

3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP<br>
Tagging Error Rectified: have/VBP (3rd person singular present) <br>

4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS <br>
Tagging Error Rectified: Can/MD<br>
<br>
<br>
<br>

(2)
1. It is a nice night.<br>
Tagged Sequence: It/PRP is/VBZ a/DT nice/JJ night/NN ./PUNC<br>

2. This crap game is over a garage in Fifty-second Street. . .<br>
Tagged Sequence: This/DT crap/NN game/NN is/VBZ over/IN a/DT garage/NN in/IN Fifty-Second/NNP Street/NNP ./PUNC ./PUNC ./PUNC <br>

3.  . . . Nobody ever takes the newspapers she sells . . . <br>
Tagged Sequence:./PUNC ./PUNC ./PUNC Nobody/NN ever/RB takes/VBZ the/DT newspapers/NNS she/PRP sells/VBZ ./PUNC 
./PUNC ./PUNC <br>

4. He is a tall, skinny guy with a long, sad, mean-looking kisser, and a mournful voice. <br>
Tagged Sequence: He/PRP is/VBZ a/DT tall/JJ ,/PUNC skinny/JJ guy/NN with/IN a/DT long/JJ ,/PUNC sad/JJ ,/PUNC 
mean-looking/JJ kisser/NN ,/PUNC and/CC a/DT mournful/JJ voice/NN ./PUNC <br>




## Theory: Run Viterbi for a simple HMM model

Consider the following parameters for a very simple HMM:

<img src="files/table1.png" width="350">

<img src="files/table2.png" width="250">

Create the trellis for the sentence "Fed cuts rates in half".

<table style="width:100%">
  <tr>
    <th align="left">State/Word</th>
    <th>(t = 1) Fed</th>
    <th>(t = 2) cuts</th>
    <th>(t = 3) rates</th>
    <th>(t = 4) in</th>
    <th>(t = 5) half</th>
  </tr>
  <tr>
    <td align="left">$S_1$ = Noun</td>
    <td>0.5 * 0.1 / 0</td>
    <td>0.0144 / 2</td>
    <td>0.0027 / 2</td>
    <td>0 / 0</td>
    <td>0.000020738 / 3</td>
  </tr>
  <tr>
    <td align="left">$S_2$ = Verb</td>
    <td>0.3 * 0.2 / 0</td>
    <td>0.015 / 1</td>
    <td>0.00288 / 1</td>
    <td>0 / 0</td>
    <td>0 / 0</td>
  </tr>
  <tr>
    <td align="left">$S_3$ = Prep</td>
    <td>0.2 * 0.0 / 0</td>
    <td>0 / 0</td>
    <td>0 / 0</td>
    <td>0.0002304 / 2</td>
    <td>0 / 0</td>
  </tr>
    <caption>Trellis table, showing the $\delta_i(t)$ scores for each state $S_i$ at step $t$.</caption>
</table>

1. Show the computed parameters $\delta_i(t)$ / $\psi_i(t)$ at each node in the trellis. 
2. What is the most likely sequence of tags? What is its probability?

1. 
$\delta_1(1) = 0.5 x 0.2 = 0.10$

$\delta_2(1) = 0.3 x 0.2 = 0.06$

$\delta_3(1) = 0.2 x 0.0 = 0$


$\delta_1(2) = max(0.10 x 0.2 x 0.4, 0.06 x 0.6 x 0.4, 0 x 0.9 x 0.4) = max(0.008, 0.0144, 0) = 0.0144 => \psi_1(2) = 2 $

$\delta_2(2) = max(0.10 x 0.5 x 0.3, 0.06 x 0.0 x 0.3, 0 x 0.1 x 0.3) = max(0.015, 0, 0) = 0.015 => \psi_2(2) = 1 $

$\delta_3(2) = max(0.10 x 0.3 x 0.0, 0.06 x 0.4 x 0.0, 0 x 0.0 x 0.0) = max(0, 0, 0) = 0 => \psi_3(2) = 0 $


$\delta_1(3) = max(0.0144 x 0.2 x 0.3, 0.015 x 0.6 x 0.3, 0 x 0.9 x 0.3) = max(0.00086, 0.0027, 0) = 0.0027 => \psi_1(3) = 2$

$\delta_2(3) = max(0.0144 x 0.5 x 0.4, 0.015 x 0.0 x 0.4, 0 x 0.1 x 0.4) = max(0.00288, 0, 0) = 0.00288 => \psi_2(3) = 1$

$\delta_3(3) = max(0.0144 x 0.3 x 0, 0.015 x 0.4 x 0, 0 x 0 x 0) = max(0, 0, 0) = 0 => \psi_3(3) = 0$


$\delta_1(4) = max(0.0027 x 0.2 x 0, 0.00288 x 0.6 x 0, 0 x 0.9 x 0.2) = max(0, 0, 0) = 0 => \psi_1(4) = 0$

$\delta_2(4) = max(0.0027 x 0.5 x 0, 0.00288 x 0.0 x 0, 0 x 0.1 x 0.2) = max(0, 0, 0) = 0 => \psi_2(4) = 0$

$\delta_3(4) = max(0.0027 x 0.3 x 0.2, 0.00288 x 0.4 x 0.2, 0 x 0 x 0.2) = max(0.00016, 0.0002304, 0) = 0.0002304 => \psi_3(4) = 2$


$\delta_1(5) = max(0 x 0.2 x 0.1, 0 x 0.6 x 0.1, 0.0002304 x 0.9 x 0.1) = max(0, 0, 0.000020738) = 0.000020738 => \psi_1(5) = 3$

$\delta_2(5) = max(0 x 0.5 x 0, 0 x 0 x 0, 0.0002304 x 0.1 x 0) = max(0, 0, 0) = 0 => \psi_2(5) = 0$

$\delta_3(5) = max(0 x 0.3 x 0, 0 x 0.2 x 0, 0.0002304 x 0 x 0) = max(0, 0, 0) = 0 => \psi_3(5) = 0$


2. Probabilistically determined Sequence of Tags:

Verb -> Noun -> Verb -> Prep -> Noun  <br>

Probabilistic value  : 0.000020738 <br>

# Implementation

## Background

Named entity recognition is the task of identifying references to named entities of certain types in text. We
use data presented in the CoNLL 2003 Shared Task (Tjong Kim Sang and De Meulder, 2003). An example
of the data is given below:

    Singapore NNP I-NP B-ORG
    Refining NNP I-NP I-ORG
    Company NNP I-NP I-ORG
    expected VBD I-VP O
    to TO I-VP O
    shut VB I-VP O
    CDU NNP I-NP B-ORG
    3 CD I-NP I-ORG
    . . O NONE O
    
There are four columns here: the word, the POS tag, the chunk bit (a form of shallow parsing—you can ignore this), and the column containing the NER tag. NER labels are given in a BIO tag scheme: beginning, inside, outside. In the example above, two named entities are present: Singapore Refining Company and CDU 3. O tags denote text not part of a named entity. B tags indicate the start of a named entity, and I tags indicate the continuation of the previous named entity. Both B and I tags are hyphenated and contain a type after the hyphen, which in this dataset is one of PER, ORG, LOC, or MISC. A B tag can immediately follow another B tag in the case where a one-word entity is followed immediately by another entity. However, note that an I tag can only follow an I tag or B tag of the same type.

A NER system’s job is to predict the NER chunks of an unseen sentence, i.e., predict the last column given the others. Output is typically evaluated according to chunk-level F-measure. To evaluate a single
sentence, let C denote the predicted set of labeled chunks represented by a tuple of (label, start index, end
index) and let C* denote the gold set of chunks. We compute precision (P), recall (R), and F1 as follows:

$P = \displaystyle\frac{|C \cap C^*|}{|C|}$; $R = \displaystyle\frac{|C \cap C^*|}{|C^*|}$; $F_1 = \displaystyle\frac{2PR}{(P+R)}$

The gold labeled chunks from the example above are (ORG, 0, 3) and (ORG, 6, 8) using 0-based indexing
and semi-inclusive notation for intervals.

To generalize to corpus-level evaluation, the numerators and denominators of precision and recall are aggregated across the corpus. State-of-the-art systems can get above 90 F1 on this dataset; we’ll be aiming to get close to this and build systems that can get in at least the mid-80s.

In [3]:
from typing import List

import sklearn
import pycrfsuite

from sklearn.metrics import classification_report

from nerdata import Token, Chunk, LabeledSentence, chunks_from_bio_tag_seq, read_data, print_evaluation, print_output

print(sklearn.__version__)

ModuleNotFoundError: No module named 'nerdata'

In [3]:
# Load the training and development data.
train = read_data('../data/eng.train')
dev = read_data('../data/eng.testa')
#stopList = open('../data/stopWordlist.txt', 'r', encoding = "utf-8").split("\n")
stopList = set()
with open ('../data/stopWordlist.txt', 'r', encoding = "utf-8") as file:
    for line in file:
        stopList.add(line.strip('\n'))
        
                     
# Here's a few sentences...
print("Examples of sentences:")
print(dev[1])
print(dev[3])
print(dev[5])

Examples of sentences:
['Token(CRICKET, NNP, I-NP)', 'Token(-, :, O)', 'Token(LEICESTERSHIRE, NNP, I-NP)', 'Token(TAKE, NNP, I-NP)', 'Token(OVER, IN, I-PP)', 'Token(AT, NNP, I-NP)', 'Token(TOP, NNP, I-NP)', 'Token(AFTER, NNP, I-NP)', 'Token(INNINGS, NNP, I-NP)', 'Token(VICTORY, NN, I-NP)', 'Token(., ., O)']
['(2, 3, ORG)']
['Token(West, NNP, I-NP)', 'Token(Indian, NNP, I-NP)', 'Token(all-rounder, NN, I-NP)', 'Token(Phil, NNP, I-NP)', 'Token(Simmons, NNP, I-NP)', 'Token(took, VBD, I-VP)', 'Token(four, CD, I-NP)', 'Token(for, IN, I-PP)', 'Token(38, CD, I-NP)', 'Token(on, IN, I-PP)', 'Token(Friday, NNP, I-NP)', 'Token(as, IN, I-PP)', 'Token(Leicestershire, NNP, I-NP)', 'Token(beat, VBD, I-VP)', 'Token(Somerset, NNP, I-NP)', 'Token(by, IN, I-PP)', 'Token(an, DT, I-NP)', 'Token(innings, NN, I-NP)', 'Token(and, CC, O)', 'Token(39, CD, I-NP)', 'Token(runs, NNS, I-NP)', 'Token(in, IN, I-PP)', 'Token(two, CD, I-NP)', 'Token(days, NNS, I-NP)', 'Token(to, TO, I-VP)', 'Token(take, VB, I-VP)', 'Tok

## Features

Next, define some features. To get full credit on the assignment, you should get a score of at least 85% F1 on the development set. Assignments falling short of this will be judged based on completeness and awarded partial credit accordingly. The instructors’ reference implementation was able to get over 88% F1, see if you can beat that!

In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used.

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

In [78]:
def my_features(sent, i):
    features = []

    # YOUR CODE HERE
    word = sent.tokens[i].word
    postag = sent.tokens[i].pos
    features = [
        'word.isNamedEnt=' + str(neClassifier(word)),
        'word.posTag=' + postag,
        'word.isNum=' + str(word.isdigit()),
        'word.isAlpha=' + str(word.isalpha()),
        'word.wordLen=' + str(len(word))
    ]
    if i > 0:
        word1 = sent.tokens[i - 1].word
        postag1 = sent.tokens[i-1].pos
        features.extend([
            '-1:word.isNamedEnt=' + str(neClassifier(word1)),
            '-1:word.posTag=' + postag1,
            '-1:word.isNum=' + str(word1.isdigit()),
            '-1:word.isAlpha=' + str(word1.isalpha()),
            '-1:word.wordLen=' + str(len(word1))
        ])
    else:
        features.append('BOS')
        
    if i < len(sent) - 1:
        word1 = sent.tokens[i + 1].word
        postag1 = sent.tokens[i + 1].pos
        features.extend([
            '+1:word.isNamedEnt=' + str(neClassifier(word1)),
            '+1:word.posTag=' + postag1,
            '+1:word.isNum=' + str(word1.isdigit()),
            '+1:word.isAlpha=' + str(word1.isalpha()),
            '+1:word.wordLen=' + str(len(word1))
        ])
    else:
        features.append('EOS') 
            
        
    return features

def neClassifier(word):
   
    if((word.lower() not in stopList) and word[0].isupper()):
        neTag = True
    else:
        neTag = False

    
    return neTag

def word2features(sent, i):
    word = sent.tokens[i].word
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word.isupper=%s' % word.isupper(),
    ]
    if i > 0:
        word1 = sent.tokens[i - 1].word
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.isupper=%s' % word1.isupper(),
        ])
    else:
        features.append('BOS')
        
    if i < len(sent) - 1:
        word1 = sent.tokens[i + 1].word
        postag1 = sent.tokens[i + 1].pos
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.isupper=%s' % word1.isupper(),
        ])
    else:
        features.append('EOS')
                
    return features


def sent2features(sent):
    return [my_features(sent, i) + word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return sent.bio_tags

def sent2words(sent):
    return [token.word for token in sent.tokens]

This shows a sample of features created with `word2features()`:

In [79]:
sent2features(dev[5])[0]

['word.isNamedEnt=False',
 'word.posTag=IN',
 'word.isNum=False',
 'word.isAlpha=True',
 'word.wordLen=5',
 'BOS',
 '+1:word.isNamedEnt=False',
 '+1:word.posTag=VBG',
 '+1:word.isNum=False',
 '+1:word.isAlpha=True',
 '+1:word.wordLen=7',
 'bias',
 'word.lower=after',
 'word.isupper=False',
 'BOS',
 '+1:word.lower=bowling',
 '+1:word.isupper=False']

Map training and development sentences to their feature sets, one set of features for each position in the sentence.

In [80]:
%%time
X_train = [sent2features(s) for s in train]
y_train = [sent2labels(s) for s in train]

X_dev = [sent2features(s) for s in dev]
y_dev = [sent2labels(s) for s in dev]

Wall time: 2.31 s


## Train the CRF model

To train the model, we create `pycrfsuite.Trainer`, load the training data and call the `train()` method.

First, create `pycrfsuite.Trainer` and load the training data into the trainer object.

In [81]:
%%time
trainer = pycrfsuite.Trainer(verbose = False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Wall time: 2.7 s


Next, set the training parameters. We will use the default L-BFGS training algorithm with Elastic Net (L1 + L2) regularization.

In [82]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

This displays a list of all the possible parameters for the default training algorithm.

In [83]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

Train the CRF model and save it into a file in the current folder.

In [84]:
%%time
trainer.train('conll2002-eng.crfmodel')

Wall time: 11.4 s


`pycrfsuite.Trainer.train` saves model to a file:

In [97]:
!dir conll2002-eng.crfmodel

 Volume in drive C has no label.
 Volume Serial Number is 3A5C-23CD

 Directory of C:\Users\dell\Documents\PythonScripts\JupyterNB\hw07\code

29-11-2021  15:53           384,000 conll2002-eng.crfmodel
               1 File(s)        384,000 bytes
               0 Dir(s)  120,762,679,296 bytes free


We can also get information about the final state of the model by looking at the `pycrfsuite.Trainer.logparser`.

If we had tagged our input data using the optional group argument in add, and had used the optional holdout argument during train, there would be information about the trainer's performance on the holdout set as well.

In [85]:
trainer.logparser.last_iteration

{'num': 50,
 'scores': {},
 'loss': 14462.099812,
 'feature_norm': 119.144445,
 'error_norm': 554.819342,
 'active_features': 6351,
 'linesearch_trials': 1,
 'linesearch_step': 1.0,
 'time': 0.201}

We can also get this information for every step using `pycrfsuite.Trainer.logparser.iterations`.

In [86]:
len(trainer.logparser.iterations), trainer.logparser.iterations[-1]

(50,
 {'num': 50,
  'scores': {},
  'loss': 14462.099812,
  'feature_norm': 119.144445,
  'error_norm': 554.819342,
  'active_features': 6351,
  'linesearch_trials': 1,
  'linesearch_step': 1.0,
  'time': 0.201})

## Make predictions on a sample sentence

To use the trained model, create a `pycrfsuite.Tagger` object, load the model into it, and use the `tag` method. Let's tag a sentence to see how it works.

In [87]:
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-eng.crfmodel')

example_sent = dev[5]
print(' '.join(sent2words(example_sent)), end = '\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended their first innings by 94 runs before being bowled out for 296 with England discard Andy Caddick taking three for 83 .

Predicted: O O B-ORG O O O O O O O O B-LOC I-LOC O B-ORG O O O O O O O O O O O O O O B-LOC O B-PER I-PER O O O O O
Correct:   O O B-ORG O O O O O O O O B-LOC I-LOC O B-ORG O O O O O O O O O O O O O O B-LOC O B-PER I-PER O O O O O


## CRF model evaluation on development data

Predict entity labels for all sentences in the  development set.

In [88]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_dev]

Wall time: 575 ms


Assemble the predicted BIO token-level tags into NE label chunks and compare against the gold NE labels to compute Precision, Recall, and F1 measure.

In [89]:
pred = []
for i, sent in enumerate(dev):
    pred.append(LabeledSentence(sent.tokens, chunks_from_bio_tag_seq(y_pred[i])))

print_evaluation(dev, pred)

Labeled F1: 86.09, precision: 5037/5759 = 87.46, recall: 5037/5943 = 84.76


## Type-level evaluation of NE recognizer

Implement a function `fine_grained_evaluation(gold, pred)` that computes precision, recall, and F1 measure for each of the 4 name entity types (LOC, MISC, ORG, PER).

In [1]:
def fine_grained_evaluation(gold: List[LabeledSentence], pred: List[LabeledSentence]):
    results = {'LOC': (0, 0, 0), 'MISC': (0, 0, 0), 'ORG': (0, 0, 0), 'PER': (0, 0, 0), }
    
    # YOUR CODE HERE
    value = {
        'LOC' :{
            'correct':0,
            'predNum':0,
            '' : 0 
        },
        'MISC' :{
            'correct':0,
            'predNum':0,
            'correctTag' : 0 
        },
        'ORG' :{
            'correct':0,
            'predNum':0,
            'correctTag' : 0 
        },
        'PER' :{
            'correct':0,
            'predNum':0,
            'correctTag' : 0 
         }
    }
    
    for g,p in zip(gold,pred):
        for i in range(len(p.chunks)):
            if (p.chunks[i]!=[]):
                p_tag = p.chunks[i].label
                if(p.chunks[i] in g.chunks):
                    value[p_tag]['goldNum'] += 1
                    value[p_tag]['predNum'] += 1
                    value[p_tag]['correctTag'] += 1
                else:
                    value[p_tag]['predNum'] += 1
        for x in range(len(g.chunks)):
            if(g.chunks[x] != []):
                g_tag = g.chunks[x].label
                if(g.chunks[x] not in p.chunks):
                    value[g_tag]['goldNum'] += 1
    for x,y in value.items():
        if(y['predNum'] == 0):
            precision = 0
        else:
            precision = y['correctTag']/float(y['predNum'])
        if(y['goldNum'] == 0):
            recall = 0
        else:
            recall = y['correctTag']/float(y['goldNum'])
        if(y['goldNum'] == 0 and y['predNum'] == 0):
            f1 = 0
        else:
            f1 = (2 * precision * recall)/float(precision + recall)
        results[x] = (precision * 100, recall * 100, f1 * 100)
        
  
    
    return results

NameError: name 'List' is not defined

In [2]:
def opeval(gold: List[LabeledSentence], pred: List[LabeledSentence]):
    setx = zip(gold,pred)
    for a,b in setx:
        print(a.chunks)
        print(b.chunks)
        
opeval(dev, pred)

NameError: name 'List' is not defined

In [102]:
results = fine_grained_evaluation(dev, pred)
for ne_type in results:
    p, r, f1 = results[ne_type]
    print('Performance on', ne_type, 'is: P =', p, '; R =', r, '; F1 =', f1)

Performance on LOC is: P = 88.61439312567131 ; R = 89.82035928143712 ; F1 = 89.213300892133
Performance on MISC is: P = 87.63769889840881 ; R = 77.57313109425785 ; F1 = 82.29885057471263
Performance on ORG is: P = 84.50244698205547 ; R = 77.25577926920208 ; F1 = 80.71679002726918
Performance on PER is: P = 88.18770226537217 ; R = 88.76221498371335 ; F1 = 88.47402597402598


## Save output on blind test data

Once you are done with feature engineering on development data, run the trained CRF model on the blind test data and save the output into a file. 

In [103]:
%%time
test = read_data('../data/eng.testb.blind')
X_test = [sent2features(s) for s in test]

y_pred = [tagger.tag(xseq) for xseq in X_test]
pred = []
for i, sent in enumerate(test):
    pred.append(LabeledSentence(sent.tokens, chunks_from_bio_tag_seq(y_pred[i])))

print_output(pred, '../data/eng.testb.out')

Wrote predictions on 3684 labeled sentences to ../data/eng.testb.out
Wall time: 2.16 s


## [5111] Comparison Conditional Random Fields vs. Logistic Regression

Train and evaluate a NE recognizer on the same dataset using Logistic Regression. Use the same features as in the CRF model. Compare the performance of the two models and explain any observed difference in performance. It is recommended that you do this in a separate notebook file called *lr-ner.ipynb*, with the output saved as PDF in *lr-ner.pdf*.

*This portion is mandatory for graduate students. Undergraduate students who implement this will get a substantial number of bonus points*.

## Bonus points ##
Anything extra goes here. For example:

- **Features** Try features on POS tags and other improvements to the feature set. If you do this, you should do more than just add some a few new features—add detailed quantitative analysis and cite some examples showing what helps and what doesn’t.

- **German** You also have access to German NER data—does the system perform well on this data? Can you add features or change it to get better performance? Note that simply running your model on this dataset and reporting results does not constitute a substantial extension.

- **Sentence boundaries** The "invalid tag sequence" warnings are caused by I tags predicted at the beginning of sentences. One way of helping the system learn that sentences do not start with an I tag is to pad each sentence with a special `<beg>` token at the beginning and `<end>` token at the end, both tagged as O. Implement this data transformation and train and evaluate the CRF model on this data.

## Analysis ##
Include an analysis of the results that you obtained in the experiments above. Present results from both the basic CRF model as well as with different features and system variants as part of your extension, and optionally discuss error cases addressed by your extension or describe how the system could be further improved.