# Named Entity Recognition with MIT Restaurant Dataset

Your name: Ha Trung Tin

Student ID: BI11-261

**Due: 23:59 19/3/2023**

## Task Description

In this assignment, you will train a NER Model using Conditional Random Fields (CRF) on and report the accuracy of your model on the test dataset.

You will use the [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset to do the task.

## How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room 
- Name your file as YourName_StudentID_Assignment4.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment4.ipynb
- Write your name and student ID into this notebook
- Copying others' assignments is strictly prohibited.


## Install python-crfsuite

In [1]:
!pip install -q python-crfsuite

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━[0m [32m0.8/1.0 MB[0m [31m22.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## Imports

In [2]:
from itertools import chain
import pycrfsuite

## Dataset

We will use [MIT Restaurant Dataset](https://groups.csail.mit.edu/sls/downloads/restaurant/) dataset.

The data set is already in CoNLL format. We will use the [train](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio) data to create the NER model and evaluate the model on the [test](https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio) data.

### Download data

In [3]:
%%capture
!rm -f restauranttrain.bio
!rm -f restauranttest.bio

!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttest.bio
!wget https://groups.csail.mit.edu/sls/downloads/restaurant/restauranttrain.bio

## Loading data (30 points)

In this part, you will load a data file into a list of sentences. Each sentence is a list of (word, tag) tuples.

**Note: Blank lines are used to seperate sentences.**

For instance, the sentence below will be loaded into a list

```
O	a
B-Rating	four
I-Rating	star
O	restaurant
B-Location	with
I-Location	a
B-Amenity	bar
```

You will complete the function below

In [4]:
## Add necessary import here
def load_data(file_path):
    """Load data into a list of list of (word, tag) tuples

    Args:
        file_path (str): Path to data

    Returns:
        sentences: list of (word, tag) tuples
    """
    sentences = []
    
    #TODO: Write your code here
    with open(file_path, "r") as f:
        sentence = []
        for line in f:
            line = line.strip()
            if line:
                tag, word = line.split('\t')
                sentence.append((word, tag))
            else:
                if sentence:
                    sentences.append(sentence)
                    sentence = []

        if sentence:
            sentences.append(sentence)

    return sentences

In [5]:
train_sents = load_data('restauranttrain.bio')
test_sents = load_data('restauranttest.bio')

Let's check the number of sentences in train and test data

In [6]:
len(train_sents)

7660

In [7]:
len(test_sents)

1521

In [8]:
train_sents[0]

[('2', 'B-Rating'),
 ('start', 'I-Rating'),
 ('restaurants', 'O'),
 ('with', 'O'),
 ('inside', 'B-Amenity'),
 ('dining', 'I-Amenity')]

## Features (50 points)

We can extract as many features as you want. You will implement following basic features.

※ Of course, you can add more features.

*Word identity (lowercase)*

- Previous word identity
- Current word identity
- Next word
- Previous word and current word combination. Concat the previous word the current word by '||'
- Current word and next word combination. Concat two words by '||'

*Word shapes*

- Word prefix and suffix (4 characters)
- The first character of the current word is the capital letter

**All you need to do is to complete the function `word2feature`.**

In [12]:
def word2features(sentence, i):
    """
    Arguments:
        sentence (list): list of words [w1, w2,...,w_n]
        i (int): index of the word
    Return:
        features (dict): dictionary of features
    """
    word = sentence[i]
    prev_word = '' if i==0 else sentence[i-1].lower()
    next_word = '' if i==len(sentence)-1 else sentence[i+1].lower()
    features = {
        #TODO: Write your features here
        'word.lower()': word.lower(),
        'prev_word.lower()': prev_word.lower(),
        'next_word.lower()': next_word.lower(),
        'prev_word||word.lower()': f'{prev_word.lower()}||{word.lower()}',
        'word.lower()||next_word.lower()': f'{word.lower()}||{next_word.lower()}',
        'word_prefix': word[:4],
        'word_suffix': word[-4:],
        'is_capitalized': word[0].isupper(),
    }
    
    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """    
    return [tag for token, tag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's try to extract features for the first sentence

In [13]:
train_sents[0]

[('2', 'B-Rating'),
 ('start', 'I-Rating'),
 ('restaurants', 'O'),
 ('with', 'O'),
 ('inside', 'B-Amenity'),
 ('dining', 'I-Amenity')]

In [14]:
sent2features(untag(train_sents[0]))[0]

{'word.lower()': '2',
 'prev_word.lower()': '',
 'next_word.lower()': 'start',
 'prev_word||word.lower()': '||2',
 'word.lower()||next_word.lower()': '2||start',
 'word_prefix': '2',
 'word_suffix': '2',
 'is_capitalized': False}

### Create train/test data

In [15]:
X_train = [sent2features(untag(s)) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(untag(s)) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

In [16]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 683 ms, sys: 22.2 ms, total: 705 ms
Wall time: 708 ms


In [17]:
#@title Set model parameters

max_iterations = 50 #@param[50, 20, 100]

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': max_iterations,

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [18]:
%%time
trainer.train('mitrestaurant.crfsuite')

CPU times: user 7.31 s, sys: 38 ms, total: 7.35 s
Wall time: 7.4 s


## Evaluation (20 points)

We will use [seqeval](https://github.com/chakki-works/seqeval) package for evaluation NER result.

In [19]:
!pip install -q seqeval[cpu]

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


### Make Predictions

In [20]:
tagger = pycrfsuite.Tagger()
tagger.open('mitrestaurant.crfsuite')

<contextlib.closing at 0x7fc4c2d23700>

In [21]:
example_sent = test_sents[0]
example_sent

[('a', 'O'),
 ('four', 'B-Rating'),
 ('star', 'I-Rating'),
 ('restaurant', 'O'),
 ('with', 'B-Location'),
 ('a', 'I-Location'),
 ('bar', 'B-Amenity')]

In [22]:
print("Predicted:", ' '.join(tagger.tag(sent2features(untag(example_sent)))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Predicted: O B-Rating I-Rating O O O B-Amenity
Correct:   O B-Rating I-Rating O B-Location I-Location B-Amenity


In [23]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 84.5 ms, sys: 4.03 ms, total: 88.6 ms
Wall time: 89 ms


In [24]:
from seqeval.metrics import classification_report

print(classification_report(y_test, y_pred))

                 precision    recall  f1-score   support

        Amenity       0.71      0.65      0.68       533
        Cuisine       0.84      0.81      0.83       532
           Dish       0.78      0.72      0.75       288
          Hours       0.73      0.65      0.69       212
       Location       0.82      0.80      0.81       812
          Price       0.80      0.81      0.80       171
         Rating       0.79      0.77      0.78       201
Restaurant_Name       0.78      0.75      0.77       402

      micro avg       0.79      0.75      0.77      3151
      macro avg       0.78      0.75      0.76      3151
   weighted avg       0.79      0.75      0.77      3151



# References

1. Datasets for Entity Recognition: https://github.com/juand-r/entity-recognition-datasets
2. [sklearn-crfsuite tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system). 
3. [Quick Recipe: Build a POS tagger using a Conditional Random Field](https://nlpforhackers.io/crf-pos-tagger/)
4. [NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields](https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31)
5. [CRFsuite - Tutorial on Chunking Task](http://www.chokkan.org/software/crfsuite/tutorial.html)