# Lab4a Introduction to Named Entity Recognition and classification

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In Lab4a, we focus on Named Entity Recognition and Classification (NERC).

We assume that you have studied the notebooks of Lab1 and Lab2 before you start this lab. Especially Lab2 is important for the assignment of this lab.

**At the end of this notebook, you will be able to**:
* understand the IOB format used to format NERC data
* load existing data sets in the IOB format to use them for training a classifier

The following notebooks should be studied in this order:

* Lab4a.1-NERC-introduction.ipynb (this notebook)
* Lab4a-Assignment.ipynb

We assume that you studied the previous labs and especially lab 2.

In this notebook, we shortly explain some basics about NERC and the typical data formats used.


## 1. NERC
In Named Entity Recognition and Classification, the goal is to determine which noun phrases refer to named entities as well as classifying them.
Named entities can be persons, locations, organizations, etc. (see [NLTK Chapter 7, Section 5](https://www.nltk.org/book/ch07.html) for more information on the task).

It is not trivial to represent NERC data in a way that we can easily train NLP systems as well as evaluate them. One of the most used formats is called [Inside–outside–beginning (IOB)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Let's look at an example from one of the most popular datasets, which is [CoNLL-2003](http://aclweb.org/anthology/W03-0419).
```
Germany NNP B-NP B-LOC
's POS B-NP O
representative NN I-NP O
to TO B-PP O
the DT B-NP O
European NNP I-NP B-ORG
Union NNP I-NP I-ORG
```

The first observation is that all information is represented at the **token-level**. For each token, e.g., *Germany*, we receive information about:
* **the word**: e.g., *Germany*
* **the part of speech**: e.g., *NNP* (from [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))
* **the phrase type**: e.g., a noun phrase
* **the NERC label**: e.g., a location (LOC).

This example contains two named entities: *Germany* and *European Union*.

Every first token of a named entity is prefixed with *B-*. Every token after that, e.g., *Union* in *European Union*, is prefixed with *I-*.

Please note that the IOB format is at the **token-level**, which means that we also are going to train and evaluate an NLP system at the token-level! The goal will hence not be to classify *European Union* as an *Organization*, but to classify:
* *European* as the first token of an entity that is an *Organization*
* *Union* as a token inside of an entity that is an *Organization*

Please make sure you understand the format before you proceed ;)

Despite the token level representation of features and tags in the CoNLL format, we can consider NERC as a typical sequence tagging task. There is a strong dependency within a sentence across words and their tags. A sequence of words (a phrase) predicts the next seuqnece of words and the same can be said for a sequence of tags, reflecting named entity expressions or not. On top of that there can be hard constraints. We cannot have a I-tag with a different type following another I-tag in case of an IOB annotation.

Machine learning approaches to NERC typically exploit these sequence dependencies by learning the most probable sequences of tags within a sentence given all possible sequences of tags, using the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm). A conditional random field classifier exploits such dependencies and allows you to define as many features as you like.

## 2. NERC datasets

Now that we've seen how to represent linguistic features, we also need to access real linguistic training data for the NERC task. In this section, we will look at large data sets that have been created by the community in which people have been annotating entities. In the assignment, you will use this data to train and test models that give a realistic performance.

Here, we will load two NERC datasets and quickly inspect their contents.

**Preparation** Please download the .zip file with the two datasets from [this link](https://drive.google.com/drive/folders/1LChp70lMFNL9BtNi0nVy-V-fCRkEobgr?usp=sharing)

Then unpack the .zip, so that the folder `nerc_datasets` is created in the same directory as this notebook. If you want to store it elsewhere, you can do that but need to adapt the path in the calls below.

### 2.1 CoNLL-2003

 One of the most popular datasets is [CoNLL-2003](http://aclweb.org/anthology/W03-0419), which was provided with the zip file you just downloaded. You can open the file "train.txt" in a text editor to inspect its content:

````
-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O
````

It follows the IOB format with one token on a line followed by columns wit the PoS, the constituent and the IOB entity tag. You can check the "test.txt" file to see it has a similar format

You can load it using the following code snippet, which makes use of the NLTK function ConllCorpusReader to do the magic. More information on the ConllCorpusReader can be found here: https://www.nltk.org/_modules/nltk/corpus/reader/conll.html

The function has three parameters:

* the path to the folder where ConLL-2003 is stored (locally in my case)
* the name of the file that will be loaded from that folder
* labels for the columns that are expected in the input file

We store the result in a variable with the name 'train' which is of the type 'nltk.corpus.reader.conll.ConllCorpusReader'

In [1]:
from nltk.corpus.reader import ConllCorpusReader

# adapt the path to point to the local copy of the nerc_datasets folder
train = ConllCorpusReader('/Users/piek/Desktop/ONDERWIJS/data/nerc_datasets/CONLL2003',
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt' 
                          ['words', 'pos', 'ignore', 'chunk'])



OSError: No such file or directory: '/Users/piek/Desktop/ONDERWIJS/data/nerc_datasets/CONLL2003'

We can use 'dir' to see it has many data elements that correspond to the many different features that can be found in the CoNNL data.

In [None]:
dir(train)

We are for now only interested in the token, the pos and the ne_label. Let's check the first one in train:

In [None]:
for token, pos, ne_label in train.iob_words():
    print(token, pos, ne_label) # please represent this information using a dictionary for the feature representation
    break

EU NNP B-ORG


We can for example iterate through this data, and make a list of the tokens as inputs, and of the `ne_label` values as desirable outputs. 
The input tokens could for example be looked up in a word embeddings dictionary (see Lab2 for details on word embeddings).

In [None]:
import gensim
##### Adapt the path to point to your local copy of the Google embeddings model
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('/Users/piek/Desktop/ONDERWIJS/data/model/GoogleNews-vectors-negative300.bin', binary=True)  

In [None]:
input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():
    
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

We have successfully loaded our data. Let's see how many tokens/labels we have:

In [None]:
print(len(labels))

203621


In [None]:
print('Last ten labels =', labels[:10])

Last ten labels = ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'B-PER']


Obviously, we should have the same size of input_vectors:

In [None]:
print(len(input_vectors))

203621


In a next step, we could easily train a model on this data as shown in above by combining the input vectors with the labels in a fit function. 
You will see it takes a lot longer to train the classifier with this  data set that has over 200K instances.
On my machine it took about 5 minutes.

In [None]:
from sklearn import svm

lin_clf = svm.LinearSVC()

In [None]:
lin_clf.fit(input_vectors, labels)

If you want to apply this classifier to a data set for testing, you need to apply the same vectorization procedure as you have followed for the training data.

Before you apply a classifier to a data set, it is important to know the data set and especially the statistics about how the labels are distributed. In other words, how often do tokens in the data set belong a human annotated data set?

This tells you how frequent or rare certain data categories are and how challenging it is for a system to learn and predict each category.

Because we have created a list of labels from our data, we can use a simple Python function *Counter* to get the statistics:

In [None]:
from collections import Counter 
print(Counter(labels))

Counter({'O': 169578, 'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155})


This clearly shows that most tokens get the label *O* and the actual enity tokens range between 1155 and 7140.

### 4.2 Kaggle
[*Kaggle*](https://www.kaggle.com/docs) is an open source platform for sharing data and competitions. It has over 1000's of datasets and  frequently releases new data and challenges. We are going to have a quick look at the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) that they provided and which was also provided in the zip file you downloaded as a so-called CSV file: ner.csv and ner_v2.csv. CSV stands for comma-separated-values and it is a commonly used format to exchange e.g. Excell or spreadsheet data as text files. Instances of data are represented on separate lines followed by values separated by commas. Another format is tab-separated-values or TSV, in which case tabs are used as in the CoNLL formats. Very often people store TSV formats in files with the extension ".csv", so it is always good practice to check the actual content to see what is used as a separator. The first line of a CSV or TSV file is usually the header that labels the different columns. 

The [*pandas*](https://pandas.pydata.org) package is a powerful package to handle data in various formats. You can check the website for details and documentation. Here we are going to use it to inspect the data.

Let's load the data in the following way:

In [None]:
import pandas

# Adapt the path to point to your local copy of the nerc_datasets
path = '/Users/piek/Desktop/ONDERWIJS/data/nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

b'Skipping line 281837: expected 25 fields, saw 34\n'


You will see the following output after running the above code cell:
```
b'Skipping line 281837: expected 25 fields, saw 34\n'
```
You can ignore this.

**pandas.read_csv** will load the csv file into a [pandas DataFrame](https://towardsdatascience.com/pandas-dataframe-a-lightweight-intro-680e3a212b96).

You can inspect which columns are in the csv file by running the following code cell:

In [None]:
kaggle_dataset.columns

Index(['id', 'lemma', 'next-lemma', 'next-next-lemma', 'next-next-pos',
       'next-next-shape', 'next-next-word', 'next-pos', 'next-shape',
       'next-word', 'pos', 'prev-iob', 'prev-lemma', 'prev-pos',
       'prev-prev-iob', 'prev-prev-lemma', 'prev-prev-pos', 'prev-prev-shape',
       'prev-prev-word', 'prev-shape', 'prev-word', 'sentence_idx', 'shape',
       'word', 'tag'],
      dtype='object')

You can seen that a wide range of features is given for each token. [Here](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus), you can read what each column represents.

You loop can loop through the dataset in the following way:

In [None]:
for index, instance in kaggle_dataset.iterrows():
    print()
    print(index)
    print(instance) # you can access information by using instance['A COLUMN NAME'] which you can use to convert to a dictionary needed for the feature representation.
    print('NERC label', instance['tag'])
    break


0
id                             0
lemma                   thousand
next-lemma                    of
next-next-lemma         demonstr
next-next-pos                NNS
next-next-shape        lowercase
next-next-word     demonstrators
next-pos                      IN
next-shape             lowercase
next-word                     of
pos                          NNS
prev-iob              __START1__
prev-lemma            __start1__
prev-pos              __START1__
prev-prev-iob         __START2__
prev-prev-lemma       __start2__
prev-prev-pos         __START2__
prev-prev-shape         wildcard
prev-prev-word        __START2__
prev-shape              wildcard
prev-word             __START1__
sentence_idx                   1
shape                capitalized
word                   Thousands
tag                            O
Name: 0, dtype: object
NERC label O


You can see that each token has many different features that people have considered useful for trhe task of NERC. In addition to the usual suspects that we saw before, each token also has features indicating previous and next words and their PoS, but als the shape of the word (upper and lower case patterns), and even the previous IOB tags.

We could use all these features as inputs in a machine learning model with our DictVectorizer, or by transforming them using embeddings if the values are words.

## End of this notebook