<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# Conditional Random Field for NER

Please, download the following dataset: 

https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus#ner_dataset.csv



https://www.aitimejournal.com/@akshay.chavan/complete-tutorial-on-named-entity-recognition-ner-using-python-and-keras


In NLP, NER is a method of extracting the relevant information from a large corpus and classifying those entities into predefined categories such as location, organization, name and so on. This is a simple example and one can come up with complex entity recognition related to domain-specific with the problem at hand.

<img src='https://d2ueix13hy5h3i.cloudfront.net/wp-content/uploads/2019/06/3.png' width=800>

In this tutorial, we will use the following dataset

https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus#ner_dataset.csv

This dataset is extracted from GMB(Groningen Meaning Bank) corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as name, location, etc.
All the entities are labeled using the BIO scheme, where each entity label is prefixed with either B or I letter. B- denotes the beginning and I- inside of an entity. The words which are not of interest are labeled with 0 – tag.

<img src='https://d2ueix13hy5h3i.cloudfront.net/wp-content/uploads/2019/06/Capture1.png' width=350>


Note: This notebook needs to downgrade the version of scikit-learn, to be compatible with the library sklearn-crfsuite, which is a CRFsuite (python-crfsuite) wrapper which provides scikit-learn-compatible sklearn_crfsuite.CRF estimator.


In [1]:
# https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/60
!pip install -U 'scikit-learn<0.24'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install -q sklearn-crfsuite


In [3]:
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_classification_report

Before to load the dataset, we need to mount our folder in google drive:

In [4]:
from google.colab import drive
drive.mount("/content/drive/")

PATH = "/content/drive/My Drive/Colab Notebooks/data/ner/"


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).



## Loading the dataset

Now, we can load the dataset. Sentence # indicates the sentence number and each sentence comprises of words that are labeled using the BIO scheme in the tag column.


In [5]:
import pandas as pd
path_data=PATH+'ner_dataset.csv'

#Reading the csv file
df = pd.read_csv(path_data, encoding = "ISO-8859-1")
print(df.head(15))
print()
print('some statistics about the dataset:')
df.describe()


     Sentence #           Word  POS    Tag
0   Sentence: 1      Thousands  NNS      O
1           NaN             of   IN      O
2           NaN  demonstrators  NNS      O
3           NaN           have  VBP      O
4           NaN        marched  VBN      O
5           NaN        through   IN      O
6           NaN         London  NNP  B-geo
7           NaN             to   TO      O
8           NaN        protest   VB      O
9           NaN            the   DT      O
10          NaN            war   NN      O
11          NaN             in   IN      O
12          NaN           Iraq  NNP  B-geo
13          NaN            and   CC      O
14          NaN         demand   VB      O

some statistics about the dataset:


Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048575,1048575,1048575
unique,47959,35178,42,17
top,Sentence: 1,the,NN,O
freq,1,52573,145807,887908




Observations :
- There are total 47,959 sentences in the dataset.
- Number unique words in the dataset are 35,178.
- Total 17 labels (Tags).


Let us to show the set of tags, which are the classes to classify the tokens based on the IOB schema:





In [6]:
##Displaying the unique Tags
df['Tag'].unique()


array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

There are lots of missing values in 'Sentence #' attribute. So we will use pandas fillna technique and use 'ffill' method which propagates last valid observation forward to next.



In [7]:

#Checking null values, if any.
df.isnull().sum()

Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

Pandas dataframe.ffill() function is used to fill the missing value in the dataframe. ‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.

In [8]:
df = df.fillna(method = 'ffill')
df.head(100)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
...,...,...,...,...
95,Sentence: 5,'s,POS,O
96,Sentence: 5,ruling,VBG,O
97,Sentence: 5,Labor,NNP,B-org
98,Sentence: 5,Party,NNP,I-org


NER can be viewed as a classification task at token level. That is, the goal is to classify each token in the input sequence with one of the classes {'O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'}.

Therefore, we will have to represent each token to classify as a set of features. 

We define a class to represent a sentence. Each sentence is represented by a list of triples. Each triple represetns a token with its PoS tag and its tag (IOB tag).

In [9]:
class Sentences(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_next(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

In [10]:
#Displaying one full sentence
getter = Sentences(df)
# rebuild the texts of sentences
text_sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
print(text_sentences[0])

i = 0
# show the tokens...
while i < 3:
    print("Sentence : ", i, getter.get_next())
    i += 1

sentences = getter.sentences

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .
Sentence :  0 [('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]
Sentence :  1 [('Families', 'NNS', 'O'), ('of', 'IN', 'O'), ('soldiers', 'NNS', 'O'), ('killed', 'VBN', 'O'), ('in', 'IN', 'O'), ('the', 'DT', 'O'), ('conflict', 'NN', 'O'), ('joined', 'VBD', 'O'), ('the', 'DT', 'O'), ('protesters', 'NNS', 'O'), ('who', 'WP', 'O'), ('carried', 'VBD', 'O'), 

## Feature Preparation
To represent each token, we will use the most common feature set for the NER Task. That is, the contextual information of the tokens around the token that you are representing. 
In particular, we use a window of size 1 around the token to be represented. 
For each token, we will use the following features:
- lowercase word
- suffixes of lenght 2 and 3.
- issuper(): boolean indicating if the word's characteres are in uppercase. 
- istitle(): if the first letter is uppercase.
- isdigit():
- postag: morphological category.


In [11]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],  # sufixe
        'word[-2:]': word[-2:],  # sufixe
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [12]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

We must split the dataset to obtain training-test dataset:

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## CRF

CRFs are used for predicting the sequences that use the contextual information to add information which will be used by the model to make a correct prediction.

Below is the formula for CRF where y is the output variable and X is input sequence.



<img src='https://d2ueix13hy5h3i.cloudfront.net/wp-content/uploads/2019/06/CodeCogsEqn1-6.png' width=500>



The output sequence is modeled as the normalized product of the feature function.

Some interesting sources about CRF:

https://repository.upenn.edu/cis_papers/159/?ref=https://githubhelp.com

https://www.sciencedirect.com/topics/computer-science/conditional-random-field

https://www.youtube.com/watch?v=rc3YDj5GiVM

https://www.aitimejournal.com/@akshay.chavan/introduction-to-conditional-random-fields-crfs

https://towardsdatascience.com/conditional-random-fields-explained-e5b8256da776


## Trainig the CRF model
Now, we can train a model. This process can take several minutes....



In [14]:
crf = CRF(algorithm = 'lbfgs',
         c1 = 0.1,
         c2 = 0.1,
         max_iterations = 100,
         all_possible_transitions = False)
crf.fit(X_train, y_train)



CRF(algorithm='lbfgs', all_possible_transitions=False, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

## Evaluation on the test dataset

In [15]:
#Predicting on the test set.
y_pred = crf.predict(X_test)

f1_score = flat_f1_score(y_test, y_pred, average = 'weighted')
print(f1_score)

report = flat_classification_report(y_test, y_pred)
print(report)


0.9711478612209549




              precision    recall  f1-score   support

       B-art       0.48      0.16      0.24        88
       B-eve       0.41      0.37      0.39        52
       B-geo       0.86      0.90      0.88      7489
       B-gpe       0.97      0.94      0.96      3159
       B-nat       0.68      0.44      0.53        39
       B-org       0.80      0.75      0.77      4123
       B-per       0.85      0.83      0.84      3334
       B-tim       0.93      0.88      0.90      4113
       I-art       0.45      0.07      0.11        76
       I-eve       0.32      0.32      0.32        44
       I-geo       0.83      0.80      0.82      1510
       I-gpe       0.90      0.65      0.75        40
       I-nat       0.75      0.38      0.50        16
       I-org       0.80      0.80      0.80      3478
       I-per       0.85      0.89      0.87      3447
       I-tim       0.84      0.78      0.81      1273
           O       0.99      0.99      0.99    176865

    accuracy              

PROPOSED EXERCISE: 

In the previous cells, we have used only the previous and next tokes of the token to be represented. 

1.   Increase the window size to 2. Does it improve the results?
2.   Extend the feature set adding new features that can help you to improve the resutls. 
3. Train a new system to recognize drug names (https://github.com/isegura/DDIcorpus)


