# <span style="font-family:Courier New; color:#CCCCCC">**Named Entity Recognition CRF**</span>

## <span style="font-family:Courier New; color:#336666">**Load Data and Imports**</span>

In [2]:
from preprocessing import convert_BIO
from NER_evaluation import *
from feature_getter import Feature_getter
import pycrfsuite

import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

esp_train = conll2002.iob_sents('esp.train') 
esp_val = conll2002.iob_sents('esp.testa')
esp_test = conll2002.iob_sents('esp.testb')

ned_train = conll2002.iob_sents('ned.train')
ned_val = conll2002.iob_sents('ned.testa')
ned_test = conll2002.iob_sents('ned.testb')

[nltk_data] Downloading package conll2002 to
[nltk_data]     /home/jordigb/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


## <span style="font-family:Courier New; color:#336666">**Train Classifier**</span>

In [2]:
esp_train = convert_BIO(esp_train)
model = nltk.tag.CRFTagger(feature_func = Feature_getter())
model.train(esp_train, 'model.crf.tagger')

In [None]:
esp_test = convert_BIO(esp_test)
X_esp_test = [[word[0] for word in sent] for sent in esp_test]
pred = model.tag_sents(X_esp_test)

In [4]:
results, results_agg_ent = compute_metrics(esp_test, pred)
results

{'correct': 2577,
 'incorrect': 530,
 'partial': 108,
 'missed': 393,
 'spurious': 268,
 'possible': 3608,
 'actual': 3483,
 'precision': 0.739879414298019,
 'recall': 0.7142461197339246,
 'F1-score': 0.726836835425187}

## <span style="font-family:Courier New; color:#336666">**Hiperparameters selection**</span>

We will begin with hiperparameters selection. However, we will perform it on the base features of the classifier. The reason lies in the runtime that a training with all features bears with it, along with the assumption that the hiperparameters doesnt have distinct interactions among the different features

We will try to do a custom function that does a gridsearch over the different values we try to test.

In [7]:
train_sents = convert_BIO(esp_train)
test_sents = convert_BIO(esp_test)
val_sents = convert_BIO(esp_val)

X_val_sents = [[word[0] for word in sent] for sent in val_sents]
X_test_sents = [[word[0] for word in sent] for sent in test_sents]
hyperparameters = {
    'c1': [0.1, 0.5, 1.0],
    'c2': [0.1, 0.5, 1.0],
    'max_iterations': [50, 100, 200]
}

In [11]:
def gridsearch_cv(hyperparameters,train_sents,val_sents,X_val_sents):
    best_f1 = 0
    best_params = {}
    num_combinations = len(hyperparameters['c1']) * len(hyperparameters['c2']) * len(hyperparameters['max_iterations'])
    current_combination = 0
    for c1 in hyperparameters['c1']:
        for c2 in hyperparameters['c2']:
            for max_iter in hyperparameters['max_iterations']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}')
                model = nltk.tag.CRFTagger(training_opt = {'c1': c1, 'c2': c2, 'max_iterations': max_iter})
                model.train(train_sents, 'model.crf.tagger')

                pred = model.tag_sents(X_val_sents)
                results, _ = compute_metrics(val_sents, pred)

                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'c1': c1, 'c2': c2, 'max_iterations': max_iter}

    return best_f1,best_params

In [12]:
best, best_params = gridsearch_cv(hyperparameters,train_sents,val_sents,X_val_sents)

Fitting model 1 of 27
Fitting model 2 of 27
Fitting model 3 of 27
Fitting model 4 of 27
Fitting model 5 of 27
Fitting model 6 of 27
Fitting model 7 of 27
Fitting model 8 of 27
Fitting model 9 of 27
Fitting model 10 of 27
Fitting model 11 of 27
Fitting model 12 of 27
Fitting model 13 of 27
Fitting model 14 of 27
Fitting model 15 of 27
Fitting model 16 of 27
Fitting model 17 of 27
Fitting model 18 of 27
Fitting model 19 of 27
Fitting model 20 of 27
Fitting model 21 of 27
Fitting model 22 of 27
Fitting model 23 of 27
Fitting model 24 of 27
Fitting model 25 of 27
Fitting model 26 of 27
Fitting model 27 of 27


In [14]:
best, best_params

(0.706310110250997, {'c1': 0.1, 'c2': 0.1, 'max_iterations': 100})

## <span style="font-family:Courier New; color:#336666">**Feature selection**</span>

### Best n-gram

Unigram

In [16]:
best

0.706310110250997

Bigram

In [19]:
model = nltk.tag.CRFTagger(training_opt = best_params,feature_func = Feature_getter(bigram = True, trigram = False, morphology = False, length = False, prefix = False,
                 sufix = True, lemma = False, POS = False, shape = False))
model.train(train_sents, 'model.crf.tagger')

pred = model.tag_sents(X_val_sents)
results, _ = compute_metrics(val_sents, pred)

results['F1-score']

Processing sentence  1
Processing sentence  2
Processing sentence  3
Processing sentence  4
Processing sentence  5
Processing sentence  6
Processing sentence  7
Processing sentence  8
Processing sentence  9
Processing sentence  10
Processing sentence  11
Processing sentence  12
Processing sentence  13
Processing sentence  14
Processing sentence  15
Processing sentence  16
Processing sentence  17
Processing sentence  18
Processing sentence  19
Processing sentence  20
Processing sentence  21
Processing sentence  22
Processing sentence  23
Processing sentence  24
Processing sentence  25
Processing sentence  26
Processing sentence  27
Processing sentence  28
Processing sentence  29
Processing sentence  30
Processing sentence  31
Processing sentence  32
Processing sentence  33
Processing sentence  34
Processing sentence  35
Processing sentence  36
Processing sentence  37
Processing sentence  38
Processing sentence  39
Processing sentence  40
Processing sentence  41
Processing sentence  42
P

0.6664314278993179

Trigrams

In [None]:
model = nltk.tag.CRFTagger(training_opt = best_params,feature_func = Feature_getter(bigram = False, trigram = True, morphology = False, length = False, prefix = False,
                 sufix = True, lemma = False, POS = False, shape = False))
model.train(train_sents, 'model.crf.tagger')

pred = model.tag_sents(X_val_sents)
results, _ = compute_metrics(val_sents, pred)

results['F1-score']

In [22]:
results['F1-score']

0.6408484270734033

We can see that unigrams work the best

### Including morphology

Without morphology

In [23]:
# The best score right now remains being
best

0.706310110250997

Including morphology

In [24]:
model = nltk.tag.CRFTagger(training_opt = best_params,feature_func = Feature_getter(bigram = False, trigram = False, morphology = True, length = False, prefix = False,
                 sufix = True, lemma = False, POS = False, shape = False))
model.train(train_sents, 'model.crf.tagger')

pred = model.tag_sents(X_val_sents)
results, _ = compute_metrics(val_sents, pred)

best_morphology = results['F1-score']

Processing sentence  1
Processing sentence  2
Processing sentence  3
Processing sentence  4
Processing sentence  5
Processing sentence  6
Processing sentence  7
Processing sentence  8
Processing sentence  9
Processing sentence  10
Processing sentence  11
Processing sentence  12
Processing sentence  13
Processing sentence  14
Processing sentence  15
Processing sentence  16
Processing sentence  17
Processing sentence  18
Processing sentence  19
Processing sentence  20
Processing sentence  21
Processing sentence  22
Processing sentence  23
Processing sentence  24
Processing sentence  25
Processing sentence  26
Processing sentence  27
Processing sentence  28
Processing sentence  29
Processing sentence  30
Processing sentence  31
Processing sentence  32
Processing sentence  33
Processing sentence  34
Processing sentence  35
Processing sentence  36
Processing sentence  37
Processing sentence  38
Processing sentence  39
Processing sentence  40
Processing sentence  41
Processing sentence  42
P

In [25]:
best_morphology

0.6484235574063059

Including all other variables


In [26]:
#best remains the same
best

0.706310110250997

In [27]:
model = nltk.tag.CRFTagger(training_opt = best_params,feature_func = Feature_getter(bigram = False, trigram = False, morphology = True, length = True, prefix = True,
                 sufix = True, lemma = True, POS = True, shape = True))
model.train(train_sents, 'model.crf.tagger')

pred = model.tag_sents(X_val_sents)
results, _ = compute_metrics(val_sents, pred)

best_other = results['F1-score']

Processing sentence  1
Processing sentence  2
Processing sentence  3
Processing sentence  4
Processing sentence  5
Processing sentence  6
Processing sentence  7
Processing sentence  8
Processing sentence  9
Processing sentence  10
Processing sentence  11
Processing sentence  12
Processing sentence  13
Processing sentence  14
Processing sentence  15
Processing sentence  16
Processing sentence  17
Processing sentence  18
Processing sentence  19
Processing sentence  20
Processing sentence  21
Processing sentence  22
Processing sentence  23
Processing sentence  24
Processing sentence  25
Processing sentence  26
Processing sentence  27
Processing sentence  28
Processing sentence  29
Processing sentence  30
Processing sentence  31
Processing sentence  32
Processing sentence  33
Processing sentence  34
Processing sentence  35
Processing sentence  36
Processing sentence  37
Processing sentence  38
Processing sentence  39
Processing sentence  40
Processing sentence  41
Processing sentence  42
P

In [28]:
best_other

0.6526241468580842