# <span style="font-family:Courier New; color:#CCCCCC">**Dutch Named Entity Recognition CRF**</span>

## <span style="font-family:Courier New; color:#336666">**Load Data and Imports**</span>

In [4]:
from preprocessing import convert_BIO
from NER_evaluation import *
from feature_getter import Feature_getter
import pycrfsuite
from collections import Counter
import pandas as pd

import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

ned_train = conll2002.iob_sents('ned.train')
ned_val = conll2002.iob_sents('ned.testa')
ned_test = conll2002.iob_sents('ned.testb')

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\Jordi\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


## <span style="font-family:Courier New; color:#336666">**Preprocessing Data**</span>

In [5]:
ned_train_BIO = convert_BIO(ned_train)
ned_val_BIO = convert_BIO(ned_val)
ned_test_BIO = convert_BIO(ned_test)

X_val_BIO = [[word[0] for word in sent] for sent in ned_val_BIO]
y_val_BIO = [[word[1] for word in sent] for sent in ned_val_BIO]
X_test_BIO = [[word[0] for word in sent] for sent in ned_test_BIO]
y_test_BIO = [[word[1] for word in sent] for sent in ned_test_BIO]

## <span style="font-family:Courier New; color:#336666">**Train Classifier**</span>

In [6]:
#Summary avaluation table
results_df = pd.DataFrame()
def save_ent_results(nclf, results, results_agg_ent, df):
    df.loc[nclf,'total acc'] = results["precision"]
    df.loc[nclf,'total recall'] = results["recall"]
    df.loc[nclf,'total F1'] = results["F1-score"]
    df.loc[nclf,'PER F1'] = results_agg_ent["PER"]["F1-score"]
    df.loc[nclf,'ORG F1'] = results_agg_ent["ORG"]["F1-score"]
    df.loc[nclf,'LOC F1'] = results_agg_ent["LOC"]["F1-score"]
    df.loc[nclf,'MISC F1'] = results_agg_ent["MISC"]["F1-score"]
    return df

### <span style="font-family:Courier New; color:#336633">**Default Feature Getter**</span>

<span style="font-family:Courier New">Feature&Hyperparameter_selection notebook suggests that best hyperparamaters for CRF.Tagger default feature getter are: {'c1': 0.01, 'c2': 0.1, 'max_iterations': 200, 'possible_transitions': True, 'possible_states': True, 'min_freq' = 0}. </span>

In [7]:
default_hyperparams = {'c1': 0.01, 'c2': 0.1, 'max_iterations': 200, 'feature.possible_transitions': True,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(training_opt = default_hyperparams)
model.train(ned_train_BIO, 'models/ned_default_BIO.tagger')

#### <span style="font-family:Courier New; color:#994C00">**Evalutation**</span>

In [113]:
pred_ned_BIO = model.tag_sents(X_val_BIO)
y_pred_BIO = [[word[1] for word in sent] for sent in pred_ned_BIO]

print(bio_classification_report(y_val_BIO, y_pred_BIO))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(ned_val_BIO, pred_ned_BIO)
results_df = save_ent_results("Default_BIO", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.72      0.62      0.66       479
       I-LOC       0.53      0.30      0.38        64
      B-MISC       0.70      0.69      0.69       748
      I-MISC       0.26      0.42      0.32       215
       B-ORG       0.90      0.57      0.70       686
       I-ORG       0.79      0.65      0.71       396
       B-PER       0.61      0.74      0.67       703
       I-PER       0.74      0.93      0.82       423

   micro avg       0.67      0.67      0.67      3714
   macro avg       0.66      0.61      0.62      3714
weighted avg       0.71      0.67      0.68      3714
 samples avg       0.07      0.07      0.07      3714

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.672,0.627,0.649,0.586,0.794,0.687,0.61


#### <span style="font-family:Courier New; color:#994C00">**Feature Importance**</span>

In [114]:
def print_state_features(state_features):
        for (attr, label), weight in state_features:
            string = "%0.3f %-6s %s" % (weight, label, attr)
            print(string, end = " "*(40 - len(string)))

def feature_importance(model):

    info = model._tagger.info()
    positive_features = Counter(info.state_features).most_common(10)
    negative_features = Counter(info.state_features).most_common()[-10:]

    print("Top positive:                       |     Top negative:")
    print("-----------------------------------------------------------------------------")

    for positive, negative in zip(positive_features, negative_features):
        print_state_features([positive])
        print_state_features([negative])
        print()
feature_importance(model)

Top positive:                       |     Top negative:
-----------------------------------------------------------------------------
10.313 O      PUNCTUATION               -3.545 O      WORD_3D-Design            
8.892 O      WORD_U                     -3.570 O      WORD_doorsnee-Hollander   
8.098 O      SUF_E                      -3.595 O      WORD_leclicgagnant        
7.872 O      WORD_I                     -3.612 O      WORD_oud-EU-commissievoorzitter
7.759 O      SUF_e                      -3.634 O      WORD_toverbonen           
7.557 O      SUF_t                      -3.678 O      WORD_racismebestrijding   
7.399 O      SUF_f                      -4.045 O      WORD_kabinet-Aelvoet      
7.301 O      SUF_p                      -4.211 O      WORD_pet                  
7.267 O      SUF_m                      -4.246 O      WORD_groenen              
7.224 O      SUF_d                      -7.079 O      CAPITALIZATION            


### <span style="font-family:Courier New; color:#336633">**Customed Feature Getter**</span>

<span style="font-family:Courier New">Feature&Hyperparameter_selection notebook suggests that best hyperparamaters for CRF.Tagger with customed feature getter are: {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'possible_transitions': False, 'possible_states': True, 'min_freq' = 0}. </span>

In [115]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='ned'), training_opt = customed_hyperparams)
model.train(ned_train_BIO, 'models/ned_customed_BIO.tagger')

#### <span style="font-family:Courier New; color:#994C00">**Evalutation**</span>

In [116]:
pred_ned_BIO = model.tag_sents(X_val_BIO)
y_pred_BIO = [[word[1] for word in sent] for sent in pred_ned_BIO]

print(bio_classification_report(y_val_BIO, y_pred_BIO))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(ned_val_BIO, pred_ned_BIO)
results_df = save_ent_results("Custom_BIO", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.78      0.78      0.78       479
       I-LOC       0.68      0.41      0.51        64
      B-MISC       0.83      0.76      0.80       748
      I-MISC       0.61      0.56      0.58       215
       B-ORG       0.87      0.64      0.74       686
       I-ORG       0.88      0.64      0.74       396
       B-PER       0.72      0.87      0.79       703
       I-PER       0.81      0.94      0.87       423

   micro avg       0.79      0.75      0.77      3714
   macro avg       0.77      0.70      0.73      3714
weighted avg       0.80      0.75      0.77      3714
 samples avg       0.07      0.07      0.07      3714

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.672,0.627,0.649,0.586,0.794,0.687,0.61
Custom_BIO,0.778,0.75,0.764,0.714,0.806,0.774,0.784


#### <span style="font-family:Courier New; color:#994C00">**Feature Importance**</span>

In [117]:
feature_importance(model)

Top positive:                       |     Top negative:
-----------------------------------------------------------------------------
4.465 O      SHAPE_xxxx                 -1.071 B-MISC LEN_3                     
3.431 O      LEN_1                      -1.109 B-PER  HAS_NUM                   
2.941 O      PUNCTUATION                -1.131 B-ORG  SHAPE_xxx                 
2.845 O      POS_ADV                    -1.134 B-MISC LEN_2                     
2.807 O      POS_PUNCT                  -1.156 B-PER  -1_POS_DET                
2.500 O      SHAPE_xxx                  -1.169 B-ORG  SHAPE_Xxx                 
2.489 O      POS_PRON                   -1.397 B-LOC  SHAPE_xxxx                
2.470 O      HAS_NUM                    -1.398 B-ORG  SHAPE_xxxx                
2.405 O      LEN_2                      -1.558 B-MISC SHAPE_xxxx                
2.107 O      SHAPE_xxxx-xxxx            -1.873 I-MISC -1_SUF_se                 


<div class="alert alert-block alert-info">
<b>See:</b> in contrast with models trained in feature&hyperparameter_selection notebook, now customed feature getter results in a much better model than default's. Thus, we will continue improving this last model, since we are not satisfied enough with results. 
</div>

### <span style="font-family:Courier New; color:#336633">**Changing tagger format**</span>

<span style="font-family:Courier New">At this point, lets check whether changing the codification of entities has a postive impact on performance. </span>

#### <span style="font-family:Courier New; color:#994C00">**IO**</span>

In [118]:
ned_train_IO = convert_BIO(ned_train, begin = False)
ned_val_IO = convert_BIO(ned_val, begin = False)
ned_test_IO = convert_BIO(ned_test, begin = False)

X_val_IO = [[word[0] for word in sent] for sent in ned_val_IO]
y_val_IO = [[word[1] for word in sent] for sent in ned_val_IO]
X_test_IO = [[word[0] for word in sent] for sent in ned_test_IO]
y_test_IO = [[word[1] for word in sent] for sent in ned_test_IO]

In [119]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='ned'), training_opt = customed_hyperparams)
model.train(ned_train_IO, 'models/ned_customed_IO.tagger')

In [120]:
pred_ned_IO = model.tag_sents(X_val_IO)
y_pred_IO = [[word[1] for word in sent] for sent in pred_ned_IO]

print(bio_classification_report(y_val_IO, y_pred_IO))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(ned_val_IO, pred_ned_IO)
results_df = save_ent_results("Custom_IO", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       I-LOC       0.77      0.75      0.76       543
      I-MISC       0.79      0.72      0.75       963
       I-ORG       0.88      0.66      0.75      1082
       I-PER       0.76      0.90      0.83      1126

   micro avg       0.80      0.76      0.78      3714
   macro avg       0.80      0.76      0.77      3714
weighted avg       0.80      0.76      0.77      3714
 samples avg       0.07      0.07      0.07      3714

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.672,0.627,0.649,0.586,0.794,0.687,0.61
Custom_BIO,0.778,0.75,0.764,0.714,0.806,0.774,0.784
Custom_IO,0.77,0.739,0.755,0.703,0.797,0.782,0.765


#### <span style="font-family:Courier New; color:#994C00">**BIOS**</span>

In [121]:
ned_train_BIOS = convert_BIO(ned_train, begin = True, single = True)
ned_val_BIOS = convert_BIO(ned_val, begin = True, single = True)
ned_test_BIOS = convert_BIO(ned_test, begin = True, single = True)

X_val_BIOS = [[word[0] for word in sent] for sent in ned_val_BIOS]
y_val_BIOS = [[word[1] for word in sent] for sent in ned_val_BIOS]
X_test_BIOS = [[word[0] for word in sent] for sent in ned_test_BIOS]
y_test_BIOS = [[word[1] for word in sent] for sent in ned_test_BIOS]

In [122]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='ned'), training_opt = customed_hyperparams)
model.train(ned_train_BIOS, 'models/ned_customed_BIOS.tagger')

In [123]:
pred_ned_BIOS = model.tag_sents(X_val_BIOS)
y_pred_BIOS = [[word[1] for word in sent] for sent in pred_ned_BIOS]

print(bio_classification_report(y_val_BIOS, y_pred_BIOS))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(ned_val_BIOS, pred_ned_BIOS)
results_df = save_ent_results("Custom_BIOS", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.58      0.30      0.39        61
       I-LOC       0.57      0.33      0.42        64
       S-LOC       0.78      0.84      0.81       418
      B-MISC       0.72      0.57      0.64       212
      I-MISC       0.62      0.54      0.58       215
      S-MISC       0.80      0.77      0.78       536
       B-ORG       0.85      0.70      0.77       298
       I-ORG       0.86      0.65      0.74       396
       S-ORG       0.81      0.54      0.64       388
       B-PER       0.82      0.93      0.87       386
       I-PER       0.81      0.95      0.87       423
       S-PER       0.61      0.76      0.67       317

   micro avg       0.77      0.73      0.75      3714
   macro avg       0.74      0.66      0.68      3714
weighted avg       0.77      0.73      0.74      3714
 samples avg       0.07      0.07      0.07      3714

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.672,0.627,0.649,0.586,0.794,0.687,0.61
Custom_BIO,0.778,0.75,0.764,0.714,0.806,0.774,0.784
Custom_IO,0.77,0.739,0.755,0.703,0.797,0.782,0.765
Custom_BIOS,0.763,0.735,0.749,0.711,0.789,0.76,0.757


#### <span style="font-family:Courier New; color:#994C00">**BIOES**</span>

In [124]:
ned_train_BIOES = convert_BIO(ned_train, begin = True, single = True, end = True)
ned_val_BIOES = convert_BIO(ned_val, begin = True, single = True, end = True)
ned_test_BIOES = convert_BIO(ned_test, begin = True, single = True, end = True)

X_val_BIOES = [[word[0] for word in sent] for sent in ned_val_BIOES]
y_val_BIOES = [[word[1] for word in sent] for sent in ned_val_BIOES]
X_test_BIOES = [[word[0] for word in sent] for sent in ned_test_BIOES]
y_test_BIOES = [[word[1] for word in sent] for sent in ned_test_BIOES]

In [125]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='ned'), training_opt = customed_hyperparams)
model.train(ned_train_BIOES, 'models/ned_customed_BIOES.tagger')

In [126]:
pred_ned_BIOES = model.tag_sents(X_val_BIOES)
y_pred_BIOES = [[word[1] for word in sent] for sent in pred_ned_BIOES]

print(bio_classification_report(y_val_BIOES, y_pred_BIOES))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(ned_val_BIOES, pred_ned_BIOES)
results_df = save_ent_results("Custom_BIOES", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.63      0.28      0.39        61
       E-LOC       0.65      0.30      0.41        50
       I-LOC       0.50      0.14      0.22        14
       S-LOC       0.77      0.84      0.80       418
      B-MISC       0.68      0.59      0.63       212
      E-MISC       0.56      0.57      0.56       127
      I-MISC       0.61      0.43      0.51        88
      S-MISC       0.79      0.78      0.79       536
       B-ORG       0.86      0.69      0.77       298
       E-ORG       0.86      0.70      0.77       276
       I-ORG       0.84      0.38      0.53       120
       S-ORG       0.81      0.54      0.65       388
       B-PER       0.82      0.94      0.88       386
       E-PER       0.82      0.95      0.88       380
       I-PER       0.78      0.88      0.83        43
       S-PER       0.60      0.75      0.67       317

   micro avg       0.77      0.73      0.75      3714
   macro avg       0.72   

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.672,0.627,0.649,0.586,0.794,0.687,0.61
Custom_BIO,0.778,0.75,0.764,0.714,0.806,0.774,0.784
Custom_IO,0.77,0.739,0.755,0.703,0.797,0.782,0.765
Custom_BIOS,0.763,0.735,0.749,0.711,0.789,0.76,0.757
Custom_BIOES,0.761,0.738,0.749,0.715,0.793,0.759,0.752


<span style="font-family:Courier New">As we can see, the codification that works best is BIO, with the higher F1-score. If we look deeply at the results, as more predictable classes are added (i.e. BIOS and BIOES), the more the model struggles to perform well in all of them. Thus, betweem BIO and IO, despite results are similar, the capability of predicting entities is higher with BIO. Because of that we will continue improving with Custom_BIO model.  </span>

### <span style="font-family:Courier New; color:#336633">**Adding Gazetteers**</span>

In [None]:
'''names =  []
for sent in ned_train_BIO:
    for token, label in sent:
        if label == 'B-PER':
            names.append(token)
r = Counter(names)
print(r.keys())'''