# <span style="font-family:Courier New; color:#CCCCCC">**Spanish Named Entity Recognition CRF**</span>

## <span style="font-family:Courier New; color:#336666">**Load Data and Imports**</span>

In [11]:
from preprocessing import convert_BIO
from NER_evaluation import *
from feature_getter import Feature_getter
import pycrfsuite
from collections import Counter
import pandas as pd

import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

esp_train = conll2002.iob_sents('esp.train')
esp_val = conll2002.iob_sents('esp.testa')
esp_test = conll2002.iob_sents('esp.testb')

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\Jordi\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


## <span style="font-family:Courier New; color:#336666">**Preprocessing Data**</span>

In [12]:
esp_train_BIO = convert_BIO(esp_train)
esp_val_BIO = convert_BIO(esp_val)
esp_test_BIO = convert_BIO(esp_test)

X_val_BIO = [[word[0] for word in sent] for sent in esp_val_BIO]
y_val_BIO = [[word[1] for word in sent] for sent in esp_val_BIO]
X_test_BIO = [[word[0] for word in sent] for sent in esp_test_BIO]
y_test_BIO = [[word[1] for word in sent] for sent in esp_test_BIO]

## <span style="font-family:Courier New; color:#336666">**Train Classifier**</span>

In [13]:
#Summary avaluation table
results_df = pd.DataFrame()
def save_ent_results(nclf, results, results_agg_ent, df):
    df.loc[nclf,'total acc'] = results["precision"]
    df.loc[nclf,'total recall'] = results["recall"]
    df.loc[nclf,'total F1'] = results["F1-score"]
    df.loc[nclf,'PER F1'] = results_agg_ent["PER"]["F1-score"]
    df.loc[nclf,'ORG F1'] = results_agg_ent["ORG"]["F1-score"]
    df.loc[nclf,'LOC F1'] = results_agg_ent["LOC"]["F1-score"]
    df.loc[nclf,'MISC F1'] = results_agg_ent["MISC"]["F1-score"]
    return df

### <span style="font-family:Courier New; color:#336633">**Default Feature Getter**</span>

<span style="font-family:Courier New">Feature&Hyperparameter_selection notebook suggests that best hyperparamaters for CRF.Tagger default feature getter are: {'c1': 0.01, 'c2': 0.1, 'max_iterations': 200, 'possible_transitions': True, 'possible_states': True, 'min_freq' = 0}. </span>

In [14]:
default_hyperparams = {'c1': 0.01, 'c2': 0.1, 'max_iterations': 200, 'feature.possible_transitions': True,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(training_opt = default_hyperparams)
model.train(esp_train_BIO, 'models/esp_default_BIO.tagger')

#### <span style="font-family:Courier New; color:#994C00">**Evalutation**</span>

In [15]:
pred_esp_BIO = model.tag_sents(X_val_BIO)
y_pred_BIO = [[word[1] for word in sent] for sent in pred_esp_BIO]

print(bio_classification_report(y_val_BIO, y_pred_BIO))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(esp_val_BIO, pred_esp_BIO)
results_df = save_ent_results("Default_BIO", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.64      0.78      0.70       984
       I-LOC       0.57      0.74      0.64       337
      B-MISC       0.64      0.53      0.58       445
      I-MISC       0.39      0.47      0.43       654
       B-ORG       0.81      0.73      0.77      1700
       I-ORG       0.74      0.70      0.72      1366
       B-PER       0.85      0.75      0.80      1222
       I-PER       0.86      0.90      0.88       859

   micro avg       0.72      0.72      0.72      7567
   macro avg       0.69      0.70      0.69      7567
weighted avg       0.73      0.72      0.72      7567
 samples avg       0.10      0.10      0.10      7567

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.732,0.702,0.717,0.836,0.745,0.622,0.561


#### <span style="font-family:Courier New; color:#994C00">**Feature Importance**</span>

In [16]:
def print_state_features(state_features):
        for (attr, label), weight in state_features:
            string = "%0.3f %-6s %s" % (weight, label, attr)
            print(string, end = " "*(40 - len(string)))

def feature_importance(model):

    info = model._tagger.info()
    positive_features = Counter(info.state_features).most_common(10)
    negative_features = Counter(info.state_features).most_common()[-10:]

    print("Top positive:                       |     Top negative:")
    print("-----------------------------------------------------------------------------")

    for positive, negative in zip(positive_features, negative_features):
        print_state_features([positive])
        print_state_features([negative])
        print()
feature_importance(model)

Top positive:                       |     Top negative:
-----------------------------------------------------------------------------
9.353 O      WORD_.                     -3.557 I-LOC  WORD_A                    
7.999 O      WORD_Y                     -3.773 I-PER  SUF_A                     
6.971 O      WORD_y                     -3.880 O      WORD_petrobras            
6.862 O      WORD_A                     -3.913 B-PER  WORD_San                  
5.733 I-PER  WORD_Gándara               -3.993 O      WORD_2000                 
5.506 O      WORD_Día                   -4.039 O      WORD_3-TELEVISION         
5.200 O      WORD_Por                   -4.260 O      WORD_NOTICIAS             
4.835 O      WORD_Para                  -4.272 O      WORD_'                    
4.827 O      SUF_O                      -4.316 O      WORD_"                    
4.810 O      WORD_En                    -8.732 O      CAPITALIZATION            


### <span style="font-family:Courier New; color:#336633">**Customed Feature Getter**</span>

<span style="font-family:Courier New">Feature&Hyperparameter_selection notebook suggests that best hyperparamaters for CRF.Tagger with customed feature getter are: {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'possible_transitions': False, 'possible_states': True, 'min_freq' = 0}. </span>

In [17]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='esp'), training_opt = customed_hyperparams)
model.train(esp_train_BIO, 'models/esp_customed_BIO.tagger')

#### <span style="font-family:Courier New; color:#994C00">**Evalutation**</span>

In [18]:
pred_esp_BIO = model.tag_sents(X_val_BIO)
y_pred_BIO = [[word[1] for word in sent] for sent in pred_esp_BIO]

print(bio_classification_report(y_val_BIO, y_pred_BIO))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(esp_val_BIO, pred_esp_BIO)
results_df = save_ent_results("Custom_BIO", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.65      0.77      0.70       984
       I-LOC       0.68      0.76      0.72       337
      B-MISC       0.67      0.47      0.55       445
      I-MISC       0.58      0.46      0.51       654
       B-ORG       0.77      0.72      0.74      1700
       I-ORG       0.79      0.69      0.74      1366
       B-PER       0.81      0.74      0.77      1222
       I-PER       0.82      0.89      0.85       859

   micro avg       0.74      0.71      0.72      7567
   macro avg       0.72      0.69      0.70      7567
weighted avg       0.74      0.71      0.72      7567
 samples avg       0.10      0.10      0.10      7567

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.732,0.702,0.717,0.836,0.745,0.622,0.561
Custom_BIO,0.703,0.678,0.69,0.77,0.709,0.64,0.509


#### <span style="font-family:Courier New; color:#994C00">**Feature Importance**</span>

In [19]:
feature_importance(model)

Top positive:                       |     Top negative:
-----------------------------------------------------------------------------
3.387 O      POS_ADV                    -1.160 B-LOC  SUF_ión                   
3.129 O      SHAPE_xxxx                 -1.178 B-MISC POS_NUM                   
2.670 O      POS_CCONJ                  -1.195 I-PER  LEN_1                     
2.395 O      LEN_1                      -1.232 B-PER  -1_gender_Fem             
2.109 O      POS_AUX                    -1.233 I-PER  PUNCTUATION               
2.038 O      POS_PRON                   -1.302 O      +1_SHAPE_dd,dd            
2.026 O      POS_VERB                   -1.468 B-LOC  HAS_NUM                   
1.946 O      HAS_NUM                    -1.545 B-MISC SHAPE_xxxx                
1.916 O      POS_SCONJ                  -2.005 B-ORG  HAS_NUM                   
1.641 O      SHAPE_xxxx-xxxx            -2.293 B-ORG  SHAPE_xxxx                


<div class="alert alert-block alert-info">
<b>See:</b> At first glance, we don't improve at all with the introduction of custom features for spanish language. However, we will try to continue the analysis to try to extract further conclusions.
</div>

### <span style="font-family:Courier New; color:#336633">**Changing tagger format**</span>

<span style="font-family:Courier New">At this point, lets check whether changing the codification of entities has a postive impact on performance. </span>

#### <span style="font-family:Courier New; color:#994C00">**IO**</span>

In [20]:
esp_train_IO = convert_BIO(esp_train, begin = False)
esp_val_IO = convert_BIO(esp_val, begin = False)
esp_test_IO = convert_BIO(esp_test, begin = False)

X_val_IO = [[word[0] for word in sent] for sent in esp_val_IO]
y_val_IO = [[word[1] for word in sent] for sent in esp_val_IO]
X_test_IO = [[word[0] for word in sent] for sent in esp_test_IO]
y_test_IO = [[word[1] for word in sent] for sent in esp_test_IO]

In [21]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='esp'), training_opt = customed_hyperparams)
model.train(esp_train_IO, 'models/esp_customed_IO.tagger')

In [22]:
pred_esp_IO = model.tag_sents(X_val_IO)
y_pred_IO = [[word[1] for word in sent] for sent in pred_esp_IO]

print(bio_classification_report(y_val_IO, y_pred_IO))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(esp_val_IO, pred_esp_IO)
results_df = save_ent_results("Custom_IO", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       I-LOC       0.65      0.76      0.70      1321
      I-MISC       0.58      0.48      0.53      1099
       I-ORG       0.81      0.70      0.75      3066
       I-PER       0.83      0.81      0.82      2081

   micro avg       0.75      0.71      0.73      7567
   macro avg       0.72      0.69      0.70      7567
weighted avg       0.75      0.71      0.73      7567
 samples avg       0.10      0.10      0.10      7567

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.732,0.702,0.717,0.836,0.745,0.622,0.561
Custom_BIO,0.703,0.678,0.69,0.77,0.709,0.64,0.509
Custom_IO,0.691,0.654,0.672,0.753,0.698,0.616,0.472


#### <span style="font-family:Courier New; color:#994C00">**BIOS**</span>

In [23]:
esp_train_BIOS = convert_BIO(esp_train, begin = True, single = True)
esp_val_BIOS = convert_BIO(esp_val, begin = True, single = True)
esp_test_BIOS = convert_BIO(esp_test, begin = True, single = True)

X_val_BIOS = [[word[0] for word in sent] for sent in esp_val_BIOS]
y_val_BIOS = [[word[1] for word in sent] for sent in esp_val_BIOS]
X_test_BIOS = [[word[0] for word in sent] for sent in esp_test_BIOS]
y_test_BIOS = [[word[1] for word in sent] for sent in esp_test_BIOS]

In [24]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='esp'), training_opt = customed_hyperparams)
model.train(esp_train_BIOS, 'models/esp_customed_BIOS.tagger')

In [25]:
pred_esp_BIOS = model.tag_sents(X_val_BIOS)
y_pred_BIOS = [[word[1] for word in sent] for sent in pred_esp_BIOS]

print(bio_classification_report(y_val_BIOS, y_pred_BIOS))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(esp_val_BIOS, pred_esp_BIOS)
results_df = save_ent_results("Custom_BIOS", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.72      0.61      0.66       270
       I-LOC       0.71      0.75      0.73       337
       S-LOC       0.59      0.80      0.68       714
      B-MISC       0.55      0.43      0.49       260
      I-MISC       0.58      0.46      0.51       654
      S-MISC       0.64      0.45      0.53       185
       B-ORG       0.76      0.63      0.69       632
       I-ORG       0.79      0.70      0.74      1366
       S-ORG       0.75      0.73      0.74      1068
       B-PER       0.83      0.89      0.86       684
       I-PER       0.84      0.89      0.86       859
       S-PER       0.73      0.54      0.62       538

   micro avg       0.73      0.70      0.71      7567
   macro avg       0.71      0.66      0.68      7567
weighted avg       0.73      0.70      0.71      7567
 samples avg       0.10      0.10      0.10      7567

Entity level evaluation


Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.732,0.702,0.717,0.836,0.745,0.622,0.561
Custom_BIO,0.703,0.678,0.69,0.77,0.709,0.64,0.509
Custom_IO,0.691,0.654,0.672,0.753,0.698,0.616,0.472
Custom_BIOS,0.698,0.677,0.687,0.783,0.71,0.626,0.487


#### <span style="font-family:Courier New; color:#994C00">**BIOES**</span>

In [26]:
esp_train_BIOES = convert_BIO(esp_train, begin = True, single = True, end = True)
esp_val_BIOES = convert_BIO(esp_val, begin = True, single = True, end = True)
esp_test_BIOES = convert_BIO(esp_test, begin = True, single = True, end = True)

X_val_BIOES = [[word[0] for word in sent] for sent in esp_val_BIOES]
y_val_BIOES = [[word[1] for word in sent] for sent in esp_val_BIOES]
X_test_BIOES = [[word[0] for word in sent] for sent in esp_test_BIOES]
y_test_BIOES = [[word[1] for word in sent] for sent in esp_test_BIOES]

In [27]:
customed_hyperparams = {'c1': 0.01, 'c2': 1, 'max_iterations': 100, 'feature.possible_transitions': False,
                                            'feature.possible_states': True, 'feature.minfreq': 0}
model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='esp'), training_opt = customed_hyperparams)
model.train(esp_train_BIOES, 'models/esp_customed_BIOES.tagger')

In [28]:
pred_esp_BIOES = model.tag_sents(X_val_BIOES)
y_pred_BIOES = [[word[1] for word in sent] for sent in pred_esp_BIOES]

print(bio_classification_report(y_val_BIOES, y_pred_BIOES))
print('='*80)
print('Entity level evaluation')
print('='*80)
results, results_agg_ent = compute_metrics(esp_val_BIOES, pred_esp_BIOES)
results_df = save_ent_results("Custom_BIOES", results, results_agg_ent, results_df)
results_df

              precision    recall  f1-score   support

       B-LOC       0.69      0.61      0.65       270
       E-LOC       0.69      0.70      0.69       227
       I-LOC       0.51      0.73      0.60       110
       S-LOC       0.60      0.80      0.68       714
      B-MISC       0.60      0.49      0.54       260
      E-MISC       0.55      0.46      0.50       250
      I-MISC       0.55      0.40      0.46       404
      S-MISC       0.66      0.44      0.53       185
       B-ORG       0.76      0.61      0.67       632
       E-ORG       0.69      0.58      0.63       590
       I-ORG       0.78      0.71      0.74       776
       S-ORG       0.75      0.74      0.74      1068
       B-PER       0.83      0.90      0.86       684
       E-PER       0.83      0.92      0.87       664
       I-PER       0.83      0.80      0.81       195
       S-PER       0.74      0.52      0.61       538

   micro avg       0.72      0.69      0.70      7567
   macro avg       0.69   

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Default_BIO,0.732,0.702,0.717,0.836,0.745,0.622,0.561
Custom_BIO,0.703,0.678,0.69,0.77,0.709,0.64,0.509
Custom_IO,0.691,0.654,0.672,0.753,0.698,0.616,0.472
Custom_BIOS,0.698,0.677,0.687,0.783,0.71,0.626,0.487
Custom_BIOES,0.7,0.68,0.69,0.782,0.711,0.626,0.518


<span style="font-family:Courier New">As we can see, custom features do not work as intended for spanish language, in fact they seem to penalize the performance. However, the difference is at most of 0.03 and validation split isn't too big, so we need to be careful about drawing conclusions.      </span>

### <span style="font-family:Courier New; color:#336633">**Adding Gazetteers**</span>

In [29]:
'''names =  []
for sent in esp_train_BIO:
    for token, label in sent:
        if label == 'B-PER':
            names.append(token)
r = Counter(names)
print(r.keys())'''

"names =  []\nfor sent in esp_train_BIO:\n    for token, label in sent:\n        if label == 'B-PER':\n            names.append(token)\nr = Counter(names)\nprint(r.keys())"