# <span style="font-family:Courier New; color:#CCCCCC">**Feature and Hyperparameter Selection NED**</span>

## <span style="font-family:Courier New; color:#336666">**Load Data and Imports**</span>

In [2]:
from preprocessing import convert_BIO
from NER_evaluation import *
from feature_getter import Feature_getter
import pycrfsuite
import pandas as pd

import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

ned_train = conll2002.iob_sents('ned.train')
ned_val = conll2002.iob_sents('ned.testa')

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\jerez\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


## <span style="font-family:Courier New; color:#336666">**Preprocessing Data**</span>

In [3]:
train_sents = convert_BIO(ned_train)
val_sents = convert_BIO(ned_val)

X_val = [[word[0] for word in sent] for sent in val_sents]
y_val = [[word[1] for word in sent] for sent in val_sents]

## <span style="font-family:Courier New; color:#336666">**Train Baseline Classifier**</span>

In [3]:
model = nltk.tag.CRFTagger()
model.train(train_sents, 'models/model.crf.tagger')

In [4]:
results_df = pd.DataFrame()
def save_results(nclf, results, results_agg_ent, df):
    df.loc[nclf,'total acc'] = results["precision"]
    df.loc[nclf,'total recall'] = results["recall"]
    df.loc[nclf,'total F1'] = results["F1-score"]
    df.loc[nclf,'PER F1'] = results_agg_ent["PER"]["F1-score"]
    df.loc[nclf,'ORG F1'] = results_agg_ent["ORG"]["F1-score"]
    df.loc[nclf,'LOC F1'] = results_agg_ent["LOC"]["F1-score"]
    df.loc[nclf,'MISC F1'] = results_agg_ent["MISC"]["F1-score"]
    return df


pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Baseline", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574


## <span style="font-family:Courier New; color:#336666">**Feature selection**</span>

 <span style="font-family:Courier New">In this section, we will attempt to perform feature selection to achieve optimal performance. We will start by examining the isolated effects of individual features, activating them one at a time. </span>

### <span style="font-family:Courier New; color:#336633">**Including Context Features**</span>

#### <span style="font-family:Courier New; color:#994C00">**Previous Token**</span>

In [5]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = False, morphology = False, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681


#### <span style="font-family:Courier New; color:#994C00">**Previous and Next Tokens**</span>

In [6]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = True, morphology = False, length = False, prefix = False,
                lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_Next", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689


<div class="alert alert-block alert-info">
<b>See:</b> performance has increased considerably by including contextual features. 
</div>

### <span style="font-family:Courier New; color:#336633">**Including Morphology**</span>

#### <span style="font-family:Courier New; color:#994C00">**Combined with Baseline**</span>

In [7]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(language='ned', prev_tok = False, next_tok = False, morphology = True, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Baseline_wMorpho", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689
Baseline_wMorpho,0.648,0.604,0.625,0.574,0.77,0.631,0.599


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous Token**</span>

In [8]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(language = 'ned', prev_tok = True, next_tok = False, morphology = True, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_wMorpho", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689
Baseline_wMorpho,0.648,0.604,0.625,0.574,0.77,0.631,0.599
Prev_tok_wMorpho,0.721,0.678,0.699,0.66,0.79,0.688,0.695


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous and Next Tokens**</span>

In [9]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(language = 'ned', prev_tok = True, next_tok = True, morphology = True, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_Next_wMorpho", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689
Baseline_wMorpho,0.648,0.604,0.625,0.574,0.77,0.631,0.599
Prev_tok_wMorpho,0.721,0.678,0.699,0.66,0.79,0.688,0.695
Prev_tok_Next_wMorpho,0.721,0.673,0.696,0.646,0.769,0.712,0.695


<span style="font-family:Courier New">As we can see, including Morpohlogy provides a light performance increase in models, so we will consider the exploration of more features. </span>

### <span style="font-family:Courier New; color:#336633">**Including the rest of Features**</span>

#### <span style="font-family:Courier New; color:#994C00">**Combined with Baseline**</span>

In [10]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(language = 'ned', prev_tok = False, next_tok = False, morphology = True, length = True, prefix = True,
                lemma = True, POS = True, shape = True))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Baseline_wAll", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689
Baseline_wMorpho,0.648,0.604,0.625,0.574,0.77,0.631,0.599
Prev_tok_wMorpho,0.721,0.678,0.699,0.66,0.79,0.688,0.695
Prev_tok_Next_wMorpho,0.721,0.673,0.696,0.646,0.769,0.712,0.695
Baseline_wAll,0.718,0.699,0.708,0.644,0.806,0.732,0.703


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous Token**</span>

In [11]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(language = 'ned', prev_tok = True, next_tok = False, morphology = True, length = True, prefix = True,
                lemma = True, POS = True, shape = True))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_wAll", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689
Baseline_wMorpho,0.648,0.604,0.625,0.574,0.77,0.631,0.599
Prev_tok_wMorpho,0.721,0.678,0.699,0.66,0.79,0.688,0.695
Prev_tok_Next_wMorpho,0.721,0.673,0.696,0.646,0.769,0.712,0.695
Baseline_wAll,0.718,0.699,0.708,0.644,0.806,0.732,0.703
Prev_tok_wAll,0.772,0.743,0.757,0.705,0.798,0.772,0.783


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous and Next Tokens**</span>

In [12]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(language='ned', prev_tok = True, next_tok = True, morphology = True, length = True, prefix = True,
                lemma = True, POS = True, shape = True))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_Next_wAll", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.649,0.586,0.616,0.572,0.736,0.652,0.574
Prev_tok,0.721,0.663,0.69,0.639,0.759,0.728,0.681
Prev_tok_Next,0.73,0.669,0.698,0.645,0.774,0.729,0.689
Baseline_wMorpho,0.648,0.604,0.625,0.574,0.77,0.631,0.599
Prev_tok_wMorpho,0.721,0.678,0.699,0.66,0.79,0.688,0.695
Prev_tok_Next_wMorpho,0.721,0.673,0.696,0.646,0.769,0.712,0.695
Baseline_wAll,0.718,0.699,0.708,0.644,0.806,0.732,0.703
Prev_tok_wAll,0.772,0.743,0.757,0.705,0.798,0.772,0.783
Prev_tok_Next_wAll,0.778,0.748,0.762,0.726,0.779,0.767,0.79


<span style="font-family:Courier New">At this point, we can observe that the best model turns out to be the one with more features. This is, considering Previous and Next token features. However, the improvement is not profitable regarding the high dimensionality. Thus, we will choose the model that takes into account all features related to a token and its Previous one.  </span>

## <span style="font-family:Courier New; color:#336666">**Hiperparameters selection**</span>

<span style="font-family:Courier New">For the optimization of hyperparameters, we will considered the Feature Selection above. This is models with our customed Feature Getter, that consider each token and its previous' features. </span>

In [1]:
hyperparameters = {
    'c1': [0.01, 0.1, 1.0],
    'c2': [0.01, 0.1, 1.0],
    'max_iterations': [50, 100, 200]
}

In [14]:
def gridsearch_cv(hyperparameters, train_sents, val_sents, X_val):

    results_df = pd.DataFrame(columns = ['c1', 'c2', 'max_iterations', 'F1-score'])
    best_f1, best_params = 0, dict()
    num_combinations = len(hyperparameters['c1']) * len(hyperparameters['c2']) * len(hyperparameters['max_iterations'])
    current_combination = 0
    for c1 in hyperparameters['c1']:
        for c2 in hyperparameters['c2']:
            for max_iter in hyperparameters['max_iterations']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}', end = '\r')
                model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='ned', next_tok=False), 
                                           training_opt = {'c1': c1, 'c2': c2, 'max_iterations': max_iter})
                model.train(train_sents, 'models/model.crf.tagger')

                pred = model.tag_sents(X_val)
                results, _ = compute_metrics(val_sents, pred)
                results_df.loc[len(results_df)] = [c1, c2, max_iter, results['F1-score']]
                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'c1': c1, 'c2': c2, 'max_iterations': max_iter}

    return best_f1, best_params, results_df

best, best_params, results_df = gridsearch_cv(hyperparameters, train_sents, val_sents, X_val)

Fitting model 27 of 27

In [15]:
results_df.sort_values(by = 'F1-score', ascending = False).head(5)

Unnamed: 0,c1,c2,max_iterations,F1-score
3,0.01,0.1,50.0,0.772
4,0.01,0.1,100.0,0.77
5,0.01,0.1,200.0,0.766
12,0.1,0.1,50.0,0.764
9,0.1,0.01,50.0,0.763


In [16]:
best_params

{'c1': 0.01, 'c2': 0.1, 'max_iterations': 50}

<span style="font-family:Courier New">To finish, lets complete the best combination of training hyperparameters:
- **feature.minfreq** -> Minimum frequency of features.
- **feature.possible_states** -> Force to generate possible state features.
-  **feature.possible_transitions** -> Force to generate possible transition features. </span>

In [5]:
hyperparameters = {
    'feature.possible_transitions': [True, False],
    'feature.possible_states': [True, False],
    'feature.minfreq': [0, 5, 10, 15]
}

In [6]:
def last_gridsearch_cv(hyperparameters, train_sents, val_sents, X_val):

    results_df = pd.DataFrame(columns = ['poss_transitions', 'poss_states', 'min_freq', 'F1-score'])
    best_f1, best_params = 0, dict()
    num_combinations = len(hyperparameters['feature.possible_transitions']) * len(hyperparameters['feature.possible_states']) * len(hyperparameters['feature.minfreq'])
    current_combination = 0
    for trans in hyperparameters['feature.possible_transitions']:
        for states in hyperparameters['feature.possible_states']:
            for min_freq in hyperparameters['feature.minfreq']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}', end = '\r')
                model = nltk.tag.CRFTagger(feature_func=Feature_getter(language='ned', next_tok=False),
                                           training_opt = {'c1': 0.01, 'c2': 0.1, 'max_iterations': 50, 'feature.possible_transitions': trans,
                                                            'feature.possible_states': states, 'feature.minfreq': min_freq})
                model.train(train_sents, 'models/model.crf.tagger')

                pred = model.tag_sents(X_val)
                results, _ = compute_metrics(val_sents, pred)
                results_df.loc[len(results_df)] = [trans, states, min_freq, results['F1-score']]
                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'feature.possible_transitions': trans, 'feature.possible_states': states, 'feature.minfreq': min_freq}

    return best_f1, best_params, results_df

best_complete, best_params_complete, results_df_complete = last_gridsearch_cv(hyperparameters, train_sents, val_sents, X_val)

Fitting model 16 of 16

In [8]:
results_df_complete.sort_values(by = 'F1-score', ascending = False).head(5)

Unnamed: 0,poss_transitions,poss_states,min_freq,F1-score
0,True,True,0,0.776
4,True,False,0,0.773
12,False,False,0,0.772
8,False,True,0,0.768
1,True,True,5,0.737


<span style="font-family:Courier New">
<b>Conclusion:</b> giving it all to improve the performance of our models, in terms of entities (our aim), we find that the best model turns out to be the one implementing the customed Feature getter, with the following hyperparameters: {'c1': 0.01, 'c2': 0.1, 'max_iterations': 50, 'possible_transitions': True, 'possible_states': True, 'min_freq' = 0} </span>
