# <span style="font-family:Courier New; color:#CCCCCC">**Feature and Hyperparameter Selection ESP**</span>

## <span style="font-family:Courier New; color:#336666">**Load Data and Imports**</span>

In [2]:
from preprocessing import convert_BIO
from NER_evaluation import *
from feature_getter import Feature_getter
import pycrfsuite
import pandas as pd

import nltk
nltk.download('conll2002')
from nltk.corpus import conll2002

esp_train = conll2002.iob_sents('esp.train')
esp_val = conll2002.iob_sents('esp.testa')

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\Jordi\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


## <span style="font-family:Courier New; color:#336666">**Preprocessing Data**</span>

In [3]:
train_sents = convert_BIO(esp_train)
val_sents = convert_BIO(esp_val)

X_val = [[word[0] for word in sent] for sent in val_sents]
y_val = [[word[1] for word in sent] for sent in val_sents]

## <span style="font-family:Courier New; color:#336666">**Train Baseline Classifier**</span>

In [3]:
model = nltk.tag.CRFTagger()
model.train(train_sents, 'models/model.crf.tagger')

In [4]:
results_df = pd.DataFrame()
def save_results(nclf, results, results_agg_ent, df):
    df.loc[nclf,'total acc'] = results["precision"]
    df.loc[nclf,'total recall'] = results["recall"]
    df.loc[nclf,'total F1'] = results["F1-score"]
    df.loc[nclf,'PER F1'] = results_agg_ent["PER"]["F1-score"]
    df.loc[nclf,'ORG F1'] = results_agg_ent["ORG"]["F1-score"]
    df.loc[nclf,'LOC F1'] = results_agg_ent["LOC"]["F1-score"]
    df.loc[nclf,'MISC F1'] = results_agg_ent["MISC"]["F1-score"]
    return df


pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Baseline", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573


## <span style="font-family:Courier New; color:#336666">**Feature selection**</span>

 <span style="font-family:Courier New">In this section, we will attempt to perform feature selection to achieve optimal performance. We will start by examining the isolated effects of individual features, activating them one at a time. </span>

### <span style="font-family:Courier New; color:#336633">**Including Context Features**</span>

#### <span style="font-family:Courier New; color:#994C00">**Previous Token**</span>

In [5]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = False, morphology = False, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558


#### <span style="font-family:Courier New; color:#994C00">**Previous and Next Tokens**</span>

In [6]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = True, morphology = False, length = False, prefix = False,
                lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_Next", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544


<div class="alert alert-block alert-info">
<b>See:</b> performance has rather increased dramatically when we introduced context!. The addintion of the next token hasn't improved that much, but it is sufficient.
</div>

### <span style="font-family:Courier New; color:#336633">**Including Morphology**</span>

#### <span style="font-family:Courier New; color:#994C00">**Combined with Baseline**</span>

In [7]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = False, next_tok = False, morphology = True, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Baseline_wMorpho", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544
Baseline_wMorpho,0.706,0.668,0.686,0.772,0.726,0.588,0.57


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous and Next Tokens**</span>

In [8]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = False, morphology = True, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_wMorpho", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544
Baseline_wMorpho,0.706,0.668,0.686,0.772,0.726,0.588,0.57
Prev_tok_wMorpho,0.747,0.722,0.734,0.821,0.781,0.647,0.555


#### <span style="font-family:Courier New; color:#994C00">**Combined with Next Tokens**</span>

In [9]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = True, morphology = True, length = False, prefix = False,
                           lemma = False, POS = False, shape = False))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_Next_wMorpho", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544
Baseline_wMorpho,0.706,0.668,0.686,0.772,0.726,0.588,0.57
Prev_tok_wMorpho,0.747,0.722,0.734,0.821,0.781,0.647,0.555
Prev_tok_Next_wMorpho,0.75,0.728,0.739,0.841,0.797,0.631,0.535


<span style="font-family:Courier New">As we can see, including Morpohlogy provides a light performance increase in models, so we will leave it and consider the furher exploration of more features. </span>

### <span style="font-family:Courier New; color:#336633">**Including the rest of Features**</span>

#### <span style="font-family:Courier New; color:#994C00">**Combined with Baseline**</span>

In [10]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = False, next_tok = False, morphology = True, length = True, prefix = True,
                lemma = True, POS = True, shape = True))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Baseline_wAll", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544
Baseline_wMorpho,0.706,0.668,0.686,0.772,0.726,0.588,0.57
Prev_tok_wMorpho,0.747,0.722,0.734,0.821,0.781,0.647,0.555
Prev_tok_Next_wMorpho,0.75,0.728,0.739,0.841,0.797,0.631,0.535
Baseline_wAll,0.719,0.691,0.705,0.805,0.756,0.591,0.577


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous Token**</span>

In [11]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = False, morphology = True, length = True, prefix = True,
                lemma = True, POS = True, shape = True))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_wAll", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544
Baseline_wMorpho,0.706,0.668,0.686,0.772,0.726,0.588,0.57
Prev_tok_wMorpho,0.747,0.722,0.734,0.821,0.781,0.647,0.555
Prev_tok_Next_wMorpho,0.75,0.728,0.739,0.841,0.797,0.631,0.535
Baseline_wAll,0.719,0.691,0.705,0.805,0.756,0.591,0.577
Prev_tok_wAll,0.748,0.733,0.74,0.85,0.79,0.641,0.551


#### <span style="font-family:Courier New; color:#994C00">**Combined with Previous and Next Tokens**</span>

In [12]:
model = nltk.tag.CRFTagger(feature_func = Feature_getter(prev_tok = True, next_tok = True, morphology = True, length = True, prefix = True,
                lemma = True, POS = True, shape = True))
model.train(train_sents, 'models/model.crf.tagger')

pred = model.tag_sents(X_val)
results, results_agg_ent = compute_metrics(val_sents, pred)
save_results("Prev_tok_Next_wAll", results, results_agg_ent, results_df)

Unnamed: 0,total acc,total recall,total F1,PER F1,ORG F1,LOC F1,MISC F1
Baseline,0.707,0.666,0.686,0.769,0.728,0.588,0.573
Prev_tok,0.746,0.716,0.731,0.809,0.774,0.653,0.558
Prev_tok_Next,0.756,0.733,0.745,0.837,0.804,0.642,0.544
Baseline_wMorpho,0.706,0.668,0.686,0.772,0.726,0.588,0.57
Prev_tok_wMorpho,0.747,0.722,0.734,0.821,0.781,0.647,0.555
Prev_tok_Next_wMorpho,0.75,0.728,0.739,0.841,0.797,0.631,0.535
Baseline_wAll,0.719,0.691,0.705,0.805,0.756,0.591,0.577
Prev_tok_wAll,0.748,0.733,0.74,0.85,0.79,0.641,0.551
Prev_tok_Next_wAll,0.753,0.739,0.746,0.858,0.799,0.636,0.567


<span style="font-family:Courier New">At this point, we can see how every change that we made has increased the F1. We will then mantain all features to make the model performance increase severely. </span>

## <span style="font-family:Courier New; color:#336666">**Hiperparameters selection**</span>

### <span style="font-family:Courier New; color:#336633">**Best model for Base Features**</span>

<span style="font-family:Courier New">First, we will perform the search with base features, since the runtime of training with all features is considerably bigger. Thus, we base this simplificaton on the assumption that the hiperparameters doesn't have distinct interactions among the different features. </span>

In [8]:
hyperparameters = {
    'c1': [0.01, 0.1, 1.0],
    'c2': [0.01, 0.1, 1.0],
    'max_iterations': [50, 100, 200]
}

In [None]:
def gridsearch_cv(hyperparameters, train_sents, val_sents, X_val):

    results_df = pd.DataFrame(columns = ['c1', 'c2', 'max_iterations', 'F1-score'])
    best_f1, best_params = 0, dict()
    num_combinations = len(hyperparameters['c1']) * len(hyperparameters['c2']) * len(hyperparameters['max_iterations'])
    current_combination = 0
    for c1 in hyperparameters['c1']:
        for c2 in hyperparameters['c2']:
            for max_iter in hyperparameters['max_iterations']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}', end = '\r')
                model = nltk.tag.CRFTagger(training_opt = {'c1': c1, 'c2': c2, 'max_iterations': max_iter})
                model.train(train_sents, 'models/model.crf.tagger')

                pred = model.tag_sents(X_val)
                results, _ = compute_metrics(val_sents, pred)
                results_df.loc[len(results_df)] = [c1, c2, max_iter, results['F1-score']]
                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'c1': c1, 'c2': c2, 'max_iterations': max_iter}

    return best_f1, best_params, results_df

best, best_params, results_df = gridsearch_cv(hyperparameters, train_sents, val_sents, X_val)

In [21]:
results_df.sort_values(by = 'F1-score', ascending = False).head(5)

Unnamed: 0,c1,c2,max_iterations,F1-score
5,0.01,0.1,200.0,0.716
4,0.01,0.1,100.0,0.714
11,0.1,0.01,200.0,0.713
13,0.1,0.1,100.0,0.712
14,0.1,0.1,200.0,0.71


In [25]:
best_params

{'c1': 0.01, 'c2': 0.1, 'max_iterations': 200}

<span style="font-family:Courier New">To finish, lets complete the best combination of training hyperparameters:
- **feature.minfreq** -> Minimum frequency of features.
- **feature.possible_states** -> Force to generate possible state features.
-  **feature.possible_transitions** -> Force to generate possible transition features. </span>

In [1]:
hyperparameters = {
    'feature.possible_transitions': [True, False],
    'feature.possible_states': [True, False],
    'feature.minfreq': [0, 5, 10, 15]
}

In [5]:
def last_gridsearch_cv(hyperparameters, train_sents, val_sents, X_val):

    results_df = pd.DataFrame(columns = ['poss_transitions', 'poss_states', 'min_freq', 'F1-score'])
    best_f1, best_params = 0, dict()
    num_combinations = len(hyperparameters['feature.possible_transitions']) * len(hyperparameters['feature.possible_states']) * len(hyperparameters['feature.minfreq'])
    current_combination = 0
    for trans in hyperparameters['feature.possible_transitions']:
        for states in hyperparameters['feature.possible_states']:
            for min_freq in hyperparameters['feature.minfreq']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}', end = '\r')
                model = nltk.tag.CRFTagger(training_opt = {'c1': 0.01, 'c2': 0.1, 'max_iterations': 200, 'feature.possible_transitions': trans,
                                                            'feature.possible_states': states, 'feature.minfreq': min_freq})
                model.train(train_sents, 'models/model.crf.tagger')

                pred = model.tag_sents(X_val)
                results, _ = compute_metrics(val_sents, pred)
                results_df.loc[len(results_df)] = [trans, states, min_freq, results['F1-score']]
                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'feature.possible_transitions': trans, 'feature.possible_states': states, 'feature.minfreq': min_freq}

    return best_f1, best_params, results_df

best_complete, best_params_complete, results_df_complete = last_gridsearch_cv(hyperparameters, train_sents, val_sents, X_val)

Fitting model 16 of 16

In [6]:
results_df_complete.sort_values(by = 'F1-score', ascending = False).head(5)

Unnamed: 0,poss_transitions,poss_states,min_freq,F1-score
0,True,True,0,0.717
12,False,False,0,0.716
4,True,False,0,0.713
8,False,True,0,0.712
1,True,True,5,0.664


<span style="font-family:Courier New">The search suggests that best hyperparamaters for CRF.Tagger default feature getter are: {'c1': 0.01, 'c2': 0.1, 'max_iterations': 200, 'possible_transitions': True, 'possible_states': True, 'min_freq' = 0}. </span>

### <span style="font-family:Courier New; color:#336633">**Best Model for Feature Selection**</span>

<span style="font-family:Courier New">Now, let's try with the model that combines previous token, next token and the rest of Features. </span>

In [9]:
def gridsearch_cv_wFeatures(hyperparameters, train_sents, val_sents, X_val):

    results_df = pd.DataFrame(columns = ['c1', 'c2', 'max_iterations', 'F1-score'])
    best_f1, best_params = 0, dict()
    num_combinations = len(hyperparameters['c1']) * len(hyperparameters['c2']) * len(hyperparameters['max_iterations'])
    current_combination = 0
    for c1 in hyperparameters['c1']:
        for c2 in hyperparameters['c2']:
            for max_iter in hyperparameters['max_iterations']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}', end = '\r')
                model = nltk.tag.CRFTagger(feature_func= Feature_getter(), training_opt = {'c1': c1, 'c2': c2, 'max_iterations': max_iter})
                model.train(train_sents, 'models/model.crf.tagger')

                pred = model.tag_sents(X_val)
                results, _ = compute_metrics(val_sents, pred)
                results_df.loc[len(results_df)] = [c1, c2, max_iter, results['F1-score']]
                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'c1': c1, 'c2': c2, 'max_iterations': max_iter}

    return best_f1, best_params, results_df

best_wFeatures, best_params_w_Features, results_df_wFeatures = gridsearch_cv_wFeatures(hyperparameters, train_sents, val_sents, X_val)

Fitting model 9 of 9

In [10]:
results_df_wFeatures.sort_values(by = 'F1-score', ascending = False).head(5)

Unnamed: 0,c1,c2,max_iterations,F1-score
2,0.01,1.0,200.0,0.749
4,0.1,0.1,200.0,0.748
1,0.01,0.1,200.0,0.746
3,0.1,0.01,200.0,0.745
5,0.1,1.0,200.0,0.744


In [4]:
hyperparameters_custom = {'c1':0.01, 'c2': 1, 'max_iterations': 200}

In [5]:
def last_gridsearch_cv(hyperparameters, train_sents, val_sents, X_val):

    results_df = pd.DataFrame(columns = ['poss_transitions', 'poss_states', 'min_freq', 'F1-score'])
    best_f1, best_params = 0, dict()
    num_combinations = len(hyperparameters['feature.possible_transitions']) * len(hyperparameters['feature.possible_states']) * len(hyperparameters['feature.minfreq'])
    current_combination = 0
    for trans in hyperparameters['feature.possible_transitions']:
        for states in hyperparameters['feature.possible_states']:
            for min_freq in hyperparameters['feature.minfreq']:
                current_combination += 1
                print(f'Fitting model {current_combination} of {num_combinations}', end = '\r')
                model = nltk.tag.CRFTagger(feature_func= Feature_getter(), training_opt = {'c1': 0.01, 'c2': 1, 'max_iterations': 200, 'feature.possible_transitions': trans,
                                                                        'feature.possible_states': states, 'feature.minfreq': min_freq})
                model.train(train_sents, 'models/model.crf.tagger')

                pred = model.tag_sents(X_val)
                results, _ = compute_metrics(val_sents, pred)
                results_df.loc[len(results_df)] = [trans, states, min_freq, results['F1-score']]
                if results['F1-score'] > best_f1:
                    best_f1 = results['F1-score']
                    best_params = {'feature.possible_transitions': trans, 'feature.possible_states': states, 'feature.minfreq': min_freq}

    return best_f1, best_params, results_df

best_complete, best_params_complete, results_df_complete = last_gridsearch_cv(hyperparameters, train_sents, val_sents, X_val)

Fitting model 4 of 4

In [6]:
results_df_complete.sort_values(by = 'F1-score', ascending = False).head(5)

Unnamed: 0,poss_transitions,poss_states,min_freq,F1-score
2,False,True,0,0.756
0,True,True,0,0.753
3,False,False,0,0.749
1,True,False,0,0.747


<span style="font-family:Courier New">The search suggests that best hyperparamaters for CRF.Tagger with customed and optimized feature getter are: {'c1': 0.01, 'c2': 1, 'max_iterations': 200, 'possible_transitions': False, 'possible_states': True, 'min_freq' = 0}. </span>

<div class="alert alert-block alert-success"> 
<b>Conclusion:</b>  We can see that we improved greatly the F1 achieved by the default with the addition of our custom features. That said, <b>we will see how it works on the test split on notebook test<b>.
</div>