### Researcher Name Extraction Dataset

Dataset statistics:

| Data file  | Documents | Sentences | Tokens | Names |
|------------|-----------|-----------|--------|-------|
| Training   | 80        | 24728     | 110269 | 5822  |
| Validation | 35        | 8743      | 36757  | 1788  |
| Test       | 35        | 10399     | 44795  | 2723  |
| Total      | 145       | 43870     | 191821 | 10333 |

In [None]:
import numpy as np
import time
import os
import random
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

from optparse import OptionParser
from pathlib import Path
from model.hmm import HiddenMarkov, load_dataset

start_time = time.time()
for name in ['train', 'valid', 'test']:
    _, Y, T = load_dataset('../data/ner_on_html/' + name)
    t = [[['O', 'B-PER', 'I-PER'][t__] for t__ in t_] for t_ in Y]
    p = [[['O', 'B-PER', 'I-PER'][p__] for p__ in p_] for p_ in Y]
    w = T
    
    with Path('../results/score/{}.preds.txt'.format(name)).open('wb') as f:
        for words, preds, tags in zip(w, p, t):
            f.write(b'\n')
            for word, pred, tag in zip(words, preds, tags):
                f.write(' '.join([word, tag, pred]).encode() + b'\n')

!cd .. && ./eval.sh | grep processed

In [None]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

def plot_word_frequency(directory, color):
    my_counter = Counter()
    for fname in ['train', 'valid', 'test']:
        with open(directory + '/' + fname) as f:
            words = [line.strip().lower().split()[0] for line in f if len(line.strip()) > 0]
            words = [w for w in words if w != '-docstart-']
            my_counter.update(words)

    data = [(key, my_counter[key]) for key in my_counter]    
    data.sort(key=lambda x: x[1], reverse=True)
    
    print([(i, x[1]) for i, x in enumerate(data)][:100])
    plt.plot([x[1] for x in data][:100], color)
    return data[:50]
    
plt.title('Word frequencies')
data1 = plot_word_frequency('../data/conll2003', 'r')
data2 = plot_word_frequency('../data/ner_on_html', 'b')

print(' '.join([d[0] for d in data1[:10]]))
print()
print(' '.join([d[0] for d in data2[:10]]))

for d1, d2 in zip(data1, data2):
    print('%s & %d & %s & %d' % (d1[0], d1[1], d2[0], d2[1]))

In [None]:
import pandas as pd
from dython import nominal

def load_raw_dataset(f):
    with open(f, 'r', encoding='utf8') as f:
        data = f.read().strip()
        sentences = [s.split('\n') for s in data.split('\n\n') if not s.startswith('-DOCSTART-')]
        X = [t.split(' ') for s in sentences for t in s if len(s) > 0]
        for i, s in enumerate(X):
            X[i] = X[i][2:5] + X[i][7:]
        return X

X = load_raw_dataset('../data/ner_on_html/train')
X += load_raw_dataset('../data/ner_on_html/valid')
X += load_raw_dataset('../data/ner_on_html/test')

data = {}
data['words']         = [x[0 ] for x in X]
data['exact_match']   = [int(x[1]) for x in X]
data['partial_match'] = [int(x[2]) for x in X]
data['email']         = [int(x[3]) for x in X]
data['number']        = [int(x[4]) for x in X]
data['honorific']     = [int(x[5]) for x in X] 
data['url']           = [int(x[6]) for x in X]
data['capitalized']   = [int(x[7]) for x in X]
data['punctuation']   = [int(x[8]) for x in X]
data['html_tag']      = [x[9 ] for x in X]
data['css_class']     = [x[10] for x in X]

data['words'][0]
df = pd.DataFrame(data)

nominal.associations(df, nominal_columns=['words','html_tag', 'css_class'])

### How to do it: https://github.com/shakedzy/dython/issues/2

Calculates Cramer's V statistic for categorical-categorical association.
Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
This is a symmetric coefficient: V(x,y) = V(y,x)

https://github.com/shakedzy/dython/blob/master/dython/nominal.py
https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V

### Nested cross-validation

5-fold cross validation


Partition the training data randomly in five folds

Nested CV
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

Common error with cross validation
https://www.youtube.com/watch?v=S06JpVoNaA0

https://www.kdnuggets.com/2017/08/dataiku-predictive-model-holdout-cross-validation.html

https://www.datarobot.com/wiki/training-validation-holdout/

The dataset is split into 3 different files: train, valid, and test. Also, we provide 11 features alongside each token.

| Feature                          | Type        |
|----------------------------------|-------------|
| Unaccented lowercase token       | Categorical |
| Exact dictionary match           | Binary      |
| Partial dictionary match         | Binary      |
| Email                            | Binary      |
| Number                           | Binary      |
| Honorific (Mr., Mrs., Dr., etc.) | Binary      |
| Matches a URL                    | Binary      |
| Is capitalized                   | Binary      |
| Is a punctuation sign            | Binary      |
| HTML tag + parent                | Categorical |
| CSS class                        | Categorical |

### Hidden Markov Models

In [None]:
import numpy as np
import time
import os
import random
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

from optparse import OptionParser
from pathlib import Path
from model.hmm import HiddenMarkov, load_dataset

def test_hmm(timesteps, use_features, dataset):
    start_time = time.time()
    naive_bayes = timesteps == 0
    if naive_bayes:
        timesteps = 1
        
    print('Fitting...')
    X1, Y1, T1 = load_dataset(dataset + '/train')
    X2, Y2, T2 = load_dataset(dataset + '/valid')
    X3, Y3, T3 = load_dataset(dataset + '/test')    
    training_set = [x for x in zip(X1 + X2 + X3, Y1 + Y2 + Y3, T1 + T2 + T3)]

    random.shuffle(training_set)
    fold_size = len(training_set) // 5
    
    folds = []
    for i in range(5):
        start = i * fold_size
        end = start + fold_size if (i < 4) else len(training_set)
        folds.append(training_set[start:end])
    print('Fold size:', fold_size)
    
    for i in range(5):
        train = []        
        for j in range(5):        
            if i != j:
                train = train + folds[j]
                
        map(list, zip(*train))
        train_X, train_Y, train_T = [list(t) for t in zip(*train)]
        
        map(list, zip(*folds[i]))
        test_X, test_Y, test_T = [list(t) for t in zip(*folds[i])]
        
        hmm = HiddenMarkov(timesteps, naive_bayes=naive_bayes, use_features=use_features, self_train=False)
        hmm.fit(train_X, train_Y)

        t = test_Y
        p = hmm.predict(test_X)

        t = [[['O', 'B-PER', 'I-PER'][t__] for t__ in t_] for t_ in t]
        p = [[['O', 'B-PER', 'I-PER'][p__] for p__ in p_] for p_ in p]
        w = test_T

        name = 'fold_' + str(i)
        print('Writing', name)
        with Path('../results/score/{}.preds.txt'.format(name)).open('wb') as f:
            for words, preds, tags in zip(w, p, t):
                f.write(b'\n')
                for word, pred, tag in zip(words, preds, tags):
                    f.write(' '.join([word, tag, pred]).encode() + b'\n')

    print('Elapsed time: %.4f' % (time.time() - start_time))

#### Naive Bayes

In [None]:
# test_hmm(0, False, '../data/ner_on_html')

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/nb
!mv ../results/score/fold* ../results/cross_validation/nb

### Maximum Entropy

In [None]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences'    
})
estimator.train_cv()
# estimator.test()

### LSTM-CRF

In [2]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "elmo",
    "char_representation": "lstm",
    "decoder": "crf",  
    # "loss": "cross_entropy"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/lstm_crf_elmo
!mv ../results/score/fold* ../results/cross_validation/lstm_crf_elmo

Fold size: 29
Loss: 0.1017, Acc: 1.0000, Time: 121.6090, Step: 1000
Loss: 0.0378, Acc: 1.0000, Time: 238.9198, Step: 2000
Loss: 0.0001, Acc: 1.0000, Time: 359.0481, Step: 3000
Loss: 0.0098, Acc: 1.0000, Time: 423.8942, Step: 3553
fold_0 - Epoch 0, Precision: 0.9042, Recall: 0.9230, F1: 0.9135
Loss: 0.0124, Acc: 1.0000, Time: 63.2519, Step: 836
fold_0 - Epoch 0, Precision: 0.9511, Recall: 0.8532, F1: 0.8995
Loss: 0.0028, Acc: 1.0000, Time: 117.8207, Step: 1000
Loss: 0.0052, Acc: 1.0000, Time: 238.0059, Step: 2000
Loss: 0.0018, Acc: 1.0000, Time: 353.4686, Step: 3000
Loss: 0.0083, Acc: 1.0000, Time: 415.8655, Step: 3553
fold_0 - Epoch 1, Precision: 0.9519, Recall: 0.9579, F1: 0.9549
Loss: 0.0044, Acc: 1.0000, Time: 62.0387, Step: 836
fold_0 - Epoch 1, Precision: 0.9280, Recall: 0.8030, F1: 0.8610
Loss: 0.0006, Acc: 1.0000, Time: 117.5116, Step: 1000
Loss: 0.0777, Acc: 1.0000, Time: 236.7794, Step: 2000
Loss: 0.0996, Acc: 0.9714, Time: 354.4473, Step: 3000
Loss: 0.0013, Acc: 1.0000, Time:

Loss: 0.0643, Acc: 1.0000, Time: 123.6233, Step: 1000
Loss: 0.0293, Acc: 1.0000, Time: 245.8858, Step: 2000
Loss: 0.0194, Acc: 1.0000, Time: 367.9636, Step: 3000
Loss: 0.8547, Acc: 0.8667, Time: 418.9395, Step: 3419
fold_4 - Epoch 0, Precision: 0.8740, Recall: 0.9160, F1: 0.8945
Loss: 0.0001, Acc: 1.0000, Time: 71.9694, Step: 971
fold_4 - Epoch 0, Precision: 0.9024, Recall: 0.9431, F1: 0.9223
Loss: 0.0643, Acc: 1.0000, Time: 122.6884, Step: 1000
Loss: 0.0348, Acc: 1.0000, Time: 244.4384, Step: 2000
Loss: 0.0010, Acc: 1.0000, Time: 365.9124, Step: 3000
Loss: 0.0003, Acc: 1.0000, Time: 416.6621, Step: 3419
fold_4 - Epoch 1, Precision: 0.9523, Recall: 0.9636, F1: 0.9579
Loss: 0.0000, Acc: 1.0000, Time: 71.4247, Step: 971
fold_4 - Epoch 1, Precision: 0.9475, Recall: 0.7862, F1: 0.8593
Loss: 0.0708, Acc: 1.0000, Time: 121.0148, Step: 1000
Loss: 0.0607, Acc: 1.0000, Time: 243.3112, Step: 2000
Loss: 0.0224, Acc: 1.0000, Time: 366.4818, Step: 3000
Loss: 0.0018, Acc: 1.0000, Time: 416.9944, Ste

In [3]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "crf",  
    # "loss": "cross_entropy"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/lstm_crf_elmo
!mv ../results/score/fold* ../results/cross_validation/lstm_crf_elmo

Fold size: 29
(35214, 300)
Loss: 1.2083, Acc: 0.9759, Time: 72.0324, Step: 1000
Loss: 0.0967, Acc: 1.0000, Time: 139.2430, Step: 2000
Loss: 0.6713, Acc: 0.9818, Time: 201.1569, Step: 3000
Loss: 0.0024, Acc: 1.0000, Time: 227.0417, Step: 3395
fold_0 - Epoch 0, Precision: 0.8549, Recall: 0.8824, F1: 0.8684
Loss: 0.1735, Acc: 1.0000, Time: 34.5421, Step: 995
fold_0 - Epoch 0, Precision: 0.9307, Recall: 0.9481, F1: 0.9393
Loss: 0.0007, Acc: 1.0000, Time: 70.0688, Step: 1000
Loss: 0.2897, Acc: 0.9714, Time: 137.5691, Step: 2000
Loss: 0.0082, Acc: 1.0000, Time: 201.5033, Step: 3000
Loss: 0.0026, Acc: 1.0000, Time: 225.4529, Step: 3395
fold_0 - Epoch 1, Precision: 0.9486, Recall: 0.9560, F1: 0.9523
Loss: 0.0404, Acc: 1.0000, Time: 34.5108, Step: 995
fold_0 - Epoch 1, Precision: 0.9346, Recall: 0.9527, F1: 0.9436
Loss: 0.0011, Acc: 1.0000, Time: 70.6903, Step: 1000
Loss: 0.0017, Acc: 1.0000, Time: 137.4134, Step: 2000
Loss: 0.3458, Acc: 0.9750, Time: 199.6181, Step: 3000
Loss: 0.0011, Acc: 1.0

Loss: 0.1035, Acc: 1.0000, Time: 234.7522, Step: 3537
fold_4 - Epoch 0, Precision: 0.8973, Recall: 0.9085, F1: 0.9029
Loss: 0.0011, Acc: 1.0000, Time: 27.8724, Step: 853
fold_4 - Epoch 0, Precision: 0.8271, Recall: 0.9678, F1: 0.8919
Loss: 0.0004, Acc: 1.0000, Time: 70.7192, Step: 1000
Loss: 0.0260, Acc: 1.0000, Time: 135.8408, Step: 2000
Loss: 0.1739, Acc: 0.9842, Time: 198.8692, Step: 3000
Loss: 0.0014, Acc: 1.0000, Time: 233.4605, Step: 3537
fold_4 - Epoch 1, Precision: 0.9545, Recall: 0.9620, F1: 0.9583
Loss: 0.0004, Acc: 1.0000, Time: 27.3897, Step: 853
fold_4 - Epoch 1, Precision: 0.8437, Recall: 0.9633, F1: 0.8996
Loss: 0.0004, Acc: 1.0000, Time: 72.1640, Step: 1000
Loss: 0.0015, Acc: 1.0000, Time: 136.4272, Step: 2000
Loss: 0.0060, Acc: 1.0000, Time: 198.7079, Step: 3000
Loss: 0.0009, Acc: 1.0000, Time: 233.3336, Step: 3537
fold_4 - Epoch 2, Precision: 0.9672, Recall: 0.9742, F1: 0.9707
Loss: 0.0000, Acc: 1.0000, Time: 27.4816, Step: 853
fold_4 - Epoch 2, Precision: 0.8377, Rec

In [2]:
!pip install tensorflow_hub

Collecting tensorflow_hub
[?25l  Downloading https://files.pythonhosted.org/packages/ac/64/3bba86ca49ef21a4add11a4d37e3f6cd05d2e61d207ebe26a8a96b340826/tensorflow_hub-0.6.0-py2.py3-none-any.whl (84kB)
[K     |████████████████████████████████| 92kB 3.9MB/s eta 0:00:011
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.6.0


In [2]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.5,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_05_loss
!mv ../results/score/fold* ../results/cross_validation/f_05_loss

Fold size: 29
(35214, 300)
Loss: -0.1895, Acc: 0.9167, Time: 76.9837, Step: 1000
Loss: -0.1957, Acc: 0.9750, Time: 147.6236, Step: 2000
Loss: -0.0667, Acc: 0.9571, Time: 214.5439, Step: 3000
Loss: -0.0871, Acc: 0.9722, Time: 250.9961, Step: 3569
fold_0 - Epoch 0, Precision: 0.3205, Recall: 0.9180, F1: 0.4751
Loss: 0.0000, Acc: 0.6667, Time: 8.2895, Step: 821
fold_0 - Epoch 0, Precision: 0.3130, Recall: 0.8684, F1: 0.4602
Loss: -0.1332, Acc: 0.9125, Time: 71.0156, Step: 1000
Loss: -0.0666, Acc: 0.9778, Time: 138.4262, Step: 2000
Loss: -0.1602, Acc: 0.9625, Time: 208.1224, Step: 3000
Loss: 0.0000, Acc: 0.8857, Time: 246.1954, Step: 3569
fold_0 - Epoch 1, Precision: 0.4165, Recall: 0.9721, F1: 0.5831
Loss: 0.0000, Acc: 0.6667, Time: 8.2661, Step: 821
fold_0 - Epoch 1, Precision: 0.2954, Recall: 0.9038, F1: 0.4453
Loss: -0.0666, Acc: 0.9333, Time: 69.7046, Step: 1000
Loss: -0.2597, Acc: 0.9357, Time: 134.7231, Step: 2000
Loss: -0.1332, Acc: 0.9800, Time: 203.9247, Step: 3000
Loss: -0.2857,

Loss: -0.0000, Acc: 0.0000, Time: 8.4373, Step: 767
fold_3 - Epoch 4, Precision: 0.5653, Recall: 0.8482, F1: 0.6785
Writing fold_3
(35214, 300)
Loss: -0.1266, Acc: 0.8875, Time: 66.0849, Step: 1000
Loss: -0.0637, Acc: 0.8875, Time: 130.9750, Step: 2000
Loss: -0.1305, Acc: 0.9545, Time: 195.8233, Step: 3000
Loss: -0.1139, Acc: 0.9222, Time: 233.4504, Step: 3570
fold_4 - Epoch 0, Precision: 0.3569, Recall: 0.9188, F1: 0.5141
Loss: -0.2889, Acc: 0.8769, Time: 9.1577, Step: 820
fold_4 - Epoch 0, Precision: 0.3560, Recall: 0.9192, F1: 0.5133
Loss: -0.0663, Acc: 0.8875, Time: 68.8678, Step: 1000
Loss: -0.0398, Acc: 0.8667, Time: 134.9228, Step: 2000
Loss: 0.0000, Acc: 0.9800, Time: 197.3293, Step: 3000
Loss: -0.2665, Acc: 0.9600, Time: 233.7834, Step: 3570
fold_4 - Epoch 1, Precision: 0.4141, Recall: 0.9721, F1: 0.5808
Loss: -0.3025, Acc: 0.8154, Time: 9.2101, Step: 820
fold_4 - Epoch 1, Precision: 0.3377, Recall: 0.9333, F1: 0.4959
Loss: -0.3330, Acc: 0.9556, Time: 67.4635, Step: 1000
Loss:

In [3]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.1,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_01_loss
!mv ../results/score/fold* ../results/cross_validation/f_01_loss

Fold size: 29
(35214, 300)
Loss: -0.1418, Acc: 0.9091, Time: 59.4684, Step: 1000
Loss: -0.3191, Acc: 0.9882, Time: 117.1921, Step: 2000
Loss: -0.1547, Acc: 0.9875, Time: 180.3011, Step: 3000
Loss: -0.5356, Acc: 1.0000, Time: 220.4531, Step: 3636
fold_0 - Epoch 0, Precision: 0.3893, Recall: 0.9215, F1: 0.5474
Loss: 0.0000, Acc: 0.9200, Time: 7.8569, Step: 754
fold_0 - Epoch 0, Precision: 0.4794, Recall: 0.9368, F1: 0.6343
Loss: -0.1803, Acc: 0.9714, Time: 59.3815, Step: 1000
Loss: 0.0000, Acc: 0.9000, Time: 116.7831, Step: 2000
Loss: -0.4504, Acc: 0.9944, Time: 179.5015, Step: 3000
Loss: -0.3633, Acc: 0.9200, Time: 220.3705, Step: 3636
fold_0 - Epoch 1, Precision: 0.4471, Recall: 0.9295, F1: 0.6038
Loss: 0.0000, Acc: 0.8400, Time: 7.9903, Step: 754
fold_0 - Epoch 1, Precision: 0.4528, Recall: 0.9080, F1: 0.6042
Loss: 0.0000, Acc: 0.9800, Time: 59.6331, Step: 1000
Loss: -0.3627, Acc: 1.0000, Time: 118.1575, Step: 2000
Loss: -0.2715, Acc: 0.9727, Time: 180.4573, Step: 3000
Loss: -0.1818, 

Loss: -0.0872, Acc: 0.9929, Time: 181.7452, Step: 3000
Loss: -0.1753, Acc: 0.9259, Time: 220.3931, Step: 3676
fold_4 - Epoch 0, Precision: 0.3852, Recall: 0.9250, F1: 0.5439
Loss: -0.0000, Acc: 0.8696, Time: 8.8416, Step: 714
fold_4 - Epoch 0, Precision: 0.3872, Recall: 0.8346, F1: 0.5290
Loss: -0.0909, Acc: 0.9867, Time: 67.1319, Step: 1000
Loss: -0.1803, Acc: 1.0000, Time: 125.4330, Step: 2000
Loss: -0.3608, Acc: 0.9937, Time: 182.9833, Step: 3000
Loss: 0.0000, Acc: 0.9753, Time: 224.0297, Step: 3676
fold_4 - Epoch 1, Precision: 0.4285, Recall: 0.9686, F1: 0.5942
Loss: -0.0000, Acc: 0.9565, Time: 9.0653, Step: 714
fold_4 - Epoch 1, Precision: 0.4965, Recall: 0.8609, F1: 0.6298
Loss: -0.0908, Acc: 0.9400, Time: 66.1054, Step: 1000
Loss: -0.2722, Acc: 0.9889, Time: 125.2519, Step: 2000
Loss: -0.0908, Acc: 0.9867, Time: 183.5213, Step: 3000
Loss: -0.1006, Acc: 0.9524, Time: 223.7957, Step: 3676
fold_4 - Epoch 2, Precision: 0.4517, Recall: 0.9717, F1: 0.6167
Loss: -0.0000, Acc: 0.7391, T

In [4]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.2,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_02_loss
!mv ../results/score/fold* ../results/cross_validation/f_02_loss

Fold size: 29
(35214, 300)
Loss: -0.1897, Acc: 0.9923, Time: 68.3791, Step: 1000
Loss: -0.1577, Acc: 0.9600, Time: 134.1882, Step: 2000
Loss: -0.0540, Acc: 0.9680, Time: 200.4276, Step: 3000
Loss: -0.1667, Acc: 0.8500, Time: 218.2753, Step: 3261
fold_0 - Epoch 0, Precision: 0.3332, Recall: 0.9256, F1: 0.4900
Loss: 0.0000, Acc: 1.0000, Time: 10.1060, Step: 1000
Loss: 0.0000, Acc: 0.9467, Time: 11.4363, Step: 1129
fold_0 - Epoch 0, Precision: 0.4689, Recall: 0.9601, F1: 0.6301
Loss: -0.0831, Acc: 0.8750, Time: 68.7389, Step: 1000
Loss: -0.2481, Acc: 0.9824, Time: 135.1296, Step: 2000
Loss: -0.1420, Acc: 0.9625, Time: 200.9225, Step: 3000
Loss: -0.2202, Acc: 0.8667, Time: 218.9280, Step: 3261
fold_0 - Epoch 1, Precision: 0.3919, Recall: 0.9713, F1: 0.5584
Loss: 0.0000, Acc: 1.0000, Time: 10.2079, Step: 1000
Loss: 0.0000, Acc: 0.9467, Time: 11.5487, Step: 1129
fold_0 - Epoch 1, Precision: 0.4650, Recall: 0.9546, F1: 0.6254
Loss: 0.0000, Acc: 0.9803, Time: 68.3576, Step: 1000
Loss: -0.0829,

Loss: -0.6016, Acc: 0.9615, Time: 9.0658, Step: 794
fold_3 - Epoch 4, Precision: 0.3839, Recall: 0.9087, F1: 0.5397
Writing fold_3
(35214, 300)
Loss: -0.0942, Acc: 0.9444, Time: 66.3876, Step: 1000
Loss: -0.1393, Acc: 0.9935, Time: 130.5749, Step: 2000
Loss: -0.1430, Acc: 0.9545, Time: 193.5614, Step: 3000
Loss: 0.0000, Acc: 0.9500, Time: 229.0970, Step: 3555
fold_4 - Epoch 0, Precision: 0.3389, Recall: 0.9173, F1: 0.4950
Loss: 0.0000, Acc: 1.0000, Time: 8.7598, Step: 835
fold_4 - Epoch 0, Precision: 0.5748, Recall: 0.9491, F1: 0.7160
Loss: -0.1650, Acc: 0.9600, Time: 65.4617, Step: 1000
Loss: -0.0833, Acc: 0.9714, Time: 129.8029, Step: 2000
Loss: -0.0833, Acc: 0.9333, Time: 194.2681, Step: 3000
Loss: -0.2077, Acc: 0.9833, Time: 230.5243, Step: 3555
fold_4 - Epoch 1, Precision: 0.4213, Recall: 0.9665, F1: 0.5868
Loss: 0.0000, Acc: 1.0000, Time: 8.9777, Step: 835
fold_4 - Epoch 1, Precision: 0.6336, Recall: 0.9449, F1: 0.7585
Loss: -0.3329, Acc: 0.9667, Time: 66.3195, Step: 1000
Loss: -

In [5]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.3,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_03_loss
!mv ../results/score/fold* ../results/cross_validation/f_03_loss

Fold size: 29
(35214, 300)
Loss: -0.2582, Acc: 0.9300, Time: 70.7980, Step: 1000
Loss: -0.2768, Acc: 0.9684, Time: 136.8705, Step: 2000
Loss: -0.1512, Acc: 0.9000, Time: 198.4584, Step: 3000
Loss: -0.1491, Acc: 1.0000, Time: 226.0133, Step: 3460
fold_0 - Epoch 0, Precision: 0.3649, Recall: 0.9296, F1: 0.5241
Loss: 0.0000, Acc: 1.0000, Time: 9.9330, Step: 930
fold_0 - Epoch 0, Precision: 0.4231, Recall: 0.8723, F1: 0.5698
Loss: 0.0000, Acc: 0.8800, Time: 69.4947, Step: 1000
Loss: -0.1533, Acc: 0.9400, Time: 134.9982, Step: 2000
Loss: -0.0743, Acc: 0.9289, Time: 198.2523, Step: 3000
Loss: -0.3070, Acc: 0.9750, Time: 226.4860, Step: 3460
fold_0 - Epoch 1, Precision: 0.4238, Recall: 0.9775, F1: 0.5912
Loss: 0.0000, Acc: 0.9000, Time: 9.9387, Step: 930
fold_0 - Epoch 1, Precision: 0.3980, Recall: 0.9234, F1: 0.5563
Loss: -0.1537, Acc: 0.9846, Time: 71.0840, Step: 1000
Loss: -0.1973, Acc: 0.8800, Time: 135.5641, Step: 2000
Loss: -0.2305, Acc: 0.9812, Time: 198.4721, Step: 3000
Loss: -0.3076,

Loss: -0.1477, Acc: 0.9636, Time: 194.0298, Step: 3000
Loss: 0.0000, Acc: 0.6667, Time: 221.7383, Step: 3404
fold_4 - Epoch 0, Precision: 0.3874, Recall: 0.9149, F1: 0.5443
Loss: 0.0000, Acc: 0.9786, Time: 9.9506, Step: 986
fold_4 - Epoch 0, Precision: 0.4285, Recall: 0.9796, F1: 0.5962
Loss: -0.0425, Acc: 0.9091, Time: 66.3667, Step: 1000
Loss: -0.1523, Acc: 0.9600, Time: 132.0872, Step: 2000
Loss: -0.0765, Acc: 0.9444, Time: 197.0097, Step: 3000
Loss: -0.1447, Acc: 0.9167, Time: 223.9748, Step: 3404
fold_4 - Epoch 1, Precision: 0.4574, Recall: 0.9686, F1: 0.6213
Loss: 0.0000, Acc: 0.9714, Time: 10.7331, Step: 986
fold_4 - Epoch 1, Precision: 0.4553, Recall: 0.9841, F1: 0.6226
Loss: -0.1532, Acc: 0.9538, Time: 65.7879, Step: 1000
Loss: -0.2305, Acc: 0.9800, Time: 131.0514, Step: 2000
Loss: -0.2304, Acc: 1.0000, Time: 196.2168, Step: 3000
Loss: 0.0000, Acc: 1.0000, Time: 222.8339, Step: 3404
fold_4 - Epoch 2, Precision: 0.4720, Recall: 0.9793, F1: 0.6370
Loss: 0.0000, Acc: 0.9571, Time

In [6]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.4,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_04_loss
!mv ../results/score/fold* ../results/cross_validation/f_04_loss

Fold size: 29
(35214, 300)
Loss: -0.2365, Acc: 0.9286, Time: 66.9386, Step: 1000
Loss: -0.0669, Acc: 0.9444, Time: 129.3893, Step: 2000
Loss: -0.0714, Acc: 0.8875, Time: 192.1344, Step: 3000
Loss: -0.1167, Acc: 0.9545, Time: 218.4487, Step: 3410
fold_0 - Epoch 0, Precision: 0.3317, Recall: 0.9066, F1: 0.4857
Loss: 0.0000, Acc: 0.0000, Time: 10.9439, Step: 980
fold_0 - Epoch 0, Precision: 0.5036, Recall: 0.9446, F1: 0.6570
Loss: -0.2520, Acc: 0.9722, Time: 64.2913, Step: 1000
Loss: -0.2131, Acc: 0.9600, Time: 127.3415, Step: 2000
Loss: -0.1426, Acc: 0.9200, Time: 190.2364, Step: 3000
Loss: 0.0000, Acc: 0.9444, Time: 217.3048, Step: 3410
fold_0 - Epoch 1, Precision: 0.4037, Recall: 0.9743, F1: 0.5709
Loss: 0.0000, Acc: 0.0000, Time: 11.0959, Step: 980
fold_0 - Epoch 1, Precision: 0.4520, Recall: 0.9446, F1: 0.6114
Loss: -0.0714, Acc: 0.9571, Time: 63.6129, Step: 1000
Loss: -0.2851, Acc: 0.9773, Time: 126.6520, Step: 2000
Loss: -0.1426, Acc: 0.9231, Time: 189.7553, Step: 3000
Loss: -0.237

Loss: 0.0000, Acc: 0.9500, Time: 8.5270, Step: 729
fold_3 - Epoch 4, Precision: 0.4526, Recall: 0.9689, F1: 0.6169
Writing fold_3
(35214, 300)
Loss: -0.0630, Acc: 0.9091, Time: 65.8795, Step: 1000
Loss: -0.1413, Acc: 0.9000, Time: 127.7201, Step: 2000
Loss: -0.0712, Acc: 0.9857, Time: 189.9020, Step: 3000
Loss: -0.0000, Acc: 0.5000, Time: 231.7668, Step: 3669
fold_4 - Epoch 0, Precision: 0.3295, Recall: 0.9153, F1: 0.4845
Loss: 0.0000, Acc: 0.7111, Time: 7.6475, Step: 721
fold_4 - Epoch 0, Precision: 0.2614, Recall: 0.9219, F1: 0.4073
Loss: -0.1428, Acc: 0.9500, Time: 65.1395, Step: 1000
Loss: -0.2136, Acc: 0.9455, Time: 125.9109, Step: 2000
Loss: -0.0709, Acc: 0.9909, Time: 188.1173, Step: 3000
Loss: -0.7143, Acc: 1.0000, Time: 231.9309, Step: 3669
fold_4 - Epoch 1, Precision: 0.4259, Recall: 0.9715, F1: 0.5922
Loss: 0.0000, Acc: 0.7778, Time: 7.6663, Step: 721
fold_4 - Epoch 1, Precision: 0.3852, Recall: 0.9308, F1: 0.5449
Loss: -0.1843, Acc: 0.9500, Time: 65.8462, Step: 1000
Loss: -

In [7]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.6,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_06_loss
!mv ../results/score/fold* ../results/cross_validation/f_06_loss

Fold size: 29
(35214, 300)
Loss: -0.1045, Acc: 0.9421, Time: 61.2864, Step: 1000
Loss: -0.1238, Acc: 0.9444, Time: 123.4944, Step: 2000
Loss: -0.1233, Acc: 0.9167, Time: 185.7995, Step: 3000
Loss: -0.2468, Acc: 0.9769, Time: 224.1854, Step: 3610
fold_0 - Epoch 0, Precision: 0.3074, Recall: 0.9021, F1: 0.4585
Loss: -0.0597, Acc: 0.9560, Time: 8.2708, Step: 779
fold_0 - Epoch 0, Precision: 0.3701, Recall: 0.9363, F1: 0.5305
Loss: -0.1005, Acc: 0.9375, Time: 62.1359, Step: 1000
Loss: -0.1248, Acc: 0.9364, Time: 125.6334, Step: 2000
Loss: -0.1246, Acc: 0.9765, Time: 186.7771, Step: 3000
Loss: -0.1250, Acc: 0.9500, Time: 225.2337, Step: 3610
fold_0 - Epoch 1, Precision: 0.3816, Recall: 0.9705, F1: 0.5478
Loss: -0.0620, Acc: 0.9600, Time: 8.1740, Step: 779
fold_0 - Epoch 1, Precision: 0.4354, Recall: 0.9553, F1: 0.5982
Loss: -0.1250, Acc: 0.9125, Time: 60.4673, Step: 1000
Loss: -0.1250, Acc: 0.9625, Time: 123.7438, Step: 2000
Loss: -0.1875, Acc: 0.9500, Time: 184.7223, Step: 3000
Loss: -0.37

Loss: -0.0621, Acc: 0.9875, Time: 183.2897, Step: 3000
Loss: 0.0000, Acc: 0.9167, Time: 203.1314, Step: 3324
fold_4 - Epoch 0, Precision: 0.3192, Recall: 0.9005, F1: 0.4713
Loss: 0.0000, Acc: 0.8500, Time: 10.9289, Step: 1000
Loss: 0.0000, Acc: 0.7083, Time: 11.6480, Step: 1066
fold_4 - Epoch 0, Precision: 0.4266, Recall: 0.9247, F1: 0.5839
Loss: -0.1847, Acc: 0.8833, Time: 62.5835, Step: 1000
Loss: -0.2950, Acc: 0.9895, Time: 120.3462, Step: 2000
Loss: -0.1248, Acc: 0.9700, Time: 182.1480, Step: 3000
Loss: 0.0000, Acc: 0.8750, Time: 201.8973, Step: 3324
fold_4 - Epoch 1, Precision: 0.4127, Recall: 0.9650, F1: 0.5782
Loss: 0.0000, Acc: 0.8500, Time: 11.0244, Step: 1000
Loss: 0.0000, Acc: 0.7083, Time: 11.7520, Step: 1066
fold_4 - Epoch 1, Precision: 0.3953, Recall: 0.9050, F1: 0.5502
Loss: -0.1870, Acc: 0.9467, Time: 60.9425, Step: 1000
Loss: -0.2180, Acc: 0.9077, Time: 119.0442, Step: 2000
Loss: -0.0624, Acc: 0.9462, Time: 179.0083, Step: 3000
Loss: -0.1562, Acc: 0.9167, Time: 199.556

In [8]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.7,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_07_loss
!mv ../results/score/fold* ../results/cross_validation/f_07_loss

Fold size: 29
(35214, 300)
Loss: -0.2188, Acc: 0.9556, Time: 61.7152, Step: 1000
Loss: -0.1136, Acc: 0.9143, Time: 124.5705, Step: 2000
Loss: -0.1733, Acc: 0.9267, Time: 190.2107, Step: 3000
Loss: -0.2502, Acc: 0.9683, Time: 222.5983, Step: 3481
fold_0 - Epoch 0, Precision: 0.3227, Recall: 0.8881, F1: 0.4734
Loss: -0.3903, Acc: 0.9333, Time: 9.3371, Step: 909
fold_0 - Epoch 0, Precision: 0.3523, Recall: 0.9625, F1: 0.5158
Loss: -0.0588, Acc: 0.9000, Time: 60.7225, Step: 1000
Loss: -0.1175, Acc: 0.9889, Time: 125.2617, Step: 2000
Loss: -0.2934, Acc: 0.9786, Time: 192.2431, Step: 3000
Loss: -0.2208, Acc: 0.9365, Time: 223.9518, Step: 3481
fold_0 - Epoch 1, Precision: 0.4199, Recall: 0.9653, F1: 0.5852
Loss: -0.3918, Acc: 0.9600, Time: 9.5099, Step: 909
fold_0 - Epoch 1, Precision: 0.3518, Recall: 0.9596, F1: 0.5148
Loss: -0.1176, Acc: 0.9000, Time: 60.1005, Step: 1000
Loss: -0.0587, Acc: 0.9684, Time: 123.9086, Step: 2000
Loss: -0.1764, Acc: 0.9824, Time: 188.7524, Step: 3000
Loss: -0.16

Loss: -0.1754, Acc: 0.8625, Time: 174.5842, Step: 3000
Loss: -0.2103, Acc: 0.9625, Time: 205.8599, Step: 3546
fold_4 - Epoch 0, Precision: 0.2840, Recall: 0.9014, F1: 0.4320
Loss: 0.0000, Acc: 0.8538, Time: 10.1619, Step: 843
fold_4 - Epoch 0, Precision: 0.3712, Recall: 0.8631, F1: 0.5191
Loss: -0.1752, Acc: 0.9300, Time: 60.0086, Step: 1000
Loss: -0.1760, Acc: 0.9600, Time: 117.2509, Step: 2000
Loss: -0.0586, Acc: 0.9800, Time: 173.6964, Step: 3000
Loss: -0.1627, Acc: 0.9526, Time: 205.5174, Step: 3546
fold_4 - Epoch 1, Precision: 0.3940, Recall: 0.9652, F1: 0.5596
Loss: 0.0000, Acc: 0.8692, Time: 10.2022, Step: 843
fold_4 - Epoch 1, Precision: 0.4265, Recall: 0.8636, F1: 0.5710
Loss: -0.1763, Acc: 0.9563, Time: 60.0654, Step: 1000
Loss: -0.2347, Acc: 0.9950, Time: 117.5740, Step: 2000
Loss: -0.1764, Acc: 0.9167, Time: 174.9110, Step: 3000
Loss: -0.0588, Acc: 0.9545, Time: 205.4467, Step: 3546
fold_4 - Epoch 2, Precision: 0.4072, Recall: 0.9744, F1: 0.5743
Loss: 0.0000, Acc: 0.8769, T

In [10]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 10,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.8,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_08_loss
!mv ../results/score/fold* ../results/cross_validation/f_08_loss

Fold size: 29
(35214, 300)
Loss: -0.0453, Acc: 0.9000, Time: 66.5641, Step: 1000
Loss: -0.0901, Acc: 0.8667, Time: 139.4658, Step: 2000
Loss: -0.1068, Acc: 0.9706, Time: 208.9476, Step: 3000
Loss: 0.0000, Acc: 1.0000, Time: 18889.0373, Step: 3403
fold_0 - Epoch 0, Precision: 0.3211, Recall: 0.8667, F1: 0.4686
Loss: -0.3452, Acc: 0.9688, Time: 3195.1964, Step: 987
fold_0 - Epoch 0, Precision: 0.3298, Recall: 0.9352, F1: 0.4877


KeyboardInterrupt: 

In [None]:
estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "logits",  
    "f_score_alpha": 0.9,      
    "loss": "f1"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/f_09_loss
!mv ../results/score/fold* ../results/cross_validation/f_09_loss