### Researcher Name Extraction Dataset

Dataset statistics:

| Data file  | Documents | Sentences | Tokens | Names |
|------------|-----------|-----------|--------|-------|
| Training   | 80        | 24728     | 110269 | 5822  |
| Validation | 35        | 8743      | 36757  | 1788  |
| Test       | 35        | 10399     | 44795  | 2723  |
| Total      | 145       | 43870     | 191821 | 10333 |

In [None]:
import numpy as np
import time
import os
import random
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

from optparse import OptionParser
from pathlib import Path
from model.hmm import HiddenMarkov, load_dataset

start_time = time.time()
for name in ['train', 'valid', 'test']:
    _, Y, T = load_dataset('../data/ner_on_html/' + name)
    t = [[['O', 'B-PER', 'I-PER'][t__] for t__ in t_] for t_ in Y]
    p = [[['O', 'B-PER', 'I-PER'][p__] for p__ in p_] for p_ in Y]
    w = T
    
    with Path('../results/score/{}.preds.txt'.format(name)).open('wb') as f:
        for words, preds, tags in zip(w, p, t):
            f.write(b'\n')
            for word, pred, tag in zip(words, preds, tags):
                f.write(' '.join([word, tag, pred]).encode() + b'\n')

!cd .. && ./eval.sh | grep processed

In [None]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

def plot_word_frequency(directory, color):
    my_counter = Counter()
    for fname in ['train', 'valid', 'test']:
        with open(directory + '/' + fname) as f:
            words = [line.strip().lower().split()[0] for line in f if len(line.strip()) > 0]
            words = [w for w in words if w != '-docstart-']
            my_counter.update(words)

    data = [(key, my_counter[key]) for key in my_counter]    
    data.sort(key=lambda x: x[1], reverse=True)
    
    print([(i, x[1]) for i, x in enumerate(data)][:100])
    plt.plot([x[1] for x in data][:100], color)
    return data[:50]
    
plt.title('Word frequencies')
data1 = plot_word_frequency('../data/conll2003', 'r')
data2 = plot_word_frequency('../data/ner_on_html', 'b')

print(' '.join([d[0] for d in data1[:10]]))
print()
print(' '.join([d[0] for d in data2[:10]]))

for d1, d2 in zip(data1, data2):
    print('%s & %d & %s & %d' % (d1[0], d1[1], d2[0], d2[1]))

In [None]:
import pandas as pd
from dython import nominal

def load_raw_dataset(f):
    with open(f, 'r', encoding='utf8') as f:
        data = f.read().strip()
        sentences = [s.split('\n') for s in data.split('\n\n') if not s.startswith('-DOCSTART-')]
        X = [t.split(' ') for s in sentences for t in s if len(s) > 0]
        for i, s in enumerate(X):
            X[i] = X[i][2:5] + X[i][7:]
        return X

X = load_raw_dataset('../data/ner_on_html/train')
X += load_raw_dataset('../data/ner_on_html/valid')
X += load_raw_dataset('../data/ner_on_html/test')

data = {}
data['words']         = [x[0 ] for x in X]
data['exact_match']   = [int(x[1]) for x in X]
data['partial_match'] = [int(x[2]) for x in X]
data['email']         = [int(x[3]) for x in X]
data['number']        = [int(x[4]) for x in X]
data['honorific']     = [int(x[5]) for x in X] 
data['url']           = [int(x[6]) for x in X]
data['capitalized']   = [int(x[7]) for x in X]
data['punctuation']   = [int(x[8]) for x in X]
data['html_tag']      = [x[9 ] for x in X]
data['css_class']     = [x[10] for x in X]

data['words'][0]
df = pd.DataFrame(data)

nominal.associations(df, nominal_columns=['words','html_tag', 'css_class'])

### How to do it: https://github.com/shakedzy/dython/issues/2

Calculates Cramer's V statistic for categorical-categorical association.
Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
This is a symmetric coefficient: V(x,y) = V(y,x)

https://github.com/shakedzy/dython/blob/master/dython/nominal.py
https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V

### Nested cross-validation

5-fold cross validation


Partition the training data randomly in five folds

Nested CV
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

Common error with cross validation
https://www.youtube.com/watch?v=S06JpVoNaA0

https://www.kdnuggets.com/2017/08/dataiku-predictive-model-holdout-cross-validation.html

https://www.datarobot.com/wiki/training-validation-holdout/

The dataset is split into 3 different files: train, valid, and test. Also, we provide 11 features alongside each token.

| Feature                          | Type        |
|----------------------------------|-------------|
| Unaccented lowercase token       | Categorical |
| Exact dictionary match           | Binary      |
| Partial dictionary match         | Binary      |
| Email                            | Binary      |
| Number                           | Binary      |
| Honorific (Mr., Mrs., Dr., etc.) | Binary      |
| Matches a URL                    | Binary      |
| Is capitalized                   | Binary      |
| Is a punctuation sign            | Binary      |
| HTML tag + parent                | Categorical |
| CSS class                        | Categorical |

### Hidden Markov Models

In [None]:
import numpy as np
import time
import os
import random
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

from optparse import OptionParser
from pathlib import Path
from model.hmm import HiddenMarkov, load_dataset

def test_hmm(timesteps, use_features, dataset):
    start_time = time.time()
    naive_bayes = timesteps == 0
    if naive_bayes:
        timesteps = 1
        
    print('Fitting...')
    X1, Y1, T1 = load_dataset(dataset + '/train')
    X2, Y2, T2 = load_dataset(dataset + '/valid')
    X3, Y3, T3 = load_dataset(dataset + '/test')    
    training_set = [x for x in zip(X1 + X2 + X3, Y1 + Y2 + Y3, T1 + T2 + T3)]

    random.shuffle(training_set)
    fold_size = len(training_set) // 5
    
    folds = []
    for i in range(5):
        start = i * fold_size
        end = start + fold_size if (i < 4) else len(training_set)
        folds.append(training_set[start:end])
    print('Fold size:', fold_size)
    
    for i in range(5):
        train = []        
        for j in range(5):        
            if i != j:
                train = train + folds[j]
                
        map(list, zip(*train))
        train_X, train_Y, train_T = [list(t) for t in zip(*train)]
        
        map(list, zip(*folds[i]))
        test_X, test_Y, test_T = [list(t) for t in zip(*folds[i])]
        
        hmm = HiddenMarkov(timesteps, naive_bayes=naive_bayes, use_features=use_features, self_train=False)
        hmm.fit(train_X, train_Y)

        t = test_Y
        p = hmm.predict(test_X)

        t = [[['O', 'B-PER', 'I-PER'][t__] for t__ in t_] for t_ in t]
        p = [[['O', 'B-PER', 'I-PER'][p__] for p__ in p_] for p_ in p]
        w = test_T

        name = 'fold_' + str(i)
        print('Writing', name)
        with Path('../results/score/{}.preds.txt'.format(name)).open('wb') as f:
            for words, preds, tags in zip(w, p, t):
                f.write(b'\n')
                for word, pred, tag in zip(words, preds, tags):
                    f.write(' '.join([word, tag, pred]).encode() + b'\n')

    print('Elapsed time: %.4f' % (time.time() - start_time))

#### Naive Bayes

In [None]:
# test_hmm(0, False, '../data/ner_on_html')

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/nb
!mv ../results/score/fold* ../results/cross_validation/nb

### Maximum Entropy

In [None]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences'    
})
estimator.train_cv()
# estimator.test()

### LSTM-CRF

In [2]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "elmo",
    "char_representation": "lstm",
    "decoder": "crf",  
    # "loss": "cross_entropy"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/lstm_crf_elmo
!mv ../results/score/fold* ../results/cross_validation/lstm_crf_elmo

Fold size: 29
Loss: 0.1017, Acc: 1.0000, Time: 121.6090, Step: 1000
Loss: 0.0378, Acc: 1.0000, Time: 238.9198, Step: 2000
Loss: 0.0001, Acc: 1.0000, Time: 359.0481, Step: 3000
Loss: 0.0098, Acc: 1.0000, Time: 423.8942, Step: 3553
fold_0 - Epoch 0, Precision: 0.9042, Recall: 0.9230, F1: 0.9135
Loss: 0.0124, Acc: 1.0000, Time: 63.2519, Step: 836
fold_0 - Epoch 0, Precision: 0.9511, Recall: 0.8532, F1: 0.8995
Loss: 0.0028, Acc: 1.0000, Time: 117.8207, Step: 1000
Loss: 0.0052, Acc: 1.0000, Time: 238.0059, Step: 2000
Loss: 0.0018, Acc: 1.0000, Time: 353.4686, Step: 3000
Loss: 0.0083, Acc: 1.0000, Time: 415.8655, Step: 3553
fold_0 - Epoch 1, Precision: 0.9519, Recall: 0.9579, F1: 0.9549
Loss: 0.0044, Acc: 1.0000, Time: 62.0387, Step: 836
fold_0 - Epoch 1, Precision: 0.9280, Recall: 0.8030, F1: 0.8610
Loss: 0.0006, Acc: 1.0000, Time: 117.5116, Step: 1000
Loss: 0.0777, Acc: 1.0000, Time: 236.7794, Step: 2000
Loss: 0.0996, Acc: 0.9714, Time: 354.4473, Step: 3000
Loss: 0.0013, Acc: 1.0000, Time:

Loss: 0.0643, Acc: 1.0000, Time: 123.6233, Step: 1000
Loss: 0.0293, Acc: 1.0000, Time: 245.8858, Step: 2000
Loss: 0.0194, Acc: 1.0000, Time: 367.9636, Step: 3000
Loss: 0.8547, Acc: 0.8667, Time: 418.9395, Step: 3419
fold_4 - Epoch 0, Precision: 0.8740, Recall: 0.9160, F1: 0.8945
Loss: 0.0001, Acc: 1.0000, Time: 71.9694, Step: 971
fold_4 - Epoch 0, Precision: 0.9024, Recall: 0.9431, F1: 0.9223
Loss: 0.0643, Acc: 1.0000, Time: 122.6884, Step: 1000
Loss: 0.0348, Acc: 1.0000, Time: 244.4384, Step: 2000
Loss: 0.0010, Acc: 1.0000, Time: 365.9124, Step: 3000
Loss: 0.0003, Acc: 1.0000, Time: 416.6621, Step: 3419
fold_4 - Epoch 1, Precision: 0.9523, Recall: 0.9636, F1: 0.9579
Loss: 0.0000, Acc: 1.0000, Time: 71.4247, Step: 971
fold_4 - Epoch 1, Precision: 0.9475, Recall: 0.7862, F1: 0.8593
Loss: 0.0708, Acc: 1.0000, Time: 121.0148, Step: 1000
Loss: 0.0607, Acc: 1.0000, Time: 243.3112, Step: 2000
Loss: 0.0224, Acc: 1.0000, Time: 366.4818, Step: 3000
Loss: 0.0018, Acc: 1.0000, Time: 416.9944, Ste

In [3]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'sentences',
    "model": "lstm_crf",  
    "epochs": 5,
    "batch_size": 10,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "crf",  
    # "loss": "cross_entropy"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/lstm_crf_elmo
!mv ../results/score/fold* ../results/cross_validation/lstm_crf_elmo

Fold size: 29
(35214, 300)
Loss: 1.2083, Acc: 0.9759, Time: 72.0324, Step: 1000
Loss: 0.0967, Acc: 1.0000, Time: 139.2430, Step: 2000
Loss: 0.6713, Acc: 0.9818, Time: 201.1569, Step: 3000
Loss: 0.0024, Acc: 1.0000, Time: 227.0417, Step: 3395
fold_0 - Epoch 0, Precision: 0.8549, Recall: 0.8824, F1: 0.8684
Loss: 0.1735, Acc: 1.0000, Time: 34.5421, Step: 995
fold_0 - Epoch 0, Precision: 0.9307, Recall: 0.9481, F1: 0.9393
Loss: 0.0007, Acc: 1.0000, Time: 70.0688, Step: 1000
Loss: 0.2897, Acc: 0.9714, Time: 137.5691, Step: 2000
Loss: 0.0082, Acc: 1.0000, Time: 201.5033, Step: 3000
Loss: 0.0026, Acc: 1.0000, Time: 225.4529, Step: 3395
fold_0 - Epoch 1, Precision: 0.9486, Recall: 0.9560, F1: 0.9523
Loss: 0.0404, Acc: 1.0000, Time: 34.5108, Step: 995
fold_0 - Epoch 1, Precision: 0.9346, Recall: 0.9527, F1: 0.9436
Loss: 0.0011, Acc: 1.0000, Time: 70.6903, Step: 1000
Loss: 0.0017, Acc: 1.0000, Time: 137.4134, Step: 2000
Loss: 0.3458, Acc: 0.9750, Time: 199.6181, Step: 3000
Loss: 0.0011, Acc: 1.0

Loss: 0.1035, Acc: 1.0000, Time: 234.7522, Step: 3537
fold_4 - Epoch 0, Precision: 0.8973, Recall: 0.9085, F1: 0.9029
Loss: 0.0011, Acc: 1.0000, Time: 27.8724, Step: 853
fold_4 - Epoch 0, Precision: 0.8271, Recall: 0.9678, F1: 0.8919
Loss: 0.0004, Acc: 1.0000, Time: 70.7192, Step: 1000
Loss: 0.0260, Acc: 1.0000, Time: 135.8408, Step: 2000
Loss: 0.1739, Acc: 0.9842, Time: 198.8692, Step: 3000
Loss: 0.0014, Acc: 1.0000, Time: 233.4605, Step: 3537
fold_4 - Epoch 1, Precision: 0.9545, Recall: 0.9620, F1: 0.9583
Loss: 0.0004, Acc: 1.0000, Time: 27.3897, Step: 853
fold_4 - Epoch 1, Precision: 0.8437, Recall: 0.9633, F1: 0.8996
Loss: 0.0004, Acc: 1.0000, Time: 72.1640, Step: 1000
Loss: 0.0015, Acc: 1.0000, Time: 136.4272, Step: 2000
Loss: 0.0060, Acc: 1.0000, Time: 198.7079, Step: 3000
Loss: 0.0009, Acc: 1.0000, Time: 233.3336, Step: 3537
fold_4 - Epoch 2, Precision: 0.9672, Recall: 0.9742, F1: 0.9707
Loss: 0.0000, Acc: 1.0000, Time: 27.4816, Step: 853
fold_4 - Epoch 2, Precision: 0.8377, Rec

In [2]:
!pip install tensorflow_hub

Collecting tensorflow_hub
[?25l  Downloading https://files.pythonhosted.org/packages/ac/64/3bba86ca49ef21a4add11a4d37e3f6cd05d2e61d207ebe26a8a96b340826/tensorflow_hub-0.6.0-py2.py3-none-any.whl (84kB)
[K     |████████████████████████████████| 92kB 3.9MB/s eta 0:00:011
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.6.0


In [None]:
import numpy as np
import time
import os
import sys
sys.path.insert(1, os.path.realpath(os.path.pardir))

import tensorflow as tf
from pathlib import Path
from model.estimator import Estimator

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Disable debug logs Tensorflow.
tf.logging.set_verbosity(tf.logging.ERROR)

estimator = Estimator()
estimator.set_dataset_params({
    'datadir': '../data/ner_on_html',
    'dataset_mode': 'batch',
    "model": "html_attention",  
    "epochs": 5,
    "batch_size": 1,
    "use_features": False,
    "word_embeddings": "glove",
    "char_representation": "lstm",
    "decoder": "crf",  
    # "loss": "cross_entropy"
})
estimator.train_cv()

!cd .. && ./eval_model.sh
!mkdir -p ../results/cross_validation/hard_attention
!mv ../results/score/fold* ../results/cross_validation/hard_attention

Fold size: 812
(35214, 300)
Loss: 5.4085, Acc: 0.9674, Time: 187.4245, Step: 1000
Loss: 0.0660, Acc: 1.0000, Time: 370.4121, Step: 2000
Loss: 0.5303, Acc: 1.0000, Time: 553.1166, Step: 3000
Loss: 0.0626, Acc: 1.0000, Time: 598.8995, Step: 3251
fold_0 - Epoch 0, Precision: 0.6763, Recall: 0.7535, F1: 0.7128
Loss: 0.0447, Acc: 1.0000, Time: 96.2599, Step: 813
fold_0 - Epoch 0, Precision: 0.8902, Recall: 0.9437, F1: 0.9162
Loss: 1.9442, Acc: 0.9556, Time: 187.6508, Step: 1000
Loss: 0.1932, Acc: 1.0000, Time: 369.7559, Step: 2000
Loss: 0.2126, Acc: 1.0000, Time: 553.2094, Step: 3000
Loss: 0.0522, Acc: 1.0000, Time: 598.0694, Step: 3251
fold_0 - Epoch 1, Precision: 0.9042, Recall: 0.9136, F1: 0.9088
Loss: 0.0196, Acc: 1.0000, Time: 95.8701, Step: 813
fold_0 - Epoch 1, Precision: 0.9410, Recall: 0.9127, F1: 0.9266
Loss: 0.0663, Acc: 1.0000, Time: 186.7542, Step: 1000
Loss: 0.0623, Acc: 1.0000, Time: 369.2649, Step: 2000
Loss: 4.6239, Acc: 0.9636, Time: 551.7218, Step: 3000
Loss: 3.0582, Acc: