# Event detection

In [1]:
# This script trains the BiLSTM-CRF architecture for part-of-speech tagging using
# the universal dependency dataset (http://universaldependencies.org/).
# The code use the embeddings by Komninos et al. (https://www.cs.york.ac.uk/nlp/extvec/)
from __future__ import print_function
import os
import logging
import sys
from neuralnets.BiLSTM import BiLSTM
from util.preprocessing import perpareDataset, loadDatasetPickle

'''
# :: Change into the working dir of the script ::
abspath = os.path.abspath(__file__)
dname = os.path.dirname(abspath)
os.chdir(dname)
'''

# :: Logging level ::
loggingLevel = logging.INFO
logger = logging.getLogger()
logger.setLevel(loggingLevel)

ch = logging.StreamHandler(sys.stdout)
ch.setLevel(loggingLevel)
formatter = logging.Formatter('%(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)


######################################################
#
# Data preprocessing
#
######################################################
datasets = {
    'HistoMention':                                   #Name of the dataset
        {'columns': {0:'tokens', 1:'lemma', 2:'POS', 5:'chunk_BIO'},
         'label': 'chunk_BIO',                                #Which column we like to predict
         'evaluate': True,                                  #Should we evaluate on this task? Set true always for single task setups
         'commentSymbol': None} 
}


#Path on your computer to the word embeddings
embeddingsPath = 'glove.6B.300d.txt'#'HistoGlove.txt'

#Prepares the dataset to be used with the LSTM-network. Creates and stores cPickle files in the pkl/ folder ::
pickleFile = perpareDataset(embeddingsPath, datasets)


######################################################
#
# The training of the network starts here
#
######################################################


#Load the embeddings and the dataset
embeddings, mappings, data = loadDatasetPickle(pickleFile)

# Some network hyperparameters
params = {'classifier': ['Softmax'], 'LSTM-Size': [75, 75], 'dropout': (0.25, 0.25),
         'featureNames': ['tokens', 'lemma', 'casing', 'POS'], 'addFeatureDimensions': 10,
         'miniBatchSize': 64, 'earlyStopping': 10}

model = BiLSTM(params)
model.setMappings(mappings, embeddings)
model.setDataset(datasets, data)
model.storeResults('./unidep_pos_results.csv') #Path to store performance scores for dev / test
model.modelSavePath = "models/[ModelName]_[DevScore]_[TestScore]_[Epoch].h5" #Path to store models
model.fit(epochs=25)

Using TensorFlow backend.


Using existent pickle file: pkl/HistoMention_glove.6B.300d.pkl
--- HistoMention ---
1985 train sentences
258 dev sentences
257 test sentences
LSTM-Size: [75, 75]
_____________________________________________________________________________________________________________________________
Layer (type)                             Output Shape               Param #        Connected to                              
words_input (InputLayer)                 (None, None)               0                                                        
_____________________________________________________________________________________________________________________________
lemma_input (InputLayer)                 (None, None)               0                                                        
_____________________________________________________________________________________________________________________________
casing_input (InputLayer)                (None, None)               0             

Train-Data: Prec: 0.875, Rec: 0.896, F1: 0.8857
Wrong BIO-Encoding 29/882 labels, 3.29%
Wrong BIO-Encoding 24/877 labels, 2.74%
Dev-Data: Prec: 0.776, Rec: 0.755, F1: 0.7653
Wrong BIO-Encoding 27/893 labels, 3.02%
Wrong BIO-Encoding 24/890 labels, 2.70%
Test-Data: Prec: 0.778, Rec: 0.774, F1: 0.7761

Scores from epoch with best dev-scores:
  Train-Score: 0.8857
  Dev-Score: 0.7653

1.76 sec for evaluation

--------- Epoch 9 -----------
4.87 sec for training (47.90 total)
-- HistoMention --
Wrong BIO-Encoding 294/7300 labels, 4.03%
Wrong BIO-Encoding 252/7258 labels, 3.47%
Train-Data: Prec: 0.891, Rec: 0.902, F1: 0.8962
Wrong BIO-Encoding 30/893 labels, 3.36%
Wrong BIO-Encoding 28/891 labels, 3.14%
Dev-Data: Prec: 0.783, Rec: 0.771, F1: 0.7770
Wrong BIO-Encoding 36/906 labels, 3.97%
Wrong BIO-Encoding 32/902 labels, 3.55%
Test-Data: Prec: 0.791, Rec: 0.790, F1: 0.7904

Scores from epoch with best dev-scores:
  Train-Score: 0.8962
  Dev-Score: 0.7770

2.51 sec for evaluation

--------- E

Wrong BIO-Encoding 30/885 labels, 3.39%
Dev-Data: Prec: 0.789, Rec: 0.770, F1: 0.7794
Wrong BIO-Encoding 46/903 labels, 5.09%
Wrong BIO-Encoding 38/895 labels, 4.25%
Test-Data: Prec: 0.777, Rec: 0.765, F1: 0.7708

Scores from epoch with best dev-scores:
  Train-Score: 0.9733
  Dev-Score: 0.7794

1.71 sec for evaluation

--------- Epoch 23 -----------
4.72 sec for training (115.48 total)
-- HistoMention --
Wrong BIO-Encoding 238/7222 labels, 3.30%
Wrong BIO-Encoding 161/7145 labels, 2.25%
Train-Data: Prec: 0.973, Rec: 0.982, F1: 0.9773
Wrong BIO-Encoding 44/922 labels, 4.77%
Wrong BIO-Encoding 33/911 labels, 3.62%
Dev-Data: Prec: 0.781, Rec: 0.782, F1: 0.7818
Wrong BIO-Encoding 49/932 labels, 5.26%
Wrong BIO-Encoding 39/922 labels, 4.23%
Test-Data: Prec: 0.762, Rec: 0.773, F1: 0.7674

Scores from epoch with best dev-scores:
  Train-Score: 0.9773
  Dev-Score: 0.7818

1.71 sec for evaluation

--------- Epoch 24 -----------
4.73 sec for training (120.21 total)
-- HistoMention --
Wrong BIO-