# Ongoing Event Detection Tool

**Contributors:** [Mariano Maisonnave](https://cs.uns.edu.ar/~mmaisonnave), [Fernando Delbianco](https://scholar.google.com.ar/citations?user=cQ3vD7oAAAAJ&hl=en&oi=ao), [Fernando Tohmé](https://scholar.google.com.ar/citations?user=butwPD4AAAAJ&hl=en&oi=ao), [Ana Maguitman](https://scholar.google.com.ar/citations?user=upxByNEAAAAJ&hl=en&oi=ao) and [Evangelos Milios](https://scholar.google.com.ar/citations?user=ME8aQywAAAAJ&hl=en&oi=ao)

This notebook contains the complete code used for the experiments conducted for the paper: "[Improving Event Detection using Contextual Word and Sentence Embeddings](https://arxiv.org/abs/2007.01379)".

Although this is the complete code for that article, that work was not run on Google Colab. Mainly because of storage and computational limitations. If you run into trouble running this in Google Colab, we recommend to download the notebook and run it locally once all the requirements are met, and all the required data is downloaded. 

The notebook was used for the development and testing of the models. The final models were run by copying all the code for each model into one single file (baseline_svm.py, baseline_cnn.py, proposed_model.py), and then ran from the command line following the syntaxis described below.  



---



***Additional files for running the experiments can be found at [https://github.com/mmaisonnave/ongoing-event-detection](https://github.com/mmaisonnave/ongoing-event-detection)***



---




**How to run the proposed model (RNN):**
```
python3 proposed_model.py --seed=1 --model_name=prop_model_seed1 --epochs=3000 --bert_enabled=True --pos_enabled=False --dep_enabled=False --tag_enabled=False --words_enabled=False --entities_enabled=False --spacyvecs_enabled=False --bert_sent_enabled=False --lstm_layers=7 --evaluate_only=False
```
**How to run the state-of-the-art baseline model (CNN):**
```
python3 baseline_cnn.py --seed=1 --model_name=baseline_seed1 --words_enabled=False --entities_enabled=True --positions_enabled=True --windows_size=11 --epochs=3000 --batch_size=50 --evaluate_only=False --patience=200

```


```

Parameter description:
  - seed: Used to obtain reproducible results.
  - model_name: The name of the model. Used to store weights, architecture and results on a folder with the same name as the model. 
  - epochs: Max number of epochs the model will train. If high, the model will stop because of early stopping before that for the number of epochs.
  - evaluate_only: If "True" the model will not train.
  - patience: Patience value for the early stopping (number of epochs without improvement before stop training. 
```


```

Parameters specific to the CNN model
  - words_enabled: If True, the Word2vec feature will be used in the model.
  - entities_enabled: If True, the Entity embedding feature will be used in the model.
  - positions_enabled: If True, the Position embedding feature will be used in the model.
  - windows_size: Size of the context window. An even number is required. For example, if windows_size=5 for predicting each token, two previous and two posterior tokens will be used for prediction. 
  - batch_size: Batch size.
Parameters specific to the RNN model
  - bert_enabled: If True, the BERT embedding feature will be used in the model.
  - pos_enabled: If True, the Part-Of-Speech (simplified) embedding feature will be used in the model.
  - dep_enabled: If True, the Dependency Parser Tag embedding feature will be used in the model.
  - tag_enabled: If True, the Part-Of-Speech (detailed) embedding feature will be used in the model.
  - words_enabled: If True, the Word2vec feature will be used in the model.
  - entities_enabled: If True, the Entity embedding feature will be used in the model.
  - spacyvecs_enabled: if True, the Spacy Contex-sensitive embedding feature will be used in the model.
  - bert_sent_enabled: If True, the sentence BERT embedding feature will be used in the model. This embedding is computed as adding all the individual BERT embeddings.
  - lstm_layers: A comma-separated list of numbers. This list represents the number of hidden units of each layer. For example, lstm_layers=7 builds a single-layer Bi-LSTM model with seven hidden units. A lstm_layer=15,5 builds a two-layer Bi-LSTM model with 15 and 5 hidden units, in the first and second layer, respectively.
```

**How to run the classical baseline model (SVM):**
```
python3 baseline_svm.py [-h] [--kernel KERNEL] [--balanced] [--standarize] 

Parameters specific to the SVM model
  -h, --help            show this help message and exit
  -k KERNEL, --kernel KERNEL
                        Specifies the kernel type to be used in the algorithm.
                        It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.
                        If none is given, ‘rbf’ will be used.
  -b, --balanced        The “balanced” mode uses the values of y to
                        automatically adjust weights inversely proportional to
                        class frequencies in the input data as n_samples /
                        (n_classes * np.bincount(y)).
  -s, --standarize      Specifies wheter to standarize the data or not.
```




**Required software and data:**
*   Software: [spaCy](https://spacy.io/), [Keras](https://keras.io/), [TensorFlow](https://www.tensorflow.org/), [NumPy](https://numpy.org/), [Gensim](https://radimrehurek.com/gensim/).
*   Data: [Event Detection Dataset](https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data), [Word2Vec](https://code.google.com/archive/p/word2vec/) 


The complete list of python packages used and their version can be found [here](https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data/requirements.txt). This requirments were obtained using `pip freeze`.



Note that the code uses BERT pre-trained word embeddings. We built this embedding by summing the last four layers of the BERT pre-trained model. This process resulted in a 768-dimension word embedding for each word. We computed these embeddings with another notebook, and we add the embeddings to the data. Therefore, the process for building those embeddings is not described here. If you are interested in that script, you can email the first author of the work.


***If you plan to use or replicate our work and run into any issue, do not hesitate to contact the first author of the work.***

**Disclaimer**

The following notebook is for having a grasp on how the model was built and executed. Data download and configuration is required. For example, the Word2vec word embeddings have to be downloaded, as well as the event detection dataset ([link](https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data)). 
 
*If you plan to replicate the models, we strongly suggest you contact the first author of this work.*

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Ongoing Event Detection Tool</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="https://cs.uns.edu.ar/~mmaisonnave/resources/ED_code/" property="cc:attributionName" rel="cc:attributionURL">Mariano Maisonnave</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.<br />Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://arxiv.org/abs/2007.01379" rel="dct:source">https://arxiv.org/abs/2007.01379</a>.

## Classical Baseline (SVM)

### Imports

In [None]:
import os
import re
import spacy
import numpy as np
import sys
import pickle



import datetime
def info(msg):
    print(f'{datetime.datetime.now()} [ INFO  ] {msg}')
def warning(msg):
    print(f'{datetime.datetime.now()} [WARNING] {msg}')
def ok(msg):
    print(f'{datetime.datetime.now()} [  OK   ] {msg}')

info('Loading spacy library')
nlp = spacy.load('en_core_web_sm')

2020-09-15 19:42:19.823343 [ INFO  ] Loading spacy library


### arguments and parameters

**If running from console this cell should not be executed.**

In [None]:
import argparse
parser = argparse.ArgumentParser()

parser.add_argument("-k", "--kernel", help="Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’. If none is given, ‘rbf’ will be used.",
                    type=str)

parser.add_argument("-b", "--balanced", help="The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).",
                    action="store_true")

parser.add_argument("-s", "--standarize", help="Specifies wheter to standarize the data or not.",
                    action="store_true")



args = parser.parse_args()

info(f"STANDARIZE option {'ENABLED' if args.standarize else 'DISABLE'}")
info(f"BALANCED   option {'ENABLED' if args.balanced else 'DISABLED'}.")
kernel = 'rbf'
if args.kernel:
    kernel = args.kernel
else:
    warning('Kernel not provieded, rbf assumed.')
info(f"{kernel.upper()} kernel selected.")


**For running the following cell you should download the required data from https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data/**

In [None]:
data_training_path = 'data/training'
data_testing_path = 'data/testing'
assert os.path.exists(data_training_path) and os.path.exists(data_testing_path)

In [None]:
info('Searching training and testing files')
training_files = []
for dirpath, _, filenames in os.walk(data_training_path):
    for filename in filenames:
        file_ = os.path.join(dirpath,filename)
        assert file_.endswith('xml')
        training_files.append(file_)
info(f'Cantidad de archivos de entrenamiento (*.xml): {len(training_files)}')

testing_files = []
for dirpath, _, filenames in os.walk(data_testing_path):
    for filename in filenames:
        file_ = os.path.join(dirpath,filename)
        assert file_.endswith('xml')
        testing_files.append(file_)
info(f'Cantidad de archivos de testing (*.xml): {len(testing_files)}')

In [None]:
# my_word2vec is a dictionary that give the vector for each word, I built this 
# from the Word2vec you download from here: https://code.google.com/archive/p/word2vec/
# I reduced the file to have only the words that are present in the dataset used 
# here. If you need this file you can email me. If you prefer, you can also 
# build yourself the original file you can download in the previous url.
my_word2vec = pickle.load(open('my_word2vec.p','rb'))

**Auxiliary methods from reading the files from the dataset**

In [None]:
def _read_file(file_):
    content = re.findall('<sentence>(.*)</sentence>', open(file_, 'r').read(), re.DOTALL)[0].strip()
    texto = re.sub('</*event>','',content)
    
    match_obj = re.search('<event>.*?</event>', content)
    events = []
    while not match_obj is None:
        idx = match_obj.start()
        trigger = match_obj.group()[7:-8] #deleting <event></event> tags
        events.append(((idx,idx+len(trigger)),trigger))
        content = re.sub('<event>(.*?)</event>','\g<1>',content, count=1)
        match_obj = re.search('<event>.*?</event>', content)
        
    assert all([texto[ini:fin]==trigger for ((ini,fin),trigger) in events])
    return texto, events

def get_matrices(file_):
    texto, events = _read_file(file_)
    indices = set([ini for ((ini,fin),trigger) in events])
    doc = nlp(texto)
    
    y = np.array([1 if token.idx in indices else 0 for token in doc])
    assert np.sum(y)==len(events)
    X = np.zeros(shape=(len(doc), 300))
    
    for idx,token in enumerate(doc):
        if token.text in my_word2vec:
            X[idx,:] = my_word2vec[token.text]
#         else:
#             print(token.text)
#         if token.has_vector:
#             X[idx,:] = token.vector
            
    return X,y


### Reading training files

In [None]:
info('Reading training files')
Xs = []
Ys = []
for file_ in training_files:
    X, y = get_matrices(file_)
    Xs.append(X)
    Ys.append(y)
    
X_train = np.vstack(Xs)
y_train = np.hstack(Ys)
info(X_train.shape)
info(y_train.shape)

### Training

In [None]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC

info(f'Training SVM classifier con {kernel}')
if args.standarize:
    clf = make_pipeline(
        StandardScaler(), 
        SVC(gamma='auto', kernel=kernel, class_weight='balanced' if args.balanced else None)
    )
else:
    clf = SVC(gamma='scale', kernel=kernel, class_weight='balanced' if args.balanced else None)
clf.fit(X_train, y_train)

### Reading testing files

In [None]:
info('Reading testing files')

Xs = []
Ys = []
for file_ in testing_files:
    X, y = get_matrices(file_)
    Xs.append(X)
    Ys.append(y)
    
X_test = np.vstack(Xs)
y_test = np.hstack(Ys)
info(X_test.shape)
info(y_test.shape)

### Computing training metrics

In [None]:
info('Computing Training Metrics:')
yhat = clf.predict(X_train)
true_positives = np.sum(yhat[y_train==1]==y_train[y_train==1])
true_negatives =  np.sum(yhat[y_train==0]==y_train[y_train==0])

false_positives = np.sum(yhat[y_train==1]!=y_train[y_train==1])
false_negatives = np.sum(yhat[y_train==0]!=y_train[y_train==0])

total_positives = np.sum(y_train==1)
total_negatives = np.sum(y_train==0)

sens = true_positives/total_positives
spec = true_negatives/total_negatives
f1 = 2*((sens*spec)/(sens+spec))
acc = clf.score(X_train, y_train)


info(f'\t - accuracy:    {acc:4.3f}')
info(f'\t - sensitivity: {sens:4.3f}')
info(f'\t - specificity: {sens:4.3f}')
info(f'\t - f1-score:    {sens:4.3f}')

### Computing testing metrics

In [None]:
info('Computing Testing Metrics:')

yhat = clf.predict(X_test)
true_positives = np.sum(yhat[y_test==1]==y_test[y_test==1])
true_negatives =  np.sum(yhat[y_test==0]==y_test[y_test==0])

false_positives = np.sum(yhat[y_test==1]!=y_test[y_test==1])
false_negatives = np.sum(yhat[y_test==0]!=y_test[y_test==0])



total_positives = np.sum(y_test==1)
total_negatives = np.sum(y_test==0)

sens = true_positives/total_positives
spec = true_negatives/total_negatives
f1 = 2*((sens*spec)/(sens+spec))
acc = clf.score(X_test, y_test)


info(f'\t - accuracy:    {acc:4.3f}')
info(f'\t - sensitivity: {sens:4.3f}')
info(f'\t - specificity: {sens:4.3f}')
info(f'\t - f1-score:    {sens:4.3f}')

print()


## State-of-the-art Baseline (CNN)

### arguments and parameters


In [None]:
import sys

my_seed = 1
windows_size = 3
model_name = 'default'

patience = 200

nro_of_filters = 150 # fixed according to Nguyen et al. 2015
epochs = 1
batch_size = 50

entities_enabled = True
positions_enabled = True
words_enabled = True
evaluate_only=False


**If running from console this cell should not be executed.**

In [None]:
import  getopt


# 

argv = sys.argv[1:]

try:
    opts, args = getopt.getopt(argv,"",["seed=",
                                        "model_name=",
                                        "words_enabled=",
                                        "entities_enabled=",
                                        "positions_enabled=",
                                        "windows_size=",
                                        "epochs=",
                                        "batch_size=",
                                        'evaluate_only=',
                                        'patience='    
                                       ])
except getopt.GetoptError:
    print('Bad parameters')
    sys.exit(2)
for opt, arg in opts:
    if opt == "--seed":
        my_seed = int(arg)
    elif opt == "--model_name":
        model_name = arg
    elif opt == "--words_enabled":
        words_enabled = arg=="True"
    elif opt == "--entities_enabled":
        entities_enabled = arg=="True"
    elif opt == "--positions_enabled":
        positions_enabled = arg=="True"
    elif opt == "--windows_size":
        windows_size = int(arg)
    elif opt == "--epochs":
        epochs = int(arg)
    elif opt == "--batch_size":
        batch_size = int(arg)
    elif opt == '--evaluate_only':
        evaluate_only = arg=='True'
    elif opt == '--patience':
        patience = int(arg)
        



In [None]:
model_name = '{}_({})_w{}'.format(model_name,my_seed,windows_size)
                  
def info(cad):
    print('[ INFO  ] {}'.format(cad))
def ok(cad):
    print('[  OK   ] {}'.format(cad))
def warning(cad):
    print('[WARNING] {}'.format(cad))
    

info('seed:              {}'.format(my_seed))
info('model_name:        {}'.format(model_name))
info('entities_enabled:  {}'.format(entities_enabled))
info('words_enabled:     {}'.format(words_enabled))
info('positions_enabled: {}'.format(positions_enabled))
info('windows size =     {}'.format(windows_size))
info('no. epochs =       {}'.format(epochs))
if evaluate_only:
    warning('EVALUATE ONLY.')
if epochs<=10:
    warning('POCAS EPOCHS.')

[ INFO  ] seed:              1
[ INFO  ] model_name:        default_(1)_w3
[ INFO  ] entities_enabled:  False
[ INFO  ] words_enabled:     True
[ INFO  ] positions_enabled: True
[ INFO  ] windows size =     3
[ INFO  ] no. epochs =       1


### Paths 

In [None]:
import os
import sys

from numpy.random import seed
seed(my_seed)

# For saving the model information:
home_path = '/home/mariano/work/python3.workspace/Event Detection - Experimentos Finales/Nguyen Replication/'

# Where to look up for the event detection data set (it is not in the same format as in the webpage)
data_path = '/home/mariano/work/python3.workspace/Event Detection - Experimentos Finales/data/'

# if not os.path.exists(home_path):
#     home_path = '/home/maiso/Event Detection/Nguyen Replication/'
#     data_path = '/home/maiso/Event Detection/data/'


word2vec_path = os.path.join(data_path,'word2vec/GoogleNews-vectors-negative300.bin')
my_word2vec_path = os.path.join(data_path, 'word2vec/my_word2vec.p')

training_sents_path = os.path.join(data_path,'training_sents.p') 
testing_sents_path = os.path.join(data_path,'testing_sents.p')



os.makedirs(os.path.join(home_path,'models/{}'.format(model_name)), exist_ok=True)

best_model_weights = os.path.join(home_path,'models/{}/weights.h5'.format(model_name))
best_model_architecture = os.path.join(home_path,'models/{}/architecture.json'.format(model_name))
history_path = os.path.join(home_path, 'models/{}/history.p'.format(model_name))




### Data

In [None]:
import pickle
oraciones = pickle.load(open(training_sents_path, 'rb'))
oraciones_testing = pickle.load(open(testing_sents_path, 'rb'))
info('Oraciones training/validation retrieved (size = {:4.0f}).'.format(len(oraciones)))
info('Oraciones testing             retrieved (size = {:4.0f}).'.format(len(oraciones_testing)))

[ INFO  ] Oraciones training/validation retrieved (size = 2000).
[ INFO  ] Oraciones testing             retrieved (size =  200).


### Features

##### positions

In [None]:
mitad = int(windows_size/2)
def token2position(token_idx, oracion):
    return list(map(abs,list(range(-mitad,mitad+1))))

##### entities

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
# # # # # # # # # # # # # # 
# GENERACIÓN ENTITY VOCAB #
# # # # # # # # # # # # # # 
entity_vocab = set()
for oracion in oraciones:
    if not 'spacy_doc' in oracion:
        oracion['spacy_doc'] = nlp(oracion['texto'])
    assert all([token.idx==i and (token.idx+len(token.text))==f for (i,f),token in zip(oracion['tokens'], oracion['spacy_doc'])]), '{}!={}'.format(token.i,i)
    e = [token.ent_iob_+'-'+token.ent_type_ for token in oracion['spacy_doc']]
    entity_vocab.update(set(e))

for oracion in oraciones_testing:
    if not 'spacy_doc' in oracion:
        oracion['spacy_doc'] = nlp(oracion['texto'])
    assert all([token.idx==i and (token.idx+len(token.text))==f for (i,f),token in zip(oracion['tokens'], oracion['spacy_doc'])]), '{}!={}'.format(token.i,i)
    e = [token.ent_iob_+'-'+token.ent_type_ for token in oracion['spacy_doc']]
    entity_vocab.update(set(e))

entity_vocab.add('[PAD]')
entity_vocab = list(entity_vocab)
ent2index = dict([(ent,index) for index,ent in enumerate(entity_vocab)])
info('Entity Vocabulary Size: {}'.format(len(ent2index)))

# # # # # # # # # # # # # # # # # # # # # # # # # 
# DEFINICIÓN FUNCIÓN PARA OBTENER FEATURES LIST #
# # # # # # # # # # # # # # # # # # # # # # # # # 

def token2entity(token_idx, oracion):
    doc = oracion['spacy_doc']
    token = doc[token_idx]
    l = int(windows_size/2)
    idx = token.i
    ini = max(0,idx-l)
    fin = min(len(doc),1+idx+l)
    
    x = [ent2index[t.ent_iob_+'-'+t.ent_type_] for t in doc[ini:fin]]

    if (ini>idx-l):
        # PAD al ppio
        dif = ini -(idx-l)
        x = [ent2index['[PAD]']]*dif +x
    if (fin< 1+idx+l):
        # PAD al final
        dif = 1+ idx+l-fin
        x = x+  [ent2index['[PAD]']]*dif
    return x

[ INFO  ] Entity Vocabulary Size: 35


##### word2vec


In [None]:
import gensim
# # # # # # # # # # # # 
# GENERACIÓN WORD2VEC #
# # # # # # # # # # # # 
if not os.path.isfile(my_word2vec_path):
    word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True) 
    # GENERACIÓN
    my_word2vec = {}
    for oracion in oraciones:
        for ini,fin in oracion['tokens']:
            token = oracion['texto'][ini:fin]
            if token in word2vec_model:
                my_word2vec[token] = word2vec_model[token]
    for oracion in oraciones_testing:
        for ini,fin in oracion['tokens']:
            token = oracion['texto'][ini:fin]
            if token in word2vec_model:
                my_word2vec[token] = word2vec_model[token]
                
    pickle.dump(my_word2vec, open(my_word2vec_path, 'wb'))
else:
    my_word2vec = pickle.load(open(my_word2vec_path,'rb'))
    
vocab = set(my_word2vec.keys())
vocab.update(set(['[UNK]','[PAD]']))
vocab = list(vocab)
info('Word Vocabulary Size: {}'.format(len(vocab)))
word2index = dict([(word,idx) for idx, word in enumerate(vocab)])

# # # # # # # # # # # # # # # # # # # # # # # # # 
# DEFINICIÓN FUNCIÓN PARA OBTENER FEATURES LIST #
# # # # # # # # # # # # # # # # # # # # # # # # # 

def token2word(token_idx, oracion):
    token_list = oracion['tokens']
    texto = oracion['texto']
    idx = token_idx
    
    l = int(windows_size/2)
    ini = max(0,idx-l)
    fin = min(len(token_list),1+idx+l)

    x = [word2index[texto[i:j]] if texto[i:j] in word2index else word2index['[UNK]'] for i,j in token_list[ini:fin]]
    if (ini>idx-l):
        # PAD al ppio
        dif = ini -(idx-l)
        x = [word2index['[PAD]']]*dif +x
    if (fin< 1+idx+l):
        # PAD al final
        dif = 1+ idx+l-fin
        x = x+  [word2index['[PAD]']]*dif

    return x

# token2word(0, oraciones[0])

[ INFO  ] Word Vocabulary Size: 8647


### The Model

##### Custom metrics

In [None]:
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore",category=DeprecationWarning)
    warnings.filterwarnings("ignore",category=FutureWarning)
    from keras.callbacks import K

def f1_score(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    sens =  true_positives / (possible_positives + K.epsilon())
    true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1)))
    possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1)))
    spec = true_negatives / (possible_negatives + K.epsilon())
    return 2*((sens*spec)/(sens+spec+K.epsilon()))

def sensitivity(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    return true_positives / (possible_positives + K.epsilon())

def specificity(y_true, y_pred):
    true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1)))
    possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1)))
    return true_negatives / (possible_negatives + K.epsilon())
info('f1-score, sensitivity and specifity custom metrics defined.')

Using TensorFlow backend.


[ INFO  ] f1-score, sensitivity and specifity custom metrics defined.


##### callbacks

In [None]:
from keras.callbacks import EarlyStopping,ModelCheckpoint
if not evaluate_only:
    mc = ModelCheckpoint(best_model_weights, monitor='val_f1_score',save_weights_only=True, save_best_only=True,mode='max')
    es = EarlyStopping(monitor='val_f1_score', mode='max', verbose=1, patience=patience)
info('Early Sopping Defined (patience={})'.format(patience))
info('Saving model weights into: {}'.format(best_model_architecture))

[ INFO  ] Early Sopping Defined (patience=200)
[ INFO  ] Saving model weights into: /home/mariano/work/python3.workspace/Event Detection - Experimentos Finales/Nguyen Replication/models/default_(1)_w3/architecture.json


##### the architecture

In [None]:
from keras.models import Sequential,Model
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Embedding,Input, Dropout# Create the model
import numpy as np
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers import Lambda
import tensorflow as tf
from keras.backend import expand_dims
from keras.layers.merge import concatenate
import tensorflow
from keras.models import model_from_json
# clear session
from keras.regularizers import l1_l2

# tensorflow.keras.backend.clear_session()
# Parametros|

def define_architecture():
    embedding_size = 300
    vocab_size = len(vocab)



    # Modelo
    words_input_layer = Input((None,))
    entities_input_layer = Input((None,))
    positions_input_layer = Input((None,))

    # transfer learning with word2vec
    emb_matrix = np.random.random((len(vocab),300))
    for idx,word in enumerate(vocab):
        if word in my_word2vec:
            emb_matrix[idx,:] = my_word2vec[word]

    words_emb_layer = Embedding(vocab_size, 300, input_length=windows_size, weights=[emb_matrix],activity_regularizer=l1_l2(0.01,0.01),)(words_input_layer)

    entities_emb_layer = Embedding(len(entity_vocab), 50, input_length=windows_size,activity_regularizer=l1_l2(0.01,0.01),)(entities_input_layer)
    positions_emb_layer = Embedding(int(windows_size/2)+1, 50, input_length=windows_size,activity_regularizer=l1_l2(0.01,0.01),)(positions_input_layer)
    
    vec_length = 0
    embedding_layers = []
    if words_enabled:
        embedding_layers.append(words_emb_layer)
        vec_length+=300
    if entities_enabled:
        embedding_layers.append(entities_emb_layer)
        vec_length+=50
    if positions_enabled:
        embedding_layers.append(positions_emb_layer)
        vec_length+=50

    if len(embedding_layers)>1:
        concat_layer  = concatenate(embedding_layers)
    else:
        concat_layer = embedding_layers[0] # la lista tiene un solo elemento lo recuperamos y lo usamos como entrada

    lamb_layer = Lambda(lambda x: expand_dims(x, 3))(concat_layer)
    # conv_layer = Conv2D(nro_of_filters, (size_of_filters, windows_size),activity_regularizer=l1_l2(0.01,0.01),  padding='same', activation='relu', kernel_constraint=maxnorm(3))(lamb_layer)

    pool_layers = []
    
    if windows_size==1:
        conv_layer_1 = Conv2D(nro_of_filters, (1, vec_length),  padding='valid', activation='relu', kernel_constraint=maxnorm(3))(lamb_layer)
        pool_layer_1 = MaxPooling2D(pool_size=(1,1))(conv_layer_1)
        pool_layers.append(pool_layer_1)
    if windows_size>=2:
        conv_layer_2 = Conv2D(nro_of_filters, ( 2, vec_length),  padding='valid', activation='relu', kernel_constraint=maxnorm(3))(lamb_layer)
        pool_layer_2 = MaxPooling2D(pool_size=(windows_size-1,1))(conv_layer_2)
        pool_layers.append(pool_layer_2)
    
    if windows_size>=3:
        conv_layer_3 = Conv2D(nro_of_filters, ( 3, vec_length),  padding='valid', activation='relu', kernel_constraint=maxnorm(3))(lamb_layer)
        pool_layer_3 = MaxPooling2D(pool_size=(windows_size-2,1))(conv_layer_3)
        pool_layers.append(pool_layer_3)

    if windows_size>=4:
        conv_layer_4 = Conv2D(nro_of_filters, ( 4, vec_length),  padding='valid', activation='relu', kernel_constraint=maxnorm(3))(lamb_layer)
        pool_layer_4 = MaxPooling2D(pool_size=(windows_size-3,1))(conv_layer_4)
        pool_layers.append(pool_layer_4)

    if windows_size>=5:
        conv_layer_5 = Conv2D(nro_of_filters, ( 5, vec_length),  padding='valid', activation='relu', kernel_constraint=maxnorm(3))(lamb_layer)
        pool_layer_5 = MaxPooling2D(pool_size=(windows_size-4,1))(conv_layer_5)
        pool_layers.append(pool_layer_5)


    if len(pool_layers)==1:
        conv_layer = pool_layers[0]
    else:
        conv_layer = concatenate(pool_layers, axis=1)

    #pool_layer = MaxPooling2D(pool_size=(10,1))(conv_layer)
    flat_layer = Flatten()(conv_layer)
    dropout_layer = Dropout(0.5)(flat_layer)
    dense_layer = Dense(1, activation='sigmoid')(dropout_layer)


    inputs = []
    if words_enabled:
        inputs.append(words_input_layer)
    if entities_enabled:
        inputs.append(entities_input_layer)
    if positions_enabled:
        inputs.append(positions_input_layer)
    model = Model(inputs=inputs, outputs=[dense_layer])
    model_json = model.to_json()
    with open(best_model_architecture, "w") as json_file:
        json_file.write(model_json)
        
    return model

        
        
if evaluate_only:
    try:
        # load json and create model
        print('[WARNING] Model retrieved from JSON')
        json_file = open(best_model_architecture, 'r')
        loaded_model_json = json_file.read()
        json_file.close()
        model = model_from_json(loaded_model_json)
    except:
        print('[WARNING] JSON could not be properly loaded')
        model = define_architecture()
else:
    model = define_architecture()
    

# Compilación
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy',sensitivity,specificity, f1_score])
model.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 3, 300)       2594100     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 3, 50)        100         input_3[0][0]                    
__________________________________________________________________________________________________
concatenat

##### the features

In [None]:
from keras.utils import to_categorical

x1 = np.array([token2word(idx, oracion) for oracion in oraciones for idx,_ in enumerate(oracion['tokens'])])
x2 = np.array([token2entity(idx, oracion) for oracion in oraciones for idx,_ in enumerate(oracion['tokens'])])
x3 = np.array([token2position(idx, oracion) for oracion in oraciones for idx,_ in enumerate(oracion['tokens'])])


y = np.array([valor for oracion in oraciones for valor in oracion['etiquetas']])


indices = np.arange(len(x1))
np.random.shuffle(indices)
train_idx,test_idx = indices[:int(len(indices)*.80)], indices[int(len(indices)*.8):]

x1_train, x1_test = x1[train_idx], x1[test_idx]
x2_train, x2_test = x2[train_idx], x2[test_idx]
x3_train, x3_test = x3[train_idx], x3[test_idx]

y_train, y_val = y[train_idx],y[test_idx]

flags = [words_enabled, entities_enabled, positions_enabled]
info('Training set size: {:6,.0f} tokens'.format(len(y_train)))
info('Testing  set size: {:6,.0f} tokens'.format(len(y_val)))
info('Word token list for first training data:     {}'.format(x1_train[0]))
info('Entity token list for first training data:   {}'.format(x2_train[0]))
info('Position token list for first training data: {}'.format(x3_train[0]))

[ INFO  ] Training set size: 61,303 tokens
[ INFO  ] Testing  set size: 15,326 tokens
[ INFO  ] Word token list for first training data:     [1937 6860 3455]
[ INFO  ] Entity token list for first training data:   [11 11 11]
[ INFO  ] Position token list for first training data: [1 0 1]


##### the training

In [None]:
train_data = []
val_data = []
if words_enabled:
    train_data.append(np.vstack(x1_train))
    val_data.append(np.vstack(x1_test))
if entities_enabled:
    train_data.append(np.vstack(x2_train))
    val_data.append(np.vstack(x2_test))
if positions_enabled:
    train_data.append(np.vstack(x3_train))
    val_data.append(np.vstack(x3_test))
    
if not evaluate_only:
    info('Starting training.')
    history = model.fit(train_data,
                        np.vstack(y_train),
                        batch_size=batch_size,
                        callbacks=[mc, es],
                        validation_data=(val_data, np.vstack(y_val)),
                        epochs=epochs)
    
    info('Finish training, saving weights.')
    pickle.dump(history.history, open(history_path, 'wb'))
else:    
    warning('NO TRAINING, EVALUATE ONLY')

[ INFO  ] Starting training.
Train on 61303 samples, validate on 15326 samples
Epoch 1/1
[ INFO  ] Finish training, saving weights.


In [None]:
print('[  OK   ] # # # # # # # # # ON BATCH # # # # # # # # # # # ')
print('[  OK   ] Loading best weights.')
model.load_weights(best_model_weights)

print('[  OK   ] Loading test set')
x1 = np.array([token2word(idx, oracion) for oracion in oraciones_testing for idx,_ in enumerate(oracion['tokens'])])
x2 = np.array([token2entity(idx, oracion) for oracion in oraciones_testing for idx,_ in enumerate(oracion['tokens'])])
x3 = np.array([token2position(idx, oracion) for oracion in oraciones_testing for idx,_ in enumerate(oracion['tokens'])])

y = np.array([valor for oracion in oraciones_testing for valor in oracion['etiquetas']])

test_data = []
if words_enabled:
    test_data.append(np.vstack(x1))
if entities_enabled:
    test_data.append(np.vstack(x2))
if positions_enabled:
    test_data.append(np.vstack(x3))


print('[  OK   ] Performing the prediction.')
print('[ INFO  ] TRAINING')
values = model.evaluate(train_data,
                    np.vstack(np.hstack([y_train])),
                        batch_size=batch_size
    )
for metric,value in zip(model.metrics_names, values):
    print('{}: {:5.4f}'.format(metric, value))

print('[ INFO  ] VAL')
values = model.evaluate(val_data,
                    np.vstack(np.hstack([y_val])),
                        batch_size=batch_size
    )
for metric,value in zip(model.metrics_names, values):
    print('{}: {:5.4f}'.format(metric, value))


print('[ INFO  ] TESTING')
values = model.evaluate(test_data,
                    np.vstack(y),
                        batch_size=batch_size
    )
for metric,value in zip(model.metrics_names, values):
    print('{}: {:5.4f}'.format(metric, value))


     

[  OK   ] # # # # # # # # # ON BATCH # # # # # # # # # # # 
[  OK   ] Loading best weights.
[  OK   ] Loading test set
[  OK   ] Performing the prediction.
[ INFO  ] TRAINING
loss: 9.7533
acc: 0.9415
sensitivity: 0.1866
specificity: 0.9942
f1_score: 0.2521
[ INFO  ] VAL
loss: 10.3884
acc: 0.9314
sensitivity: 0.1520
specificity: 0.9905
f1_score: 0.2091
[ INFO  ] TESTING
loss: 12.1221
acc: 0.9455
sensitivity: 0.1860
specificity: 0.9898
f1_score: 0.2511


## Proposed Model (RNN)

### arguments and parameters


In [None]:
import sys
import sys, getopt

my_seed = 1
model_name = 'default'

epochs = 1

bert_enabled = True
bert_sent_enabled=True
dep_enabled = False
tag_enabled = False
pos_enabled = False
words_enabled = False
entities_enabled = True
spacyvecs_enabled = True

evaluate_only=True

lstm_layers = []

**If running from console this cell should not be executed.**

In [None]:



# IF RUNNING FROM CONSOLE THE FOLLOWING CODE SHOULD NOT BE EXCUTED. 


argv = sys.argv[1:]

try:
    opts, args = getopt.getopt(argv,"",["seed=",
                                        "model_name=",
                                        "epochs=",
                                        "bert_enabled=",
                                        'bert_sent_enabled=',
                                        "pos_enabled=",
                                        "dep_enabled=",
                                        "tag_enabled=",
                                        "words_enabled=",
                                        "entities_enabled=",
                                        "lstm_layers=",
                                        'spacyvecs_enabled=',
                                        'evaluate_only='                                        
                                       ])
except getopt.GetoptError:
    print('Bad parameters')
    sys.exit(2)
for opt, arg in opts:
    if opt == "--seed":
        my_seed = int(arg)
    elif opt == "--model_name":
        model_name = arg
    elif opt == "--epochs":
        epochs = int(arg)
    elif opt == "--bert_enabled":
        bert_enabled = arg=="True"
    elif opt == "--pos_enabled":
        pos_enabled = arg=="True"
    elif opt == "--dep_enabled":
        dep_enabled = arg=="True"
    elif opt == "--tag_enabled":
        tag_enabled = arg=="True"
    elif opt == "--words_enabled":
        words_enabled = arg=="True"
    elif opt == "--entities_enabled":
        entities_enabled = arg=="True"
    elif opt == "--lstm_layers":
        lstm_layers = [int(value) for value in arg.split(',')]
    elif opt == "--spacyvecs_enabled":
        spacyvecs_enabled = arg=="True"
    elif opt == '--bert_sent_enabled':
        bert_sent_enabled = arg=='True'
    elif opt == '--evaluate_only':
        evaluate_only = arg=='True'
        


Bad parameters


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [None]:
model_name = '{}_{}_({})'.format(model_name,'_'.join([str(value) for value in lstm_layers]), my_seed)
                          
print("Parameters:")
print("\tseed: {}".format(my_seed))
print('\tmodel_name: {}'.format(model_name))

print('bert_enabled: {}'.format(bert_enabled))
print('dep_enabled: {}'.format(dep_enabled))
print('tag_enabled: {}'.format(tag_enabled))
print('pos_enabled: {}'.format(pos_enabled))
print('words_enabled: {}'.format(words_enabled))
print('entities_enabled: {}'.format(entities_enabled))
print('spacyvecs_enabled: {}'.format(spacyvecs_enabled))
print('bert_sent_enabled: {}'.format(bert_sent_enabled))

if evaluate_only:
    print('[WARNING] EVALUATE ONLY !!')

### Paths 

In [None]:
import os
import sys

from numpy.random import seed
seed(my_seed)

# home_path = '/home/maiso/Event Detection/My RNN Model/'
# data_path = '/home/maiso/Event Detection/data'


# For saving the model information:
home_path = '/home/mariano/work/python3.workspace/Event Detection - Experimentos Finales/My RNN Model/'

# Where to look up for the event detection data set (it is not in the same format as in the webpage)
data_path = '/home/mariano/work/python3.workspace/Event Detection - Experimentos Finales/data/'

if not os.path.exists(home_path):
    home_path = '/home/maiso/Event Detection/My RNN Model/'
    data_path = '/home/maiso/Event Detection/data'
word2vec_path = os.path.join(data_path,'word2vec/GoogleNews-vectors-negative300.bin')
my_word2vec_path = os.path.join(data_path, 'word2vec/my_word2vec.p')

training_sents_path = os.path.join(data_path,'training_sents.p') 
testing_sents_path = os.path.join(data_path,'testing_sents.p')


os.makedirs(os.path.join(home_path,'models/{}'.format(model_name)), exist_ok=True)

best_model_weights = os.path.join(home_path,'models/{}/weights.h5'.format(model_name))
best_model_architecture = os.path.join(home_path,'models/{}/architecture.json'.format(model_name))
history_path = os.path.join(home_path, 'models/{}/history.p'.format(model_name))



flags = [bert_enabled, bert_sent_enabled, dep_enabled, tag_enabled, pos_enabled, words_enabled, entities_enabled,spacyvecs_enabled]
flags

[True, True, False, False, False, False, True, True]

### Data

In [None]:
import pickle
import spacy
nlp = spacy.load('en_core_web_sm')
oraciones = pickle.load(open(training_sents_path, 'rb'))
for oracion in oraciones:
    if 'spacy_doc' not in oracion:
        oracion['spacy_doc'] = nlp(oracion['texto'])
        
oraciones_testing = pickle.load(open(testing_sents_path, 'rb'))
for oracion in oraciones_testing:
    if 'spacy_doc' not in oracion:
        oracion['spacy_doc'] = nlp(oracion['texto'])
print(len(oraciones))
print(len(oraciones_testing))

2000
200


### Features

##### POStag && DEPtag && reduced POStag


In [None]:
tag_vocab = list(set([token.tag_ for oracion in oraciones+oraciones_testing for token in oracion['spacy_doc'] ]))
tag2index = dict([(tag,idx) for idx,tag in enumerate(tag_vocab)])
print('size of POSTag: {}'.format(len(tag2index)))

dep_vocab = list(set([token.dep_ for oracion in oraciones+oraciones_testing for token in oracion['spacy_doc'] ]))
dep2index = dict([(dep,idx) for idx,dep in enumerate(dep_vocab)])
print('size of dependency tags: {}'.format(len(dep2index)))

pos_vocab = list(set([token.pos_ for oracion in oraciones+oraciones_testing for token in oracion['spacy_doc'] ]))
pos2index = dict([(pos,idx) for idx,pos in enumerate(pos_vocab)])
print('size of pos tags: {}'.format(len(pos2index)))

size of POSTag: 47
size of dependency tags: 47
size of pos tags: 16


##### entities

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
# # # # # # # # # # # # # # 
# GENERACIÓN ENTITY VOCAB #
# # # # # # # # # # # # # # 
entity_vocab = set()
for oracion in oraciones:
    if not 'spacy_doc' in oracion:
        oracion['spacy_doc'] = nlp(oracion['texto'])
    assert all([token.idx==i and (token.idx+len(token.text))==f for (i,f),token in zip(oracion['tokens'], oracion['spacy_doc'])]), '{}!={}'.format(token.i,i)
    e = [token.ent_iob_+'-'+token.ent_type_ for token in oracion['spacy_doc']]
    entity_vocab.update(set(e))
for oracion in oraciones_testing:
    if not 'spacy_doc' in oracion:
        oracion['spacy_doc'] = nlp(oracion['texto'])
    assert all([token.idx==i and (token.idx+len(token.text))==f for (i,f),token in zip(oracion['tokens'], oracion['spacy_doc'])]), '{}!={}'.format(token.i,i)
    e = [token.ent_iob_+'-'+token.ent_type_ for token in oracion['spacy_doc']]
    entity_vocab.update(set(e))

entity_vocab.add('[PAD]')
entity_vocab = list(entity_vocab)
ent2index = dict([(ent,index) for index,ent in enumerate(entity_vocab)])
print('entity vocabulary size: {}'.format(len(ent2index)))

entity vocabulary size: 35


##### word2vec


In [None]:
import gensim
# # # # # # # # # # # # 
# GENERACIÓN WORD2VEC #
# # # # # # # # # # # # 
if not os.path.isfile(my_word2vec_path):
    word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True) 
    # GENERACIÓN
    my_word2vec = {}
    for oracion in oraciones:
        for ini,fin in oracion['tokens']:
            token = oracion['texto'][ini:fin]
            if token in word2vec_model:
                my_word2vec[token] = word2vec_model[token]
                
    for oracion in oraciones_testing:
        for ini,fin in oracion['tokens']:
            token = oracion['texto'][ini:fin]
            if token in word2vec_model:
                my_word2vec[token] = word2vec_model[token]

    pickle.dump(my_word2vec, open(my_word2vec_path, 'wb'))
else:
    my_word2vec = pickle.load(open(my_word2vec_path,'rb'))
    
vocab = set(my_word2vec.keys())
vocab.update(set(['[UNK]']))
vocab = list(vocab)
print('vocabulary size: {}'.format(len(vocab)))
word2index = dict([(word,idx) for idx, word in enumerate(vocab)])

vocabulary size: 8646


##### spacy word2vec


In [None]:
spacy_vocab = set()
for oracion in oraciones:
    for token in oracion['spacy_doc']:
        if token.has_vector:
            spacy_vocab.add(token.text)
for oracion in oraciones_testing:
    for token in oracion['spacy_doc']:
        if token.has_vector:
            spacy_vocab.add(token.text)
spacy_vocab.add('[UNK]')
spacyword2index = dict([(word,idx) for idx, word in enumerate(spacy_vocab)])
len(spacyword2index)

9113

### The Model

##### Custom metrics

In [None]:
from keras.callbacks import K

def f1_score(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    sens =  true_positives / (possible_positives + K.epsilon())
    true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1)))
    possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1)))
    spec = true_negatives / (possible_negatives + K.epsilon())
    return 2*((sens*spec)/(sens+spec+K.epsilon()))

def sensitivity(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    return true_positives / (possible_positives + K.epsilon())

def specificity(y_true, y_pred):
    true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1)))
    possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1)))
    return true_negatives / (possible_negatives + K.epsilon())

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


##### callbacks

In [None]:
from keras.callbacks import EarlyStopping,ModelCheckpoint
mc = ModelCheckpoint(best_model_weights, monitor='val_f1_score', save_best_only=True,mode='max')
es = EarlyStopping(monitor='val_f1_score', mode='max', verbose=1, patience=400)

##### the architecture

In [None]:
from keras.models import Sequential,Model
from keras.layers import Dense, Embedding,Input, Dropout, Bidirectional, LSTM# Create the model
import numpy as np
from keras.constraints import maxnorm
from keras.optimizers import SGD
from keras.layers import Lambda
import tensorflow as tf
from keras.backend import expand_dims
from keras.layers.merge import concatenate
import tensorflow
from keras.models import model_from_json
# clear session
from keras.regularizers import l1_l2

# tensorflow.keras.backend.clear_session()
# Parametros
if evaluate_only:
    # load json and create model
    print('[WARNING] Model retrieved from JSON')
    
    json_file = open(best_model_architecture, 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    model = model_from_json(loaded_model_json)
else:
    embedding_size = 300
    vocab_size = len(vocab)



    # Modelo
    bert_input = Input((None,768))
    bert_sent_input = Input((None,768))
    #embedding_wrd = Embedding(len(word2index),96, weights=[embedding_matrix])(input_wrd) #pretrainned weights

    input_dep = Input((None,))
    embedding_dep  = Embedding(len(dep2index), 10)(input_dep)

    input_tag = Input((None,))
    embedding_tag= Embedding(len(tag2index), 10)(input_tag)

    input_pos = Input((None,))
    embedding_pos  = Embedding(len(pos2index), 10)(input_pos)

    input_word = Input((None,))

    embedding_matrix = np.random.random((len(word2index), 300))
    visitados = set()
    for oracion in oraciones:
        for token in oracion['spacy_doc']:
            if not token.text in visitados and token.text in word2index:
                embedding_matrix[word2index[token.text]]= my_word2vec[token.text]
                visitados.add(token.text)

    embedding_word  = Embedding(len(word2index), 300, weights=[embedding_matrix] )(input_word)

    ## 
    input_spacyword = Input((None,))

    embedding_matrix = np.random.random((len(spacyword2index), 96))
    visitados = set()
    for oracion in oraciones:
        for token in oracion['spacy_doc']:
            if not token.text in visitados and token.text in spacyword2index:
                embedding_matrix[spacyword2index[token.text]]= token.vector
                visitados.add(token.text)
    for oracion in oraciones_testing:
        for token in oracion['spacy_doc']:
            if not token.text in visitados and token.text in spacyword2index:
                embedding_matrix[spacyword2index[token.text]]= token.vector
                visitados.add(token.text)

    embedding_spacyword  = Embedding(len(spacyword2index), 96, weights=[embedding_matrix] )(input_spacyword)
    ##

    input_entities = Input((None,))
    embedding_entities  = Embedding(len(ent2index), 10)(input_entities)

    layers = [bert_input, 
                  bert_sent_input,
                         embedding_dep,
                         embedding_tag,
                         embedding_pos,
                         embedding_word,
                         embedding_entities,
                         embedding_spacyword
                         ]

    layers = [layer for layer,flag in zip(layers,flags) if flag]

    if len(layers)>1:
        merged = concatenate(layers)
    else:
        merged = layers[0]

    # merged = concatenate([bert_input, 
    #                      embedding_dep,
    #                      embedding_tag,
    #                      embedding_pos,
    #                      embedding_word,
    #                      embedding_entities                     
    #                      ])

    # Arquitectura en conjunto
    dropout_1 = Dropout(0.1)(merged)

    last_layer = dropout_1
    for lstm in lstm_layers:
        lstm_i = Bidirectional(LSTM(lstm,return_sequences=True,activity_regularizer=l1_l2(0.001,0.001)))(last_layer)
        last_layer = lstm_i
    # lstm_1 = Bidirectional(LSTM(100,return_sequences=True,activity_regularizer=l1_l2(0.001,0.001)))(dropout_1)
    # lstm_2 = Bidirectional(LSTM(15,return_sequences=True,activity_regularizer=l1_l2(0.001,0.001)))(lstm_1)
    # lstm_3 = Bidirectional(LSTM(5,return_sequences=True,activity_regularizer=l1_l2(0.001,0.001)))(lstm_2)

    dense_out = Dense(1, activation='sigmoid')(last_layer)

    inputs = [bert_input, bert_sent_input, input_dep, input_tag, input_pos, input_word, input_entities,input_spacyword]

    inputs = [input_ for input_,flag in zip(inputs,flags) if flag]

    model = Model(inputs=inputs, outputs=[dense_out])

    model_json = model.to_json()
    with open(best_model_architecture, "w") as json_file:
        json_file.write(model_json)

    
model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy',sensitivity,specificity, f1_score])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_8 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_7 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (None, None, 768)    0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 768)    0                                            
__________________________________________________________________________________________________
embedding_

##### the features

In [None]:
import numpy as np
def sent2xlist(oracion):
#     xbert = np.zeros(shape=(1,len(oracion['spacy_doc']),768)) #oracion.bert_embeddings  # input layer, not embeddings
    xbert = np.array(oracion['bert'])[np.newaxis,:,:]
    xsentbert = np.average(xbert, axis=1) *np.ones(shape=(1,len(xbert[0,:,0]),768))
    xdep = np.zeros(shape=(1,len(oracion['spacy_doc']))) # one-dimension porque son embeddings.
    xtag = np.zeros(shape=(1,len(oracion['spacy_doc'])))
    xpos = np.zeros(shape=(1,len(oracion['spacy_doc'])))
    
    xwrd = np.zeros(shape=(1,len(oracion['spacy_doc']))) # embedding
    
    xents = np.zeros(shape=(1,len(oracion['spacy_doc'])))
    
    xspacy = np.zeros(shape=(1,len(oracion['spacy_doc'])))
    
    
#     for idx,token in enumerate(oracion.spacy_doc):
#         xwrd[0,idx] = word2index[token.lemma_.lower()]
    for idx,token in enumerate(oracion['spacy_doc']):
        xdep[0,idx] = dep2index[token.dep_]
    for idx,token in enumerate(oracion['spacy_doc']):
        xtag[0,idx] = tag2index[token.tag_]
    for idx,token in enumerate(oracion['spacy_doc']):
        xpos[0,idx] = pos2index[token.pos_]
        
    for idx,token in enumerate(oracion['spacy_doc']):
        if token.text in word2index:
            xwrd[0,idx] = word2index[token.text]
        else:
            xwrd[0,idx] = word2index['[UNK]']
        
    for idx,token in enumerate(oracion['spacy_doc']):
        xents[0,idx] = ent2index[token.ent_iob_+'-'+token.ent_type_]
        
    for idx,token in enumerate(oracion['spacy_doc']):
        xspacy[0,idx] = spacyword2index[token.text]
    
    return [xbert, xsentbert, xdep, xtag, xpos,xwrd, xents, xspacy]
sent2xlist(oraciones[0])

def g(label_data):
        j = 0
        while True:
            yield label_data[j]
            j+=1
            if j==len(label_data):
                j=0
                
sent2xlist(oraciones[0])[0].shape

(1, 33, 768)

##### the training

In [None]:
x = [sent2xlist(oracion) for oracion in oraciones]
y = [np.array(oracion['etiquetas'])[np.newaxis,:,np.newaxis] for oracion in oraciones]

x = [[elem for elem,flag in zip(list_,flags) if flag] for list_ in x]

indices = np.arange(len(x))
np.random.shuffle(indices)
train_idx,val_idx = set(indices[:int(len(indices)*.80)]), set(indices[int(len(indices)*.8):])

x_train, x_val = [elem for idx,elem in enumerate(x) if idx in train_idx], [elem for idx,elem in enumerate(x) if idx in val_idx]
y_train, y_val = [elem for idx,elem in enumerate(y) if idx in train_idx], [elem for idx,elem in enumerate(y) if idx in val_idx]
# x_train = [sent2xlist(oracion) for oracion in oraciones]
# y_train = [np.array(oracion['etiquetas'])[np.newaxis,:,np.newaxis] for oracion in oraciones]

# x_val = [sent2xlist(oracion) for oracion in oraciones]
# y_val = [np.array(oracion['etiquetas'])[np.newaxis,:,np.newaxis] for oracion in oraciones]

if not evaluate_only:
    history = model.fit_generator(g([(x,y) for x,y in zip(x_train,y_train)]), 
                                   epochs=epochs, 
                                   steps_per_epoch=len(x_train), 
                                   callbacks=[es,mc],
                                   validation_data=g([(x,y) for x,y in zip(x_val,y_val)]),
                                   validation_steps=len(x_val), 
                                  verbose=2)

    pickle.dump(history.history, open(history_path, 'wb'))
    
else:
    print('[WARNING] NO TRAINING, EVALUATE ONLY')



In [None]:
print('[  OK   ] Loading best weights from H5.')
model.load_weights(best_model_weights)

print('[  OK   ] Loading test set')
x_test = [sent2xlist(oracion) for oracion in oraciones_testing]
y_test = [np.array(oracion['etiquetas'])[np.newaxis,:,np.newaxis] for oracion in oraciones_testing]

x_test = [[elem for elem,flag in zip(list_,flags) if flag] for list_ in x_test]

print('[  OK   ] Performing the prediction.')

print()
print('[ INFO  ] TRAINING')
values = model.evaluate_generator(g([(x,y) for x,y in zip(x_train,y_train)]), steps=len(x_train))

for metric,value in zip(model.metrics_names, values):
    print('{}: {:5.4f}'.format(metric, value))

print()
print('[ INFO  ] VAL')
values = model.evaluate_generator(g([(x,y) for x,y in zip(x_val,y_val)]), steps=len(x_val))

for metric,value in zip(model.metrics_names, values):
    print('{}: {:5.4f}'.format(metric, value))


print()
print('[ INFO  ] TESTING')
values = model.evaluate_generator(g([(x,y) for x,y in zip(x_test,y_test)]), steps=len(x_test))

for metric,value in zip(model.metrics_names, values):
    print('{}: {:4.4f}'.format(metric, value))



[  OK   ] Loading best weights from H5.
[  OK   ] Loading test set
[  OK   ] Performing the prediction.

[ INFO  ] TRAINING
loss: 0.2090
acc: 0.9403
sensitivity: 0.2131
specificity: 0.9966
f1_score: 0.2691

[ INFO  ] VAL
loss: 0.2045
acc: 0.9424
sensitivity: 0.2232
specificity: 0.9968
f1_score: 0.2765

[ INFO  ] TESTING
loss: 0.1635
acc: 0.9529
sensitivity: 0.2328
specificity: 0.9968
f1_score: 0.2703
