# ONEIE Demo page
This notebook demo's ONEIE model for event extraction. The notebook uses the author's trained model. The paper can be found [here](https://www.aclweb.org/anthology/2020.acl-main.713/)

In [1]:
import os
import json
import glob
import tqdm
import traceback
from argparse import ArgumentParser

import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer, BertConfig
from nltk.tokenize import sent_tokenize, word_tokenize

import sys

from pathlib import Path
cur_dir = Path.cwd()
sys.path.append(str(cur_dir.parents[0] / 'oneie'))

from model import OneIE
from config import Config
from util import save_result
from data import IEDatasetEval, InstanceLdcEval, BatchLdcEval
from convert import json_to_cs



format_ext_mapping = {'txt': 'txt', 'ltf': 'ltf.xml', 'json': 'json',
                      'json_single': 'json'}


### 1. Loading the Model
The model can be downloaded [in this link](http://blender.cs.illinois.edu/software/oneie/).  
Be sure to save the model in `oneie/models` directory.

In [6]:
from predict import load_model
model_path = cur_dir.parents[0] / 'oneie' / 'models' / 'best.role.mdl'
model, tokenizer, config = load_model(model_path, 
                                      device=0, 
                                      gpu=False,
                                      beam_size=5)

Loading the model from /home/vinitrinh/Desktop/Event Extraction/ONEIE/predictions/eai-dsta/oneie/models/best.role.mdl


### 2. Preprocess
Some preprocessing is needed before the model can predict them. First we take the raw text and separate it into word tokens. The expected output show below as:  
`[(doc_id_str), [(token1, 0,1), (token2,1,2) ...]`

In [3]:
text = "Prime Minister Abdullah Gul resigned earlier Tuesday to make way for Erdogan, who won a parliamentary seat in by-elections Sunday."
def text_to_tokens(text):
    """
    this tokenizes the text into words according to NLTK's model
    The important output is doc, which contains:
        (1) doc id
        (2) tokens
    """
    doc_id = 'asd'
    offset = 0
    doc_tokens = []
    tokens = word_tokenize(text)
    tokens = [(token, offset + i, offset + i + 1)
              for i, token in enumerate(tokens)]
    doc_tokens.append(('asd0', tokens))
    return doc_tokens, tokens

doc_tokens, tokens = text_to_tokens(text)
doc_tokens[:4]

[('asd0',
  [('Prime', 0, 1),
   ('Minister', 1, 2),
   ('Abdullah', 2, 3),
   ('Gul', 3, 4),
   ('resigned', 4, 5),
   ('earlier', 5, 6),
   ('Tuesday', 6, 7),
   ('to', 7, 8),
   ('make', 8, 9),
   ('way', 9, 10),
   ('for', 10, 11),
   ('Erdogan', 11, 12),
   (',', 12, 13),
   ('who', 13, 14),
   ('won', 14, 15),
   ('a', 15, 16),
   ('parliamentary', 16, 17),
   ('seat', 17, 18),
   ('in', 18, 19),
   ('by-elections', 19, 20),
   ('Sunday', 20, 21),
   ('.', 21, 22)])]

Most of the important preprocessing happens in `numberize`, which is a method to the `IEDatasetEval` in `data.py`.   
</br>

As the name suggests, it changes the string tokens into their integer indices, eg, 'asd' token corresponds to index 345 in the word embedding. The resulting output has all of BERT's attention piece idx and attention masks etc.  
</br>

One misleading name is the `token_idx` which is the token index which has nothing to do with the models tokenizer. It tracks the specific word in the document and follows the format `<doc-id>-<word-id>`

In [4]:
def numberize(data, tokenizer):
    numberized_data = []
    for i, (sent_id, sent_tokens) in enumerate(data):
        tokens = []
        token_ids = []
        pieces = []
        token_lens = []
        for token_text, start_char, end_char in sent_tokens:
            token_id = '{}:{}-{}'.format("asd", start_char, end_char)
            token_pieces = [p for p in tokenizer.tokenize(token_text) if p]
            if len(token_pieces) == 0:
                continue
            tokens.append(token_text)
            pieces.extend(token_pieces)
            token_lens.append(len(token_pieces))
            token_ids.append(token_id)

        # skip overlength sentences, set max_length = 200 for purpose of demo
        if len(pieces) > 200 - 2:
            continue
        # skip empty sentences
        if len(pieces) == 0:
            continue

        # pad word pieces with special tokens
        piece_idxs = tokenizer.encode(pieces,
                                      add_special_tokens=True,
                                      max_length=200)
        pad_num = 200 - len(piece_idxs)
        attn_mask = [1] * len(piece_idxs) + [0] * pad_num
        piece_idxs = piece_idxs + [0] * pad_num

        instance = InstanceLdcEval(
            sent_id=sent_id,
            tokens=tokens,
            token_ids=token_ids,
            pieces=pieces,
            piece_idxs=piece_idxs,
            token_lens=token_lens,
            attention_mask=attn_mask
        )
        numberized_data.append(instance)
    return numberized_data

numberize(doc_tokens, tokenizer)

[InstanceLdcEval(sent_id='asd0', tokens=['Prime', 'Minister', 'Abdullah', 'Gul', 'resigned', 'earlier', 'Tuesday', 'to', 'make', 'way', 'for', 'Erdogan', ',', 'who', 'won', 'a', 'parliamentary', 'seat', 'in', 'by-elections', 'Sunday', '.'], token_ids=['asd:0-1', 'asd:1-2', 'asd:2-3', 'asd:3-4', 'asd:4-5', 'asd:5-6', 'asd:6-7', 'asd:7-8', 'asd:8-9', 'asd:9-10', 'asd:10-11', 'asd:11-12', 'asd:12-13', 'asd:13-14', 'asd:14-15', 'asd:15-16', 'asd:16-17', 'asd:17-18', 'asd:18-19', 'asd:19-20', 'asd:20-21', 'asd:21-22'], pieces=['Prime', 'Minister', 'Abdullah', 'G', '##ul', 'resigned', 'earlier', 'Tuesday', 'to', 'make', 'way', 'for', 'E', '##rdo', '##gan', ',', 'who', 'won', 'a', 'parliamentary', 'seat', 'in', 'by', '-', 'elections', 'Sunday', '.'], piece_idxs=[101, 3460, 2110, 14677, 144, 4654, 4603, 2206, 9667, 1106, 1294, 1236, 1111, 142, 16525, 3820, 117, 1150, 1281, 170, 6774, 1946, 1107, 1118, 118, 3212, 3625, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Oneie's model class predicts instances as batches (only), so it doesn't accept the `InstanceLdcEval` we just created. So we copy and paste a `collate_fn` here, which is similarly from `IEDatasetEval` to make a batch of `InstanceLdcEval` into `BatchLdcEval` for this demo's sake.

In [5]:
def collate_fn(batch):
    batch_piece_idxs = []
    batch_tokens = []
    batch_token_lens = []
    batch_attention_masks = []
    batch_sent_ids = []
    batch_token_ids = []
    batch_token_nums = []

    for inst in batch:
        token_num = len(inst.tokens)
        batch_piece_idxs.append(inst.piece_idxs)
        batch_attention_masks.append(inst.attention_mask)
        batch_token_lens.append(inst.token_lens)
        batch_tokens.append(inst.tokens)
        batch_sent_ids.append(inst.sent_id)
        batch_token_ids.append(inst.token_ids)
        batch_token_nums.append(len(inst.tokens))

    batch_piece_idxs = torch.LongTensor(batch_piece_idxs)
    batch_attention_masks = torch.FloatTensor(
        batch_attention_masks)
    batch_token_nums = torch.LongTensor(batch_token_nums)

    return BatchLdcEval(sent_ids=batch_sent_ids,
                        token_ids=batch_token_ids,
                        tokens=batch_tokens,
                        piece_idxs=batch_piece_idxs,
                        token_lens=batch_token_lens,
                        attention_masks=batch_attention_masks,
                        token_nums=batch_token_nums)

All the steps are collated here. These functions (not oneie's source code) are written to help us eyeball the output of oneie's events. 

In [6]:
def prepare_text(text):
    data, tokens = text_to_tokens(text)
    data = numberize(data, tokenizer)
    data = collate_fn(data)
    return data, tokens

In [7]:
def examine_trigger_roles(graph, text, tokens):
    print(text+'\n')
    
    print("entities:")
    collect = False
    entity_strings = []
    for entity in graph.entities:
        start, end, entity_type = entity

        matched_tokens = []        
        for token in tokens:
            if (token[1]==start):
                collect = True
            if collect == True: matched_tokens.append(token[0])
            if end in [token[1], token[2]]:
                collect = False

        entity_itos = {i: s for s, i in graph.vocabs['entity_type'].items()}
        entity_label = entity_itos[entity_type]
        entity_strings.append(matched_tokens)
        print(f"   {' '.join(matched_tokens)} - {entity_label}")


    print("\ntriggers:")
    collect = False
    trigger_class_strings = []
    for trigger in graph.triggers:
        start, end, trigger_type = trigger
        
        matched_tokens = []
        for token in tokens:
            if (token[1]==start):
                collect = True
            if collect == True: matched_tokens.append(token[0])
            if end in [token[1], token[2]]:
                collect = False

        trigger_itos = {i: s for s, i in graph.vocabs['event_type'].items()}
        trigger_label = trigger_itos[trigger_type]
        trigger_class_strings.append(trigger_label)
        print(f"   {' '.join(matched_tokens)} - {trigger_label}")

    print("\nroles:")
    for role in graph.roles:
        trigger_idx, entity_idx, role_type = role
        
        trigger_class_string = trigger_class_strings[trigger_idx]
        entity_string = entity_strings[entity_idx]
        role_itos = {i: s for s, i in graph.vocabs['role_type'].items()}
        role_label = role_itos[role_type]
        
        print(f"   {' '.join(entity_string)} - {role_label} - {trigger_class_string}")

In [8]:
def demo(text):
    data, tokens = prepare_text(text)
    graph = model.predict(data)

    graph = graph[0]
    graph.clean(relation_directional=config.relation_directional,
                symmetric_relations=config.symmetric_relations)

    examine_trigger_roles(graph, text, tokens)
    return graph

### 3. Demo model
Input text here for demo

In [9]:
text = "Prime Minister Abdullah Gul resigned earlier Tuesday to make way for Erdogan, who won a parliamentary seat in by-elections Sunday."
graph = demo(text)

Prime Minister Abdullah Gul resigned earlier Tuesday to make way for Erdogan, who won a parliamentary seat in by-elections Sunday.

entities:
   Minister - PER
   Abdullah Gul - PER
   Erdogan - PER
   who - PER
   parliamentary - ORG

triggers:
   resigned - Personnel:End-Position
   won - Personnel:Elect

roles:
   Prime - Person
   Minister - Person


In [10]:
text = "A civilian aid worker from San Francisco was killed in an attack in Afghanistan"
graph = demo(text)

A civilian aid worker from San Francisco was killed in an attack in Afghanistan

entities:
   civilian - PER
   worker - PER
   San Francisco - GPE
   Afghanistan - GPE

triggers:
   killed - Life:Die
   attack - Conflict:Attack

roles:
   A - Victim
   civilian - Target
   A civilian aid - Place
   civilian aid - Place


In [19]:
text = "CNN_CF_20030303.1900.02 HAIn San Francisco there is Harry no event in this sentence"
graph = demo(text)

CNN_CF_20030303.1900.02 In San Francisco there is Harry no event in this sentence

entities:
   San Francisco - GPE
   Harry - PER

triggers:

roles:


### 4. Oneie's key output: graph
One key output in this process is the `graph` object that oneie outputs. All the entity mentions and triggers etc, are directly available from the class but the `to_dict` method nicely brings it all out conveniently for us.

In [12]:
dir(graph)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'add_entity',
 'add_relation',
 'add_role',
 'add_trigger',
 'clean',
 'copy',
 'empty_graph',
 'entities',
 'entity_num',
 'entity_scores',
 'graph_local_score',
 'mentions',
 'relation_num',
 'relation_scores',
 'relations',
 'role_num',
 'role_scores',
 'roles',
 'to_dict',
 'to_label_idxs',
 'trigger_num',
 'trigger_scores',
 'triggers',
 'vocabs']

In [13]:
graph.to_dict()

{'entities': [], 'triggers': [], 'relations': [], 'roles': []}