<a href="https://colab.research.google.com/github/disi-unibo-nlp/bio-ee-egv/blob/main/notebooks/create_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

## Constants

In [1]:
random_seed = 42
test_size_ratio = 0.1
min_class_occurance = 1
train_set_name = 'train'
validation_set_name = 'validation'
test_set_name = 'test'
datasets_type = '.tsv'
datasets_separator = '\t'
dataset_file_name = 'dataset.zip'
ann_datasets = ['genia-mk']
t5_version = "base"
dataset_url = 'https://raw.githubusercontent.com/disi-unibo-nlp/bio-ee-egv/main/data/datasets/original_datasets.tar.gz'

## Imports

In [2]:
!pip install tqdm
!pip install sentencepiece
!pip install transformers

import os, glob, re, spacy, tarfile, torch, itertools, csv, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from tqdm import tqdm
from functools import partial
import xml.etree.ElementTree as ET
from sklearn.model_selection import train_test_split
from transformers import T5Tokenizer, BartTokenizer
from google.colab import files

tqdm = partial(tqdm, position=0, leave=True)
tokenizerT5 = T5Tokenizer.from_pretrained("t5-" + t5_version)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 6.4 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.0-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 8.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 45.7 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## Downloads

In [3]:
!rm -R /content/dataset
os.chdir("/content")
!wget {dataset_url}
tar = tarfile.open(dataset_url.split("/")[-1])
tar.extractall()
tar.close()
!rm original_datasets.tar.gz
os.chdir("./dataset")
!python3 -m spacy download en_core_web_sm

rm: cannot remove '/content/dataset': No such file or directory
--2022-09-15 12:43:32--  https://raw.githubusercontent.com/disi-unibo-nlp/bio-ee-egv/main/data/datasets/original_datasets.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9032157 (8.6M) [application/octet-stream]
Saving to: ‘original_datasets.tar.gz’


2022-09-15 12:43:32 (158 MB/s) - ‘original_datasets.tar.gz’ saved [9032157/9032157]

2022-09-15 12:43:53.702501: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/d

# Functions

## Read files

In [4]:
datasetsNames = [datasetName[:-1] for datasetName in glob.glob("*/")]

In [5]:
datasets = {}
for datasetName in datasetsNames:
  filelist = [filename for filename in glob.glob("./" + datasetName + "/*.*")] # exclude README and LICENSE
  articleIDs = {}  
  for i, x in enumerate(filelist):  
      key = os.path.basename(x).split('.')[0]
      group = articleIDs.get(key,[])
      group.append(x)  
      articleIDs[key] = group
  datasets[datasetName] = articleIDs
#print(datasets)  # datasetName -> list of triples (abstract, extracted entities, extracted events)

### Read .a1/.a2

In [6]:
predictions = {}
goldEntities = {}
abstracts = {}
debugStructure = {}

for dataset in datasets.keys():
  if not dataset in ann_datasets:
    debugStructure[dataset] = {}
    predictions[dataset] = []
    goldEntities[dataset] = []
    abstracts[dataset] = []
    for article in datasets[dataset]:
      predictionFile = [x for x in datasets[dataset][article] if "a2" in x][0]
      entityFile = [x for x in datasets[dataset][article] if "a1" in x][0]
      abstractFile = [x for x in datasets[dataset][article] if "txt" in x][0]
      with open(predictionFile) as f:
          predictionList = {}
          for row in f.readlines():
              # Clean up
              temp = row.replace("\t", "%%&%%").replace("\n", "").split("%%&%%")
              temp[1] = re.split('\s|;', temp[1])
              dictId = temp[0] 
              temp.pop(0)
              predictionList[dictId] = temp
          predictions[dataset].append(predictionList)
          
      with open(entityFile) as f:
          entityList = {}
          for row in f.readlines():
              # Clean up
              temp = row.replace("\t", "%%&%%").replace("\n", "").split("%%&%%") ; 
              temp[1] = re.split('\s|;', temp[1])
              dictId = temp[0] 
              temp.pop(0)
              entityList[dictId] = temp
          goldEntities[dataset].append(entityList)

      abstract = ""
          
      with open(abstractFile) as f:
          abstract = f.read()
          abstracts[dataset].append(abstract)

      debugStructure[dataset][article] = {}
      debugStructure[dataset][article]['predictions'] = predictionList
      debugStructure[dataset][article]['entitities'] = entityList
      debugStructure[dataset][article]['abstract'] = abstract

### Read .ann

In [16]:
articles = {}

for dataset in datasets.keys():
  if dataset in ann_datasets:
    predictions[dataset] = []
    abstracts[dataset] = []
    for i, article in enumerate(datasets[dataset]):
      articles[i] = article
      predictionFile = [x for x in datasets[dataset][article] if "ann" in x][0]
      abstractFile = [x for x in datasets[dataset][article] if "txt" in x][0]
      with open(predictionFile) as f:
          predictionList = {}
          for row in f.readlines():
              # Clean up
              temp = row.replace("\t", "%%&%%").replace("\n", "").split("%%&%%") ; 
              temp[1] = re.split('\s|;', temp[1])
              dictId = temp[0] 
              temp.pop(0)
              predictionList[dictId] = temp
          predictions[dataset].append(predictionList)
          
      with open(abstractFile) as f:
          abstracts[dataset].append(f.read())

## Utility functions

In [17]:
# Load the Spacy model to use for tokenization
nlp = spacy.load('en_core_web_sm')

def sentence_tokenization(text, spacy_model=nlp):
    if (spacy_model is None):
        # Match punctuation characters and add spaces after them
        sentences = re.sub(r'([.,!?()]+)([a-zA-Z0-9_])', r'\1 \2', text)
        # Collapse multiple spaces
        sentences = re.sub('\s{2,}', ' ', text)
    else:
        doc = spacy_model(text)
        #print([sent for sent in doc.sents])
        sentences = [{'sentence': str(sent).strip(), 'start_index': sent.start_char, 'end_index': sent.end_char} 
          for sent in doc.sents]
    return sentences

View [Manipulate datasets section](#datasets-maniuplation) to interpret these magic numbers ;)

In [18]:
def select_sentence_class(sentence):
  length = len(sentence)
  if length <= 123:
    return 'S'
  if length <= 182:
    return 'M'
  return 'L'

Clean sets that contain elements that are sub-elements of others.

In [19]:
def clean_set(my_set):
    my_list = list(my_set)
    my_list.sort(key=lambda s: len(s), reverse=True)
    out = []
    for s in my_list:
        if not any([s in o for o in out]) and not s == "":
             out.append(s)
    return out

Write a Python Dict object into a `.csv` file.

In [20]:
def write_dict_to_csv(filename, myDict):
  with open(filename, 'w') as csv_file:  
    writer = csv.writer(csv_file)
    for key, value in sorted(myDict.items()):
       writer.writerow([key, value])

Debug function. It retrives (the first occurrence of) `articleID` given a sentence as input.

In [21]:
def retrive_article_ID_by_sentence(sentence, dataset_name):
  for article in datasets[dataset_name]:
    abstractFile = [x for x in datasets[dataset_name][article] if "txt" in x][0]
    with open(abstractFile) as f:
        if sentence in f.read():
          return abstractFile.split("/")[-1].split(".")[0]
  return False

## Event creation

In [22]:
def subsitute_object_with_content(datasetName, file_number, relationObject):
    objectContent = object
    try:
        objectContent = prediction[relationObject]
    except: # if not found search in goldEntities
        objectContent = goldEntities[datasetName][file_number][relationObject]
    return objectContent

def process_arguments(arguments, datasetName, file_number, event):
  relations = []
  for argument in arguments:  # exclude the trigger
    if argument:
      inner = {}
      inner['predicate'] = argument.rsplit(':')[0]
      relationObject = argument.rsplit(':')[1]
      if "E" in relationObject:
        #eventsMentioned.append(relationObject)
        nestedEvent = events[relationObject]
        #print(os.linesep + "\tNested event found: " + relationObject + " " + str(nestedEvent) + os.linesep)
        relationObject = create_event(datasetName, file_number, \
                                      (relationObject, events[relationObject]))
        inner['object'] = relationObject
      else:
        inner['object'] = subsitute_object_with_content(datasetName, file_number, relationObject)
      relations.append(inner)
  return relations

def create_event(datasetName, file_number, event):
    eventID = event[0]
    myEvent = {}
    myEvent['eventID'] = event[0]
    myEvent['subject'] = prediction[event[1][0][0].rsplit(':')[1]] # subject = trigger ID
    myEvent['relations'] = process_arguments(event[1][0][1:], datasetName, file_number, event)
    return myEvent

def create_serialized_event(triple, modifiers, eventID, nested = False, roleEvent = None, parentEvent = None, nodeNumber = 0):
  serializedEvent = []
  serializedEvent.append("[")
  serializedEvent.append(triple['subject'][1] + " | " + triple['subject'][0][0])
  eventModifier = []
  nodeNumber += 1
  # If exist modifiers for this event
  if eventID in modifiers.keys():
    eventModifier = modifiers[eventID]
  for modifier in eventModifier:
    serializedEvent.append("|")
    serializedEvent.append(modifier[0] + " = " + modifier[1])
  if nested:
    serializedEvent.append(" | " + roleEvent + " = " + parentEvent)
  serializedEvent.append("]")
  for relation in triple['relations']:
      predicate = relation['predicate']
      trigger = triple['subject'][1]
      relationObject = "["
      try:
        nodeNumber += 1
        relationObject = relationObject + relation['object'][1] + " | " + relation['object'][0][0] + " | " + \
          predicate + " = " + trigger
      except:
          eventID = relation['object']['eventID']
          relationObject = relationObject + " ".join(create_serialized_event(relation['object'], modifiers, eventID, nested = True, roleEvent = predicate, parentEvent = trigger, nodeNumber = nodeNumber)[0])
      relationObject = relationObject + "]"
      serializedEvent.append(relationObject)
  return serializedEvent, nodeNumber

def get_indices(event):
  indices = [int(index) for index in event['subject'][0][1:]]
  for relation in event['relations']:   
    if type(relation['object']) == dict: # nested event
      indices.extend(get_indices(relation['object']))
    else:
      for index in relation['object'][0][1:]:
        indices.append(int(index))
  return indices

def get_event_mentions(event, sentences): # get sentences from min to max indices
    indices = get_indices(event)
    minIndex = min(indices)
    maxIndex = max(indices)
   
    sentences = [s['sentence'] for s in sentences 
                    if s['start_index'] <= minIndex <= s['end_index'] or
                    s['start_index'] <= maxIndex <= s['end_index']]
    return sentences

## Modifiers

In [23]:
def set_modifier_value(modifier):
  if modifier == "Polarity":
    value = "Negative"
  elif modifier == "Speculation":
    value = "True"
  else:
    value = "None"
  return value

def prepare_modifiers(prediction):
  modifiers = {}
  for key, value in prediction.items():
    if 'M' in key or 'A' in key:
      eventID = value[0][1]
      modifier = value[0][0]
      modifierValue = ""
      try: ## if modifier value is present
        modifierValue = value[0][2]
      except: ## else choose its default value
        modifierValue = set_modifier_value(modifier)
      # (Negation = True) = (polarity = False) 
      modifier = "Polarity" if modifier == "Negation" else modifier
      if eventID not in modifiers.keys():
        modifiersList = []
        modifiersList.append((modifier, modifierValue))
        modifiers[eventID] = modifiersList
      else:
        modifiers[eventID].append((modifier, modifierValue))
  return modifiers # eventID -> list(modificator, value)

## Scompone events

In [24]:
def processSquaredBrackets(row, startIndex, endIndex, structuredEvent, itemID, trigger, nestedLevel):
  row = row[startIndex : endIndex + 1]
  if '|' in row: # There could be squared brackets that are not events but entitities
    matchedText = row.rsplit('|')[0].rsplit('[')[1].lstrip().rstrip()
    itemType = row.rsplit('|')[1].split("]")[0].lstrip().rstrip()
    if trigger:
      modifiers = [modificator.strip() for modificator in row.rsplit('|')[2:] for modificator in modificator.split("]")[0].split("=")] if nestedLevel == 0 else [modificator.strip() for modificator in row.rsplit('|')[2:-1] for modificator in modificator.split("]")[0].split("=")]
      item = {"item": (matchedText, itemType), "modifiers": tuple(modifiers)}
    else:
      role, macthedTriggerText = row.rsplit('|')[-1].rsplit(']')[0].rsplit('=')
      item = (matchedText, itemType, role.lstrip().rstrip(), 
              macthedTriggerText.lstrip().rstrip())
    if not itemID in structuredEvent.keys():
      structuredEvent[itemID] = []
    structuredEvent[itemID].append(item)  

def scompone_event(row):
  nestedLevel = 0
  structuredEvent = {}
  lastBracket = ''
  itemVisited = 0 # resetted at every nesting level

  for index, char in enumerate(row):
    if char == '[':
      nestedLevel += 1
      startIndex = index
      if lastBracket == '[':
        itemVisited = 0
      lastBracket = '['
    elif char == ']':
      nestedLevel -= 1
      endIndex = index
      if lastBracket != ']':
        itemVisited += 1
        if itemVisited == 1:
          itemID = 'trigger'
        elif itemVisited > 1:
          itemID = 'entity' 
        itemID += '_nesting_' + str(nestedLevel)
        processSquaredBrackets(row, startIndex, endIndex, structuredEvent, itemID, itemVisited == 1, nestedLevel)
      lastBracket = ']'
  
  if nestedLevel != 0:
    raise Exception("Error: wrong nesting in " + str(row) + " at level " + str(nestedLevel))

  return structuredEvent

scompone_event("[regulating | Regulation | modifier1 = demo1 | modifier2 = demo2][[transformation | Cell_transformation  | Theme = regulating][cell | Cell | Theme = transformation]]")

{'trigger_nesting_0': [{'item': ('regulating', 'Regulation'),
   'modifiers': ('modifier1', 'demo1', 'modifier2', 'demo2')}],
 'trigger_nesting_1': [{'item': ('transformation', 'Cell_transformation'),
   'modifiers': ()}],
 'entity_nesting_1': [('cell', 'Cell', 'Theme', 'transformation')]}

# Core EGV/EE datasets creation

In order to perform a smarter splitting in training/validation/test sets, a class label has been added for each row. 
Class label (EGV) consists in a jointly rappresentation of 4 elements:
DATASET_NAME - EVENT_TYPE - NUMBER_OF_EVENT_MENTIONS

In [25]:
numberEvents = 0
problematicEvents = 0
singleEvents = 0
datasetEGV = [] #(event) -> event_mentions
datasetEE = [] #(event_mention) -> event

for datasetName in tqdm(predictions.keys()):
  for file_number, prediction in enumerate(predictions[datasetName][:]): # text by text
    abstract = abstracts[datasetName][file_number]
    sentences = sentence_tokenization(abstracts[datasetName][file_number])
    senteces_without_event = set([sentence['sentence'] for sentence in sentences])
    events = {k: v for k, v in prediction.items() if 'E' in k}
    modifiers = prepare_modifiers(prediction)
    for event in events.items():
        eventID = event[0]
        row = {}
        numberEvents += 1
        try:
          myEvent = create_event(datasetName, file_number, event)
          serializedEvent, nodeNumber = create_serialized_event(myEvent, modifiers, eventID)
          event_mentions = get_event_mentions(myEvent, sentences)

          # create row
          row['event'] = " ".join(serializedEvent).replace(" [", "[").replace("[ ", "[").replace(" ]", "]").replace("] ", "]")
          row['event_mention'] = str(event_mentions)
          row['class_for_splitting'] = datasetName + " " + myEvent['subject'][0][0] + " " + str(len(event_mentions))
          
          if not nodeNumber == 1:
            datasetEGV.append(row)
          else:
            singleEvents == 1
          if len(event_mentions) == 1:
            senteces_without_event = set(senteces_without_event).difference(event_mentions)
            row['event_mention'] = event_mentions[0]
            row['class_for_splitting'] = datasetName + " " + select_sentence_class(row['event_mention']) # possibly multiple events per sentence.
            datasetEE.append(row)
        except:
          problematicEvents += 1
    for sentence in senteces_without_event: # append negative examples
      datasetEE.append({'event': "", 'event_mention': sentence, 'class_for_splitting': datasetName + " " + select_sentence_class(sentence)})
print(os.linesep + "Total events: " + str(numberEvents))
print("Deleted " + str(problematicEvents) + " nested problematic events")
print("Deleted " + str(singleEvents) + " single node events in EGV")
print(os.linesep + "Dataset EGV with " + str(len(datasetEGV)) + " examples created.")
print(os.linesep + "Dataset EE with " + str(len(datasetEE)) + " examples created.")

100%|██████████| 10/10 [04:14<00:00, 25.48s/it]


Total events: 100083
Deleted 15 nested problematic events
Deleted 0 single node events in EGV

Dataset EGV with 96929 examples created.

Dataset EE with 117209 examples created.





<a name="datasets-maniuplation"></a>
## Datasets manipulatation

Sometimes an event is a substring of anothers. With `clean_set` function, we mantain only the most complete one. With `select_sentence_class` function we map each sentence to its length class.

In [26]:
# Delete all duplicates
dfEGV = pd.DataFrame(datasetEGV).drop_duplicates(subset='event')

# Delete all rows that belong to a restricted class
dfEGV = dfEGV[dfEGV.groupby('class_for_splitting').class_for_splitting.transform('count') > min_class_occurance]

The following code is used to decide thresholds in sentences classes.

In [27]:
dfEE = pd.DataFrame(datasetEE)
vector_lengths = []
for sentence in dfEE['event_mention']:
  vector_lengths.append(len(sentence))

smallThreshold = pd.Series(vector_lengths).quantile(q=0.33, interpolation='linear') # prints 123
mediumThreshold = pd.Series(vector_lengths).quantile(q=0.66, interpolation='linear') # prints 182
largeThreshold = pd.Series(vector_lengths).quantile(q=1, interpolation='linear')

In [28]:
TOKEN_SEP = " "

def manipulate_dataframe(row):
  # how many events? if there's at least one event number of separator + 1, else 0.
  number_events = row['event'].count(f' {TOKEN_SEP} ') + 1 if not len(row['event']) == 0 else 0
  class_for_splitting_items = row['class_for_splitting'].split(" ")
  row['dataset'] = class_for_splitting_items[0]

  if number_events > 1:
    symbol = "+"
  elif number_events == 0:
    symbol = "None"
  elif number_events == 1:
    symbol = row['event'].split("|")[1].split("]")[0].strip(" ")
  row['class_for_splitting'] = class_for_splitting_items[0] + " " + symbol + " " + class_for_splitting_items[1]
  return row

drop_duplicates = True
cut_unrepresentative_classes = True

dfEE = pd.DataFrame(datasetEE)
if drop_duplicates:
  dfEE = dfEE.drop_duplicates(["event", "event_mention"]) # insert class_for_splitting to mantain same sentence from different source
else:
  dfEE[dfEE.duplicated(subset=["event_mention"], keep=False)].sort_values("event_mention")

dfEE = dfEE.groupby(by=['event_mention', 'class_for_splitting']) # group by sentence
dfEE = dfEE['event'].apply(lambda event: f' {TOKEN_SEP} '.join(clean_set(list(event)))).reset_index() # concatenate each correlated event
dfEE = dfEE.apply(lambda row: manipulate_dataframe(row), axis=1) #axis = 1 means row by row

dfEE = dfEE.sort_values(["class_for_splitting"]).drop_duplicates("event_mention", keep="last") # mantain only genia-mk when there's a conflict

if cut_unrepresentative_classes:
  dfEE = dfEE[dfEE.groupby('class_for_splitting').class_for_splitting.transform('count') > 1]

dfEE.loc[dfEE['event'] == "", "event"] = "ND"
dfEE = dfEE.replace("\n", " ", regex=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


## Splitting EGV
The whole dataset has been splitted in train/validation/test sets, according to the distribution of classes (they are indicated in the column `class_for_splitting`). 

In [32]:
X_train, X_test, y_train, y_test = train_test_split(dfEGV.drop('event_mention', axis=1), dfEGV['event_mention'], \
                                                    test_size=test_size_ratio, random_state=random_seed, \
                                                    stratify=dfEGV['class_for_splitting'])
X_validation, X_test, y_validation, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=random_seed)
print("Training set EGV size: " + str(X_train.shape[0]))
print("Validation set EGV size: " + str(X_validation.shape[0]))
print("Test set EGV size: " + str(X_test.shape[0]))
print("Total EGV size: " + str(X_train.shape[0] + X_validation.shape[0] + X_test.shape[0]))

Training set EGV size: 61332
Validation set EGV size: 3407
Test set EGV size: 3408
Total EGV size: 68147


## Splitting EE

In [33]:
X_train_ee, X_test_ee, y_train_ee, y_test_ee = train_test_split(dfEE['event_mention'], dfEE.drop('event_mention', axis=1), \
                                                                test_size=test_size_ratio, random_state=random_seed, \
                                                                stratify=dfEE['class_for_splitting'])
X_validation_ee, X_test_ee, y_validation_ee, y_test_ee = train_test_split(X_test_ee, y_test_ee, test_size=0.5, random_state=random_seed)
print("Training set EE size: " + str(X_train_ee.shape[0]))
print("Validation set EE size: " + str(X_validation_ee.shape[0]))
print("Test set EE size: " + str(X_test_ee.shape[0]))
print("Total EE size: " + str(X_train_ee.shape[0] + X_validation_ee.shape[0] + X_test_ee.shape[0]))

Training set EE size: 32107
Validation set EE size: 1784
Test set EE size: 1784
Total EE size: 35675


## Export datasets

In [None]:
train = pd.concat([X_train.drop(['class_for_splitting'], axis=1), y_train], axis=1)
validation = pd.concat([X_validation.drop(['class_for_splitting'], axis=1), y_validation], axis=1)
test = pd.concat([X_test.drop(['class_for_splitting'], axis=1), y_test], axis=1)

train_ee = pd.concat([X_train_ee, y_train_ee.drop(['class_for_splitting'], axis=1)], axis=1)
validation_ee = pd.concat([X_validation_ee, y_validation_ee.drop(['class_for_splitting'], axis=1)], axis=1)
test_ee = pd.concat([X_test_ee, y_test_ee.drop(['class_for_splitting'], axis=1)], axis=1)

train.to_csv(train_set_name + datasets_type, index=False, sep=datasets_separator)
validation.to_csv(validation_set_name + datasets_type, index=False, sep=datasets_separator)
test.to_csv(test_set_name + datasets_type, index=False, sep=datasets_separator)

train_ee.to_csv(train_set_name + "_ee" + datasets_type, index=False, sep=datasets_separator)
validation_ee.to_csv(validation_set_name + "_ee" + datasets_type, index=False, sep=datasets_separator)
test_ee.to_csv(test_set_name + "_ee" + datasets_type, index=False, sep=datasets_separator)

!zip dataset.zip *.t*
!rm *.tsv
files.download(dataset_file_name)

  adding: test_ee.tsv (deflated 73%)
  adding: test.tsv (deflated 76%)
  adding: train_ee.tsv (deflated 74%)
  adding: train.tsv (deflated 76%)
  adding: validation_ee.tsv (deflated 74%)
  adding: validation.tsv (deflated 76%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Statistics
Each cell contains two sub-cells. The first one is related to EGV, the last one to EE.

## Event types

In [34]:
error = 0

eventTypes = set()
for row in dfEGV['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

eventTypes = set()
for row in X_train['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

eventTypes = set()
for row in X_validation['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

eventTypes = set()
for row in X_test['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

166
163
101
98


In [35]:
error = 0

eventTypes = set()
for row in dfEE['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

eventTypes = set()
for row in y_train_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

eventTypes = set()
for row in y_validation_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

eventTypes = set()
for row in y_test_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            triggerType = eventItems[key][0]['item'][1]
            eventTypes.add(triggerType)
      except:
        error += 1
print(len(eventTypes))

160
159
86
84


## Entity types

In [36]:
error = 0

entityTypes = set()
for row in dfEGV['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

entityTypes = set()
for row in X_train['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

entityTypes = set()
for row in X_validation['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

entityTypes = set()
for row in X_test['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

150
146
93
84


In [37]:
error = 0

entityTypes = set()
for row in dfEE['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

entityTypes = set()
for row in y_train_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

entityTypes = set()
for row in y_validation_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

entityTypes = set()
for row in y_test_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            entityTypes.add(entities[1])
      except:
        error += 1
print(len(entityTypes))

132
129
54
58


## Role types

In [38]:
error = 0

roleTypes = set()
for row in dfEGV['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

roleTypes = set()
for row in X_train['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

roleTypes = set()
for row in X_validation['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

roleTypes = set()
for row in X_test['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

19
19
13
14


In [39]:
error = 0

roleTypes = set()
for row in dfEE['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

roleTypes = set()
for row in y_train_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

roleTypes = set()
for row in y_validation_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

roleTypes = set()
for row in y_test_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "entity" in key:
            entities = eventItems[key][0]
            roleTypes.add(entities[2])
      except:
        error += 1
print(len(roleTypes))

19
19
13
13


## Modifier types

In [40]:
error = 0

modifiersTypes = set()
for row in dfEGV['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiers = eventItems[key][0]['modifiers']
            for modifier in modifiers[0::2]:
              modifiersTypes.add(modifier)
      except:
        error += 1
print(len(modifiersTypes))

6


In [41]:
error = 0

modifiersTypes = set()
for row in dfEE['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiers = eventItems[key][0]['modifiers']
            for modifier in modifiers[0::2]:
              modifiersTypes.add(modifier)
      except:
        error += 1
print(len(modifiersTypes))

modifiersTypes = set()
for row in y_train_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiers = eventItems[key][0]['modifiers']
            for modifier in modifiers[0::2]:
              modifiersTypes.add(modifier)
      except:
        error += 1
print(len(modifiersTypes))

modifiersTypes = set()
for row in y_validation_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiers = eventItems[key][0]['modifiers']
            for modifier in modifiers[0::2]:
              modifiersTypes.add(modifier)
      except:
        error += 1
print(len(modifiersTypes))

modifiersTypes = set()
for row in y_test_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiers = eventItems[key][0]['modifiers']
            for modifier in modifiers[0::2]:
              modifiersTypes.add(modifier)
      except:
        error += 1
print(len(modifiersTypes))

6
6
6
6


## Nodes per event

In [42]:
numberNodes = []

for index, row in dfEGV.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

numberNodes = []

for index, row in X_train.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

numberNodes = []

for index, row in X_validation.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

numberNodes = []

for index, row in X_test.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

count    68147.000000
mean         4.291942
std          2.689906
min          2.000000
25%          2.000000
50%          3.000000
75%          5.000000
max         35.000000
dtype: float64
count    61332.000000
mean         4.291707
std          2.691245
min          2.000000
25%          2.000000
50%          3.000000
75%          5.000000
max         35.000000
dtype: float64
count    3407.000000
mean        4.260053
std         2.651050
min         2.000000
25%         2.000000
50%         3.000000
75%         5.000000
max        35.000000
dtype: float64
count    3408.000000
mean        4.328052
std         2.704682
min         2.000000
25%         2.000000
50%         3.000000
75%         5.000000
max        33.000000
dtype: float64


In [43]:
numberNodes = []

for index, row in dfEE.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

numberNodes = []

for index, row in y_train_ee.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

numberNodes = []

for index, row in y_validation_ee.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

numberNodes = []

for index, row in y_test_ee.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      numberNodes.append(event.count("]"))

print(pd.Series(numberNodes).describe())

count    66255.000000
mean         2.351596
std          2.254288
min          0.000000
25%          0.000000
50%          2.000000
75%          3.000000
max         20.000000
dtype: float64
count    59636.000000
mean         2.354316
std          2.255126
min          0.000000
25%          0.000000
50%          2.000000
75%          3.000000
max         20.000000
dtype: float64
count    3272.000000
mean        2.290342
std         2.186377
min         0.000000
25%         0.000000
50%         2.000000
75%         3.000000
max        18.000000
dtype: float64
count    3347.000000
mean        2.363012
std         2.303999
min         0.000000
25%         0.000000
50%         2.000000
75%         3.000000
max        15.000000
dtype: float64


## Modifiers per event

In [44]:
error = 0

modifiersNumber = []
for row in dfEGV['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

modifiersNumber = []
for row in X_train['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

modifiersNumber = []
for row in X_validation['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

modifiersNumber = []
for row in X_test['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

count    103797.000000
mean          2.424164
std           2.467239
min           0.000000
25%           0.000000
50%           1.000000
75%           5.000000
max           5.000000
dtype: float64
count    93393.000000
mean         2.426595
std          2.467427
min          0.000000
25%          0.000000
50%          1.000000
75%          5.000000
max          5.000000
dtype: float64
count    5170.000000
mean        2.408317
std         2.466610
min         0.000000
25%         0.000000
50%         1.000000
75%         5.000000
max         5.000000
dtype: float64
count    5234.000000
mean        2.396446
std         2.464748
min         0.000000
25%         0.000000
50%         1.000000
75%         5.000000
max         5.000000
dtype: float64


In [45]:
error = 0

modifiersNumber = []
for row in dfEE['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

modifiersNumber = []
for row in y_train_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

modifiersNumber = []
for row in y_validation_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

modifiersNumber = []
for row in y_test_ee['event'][:]:
  events = row.split(f' {TOKEN_SEP} ')
  for event in events:
    if len(event) > 0:
      try:
        eventItems = scompone_event(event)
        for key in eventItems.keys():
          if "trigger" in key:
            modifiersNumber.append(len(eventItems[key][0]['modifiers'][0::2]))
      except:
        error += 1
print(pd.Series(modifiersNumber).describe())

count    64618.000000
mean         0.516156
std          1.439591
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          5.000000
dtype: float64
count    58241.000000
mean         0.519222
std          1.443336
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          5.000000
dtype: float64
count    3128.000000
mean        0.517583
std         1.437876
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         5.000000
dtype: float64
count    3249.000000
mean        0.459834
std         1.371666
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         5.000000
dtype: float64


## Sentences per event mention

In [46]:
error = 0

sentencesPerEventMention = []
for row in dfEGV['event_mention'][:]:
  try:
    number_event_mentions = row.count("'") // 2 if "['" in row else 1
    sentencesPerEventMention.append(number_event_mentions)
  except:
    error += 1
print(pd.Series(sentencesPerEventMention).describe())

sentencesPerEventMention = []
for row in y_train[:]:
  try:
    number_event_mentions = row.count("'") // 2 if "['" in row else 1
    sentencesPerEventMention.append(number_event_mentions)
  except:
    error += 1
print(pd.Series(sentencesPerEventMention).describe())

sentencesPerEventMention = []
for row in y_validation[:]:
  try:
    number_event_mentions = row.count("'") // 2 if "['" in row else 1
    sentencesPerEventMention.append(number_event_mentions)
  except:
    error += 1
print(pd.Series(sentencesPerEventMention).describe())

sentencesPerEventMention = []
for row in y_test[:]:
  try:
    number_event_mentions = row.count("'") // 2 if "['" in row else 1
    sentencesPerEventMention.append(number_event_mentions)
  except:
    error += 1
print(pd.Series(sentencesPerEventMention).describe())

count    68147.000000
mean         1.137878
std          0.359690
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          3.000000
dtype: float64
count    61332.000000
mean         1.137840
std          0.359461
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          3.000000
dtype: float64
count    3407.000000
mean        1.132375
std         0.353366
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
dtype: float64
count    3408.000000
mean        1.144073
std         0.369937
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         3.000000
dtype: float64


## Tokens per event mention

In [47]:
error = 0

sentencesLengths = []
for row in dfEGV['event_mention'][:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

sentencesLengths = []
for row in y_train[:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

sentencesLengths = []
for row in y_validation[:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

sentencesLengths = []
for row in y_test[:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

count    68147.000000
mean        63.267613
std         34.120959
min          6.000000
25%         40.000000
50%         55.000000
75%         77.000000
max        407.000000
dtype: float64
count    61332.000000
mean        63.307164
std         34.215564
min          6.000000
25%         40.000000
50%         55.000000
75%         77.000000
max        407.000000
dtype: float64
count    3407.000000
mean       62.607866
std        33.416095
min        10.000000
25%        40.000000
50%        54.000000
75%        76.000000
max       313.000000
dtype: float64
count    3408.000000
mean       63.215376
std        33.100881
min        12.000000
25%        40.000000
50%        55.000000
75%        78.000000
max       407.000000
dtype: float64


In [48]:
error = 0

sentencesLengths = []
for row in dfEE['event_mention'][:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

sentencesLengths = []
for row in X_train_ee[:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

sentencesLengths = []
for row in X_validation_ee[:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

sentencesLengths = []
for row in X_test_ee[:]:
    try:
      sentencesLengths.append(len(tokenizerT5.encode(row)))
    except:
      error += 1
print(pd.Series(sentencesLengths).describe())

count    35675.000000
mean        44.342144
std         22.584563
min          1.000000
25%         29.000000
50%         40.000000
75%         55.000000
max        381.000000
dtype: float64
count    32107.000000
mean        44.320273
std         22.536463
min          1.000000
25%         30.000000
50%         40.000000
75%         55.000000
max        381.000000
dtype: float64
count    1784.000000
mean       44.712444
std        22.666069
min         2.000000
25%        30.000000
50%        41.000000
75%        54.000000
max       217.000000
dtype: float64
count    1784.000000
mean       44.365471
std        23.364095
min         2.000000
25%        29.000000
50%        40.000000
75%        54.000000
max       183.000000
dtype: float64


## Events per event mentions

In [49]:
eventsPerSentence = []
for row in dfEE['event']:
  number_events = row.count(f' {TOKEN_SEP} ') + 1 if not len(row) == 0 else 0
  eventsPerSentence.append(number_events)

print(pd.Series(eventsPerSentence).describe())


eventsPerSentence = []
for row in y_train_ee['event']:
  number_events = row.count(f' {TOKEN_SEP} ') + 1 if not len(row) == 0 else 0
  eventsPerSentence.append(number_events)

print(pd.Series(eventsPerSentence).describe())

eventsPerSentence = []
for row in y_validation_ee['event']:
  number_events = row.count(f' {TOKEN_SEP} ') + 1 if not len(row) == 0 else 0
  eventsPerSentence.append(number_events)

print(pd.Series(eventsPerSentence).describe())

eventsPerSentence = []
for row in y_test_ee['event']:
  number_events = row.count(f' {TOKEN_SEP} ') + 1 if not len(row) == 0 else 0
  eventsPerSentence.append(number_events)

print(pd.Series(eventsPerSentence).describe())

count    35675.000000
mean         1.857183
std          1.931766
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max         60.000000
dtype: float64
count    32107.000000
mean         1.857414
std          1.936216
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max         60.000000
dtype: float64
count    1784.000000
mean        1.834081
std         1.810868
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        22.000000
dtype: float64
count    1784.000000
mean        1.876121
std         1.969143
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        21.000000
dtype: float64


## Plots
Code to create `.csv` files in order to draw paper plots (figure 3).

### Nodes per event graph

In [50]:
plotEGV = {}

for index, row in dfEGV.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      number_nodes = event.count("]")
      plotEGV[number_nodes] = 1 if number_nodes not in plotEGV else plotEGV[number_nodes] + 1

plotEE = {}

for index, row in dfEE.iterrows():
  for event in row['event'].split(f' {TOKEN_SEP} '):
    if len(event) > 0:
      number_nodes = event.count("]")
      plotEE[number_nodes] = 1 if number_nodes not in plotEE else plotEE[number_nodes] + 1

write_dict_to_csv('egv_nodes_per_graph.csv', plotEGV)
files.download('egv_nodes_per_graph.csv')

write_dict_to_csv('ee_nodes_per_graph.csv', plotEE)
files.download('ee_nodes_per_graph.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Tokens per event graph

In [51]:
plotEGV = {}
for event in dfEGV['event']:
  length = len(tokenizerT5.encode(event))
  plotEGV[length] = 1 if length not in plotEGV else plotEGV[length] + 1

plotEE = {}
for row in dfEE['event']:
  for event in row.split(f' {TOKEN_SEP} '):
    length = len(tokenizerT5.encode(event)) if not len(event) == 0 else 0
    plotEE[length] = 1 if length not in plotEE else plotEE[length] + 1

write_dict_to_csv('egv_tokens_per_graph.csv', plotEGV)
files.download('egv_tokens_per_graph.csv')

write_dict_to_csv('ee_tokens_per_graph.csv', plotEE)
files.download('ee_tokens_per_graph.csv')

Token indices sequence length is longer than the specified maximum sequence length for this model (586 > 512). Running this sequence through the model will result in indexing errors


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Tokens per event mention

In [52]:
plotEGV = {}
for sentence in dfEGV['event_mention']:
  length = len(tokenizerT5.encode(sentence))
  plotEGV[length] = 1 if length not in plotEGV else plotEGV[length] + 1

plotEE = {}
for sentence in dfEE['event_mention']:
  length = len(tokenizerT5.encode(sentence))
  plotEE[length] = 1 if length not in plotEE else plotEE[length] + 1

write_dict_to_csv("egv_tokens_per_event_mention.csv", plotEGV)
files.download("egv_tokens_per_event_mention.csv")

write_dict_to_csv('ee_tokens_per_event_mention.csv', plotEE)
files.download('ee_tokens_per_event_mention.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>