No description, website, or topics provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Protocol Generators Experiment


To run this experiment, download the following versions of the PROHOW dataset:

You will need a triplestore which exposes a SPARQL endpoint, like the one offered by Virtuoso OpenLink

Data Filtering

From these datasets, we extract a subset of articles we are interested in using this tool:

This script is run with the following configuration:

list_of_allowed_languages = ["en","es"]
list_of_allowed_categories = ["","" # 468
                             ,"","" # 15681
                             ,"","" # 16733
perform_sparql_filtering = True
remove_multiple_methods = True
remove_multiple_requirements = True
min_number_of_steps = 4
max_number_of_steps = 20
min_number_of_requirements = 4
max_number_of_requirements = 20
owl_sameAs_required_prefixes = [["",""]]
save_simplified = True
concatenate_label_abstract = False
parse_html_into_text = True

Triplestore Creation

The simplified triples extracted in the previous phase are loaded in triplesotre (I used Virtuoso).

Protocol Generator

The protocol is configured to access the dataset at the given enpoint (http://localhost:8890/sparql/ by default).

The is run and the following files are generated:

  • all_labels.txt contains instance-label pairs
  • all_dependencies.txt contains instance-instance pairs, where the second instance need to come after the first
  • all_turtle the same as the previous ones, but in valid RDF Turtle format (just add a random prefix for :, like the line: PREFIX : <http://e.c>

Results in Numbers

  • The original datasets has 254.349 instructions, about 120.000 each per language

Data Filtering

  • In the selected category there were 21.635 (91.4% loss from dataset)
  • After the filtering we got 5.867 (72.8% loss from previous step)
  • Of those there are 2887 pairs of English-Spanish versions of the same set of instructions (1.5% loss from previous step)

Protocol Generator

  • Then we filter out those pairs that do not have exactly the same number of steps and requirements, losing 538 pairs (18.6% loss from previous step)
  • Then the Protocol Generator tries to assign each requirement to the first steps that requires it, to generate a more interesting graph
  • The Protocol Generator deletes those graphs that are not isomorphic, as we want exactly similar protocols
  • The Protocol Generator deletes those graphs that have less than 5 requirements assigned to tasks further than the first one
  • Finally we get 679 pairs of instructions/protocols (71% loss from previous step, 0.5% of the original set)

Label Simplification

The script takes a file called labels.txt and transforms it into another one called labels_converted.txt where all the labels have been modified by another script.

The external script, such as, can be called by this function, which knows about the language and type of the label to convert:

# Tags: type_tag
# st : step
# re : requirement
# co : requirement consumable, usually ingredient
# nc : requirement non-consumable, usually a tool
# Tags: language_tag
# en : English
# es : Spanish