Protocol Generators Experiment
To run this experiment, download the following versions of the PROHOW dataset:
You will need a triplestore which exposes a SPARQL endpoint, like the one offered by Virtuoso OpenLink
From these datasets, we extract a subset of articles we are interested in using this tool:
This script is run with the following configuration:
list_of_allowed_languages = ["en","es"] list_of_allowed_categories = ["http://es.wikihow.com/Categor%C3%ADa:Desayunos","http://www.wikihow.com/Category:Breakfast" # 468 ,"http://www.wikihow.com/Category:Recipes","http://es.wikihow.com/Categor%C3%ADa:Recetas" # 15681 ,"http://es.wikihow.com/Categor%C3%ADa:Recetas-para-dietas-especiales","http://www.wikihow.com/Category:Specialty-Diet-Recipes" ,"http://www.wikihow.com/Category:Food-Preparation","http://www.wikihow.com/Category:Food-Preparation" # 16733 ,"http://www.wikihow.com/Category:Cooking-for-Children" ,"http://www.wikihow.com/Category:Drinks","http://es.wikihow.com/Categor%C3%ADa:Bebidas" ,"http://www.wikihow.com/Category:Holiday-Cooking","http://es.wikihow.com/Categor%C3%ADa:Comidas-festivas" ,"http://www.wikihow.com/Category:Party-Snacks","http://es.wikihow.com/Categor%C3%ADa:Bocadillos-para-fiestas" ] perform_sparql_filtering = True remove_multiple_methods = True remove_multiple_requirements = True min_number_of_steps = 4 max_number_of_steps = 20 min_number_of_requirements = 4 max_number_of_requirements = 20 owl_sameAs_required_prefixes = [["http://es.wikihow.com/","http://www.wikihow.com/"]] save_simplified = True concatenate_label_abstract = False parse_html_into_text = True
The simplified triples extracted in the previous phase are loaded in triplesotre (I used Virtuoso).
dependency_extractor.py is configured to access the dataset at the given enpoint (
http://localhost:8890/sparql/ by default).
dependency_extractor.py is run and the following files are generated:
all_labels.txtcontains instance-label pairs
all_dependencies.txtcontains instance-instance pairs, where the second instance need to come after the first
all_turtlethe same as the previous ones, but in valid RDF Turtle format (just add a random prefix for
:, like the line:
PREFIX : <http://e.c>
Results in Numbers
- The original datasets has 254.349 instructions, about 120.000 each per language
- In the selected category there were 21.635 (91.4% loss from dataset)
- After the filtering we got 5.867 (72.8% loss from previous step)
- Of those there are 2887 pairs of English-Spanish versions of the same set of instructions (1.5% loss from previous step)
- Then we filter out those pairs that do not have exactly the same number of steps and requirements, losing 538 pairs (18.6% loss from previous step)
- Then the Protocol Generator tries to assign each requirement to the first steps that requires it, to generate a more interesting graph
- The Protocol Generator deletes those graphs that are not isomorphic, as we want exactly similar protocols
- The Protocol Generator deletes those graphs that have less than 5 requirements assigned to tasks further than the first one
- Finally we get 679 pairs of instructions/protocols (71% loss from previous step, 0.5% of the original set)
label_converter.py script takes a file called
labels.txt and transforms it into another one called
labels_converted.txt where all the labels have been modified by another script.
The external script, such as
simple_label_parser.py, can be called by this function, which knows about the language and type of the label to convert:
# Tags: type_tag # st : step # re : requirement # co : requirement consumable, usually ingredient # nc : requirement non-consumable, usually a tool # Tags: language_tag # en : English # es : Spanish process_label(language_tag,type_tag,label): ...