New Treex

Experiments with efficient API for processing dependency trees (new version of Treex).

Benchmark

First, run make data to download UD 1.2 and store in data directory. See definition of the CoNLL-U data format. Add your implementation of the New Treex API (TODO: add a link to the specification) in a directory named by the programming language of your choice (python, perl, java, c,...). Add possible compilation into the Makefile. Edit @COMMANDS in benchmark.pl so it calls a command which runs the benchmark on selected treebanks. For correct memory consumption statistics the command should not contain shell metacharacters and it should not execute any child process. The command may contain options to run different version of your implementation (e.g. prioritizing speed over memory). Finally, run make benchmark.

The benchmark command should do several tasks and after finishing each task print the task name on STDOUT. Nothing else should be printed on STDOUT. The tasks are:

init Intialize whatever is needed.
load Load the CoNLL-U file (specified as the first parameter) to memory.
iter Iterate over all nodes in all sentences by their word order.
iterF Iterate over all nodes (fast). Nodes within one sentence may be iterated in any order.
iterS Iterate over all (sorted) descendants of all (sorted) children of the root. Some implementations have faster descendants() if called on the root, so this task should evaluate the difference.
iterN Take the root and iterate over all (sorted) nodes by calling next_node() in a while cycle.
read Iterate over all nodes (sorted, this holds for the rest of the tasks) and create a variable with concatenated form and lemma.
write Set deprel attribute of each node to the value dep.
rehang Rehang each node to a random* parent in the same sentence. Method set_parent should raise an exception if this would lead to a cycle. Catch such exceptions and leave such nodes under their original parents. Alternatively, there may be a parameter cycles=skip, which prevents the exception.
remove Delete random 10% of nodes. That is if (myrand(10)==0){ $node->remove()}, so it does not need to be exactly 10%. Removing a node by default removes all its descendants. (Note that if you iterate over a list of original nodes, you may encounter already deleted nodes. You should check this case and don't try to delete already deleted nodes.)
add Select random 10% of nodes (as above) and add a child under them (lemma=x, form=x). It should be the last child according to word order (and last=rightmost descendant).
reorder Shift 10% of nodes (with their whole subtree) after a random node (except when that random node is a descendant). From the rest of the nodes, shift 10% of nodes without their subtree before a random subtree.**
save Save the in-memory document to an output CoNLL-U file (specified as the second parameter).

*) For selecting random node and random 10% in tasks 9-12, use an equivalent of the following function myrand, so it is deterministic and replicable accross different programming languages.

my $seed = 42;
my $maxseed = 2**32;
sub myrand {
    my ($modulo) = @_;
    $seed = (1103515245 * $seed + 12345) % $maxseed;
    return $seed % $modulo;
}

**) The code for task 10 (reorder) is:

  # for each $tree
  my @nodes = $tree->descendants;
  # The order of the following two lines is IMPORTANT for replicability.
  my $rand_index = myrand($#nodes+1);
  if (myrand(10)==0){
    try{$node->shift_after_node($nodes[$rand_index]);}
    # catch '$reference_node is a descendant of $self'
  } elsif (myrand(10)==0) {
    $node->shift_before_subtree($nodes[$rand_index], {without_children=>1});
  }

The command/script running the task should do each task independently, not keeping any datastructure (e.g. array of all bundles, or even sorted nodes) except for the one document. This means that iteration over all nodes should be done again for each task (to simulate the real usage). Don't forget to switch off stdout buffering, so the task names are printed immediately (and timing is accurate).

If the script is executed with -d (debug), it should save the document into conll files after all task which modify the document (load, write, rehang, remove, add, reorder). The filenames should contain implementation name and task name, e.g. pytreex-load.conllu, pytreex-write.conllu,... and they should be saved in the current directory.

Results

MAXMEM is maximum (resident set size) memory (ps -orss) in MiB. Other columns are time in seconds. Run on x86_64.

data/UD_Czech/cs-ud-train-l.conllu (68 MB): REPEATS=10 hostname=tauri2

experiment	REAL	CPU	MAXMEM	init	load	save	iter	iterF	iterS	iterN	read	write	rehang	remove	add	reorder	exit	RSD
old_Treex	2989.996	?	18023.746	2.772	2501.309	201.291	7.647	3.169	?	?	9.185	11.618	55.882	58.347	47.265	35.765	?	?
pytreex	240.090	240.039	3808.691	0.097	158.064	7.869	2.851	0.990	0.977	15.217	3.291	3.111	7.348	8.918	17.059	11.215	3.086	0.003
utreex	45.508	45.471	879.117	0.029	24.096	5.657	0.143	0.142	0.744	1.726	0.373	0.267	4.579	2.240	3.049	1.215	1.248	0.005
perl	20.646	20.601	748.066	0.077	6.793	2.932	0.193	0.185	1.069	0.589	0.617	0.524	2.441	2.067	1.259	1.863	0.039	0.004
java	14.365	36.185	1323.910	0.099	8.636	0.913	0.309	0.204	0.219	1.389	0.309	0.227	0.322	0.487	0.446	0.740	0.066	0.014

Comments:

cpp_raw does not implement rehang, remove, add, reorder yet.
More details (and newer results) are at the GitHub wiki

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
cpp_raw		cpp_raw
demo		demo
doc		doc
java		java
perl		perl
python		python
ref/ro-ud-train		ref/ro-ud-train
release		release
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
bench_udapi_new.py		bench_udapi_new.py
benchmark.pl		benchmark.pl
dummy.pl		dummy.pl
printpid.sh		printpid.sh

martinpopel/newtreex

Folders and files

Latest commit

History

Repository files navigation

New Treex

Benchmark

Results

Comments:

About

Resources

Stars

Watchers

Forks

Languages