## Evaluation for Dependency Parsing

The evaluation for dependency parsing will be split up into four stages, namely:

1) Parsing the dependency outputs on the disambiguated text
2) Correcting mistakes in these outputs to create a gold standard
3) Evaluating the dependency parsing model on the gold standard


### Parsing the dependency outputs

To parse the dependency outputs, we will run the current dependency parsing model on `disambig_gold_100.txt `. The reason for running the model on the gold standard of the disambiguation, instead of raw input text, is that this allows us to create a *truly* gold standard for the dependency model. We want the gold standard for the dependencies be focused on only the dependency parsing logic, so we need the input readings to be correctly disambiguated beforehand. When evaluating the dependency set, we will be sure to check for failure cases that might have been caused by the disambiguation model, but such cases should be relatively rare. 

Using the 100 hand-corrected sentences, we will now parse the system-generated dependency outputs. We will use code from `src/booklets.py`, which will simultaneously provide us a html visualization to help with annotation of the gold standard:

In [1]:
import sys
from pathlib import Path
ROOT = Path.cwd().resolve()
while ROOT != ROOT.parent and not (ROOT / "src").exists():
    ROOT = ROOT.parent
sys.path.insert(0, str(ROOT))

In [2]:
from src.booklets import build_dep_booklet_from_disamb
from pathlib import Path

# Build the gold sample (Paths are from project root)
source = ROOT / "evaluation" / "eval_data" / "gold" / "disambig_gold_100.txt"
dep_cg3 = ROOT / "data" / "grammars" / "dependency.cg3"
out_conllu = ROOT / "evaluation" / "eval_data" / "sample_dep" / "opd_dep_100.conllu"
out_html = ROOT / "evaluation" / "eval_data" / "sample_dep" / "opd_dep_100.html"

build_dep_booklet_from_disamb(source, dep_cg3, out_conllu, out_html,
                              html_title="Ojibwe Dependencies on Disambiguation Gold")


Output()

Output()

Wrote booklet: /Users/matthias/ELF-Lab Repos/Ojibwe_Constraint_Grammar/evaluation/eval_data/sample_dep/opd_dep_100.html  (100 sentences)
Wrote treebank: /Users/matthias/ELF-Lab Repos/Ojibwe_Constraint_Grammar/evaluation/eval_data/sample_dep/opd_dep_100.conllu


### Create the gold standard

The gold standard was created manually done by editing and reassigning relationships in  `opd_dep_100.conllu`. Only currently modeled dependency relations were checked. The gold set is in `eval_data/gold/dep_gold_100.conllu`.

### Evaluating the dependency parsing module

Based on the gold set, we can now evaluate the actual performance of our dependency model.

In [3]:
from evaluation.eval_modules.data_io import parse_conllu
from evaluation.eval_modules.dep_eval import eval_rels_detailed, write_per_rel_table

# Data directory for evaluation
DATA = ROOT / "evaluation" / "eval_data"

# Dependency eval
# List of dep relations we focus on
FOCUS_RELS = ["nsubj","obj","iobj","det","obl","nummod","discourse","case","advmod","neg","acl:relcl","csubj","ccomp"]
gold_dep = DATA / "gold/dep_gold_100.conllu"
sys_dep  = DATA / "sample_dep/opd_dep_100.conllu"
gold_blocks = parse_conllu(gold_dep)
sys_blocks  = parse_conllu(sys_dep)
overall, per_rel = eval_rels_detailed(
    gold_blocks, sys_blocks,
    FOCUS_RELS
)
overall


{'tokens_considered': 237,
 'precision': 0.9903381642512077,
 'recall': 0.8649789029535865,
 'f1': 0.9234234234234234,
 'accuracy': 0.8649789029535865,
 'sys_total': 207,
 'gold_total': 237,
 'correct': 205,
 'macro_precision': 0.7642776735459662,
 'macro_recall': 0.6612493197859052,
 'macro_f1': 0.709040367700942}

#### Eval Table: Per-Relationship Type

In [4]:
print(write_per_rel_table(per_rel, FOCUS_RELS))


rel             gold   sys  correct      P      R     F1
--------------------------------------------------------
nsubj             41    40       40  1.000  0.976  0.988
obj               41    41       40  0.976  0.976  0.976
iobj               0     0        0  0.000  0.000  0.000
det               27    27       27  1.000  1.000  1.000
obl               33    25       24  0.960  0.727  0.828
nummod             1     1        1  1.000  1.000  1.000
discourse         22    22       22  1.000  1.000  1.000
case               7     6        6  1.000  0.857  0.923
advmod            55    40       40  1.000  0.727  0.842
neg                4     4        4  1.000  1.000  1.000
acl:relcl          3     1        1  1.000  0.333  0.500
csubj              0     0        0  0.000  0.000  0.000
ccomp              3     0        0  0.000  0.000  0.000
