# Evaluation

This Notebook contains the evaluation metrics for [`Rantanplan`](https://pypi.org/project/rantanplan/0.4.3/) v0.4.3

In [1]:
from datetime import datetime
print(f"Last run: {datetime.utcnow().strftime('%B %d %Y - %H:%M:%S')}")

Last run: June 17 2020 - 23:05:22


## Setup

Installation of necessary libaries and loading the reference corpus [EDFU](https://github.com/linhd-postdata/edfu/).

In [2]:
!pip install -q pandas numpy "spacy<2.3.0" spacy_affixes
!pip install -q https://github.com/linhd-postdata/averell/archive/develop.zip

In [3]:
%%bash --out _
python -m spacy download es_core_news_md 
python -m spacy_affixes download es

In [4]:
!pip install -q "rantanplan==0.4.3"

In [5]:
import requests
import json

In [6]:
edfu = requests.get("https://github.com/linhd-postdata/edfu/blob/v1.0.0/edfu.json?raw=true").json()
with open("edfu.txt", "w") as file:
    file.write("\n".join(edfu.keys()))
print(f"EDFU corpus contains {len(edfu)} syllabified words")

EDFU corpus contains 106353 syllabified words


## Agirrezabal

Agirrezabal's syllabification system is a finite state machine written in [_foma_](https://fomafst.github.io/), which is available for most OS's. We will use the native package in Ubuntu after being instaled via `apt` (it also installs the `flookup` tool):
```bash
$ sudo apt install -y foma
```

In [7]:
%%bash --out _
# downloads a foma script for Spanish syllabification
wget -q -O silabaEs.script https://bitbucket.org/manexagirrezabal/syllabification_gold_standard/raw/master/ruleSyllabifiers/silabaEs.script
# compiles the foma script
foma -q -f silabaEs.script
# runs foma on the words in EDFU
cat edfu.txt | flookup -ix silabaEs.fst > agirrezabal_edfu.txt 

In [8]:
with open("agirrezabal_edfu.txt", "r") as manex_file:
    lines = manex_file.read().replace(".", "-").split("\n")
    agirrezabal_edfu = {line.replace("-", ""): line for line in lines if line.strip()}

### Accuracy on EDFU

In [9]:
correct = sum(agirrezabal_edfu[key] == edfu[key] for key in edfu)
agirrezabal_edfu_accuracy = correct / len(edfu)
print(f"Agirrezabal on EDFU: {agirrezabal_edfu_accuracy:.4f}")

Agirrezabal on EDFU: 0.9807


---

## Navarro-Colorado

Since Navarro-Colorado's system is not packaged and therefore the syllabification algorithm is not exposed, we'll use the Docker image to work around this limitation with a custom script (it will take a long time, almost a day).

In [None]:
navarro_colorado_syllabification = "from modules.escansion import *; [print(AnalizaSilabas.analizaSilabas(AnalizaCategoriaGramaticalFreeling.Analiza(verso=modernizaTexto(word))).lower()) for word in open('/data/edfu.txt').read().split()];"
navarro_colorado_edfu = ! docker run -v $(pwd):/data linhdpostdata/adso python3 -c "{navarro_colorado_syllabification}"

### Accuracy on EDFU

In [None]:
correct = sum(navarro_colorado_edfu[index] == edfu[key] for index, key in enumerate(edfu.keys()))
navarro_colorado_edfu_accuracy = correct / len(edfu)
print(f"Navarro-Colorado on EDFU: {navarro_colorado_edfu_accuracy:.4f}")

---

## Rantanplan

In [None]:
from rantanplan.core import syllabify

rantanplan_edfu = ["-".join(syllabify(word)[0]) for word in edfu.keys()]

### Accuracy on EDFU

In [None]:
correct = sum(rantanplan_edfu[index] == edfu[key] for index, key in enumerate(edfu))
rantanplan_edfu_accuracy = correct / len(edfu)
print(f"Rantanplan on on EDFU: {rantanplan_edfu_accuracy:.4f}")

---

# Results

In [None]:
import pandas as pd
from IPython.display import display, HTML

display(HTML(
    pd.DataFrame([
        ["Navarro-Colorado", navarro_colorado_edfu_accuracy],
        ["Agirrezabal", agirrezabal_edfu_accuracy],
        ["Rantanplan", rantanplan_edfu_accuracy],
    ], columns=["Model", "Accuracy"]).to_html(index=False)
))