# Evaluation

This Notebook contains the evaluation metrics for [`Rantanplan`](https://pypi.org/project/rantanplan/0.4.3/) v0.4.3

In [1]:
from datetime import datetime
print(f"Last run: {datetime.utcnow().strftime('%B %d %Y - %H:%M:%S')}")

Last run: June 18 2020 - 11:25:28


## Setup

Installing dependencies and downloading necessary corpora using [`Averell`](https://pypi.org/project/averell/).

In [2]:
!pip install -q pandas numpy "spacy<2.3.0" spacy_affixes
!pip install -q https://github.com/linhd-postdata/averell/archive/develop.zip

In [3]:
%%bash --out _
# pip install https://github.com/explosion/spacy-models/archive/es_core_news_md-2.2.5.zip
python -m spacy download es_core_news_md
python -m spacy_affixes download es

In [4]:
!pip install -q "rantanplan==0.4.3"

In [5]:
!averell list

  id  name                size      docs    words  granularity    license
----  ------------------  ------  ------  -------  -------------  -----------
   1  Disco V2            22M       4088   381539  stanza         CC-BY
                                                   line
   2  Disco V3            28M       4080   377978  stanza         CC-BY
                                                   line
   3  Sonetos Siglo       6.8M      5078   466012  stanza         CC-BY-NC
      de Oro                                       line           4.0
   4  ADSO 100            128K       100     9208  stanza         CC-BY-NC
      poems corpus                                 line           4.0
   5  Poesía Lírica       3.8M       475   299402  stanza         CC-BY-NC
      Castellana Siglo                             line           4.0
      de Oro                                       word
                                                   syllable
   6  Gongocorpus         9

In [6]:
%%bash
averell download 3 4 > /dev/null 2>&1
averell export 3 --granularity line
mv corpora/line.json sonnets.json
averell export 4 --granularity line
mv corpora/line.json adso.json
du -h *.json

Using corpora folder: './corpora'
Using corpora folder: './corpora'
3.6M	adso.json
175M	sonnets.json


Defining helper functions

In [7]:
import json
import math
import re
from io import StringIO

import numpy as np
import pandas as pd

def clean_text(string):
    output = string.strip()
    # replacements = (("“", '"'), ("”", '"'), ("//", ""), ("«", '"'), ("»",'"'))
    replacements = (("“", ''), ("”", ''), ("//", ""), ("«", ''), ("»",''))
    for replacement in replacements:
        output = output.replace(*replacement)
    output = re.sub(r'(?is)\s+', ' ', output)
    output = re.sub(r"(\w)-(\w)", r"\1\2", output)  # "Villa-nueva" breaks Navarro-Colorado's system
    return output

In [8]:
adso = pd.DataFrame.from_records(
    json.load(open("adso.json"))
).query("manually_checked == True")[["line_text", "metrical_pattern"]].drop_duplicates().reset_index(drop=True)
adso.line_text = adso.line_text.apply(clean_text)
adso

Unnamed: 0,line_text,metrical_pattern
0,"Yo vi unos ojos bellos, que hirieron",++-+-+---+-
1,"con dulce flecha un corazón cuitado,",-+-++--+-+-
2,y que para encender nuevo cuidado,-----++--+-
3,su fuerza toda contra mí pusieron.,-+-+---+-+-
4,Yo vi que muchas veces prometieron,++-+-+---+-
...,...,...
1385,"cobró armado y prudente su denuedo,",-++--+---+-
1386,que sin victorias no contó algún día.,---+-+-+++-
1387,Esto fue don Fadrique de Toledo.,+-+--+---+-
1388,"Hoy nos da, desatado en sombra fría,",+-+--+-+-+-


In [9]:
sonnets = pd.DataFrame.from_records(
    json.load(open("sonnets.json"))
).query("manually_checked == True")[["line_text", "metrical_pattern"]].drop_duplicates().reset_index(drop=True)
sonnets.line_text = sonnets.line_text.apply(clean_text)
sonnets

Unnamed: 0,line_text,metrical_pattern
0,Cuando la alegre y dulce primavera,---+-+---+-
1,"a partir sus riquezas comenzaba,",--+--+---+-
2,y de los verdes campos desterraba,---+-+---+-
3,"aquella estéril sequedad primera,",-+-+---+-+-
4,un pastor triste y solo en la ribera,+-++-+---+-
...,...,...
10205,y en tus labios purpúrea competencia,--+--+---+-
10206,"agora al alba y al clavel ofrece,",-+-+---+-+-
10207,"la edad, con invisible diligencia,",-+---+---+-
10208,en el común ocaso lo oscurece;,---+-+---+-


In [10]:
# Fixes an issue with lines containing line breaks, e.g., 7518
sonnets.line_text = sonnets.line_text.str.strip().str.split("\n").str[0].str.strip()

Importing `Rantanplan` main functions and warming up the cache.

In [88]:
from rantanplan.rhymes import analyze_rhyme
from rantanplan import get_scansion

In [89]:
%%time
get_scansion("Prueba")
pass

CPU times: user 10.6 ms, sys: 128 µs, total: 10.7 ms
Wall time: 9.47 ms


## Navarro-Colorado

Preparing corpora and measuring running times for Navarro-Colorado scansion system.

In [13]:
!mkdir -p sonnets
!mkdir -p adso
!mkdir -p outputs

In [14]:
with open("adso/adso.txt", "w") as file:
    file.write("\n".join(adso["line_text"].values))
with open("sonnets/sonnets.txt", "w") as file:
    file.write("\n".join(sonnets["line_text"].values))

We built and pushed a Docker image with Navarro-Colorado scansion system. The execution of the next cells will take a very long time and will produce a very verbose output in the process which we will omit. Alternatively, the files `./data/navarro_colorado_adso.xml` and `./data/navarro_colorado_sonnets.xml` contain the output of the last run.

In [15]:
%%bash --err adso_timing --out adso_output
time -p docker run -v $(pwd)/adso:/adso/data_in -v $(pwd)/outputs:/adso/data_out linhdpostdata/adso
cp outputs/adso.xml data/navarro_colorado_adso.xml

In [16]:
navarro_colorado_adso_times = dict(pair.split(" ") for pair in adso_timing.strip().split("\n")[-3:])

In [17]:
%%bash --err sonnets_timing --out sonnets_output
time -p docker run -v $(pwd)/sonnets:/adso/data_in -v $(pwd)/outputs:/adso/data_out linhdpostdata/adso
cp outputs/sonnets.xml data/navarro_colorado_sonnets.xml

In [18]:
navarro_colorado_sonnets_times = dict(pair.split(" ") for pair in sonnets_timing.strip().split("\n")[-3:])

Loading the outputs of the ADSO System into Pandas `DataFrame`'s

In [19]:
from glob import glob
from xml.etree import ElementTree


def load_tei(filename):
    lines = []
    with open(filename, "r") as xml:
        contents = xml.read()
        tree = ElementTree.fromstring(contents)
        tags = tree.findall(".//{http://www.tei-c.org/ns/1.0}l")
        for tag in tags:
            text = clean_text(tag.text)
            lines.append((text, tag.attrib['met']))
    return pd.DataFrame(lines, columns=["line_text", "metrical_pattern"])

In [20]:
navarro_colorado_adso = load_tei("outputs/adso.xml")
navarro_colorado_adso

Unnamed: 0,line_text,metrical_pattern
0,"Yo vi unos ojos bellos, que hirieron",++-+-+---+-
1,"con dulce flecha un corazón cuitado,",-+-++--+-+-
2,y que para encender nuevo cuidado,-----++--+-
3,su fuerza toda contra mí pusieron.,-+-+---+-+-
4,Yo vi que muchas veces prometieron,++-+-+---+-
...,...,...
1385,"cobró armado y prudente su denuedo,",-++--+---+-
1386,que sin victorias no contó algún día.,---+-+-+++-
1387,Esto fue don Fadrique de Toledo.,+-++-+---+-
1388,"Hoy nos da, desatado en sombra fría,",+-+--+-+-+-


In [21]:
navarro_colorado_sonnets = load_tei("outputs/sonnets.xml")
navarro_colorado_sonnets

Unnamed: 0,line_text,metrical_pattern
0,Cuando la alegre y dulce primavera,---+-+---+-
1,"a partir sus riquezas comenzaba,",--+--+---+-
2,y de los verdes campos desterraba,---+-+---+-
3,"aquella estéril sequedad primera,",-+-+---+-+-
4,un pastor triste y solo en la ribera,+-++-+---+-
...,...,...
10205,y en tus labios purpúrea competencia,--+--+---+-
10206,"agora al alba y al clavel ofrece,",-+-+---+-+-
10207,"la edad, con invisible diligencia,",-+---+---+-
10208,en el común ocaso lo oscurece;,---+-+---+-


### Accuracy on ADSO

In [22]:
correct = sum(navarro_colorado_adso.metrical_pattern == adso.metrical_pattern)
accuracy_navarro_colorado_adso = correct / adso.metrical_pattern.size
print(f"Navarro-Colorado on ADSO: {accuracy_navarro_colorado_adso:.4f} ({navarro_colorado_adso_times['real']}s)")

Navarro-Colorado on ADSO: 0.9453 (2698.15s)


### Accuracy on Sonnets

In [24]:
correct = sum(navarro_colorado_sonnets.metrical_pattern == sonnets.metrical_pattern)
accuracy_navarro_colorado_sonnets = correct / sonnets.metrical_pattern.size
print(f"Navarro-Colorado on Sonnets: {accuracy_navarro_colorado_sonnets:.4f} ({navarro_colorado_sonnets_times['real']}s)")

Navarro-Colorado on Sonnets: 0.9088 (19945.01s)


---

## Gervás

Gervás was kind enough to run its system against the ADSO corpus and sending us the results for evaluation. We are including the raw results and the transformations functions we used to evaluate its performance.

In [28]:
with open("data/gervas_adso.txt", "r") as file:
    lines = file.read().split("\n")
    gervas = pd.DataFrame.from_records(
        [(lines[index-1], *lines[index].split(" ")) for index, line in enumerate(lines) if index % 2 != 0]
    ).drop([1, 6, 9, 10], axis=1).rename(columns={
        0: "line_text",
        2: "stress",
        3: "indexed_metrical_pattern",
        4: "length",
        5: "metrical_yype",
        7: "consonant_ending",
        8: "asonant_ending",
    })
    
def indexed2binary(df):
    binary = ["-" for i in range(int(df["length"]))]
    for pos in df["indexed_metrical_pattern"].split("'"):
        binary[int(pos) - 1] = "+"
    return "".join(binary)
    
gervas["metrical_pattern"] = gervas.apply(indexed2binary, axis=1)

Calculating overlap of verses evaluated by Gervás and those in ADSO

In [29]:
overlap_adso = list(set(gervas.line_text.tolist()) & set(adso.line_text.tolist()))
print(f"{len(overlap_adso)} lines from ADSO")

1291 lines from ADSO


In [30]:
overlap_sonnets = list(set(gervas.line_text.tolist()) & set(sonnets.line_text.tolist()))
print(f"{len(overlap_sonnets)} lines from Sonnets")

9639 lines from Sonnets


### Accuracy on ADSO

In [31]:
gervas_metrical_patterns = gervas[gervas.line_text.isin(overlap_adso)].sort_values("line_text").metrical_pattern.values
adso_metrical_patterns = adso[adso.line_text.isin(overlap_adso)].sort_values("line_text").metrical_pattern.values
accuracy_gervas_adso = sum(gervas_metrical_patterns == adso_metrical_patterns) / len(adso_metrical_patterns)
print(f"Gervás on {len(overlap_adso)} ADSO lines: {accuracy_gervas_adso:.4f}")

Gervás on 1291 ADSO lines: 0.7088


Despite the difference in the number of lines, this value is way lower than the originally reported by Gervás (0.8873).

### Accuracy on Sonnets

In [43]:
gervas_metrical_patterns = (gervas[gervas.line_text.isin(overlap_sonnets)]
    .drop_duplicates("line_text")
    .sort_values("line_text")
    .metrical_pattern
    .values)
sonnets_metrical_patterns = (sonnets[sonnets.line_text.isin(overlap_sonnets)]
    .drop_duplicates("line_text")
    .sort_values("line_text")
    .metrical_pattern
    .values)
accuracy_gervas_sonnets = sum(gervas_metrical_patterns == sonnets_metrical_patterns) / len(sonnets_metrical_patterns)
print(f"Gervás on {len(overlap_sonnets)} Sonnets lines: {accuracy_gervas_sonnets:.4f}")

Gervás on 9639 Sonnets lines: 0.6759


---

## Rantanplan

Importing libraries. We will disable cache so subsequent calls while timing execution doesn't get affeted.

In [44]:
import rantanplan.pipeline
from rantanplan import get_scansion

Measuring running times for Rantanplan on ADSO

In [73]:
adso_text = "\n".join(adso.line_text.values)
adso_lengths = [11] * adso.line_text.size

In [64]:
# disabling cache
rantanplan_adso_times = %timeit -o rantanplan.pipeline._load_pipeline = {}; get_scansion(adso_text, rhythmical_lengths=adso_lengths)

9.54 s ± 176 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [74]:
rantanplan_adso = get_scansion(adso_text, rhythmical_lengths=adso_lengths)
rantanplan_adso_stress = [line["rhythm"]["stress"] for line in rantanplan_adso]

Measuring running times for Rantanplan on Sonnets

In [49]:
sonnets_text = "\n".join(sonnets.line_text.values)
sonnets_lengths = [11] * sonnets.line_text.size

In [66]:
# disabling cache
rantanplan_sonnets_times = %timeit -o rantanplan.pipeline._load_pipeline = {}; rantanplan_sonnets = get_scansion(sonnets_text, rhythmical_lengths=sonnets_lengths)

53.5 s ± 1.78 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [52]:
rantanplan_sonnets = get_scansion(sonnets_text, rhythmical_lengths=sonnets_lengths)
rantanplan_sonnets_stress = [line["rhythm"]["stress"] for line in rantanplan_sonnets]

### Accuracy on ADSO

In [75]:
rantanplan_adso_stress = [line["rhythm"]["stress"] for line in rantanplan_adso]
accuracy_rantanplan_adso = sum(rantanplan_adso_stress == adso.metrical_pattern) / adso.metrical_pattern.size
print(f"Rantanplan on ADSO: {accuracy_rantanplan_adso:.4f} ({rantanplan_adso_times.average:.4f}s)")

Rantanplan on ADSO: 0.9245 (9.5431s)


### Accuracy on Sonnets

In [68]:
rantanplan_sonnets_stress = [line["rhythm"]["stress"] for line in rantanplan_sonnets]
accuracy_rantanplan_sonnets = sum(rantanplan_sonnets_stress == sonnets.metrical_pattern) / sonnets.metrical_pattern.size
print(f"Rantanplan on Sonnets: {accuracy_rantanplan_sonnets:.4f} ({rantanplan_sonnets_times.average:.4f}s)")

Rantanplan on Sonnets: 0.8793 (53.5131s)


---

# Results 

In [69]:
from IPython.display import display, HTML

## ADSO

In [70]:
display(HTML(
    pd.DataFrame([
        ["Gervás", accuracy_gervas_adso, "N/A"],
        ["Navarro-Colorado", accuracy_navarro_colorado_adso, float(navarro_colorado_adso_times["real"])],
        ["Rantanplan", accuracy_rantanplan_adso, rantanplan_adso_times.average]
    ], columns=["Model", "Accuracy", "Time"]).to_html(index=False)
))

Model,Accuracy,Time
Gervás,0.708753,
Navarro-Colorado,0.945324,2698.15
Rantanplan,0.92446,9.54309


## Sonnets

In [71]:
display(HTML(
    pd.DataFrame([
        ["Gervás", accuracy_gervas_sonnets, "N/A"],
        ["Navarro-Colorado", accuracy_navarro_colorado_sonnets, float(navarro_colorado_sonnets_times["real"])],
        ["Rantanplan", accuracy_rantanplan_sonnets, rantanplan_sonnets_times.average]
    ], columns=["Model", "Accuracy", "Time"]).to_html(index=False)
))

Model,Accuracy,Time
Gervás,0.6759,
Navarro-Colorado,0.908815,19945.0
Rantanplan,0.879334,53.5131
