<a href="https://colab.research.google.com/github/isegura/iso4simplify/blob/main/TextSimplification_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluación de simplificación de textos (Prompting + LLM)

Este notebook (Google Colab) carga `test.csv` y `pred.csv` (o un único CSV combinado), alinea por `pair_id` y evalúa:

**Métricas de simplificación:** BLEU, SARI, BERTScore, MeaningBERT y LENS  
**Legibilidad:** (según idioma) FKGL, Flesch Reading Ease y SMOG (o adaptaciones ES/FR cuando aplique)

Al final, guarda un CSV con las puntuaciones:
`scores_{dataset}_{model}_{tipo_prompt}_{strategy_prompt}.csv`


## 1) Configuración


In [None]:
#@title Parámetros (edita si lo necesitas)
DATASET = "tsar2024"          #@param ["tsar2024","clara-med","cochrane","meds","wiki"]
MODEL_NAME = "gemma-3-12b-it" #@param {type:"string"}
PROMPT_STRATEGY = "zero"      #@param ["zero","one","few"]
PROMPT_TYPE = "ISO"           #@param ["ISO","2steps","brief"]


# Rutas (Drive) o nombres (si subes con el uploader)
TEST_CSV_PATH = ""      #@param {type:"string"}
PRED_CSV_PATH = ""      #@param {type:"string"}
COMBINED_CSV_PATH = ""  #@param {type:"string"}

# Columnas: por defecto asumimos (pero el notebook intenta autodetectar).
# Si tu CSV usa otros nombres, edítalos aquí.
COL_PAIR_ID = "pair_id"         #@param {type:"string"}
COL_COMPLEX = "complex"         #@param {type:"string"}
COL_REFERENCE = "simple"       #@param {type:"string"}
COL_PREDICTION = "prediction"   #@param {type:"string"}


## 2) Instalación de dependencias


In [None]:
!pip -q install sacrebleu bert-score textstat

In [None]:
!pip uninstall -y transformers tokenizers numpy
!pip -q install --no-cache-dir numpy==1.26.4
!pip -q install --no-cache-dir transformers>=4.40.0 tokenizers

Found existing installation: transformers 5.2.0
Uninstalling transformers-5.2.0:
  Successfully uninstalled transformers-5.2.0
Found existing installation: tokenizers 0.22.2
Uninstalling tokenizers-0.22.2:
  Successfully uninstalled tokenizers-0.22.2
Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bert-score 0.3.13 requires transformers>=3.0.0, which is not installed.
sentence-transformers 5.2.3 requires transformers<6.0.0,>=4.41.0, which is not installed.
torchtune 0.6.1 requires tokenizers, which is not installed.
peft 0.18.1 requires transformers, which is not installed.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 3.0.1 which is incompatible.
opencv-contrib-python 4.13.0.92 requires numpy>=2; python_version >= "3.9", but you ha

## 3) Carga de ficheros (Google Drive o subida manual)


In [None]:
from google.colab import files
import os

def upload_files():
    uploaded = files.upload()
    return list(uploaded.keys())

print("Sube exactamente dos ficheros: test.csv y pred.csv")

uploaded_files = upload_files()

if len(uploaded_files) != 2:
    raise ValueError("Debes subir exactamente dos ficheros: test.csv y pred.csv")

# Construimos rutas absolutas en Colab
uploaded_paths = [os.path.join("/content", f) for f in uploaded_files]

# Detectamos por nombre
test_path = ""
pred_path = ""

for p in uploaded_paths:
    name = os.path.basename(p).lower()
    if "test" in name:
        test_path = p
    elif "pred" in name:
        pred_path = p

# Si no detecta por nombre, asigna por orden
if not test_path:
    test_path = uploaded_paths[0]

if not pred_path:
    pred_path = uploaded_paths[1]

print("✅ test_path =", test_path)
print("✅ pred_path =", pred_path)

Sube exactamente dos ficheros: test.csv y pred.csv


TypeError: 'NoneType' object is not subscriptable

## 4) Lectura

In [None]:
import pandas as pd

# Cargar ficheros
test_df = pd.read_csv(test_path)
pred_df = pd.read_csv(pred_path)

test_df = test_df[[COL_PAIR_ID, COL_COMPLEX, COL_REFERENCE]].copy()
pred_df = pred_df[[COL_PAIR_ID, COL_PREDICTION]].copy()


print("✅ test_df cargado. Tamaño:", test_df.shape)
print("✅ pred_df cargado. Tamaño:", pred_df.shape)

#display(test_df.head(2))
#display(pred_df.head(2))

df = test_df.merge(pred_df, on="pair_id", how="inner")
display(merged.head(2))


✅ test_df cargado. Tamaño: (117, 3)
✅ pred_df cargado. Tamaño: (117, 2)


Unnamed: 0,pair_id,complex,simple,prediction
0,CD001096,Twenty-eight studies (reporting a total of thi...,The reminders improved physician practices by ...,Studies: 28.\nReminders: Helped.\nImprovement:...
1,CD006466,"Of 12,620 identified citations, 10 RCTs fulfil...",The studies used two types of blood thinner:\n...,"Of 12 studies, 10 showed that oral medications..."


## 5) Métricas de simplificación (BLEU, SARI, BERTScore, MeaningBERT, LENS)


In [None]:
import numpy as np
from sacrebleu.metrics import BLEU
from bert_score import score as bert_score

DATASET2LANG_DEFAULT = {
    "tsar2024": "en",
    "clara-med": "es",
    "cochrane": "fr",
    "meds": "fr",
    "wiki": "fr",
}


P, R, F1 = bert_score(
    cands=preds,
    refs=refs_1,
    lang=lang_default,       # opcional si pasas model_type
    model_type=model_type,
    verbose=False,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

scores["BERTScore_P"]  = float(P.mean().item())
scores["BERTScore_R"]  = float(R.mean().item())
scores["BERTScore_F1"] = float(F1.mean().item())

FORCE_LANG = ""  # Opciones: "en", "es", "fr"

sources = df[COL_COMPLEX].tolist()
preds   = df[COL_PREDICTION].tolist()
refs_1  = df[COL_REFERENCE].tolist()
refs_for_hf = [[r] for r in refs_1]

scores = {}

# BLEU (corpus)
bleu = BLEU()
bleu_res = bleu.corpus_score(preds, [refs_1])
scores["BLEU"] = float(bleu_res.score)

# SARI
sari = evaluate.load("sari")
sari_res = sari.compute(sources=sources, predictions=preds, references=refs_for_hf)
scores["SARI"] = float(sari_res["sari"])

# BERTScore
# lang_default: "en" / "es" / "fr"
# Puedes fijar un backbone explícito (recomendado para ES/FR):
#  - en: "roberta-large" (clásico)
#  - es: "PlanTL-GOB-ES/roberta-base-bne" (o similar)
#  - fr: "camembert-base"
BACKBONE_BY_LANG = {
    "en": "roberta-large",
    "es": "PlanTL-GOB-ES/roberta-base-bne",
    "fr": "camembert-base",
}

model_type = BACKBONE_BY_LANG.get(lang_default, "roberta-large")

P, R, F1 = bert_score(
    cands=preds,
    refs=refs_1,
    lang=lang_default,       # opcional si pasas model_type
    model_type=model_type,
    verbose=False,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

scores["BERTScore_P"]  = float(P.mean().item())
scores["BERTScore_R"]  = float(R.mean().item())
scores["BERTScore_F1"] = float(F1.mean().item())

# MeaningBERT (meaning preservation entre source y prediction)
meaningbert = evaluate.load("davebulaval/meaningbert")
mb_res = meaningbert.compute(references=sources, predictions=preds)
if "score" in mb_res:
    scores["MeaningBERT"] = float(mb_res["score"])
elif "scores" in mb_res:
    scores["MeaningBERT"] = float(np.mean(mb_res["scores"]))
else:
    scores["MeaningBERT"] = float(list(mb_res.values())[0])



scores


ModuleNotFoundError: Could not import module 'AutoTokenizer'. Are this object's requirements defined correctly?

## 6) Métricas de legibilidad (según idioma)


In [None]:
import numpy as np
import textstat

lang = (FORCE_LANG.strip().lower() or DATASET2LANG_DEFAULT.get(DATASET, "en"))
if lang not in {"en","es","fr"}:
    print(f"⚠️ Idioma '{lang}' no soportado en este notebook. Usaré 'en'.")
    lang = "en"

try:
    textstat.set_lang(lang)
except Exception:
    pass

def kandel_moles_french(text: str) -> float:
    """Kandel & Moles (adaptación francesa de Flesch): 207 - 1.015*ASL - 73.6*ASW"""
    sentences = max(1, textstat.sentence_count(text))
    words = max(1, textstat.lexicon_count(text, removepunct=True))
    syllables = max(1, textstat.syllable_count(text))
    asl = words / sentences
    asw = syllables / words
    return 207.0 - (1.015 * asl) - (73.6 * asw)

def avg_metric(texts, fn):
    vals = []
    for t in texts:
        try:
            vals.append(float(fn(t)))
        except Exception:
            vals.append(np.nan)
    return float(np.nanmean(vals))

pred_texts = preds

if lang == "en":
    scores["FKGL"] = avg_metric(pred_texts, textstat.flesch_kincaid_grade)
    scores["FleschReadingEase"] = avg_metric(pred_texts, textstat.flesch_reading_ease)
    scores["SMOG"] = avg_metric(pred_texts, textstat.smog_index)

elif lang == "es":
    scores["FernandezHuerta"] = avg_metric(pred_texts, getattr(textstat, "fernandez_huerta", lambda x: np.nan))
    scores["SzigrisztPazos"]  = avg_metric(pred_texts, getattr(textstat, "szigriszt_pazos", lambda x: np.nan))
    scores["GutierrezPolini"] = avg_metric(pred_texts, getattr(textstat, "gutierrez_polini", lambda x: np.nan))

elif lang == "fr":
    scores["KandelMoles"] = float(np.nanmean([kandel_moles_french(t) for t in pred_texts]))

scores


## 7) Mostrar resultados y guardar a CSV


In [None]:
import pandas as pd
from datetime import datetime

meta = {
    "dataset": DATASET,
    "model": MODEL_NAME,
    "prompt_type": PROMPT_TYPE,
    "prompt_strategy": PROMPT_STRATEGY,
    "lang": lang,
    "n": len(df),
    "timestamp": datetime.now().isoformat(timespec="seconds"),
}

row = {**meta, **scores}
res_df = pd.DataFrame([row])

print("=== Resultados ===")
display(res_df)

out_name = f"scores_{DATASET}_{MODEL_NAME}_{PROMPT_TYPE}_{PROMPT_STRATEGY}.csv"
res_df.to_csv(out_name, index=False)
print("✅ Guardado:", out_name)


## Notas rápidas

- El notebook intenta **autodetectar columnas** comunes si tus CSV no usan los nombres por defecto.
- Para **múltiples referencias** (más de una simplificación gold por `pair_id`), adapta `refs_for_hf` para que sea una lista de listas con todas las referencias por ejemplo.
- Si `LENS` no se calcula, la celda deja `NaN` y muestra el error del runtime.
