# Evaluaci√≥n de simplificaci√≥n de textos (Prompting + LLM)

Fichero debe contener las siguientes columnas (en el notebook se pueden renombrar):
- COL_COMPLEX = "complex"
- COL_REFERENCE = "simple"
- COL_PREDICTION = "prediction"



**Output:** guarda un CSV con los scores:
`scores_{dataset}_{model}_{tipo_prompt}_{strategy_prompt}.csv`



### üìä M√©tricas de simplificaci√≥n:

- BLEU,
- SARI,
- BERTScore,
- MeaningBERT,


### üìä M√©tricas de Legibilidad

- FKGL: Nivel escolar para comprender el texto. Ej 8.2 ‚âà 8¬∫ grado
- Flesch Reading Ease: Facilidad de lectura basada en longitud de frases y s√≠lablas. M√°s alto = m√°s f√°cil.


| M√©trica | Adaptaci√≥n en espa√±ol | Adaptaci√≥n en franc√©s |
|----------|----------------------|----------------------|
| FKGL (Flesch‚ÄìKincaid Grade Level) | No | No |
| Flesch Reading Ease | Fern√°ndez-Huerta | Kandel & Moles |






## 1) Configuraci√≥n


In [27]:
#@title Par√°metros (edita si lo necesitas)
PRED_FILE = "predictions_gemma-3-12b-it_docs_0.csv"    #@param {type:"string"}

DATASET = "tsar2024"          #@param ["tsar2024","clara-med","cochrane","meds","wiki"]

MODEL_NAME = "gemma-3-12b-it" #@param {type:"string"}
PROMPT_STRATEGY = "one"      #@param ["zero","one","few"]
PROMPT_TYPE = "minimal"           #@param ["minimal", "ISO", "ISO+examples"]


# Si tu CSV usa otros nombres, ed√≠talos aqu√≠.
COL_COMPLEX = "complex"         #@param {type:"string"}
COL_REFERENCE = "simple"       #@param {type:"string"}
COL_PREDICTION = "prediction"   #@param {type:"string"}


lang='en'
DATASET2LANG_DEFAULT = {
    "tsar2024": "en",
    "clara-med": "es",
    "cochrane": "fr",
    "meds": "fr",
    "wiki": "fr",
}
if DATASET in DATASET2LANG_DEFAULT.keys():
    lang = DATASET2LANG_DEFAULT[DATASET]
print(DATASET,lang)


tsar2024 en


## 2) Instalaci√≥n de dependencias


In [28]:
!pip -q install sacrebleu bert-score textstat sentence-transformers


## 0) Carga de fichero y dataset


In [29]:
import pandas as pd
print("‚úÖ file_path =", PRED_FILE)

df = pd.read_csv(PRED_FILE)
df = df[[COL_PAIR_ID, COL_COMPLEX, COL_REFERENCE, COL_PREDICTION]].copy()


print("‚úÖ Dataset cargado. Tama√±o:", df.shape)
# print(df.head(1))
# Extraer columnas
sources = df[COL_COMPLEX].astype(str).tolist()
references = df[COL_REFERENCE].astype(str).tolist()
predictions = df[COL_PREDICTION].astype(str).tolist()

‚úÖ file_path = predictions_gemma-3-12b-it_docs_0.csv
‚úÖ Dataset cargado. Tama√±o: (117, 4)


## 4) M√©tricas de simplificaci√≥n (BLEU, SARI, BERTScore, MeaningBERT, LENS)


### BLEU Y SARI


In [30]:
import sacrebleu
import numpy as np

scores={}

# ======================
# üîµ BLEU
# ======================
# BLEU
bleu = sacrebleu.corpus_bleu(predictions, [references]).score
print(f"BLEU: {bleu:.2f}")
scores["BLEU"]=round(bleu,2)

# ======================
# üü¢ SARI  (implementaci√≥n sin easse)
# ======================
def sari_score(orig_sents, sys_sents, refs_sents):
    def get_ngrams(sentence, n):
        tokens = sentence.split()
        return [" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

    def f1(tp, denom_p, denom_r):
        p = tp / denom_p if denom_p else 0.0
        r = tp / denom_r if denom_r else 0.0
        return (2*p*r/(p+r)) if (p+r) else 0.0

    total = 0.0
    for orig, sys, ref in zip(orig_sents, sys_sents, refs_sents):
        add, keep, dele = [], [], []
        for n in range(1, 5):
            o = set(get_ngrams(orig, n))
            s = set(get_ngrams(sys, n))
            r = set(get_ngrams(ref, n))

            # ADD
            s_add = s - o
            r_add = r - o
            tp = len(s_add & r_add)
            add.append(f1(tp, len(s_add), len(r_add)))

            # KEEP
            s_keep = s & o
            r_keep = r & o
            tp = len(s_keep & r_keep)
            keep.append(f1(tp, len(s_keep), len(r_keep)))

            # DELETE
            s_del = o - s
            r_del = o - r
            tp = len(s_del & r_del)
            dele.append(f1(tp, len(s_del), len(r_del)))

        total += (np.mean(add) + np.mean(keep) + np.mean(dele)) / 3.0

    return (total / len(orig_sents)) * 100.0 if orig_sents else 0.0


sari = sari_score(sources, predictions, references)
print(f"SARI: {sari:.2f}")
scores["SARI"]=round(sari,2)


BLEU: 5.36
SARI: 33.32


### BERT_SCORE

- Para ingl√©s, usamos *microsoft/deberta-xlarge-mnli*, porque es el que mejores correlaciones con la evaluaci√≥n humana (ver https://pypi.org/project/bert-score/)

- Para el resto, usamos *xlm-roberta-large* porque da mejor correlaci√≥n que el bert-base.

In [31]:
from bert_score import score

model_type="xlm-roberta-large"

#if lang=='en':
    # es el modelo que mejor correlaci√≥n tiene con la ev. humana
    # model_type='microsoft/deberta-xlarge-mnli'


print("ü§ñ Modelo seleccionado:", model_type)

P, R, F1 = score(predictions, references, model_type=model_type,
    lang=lang,  # cambia si no es espa√±ol
    rescale_with_baseline=True, # normalizes scores for interpretability
    verbose=True
)

precision = round(P.mean().item(),4)
recall = round(R.mean().item(),4)
f1 = round(F1.mean().item(),4)

print("BERTScore Precision:", precision)
print("BERTScore Recall:", recall)
print("BERTScore F1:", f1)
scores["BERTScore-P:"]=precision
scores["BERTScore-R:"]=recall
scores["BERTScore-F1:"]=f1

ü§ñ Modelo seleccionado: xlm-roberta-large


Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]

XLMRobertaModel LOAD REPORT from: xlm-roberta-large
Key                       | Status     |  | 
--------------------------+------------+--+-
lm_head.layer_norm.bias   | UNEXPECTED |  | 
lm_head.dense.weight      | UNEXPECTED |  | 
lm_head.bias              | UNEXPECTED |  | 
lm_head.layer_norm.weight | UNEXPECTED |  | 
lm_head.dense.bias        | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


calculating scores...
computing bert embedding.


  0%|          | 0/4 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 12.49 seconds, 9.36 sentences/sec
BERTScore Precision: 0.2111
BERTScore Recall: -0.0606
BERTScore F1: 0.0707


### MeaningBERT


In [32]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

complex_emb = embedder.encode(sources, convert_to_numpy=True, batch_size=32)
ref_emb = embedder.encode(references, convert_to_numpy=True, batch_size=32)
pred_emb = embedder.encode(predictions, convert_to_numpy=True, batch_size=32)

sim_OR_diag = cosine_similarity(complex_emb, ref_emb).diagonal()
sim_OR = round(np.mean(sim_OR_diag),4)
print("MeaningBERT (complex ‚Üí simple):", sim_OR)
scores["MBERT_OR"]=sim_OR

sim_OP_diag = cosine_similarity(complex_emb, pred_emb).diagonal()
sim_OP = round(np.mean(sim_OP_diag),4)
print("MeaningBERT (complex ‚Üí prediction):", sim_OP)
scores["MBERT_OP"]=sim_OP

sim_RP_diag = cosine_similarity(ref_emb, pred_emb).diagonal()
sim_RP = round(np.mean(sim_RP_diag),4)
print("MeaningBERT (simple ‚Üí prediction):", sim_RP)
scores["MBERT_RP"]=sim_RP




Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


MeaningBERT (complex ‚Üí simple): 0.7774
MeaningBERT (complex ‚Üí prediction): 0.6879
MeaningBERT (simple ‚Üí prediction): 0.6239


## 6) M√©tricas de legibilidad (seg√∫n idioma)


In [33]:
import numpy as np
import textstat

if lang not in {"en","es","fr"}:
    print(f"‚ö†Ô∏è Idioma '{lang}' no soportado en este notebook. Usar√© 'en'.")
    lang = "en"

try:
    textstat.set_lang(lang)
except Exception:
    pass

def kandel_moles_french(text: str) -> float:
    """Kandel & Moles (adaptaci√≥n francesa de Flesch): 207 - 1.015*ASL - 73.6*ASW"""
    sentences = max(1, textstat.sentence_count(text))
    words = max(1, textstat.lexicon_count(text, removepunct=True))
    syllables = max(1, textstat.syllable_count(text))
    asl = words / sentences
    asw = syllables / words
    return 207.0 - (1.015 * asl) - (73.6 * asw)

def avg_metric(texts, fn):
    vals = []
    for t in texts:
        try:
            vals.append(float(fn(t)))
        except Exception:
            vals.append(np.nan)
    return float(np.nanmean(vals))

pred_texts = predictions

if lang == "en":
    scores["FKGL"] = avg_metric(pred_texts, textstat.flesch_kincaid_grade)
    scores["Flesch"] = avg_metric(pred_texts, textstat.flesch_reading_ease)
    #scores["SMOG"] = avg_metric(pred_texts, textstat.smog_index)

elif lang == "es":
    scores["FernandezHuerta"] = avg_metric(pred_texts, getattr(textstat, "fernandez_huerta", lambda x: np.nan))
    # scores["SzigrisztPazos"]  = avg_metric(pred_texts, getattr(textstat, "szigriszt_pazos", lambda x: np.nan))
    # scores["GutierrezPolini"] = avg_metric(pred_texts, getattr(textstat, "gutierrez_polini", lambda x: np.nan))

elif lang == "fr":
    scores["KandelMoles"] = float(np.nanmean([kandel_moles_french(t) for t in pred_texts]))

scores


{'BLEU': 5.36,
 'SARI': np.float64(33.32),
 'BERTScore-P:': 0.2111,
 'BERTScore-R:': -0.0606,
 'BERTScore-F1:': 0.0707,
 'MBERT_OR': np.float32(0.7774),
 'MBERT_OP': np.float32(0.6879),
 'MBERT_RP': np.float32(0.6239),
 'FKGL': 6.521908516060169,
 'Flesch': 67.74894170928633}

## 7) Mostrar resultados y guardar a CSV


In [34]:
import pandas as pd
from datetime import datetime

meta = {
    "dataset": DATASET,
    "model": MODEL_NAME,
    "prompt_type": PROMPT_TYPE,
    "prompt_strategy": PROMPT_STRATEGY,
    "lang": lang,
    "n": len(df),
    "timestamp": datetime.now().isoformat(timespec="seconds"),
}

row = {**meta, **scores}
res_df = pd.DataFrame([row])

print("=== Resultados ===")
display(res_df)

out_name = f"scores_{DATASET}_{MODEL_NAME}_{PROMPT_TYPE}_{PROMPT_STRATEGY}.csv"
res_df.to_csv(out_name, index=False)
print("‚úÖ Guardado:", out_name)


=== Resultados ===


Unnamed: 0,dataset,model,prompt_type,prompt_strategy,lang,n,timestamp,BLEU,SARI,BERTScore-P:,BERTScore-R:,BERTScore-F1:,MBERT_OR,MBERT_OP,MBERT_RP,FKGL,Flesch
0,tsar2024,gemma-3-12b-it,minimal,one,en,117,2026-02-27T16:51:47,5.36,33.32,0.2111,-0.0606,0.0707,0.7774,0.6879,0.6239,6.521909,67.748942


‚úÖ Guardado: scores_tsar2024_gemma-3-12b-it_minimal_one.csv
