### Decide which hand-crafted rules are meaning-preserving and thus safe to include in the reward model (RM) or PPO training.

- look into an LLM that provides a targeted German support:
  - "xlm-roberta-base"
  - "dbmdz/bert-base-german-uncased"
  - deepset/gbert-base
  - bert-base-german-dbmdz-uncased

- simplification score to be:
  - the rule-compliance tracker
  - inserting SARI as well would be too 'simple minded'

possible simplification score combination
- combine 
- reward = alpha * simplification_score + beta * bert_score
- alpha, beta can be tuned depending on priorities (which score is more critical?)

In [1]:
# pip install torch torchvision transformers
# pip install bert-score

In [2]:
from bert_score import score
from transformers import AutoTokenizer, AutoModel
import matplotlib.pyplot as plt
import pandas as pd
import re

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
input_path = "master_data/0_original/all.txt"
output_path = "master_data/3_simplified/all_simplified_plain.txt"
#log_path = "simplification_logs/all_parsed_log_2025-09-13_23-31-26.csv"
log_path = "simplification_logs/all_parsed_log_2025-09-14_12-38-08.csv"

In [4]:
df = pd.read_csv(log_path)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112715 entries, 0 to 112714
Data columns (total 7 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   uid                        112715 non-null  int64 
 1   original                   112710 non-null  object
 2   initial_original_sentence  112715 non-null  object
 3   rule                       112715 non-null  object
 4   applied                    112715 non-null  bool  
 5   simplified                 112654 non-null  object
 6   doc_name                   112715 non-null  object
dtypes: bool(1), int64(1), object(5)
memory usage: 5.3+ MB


#### There are non-null rows in simplified, identified to come from word_to_number() vconversion. They need to be filtered out.

In [6]:
df = df.dropna(how='any', axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 112654 entries, 0 to 112714
Data columns (total 7 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   uid                        112654 non-null  int64 
 1   original                   112654 non-null  object
 2   initial_original_sentence  112654 non-null  object
 3   rule                       112654 non-null  object
 4   applied                    112654 non-null  bool  
 5   simplified                 112654 non-null  object
 6   doc_name                   112654 non-null  object
dtypes: bool(1), int64(1), object(5)
memory usage: 6.1+ MB


In [7]:
df.head(10)

Unnamed: 0,uid,original,initial_original_sentence,rule,applied,simplified,doc_name
0,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,clean_punctuation,False,Der Iran wird teilweise aus dem Atom-Abkommen ...,all_parsed.txt
1,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,rewrite_apposition,False,Der Iran wird teilweise aus dem Atom-Abkommen ...,all_parsed.txt
2,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,simplify_subordinate,False,Der Iran wird teilweise aus dem Atom-Abkommen ...,all_parsed.txt
3,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,convert_passive_to_active,False,Der Iran wird teilweise aus dem Atom-Abkommen ...,all_parsed.txt
4,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,normalize_verb_tense,True,Der Iran wird teilweise aus dem Atom-Abkommen ...,all_parsed.txt
5,2,Präsidentin,Brüssel Ursula von der Leyen ist die Präsident...,split_compound,True,Präsi·Dentin,all_parsed.txt
6,2,Brüssel Ursula von der Leyen ist die Präsi·Den...,Brüssel Ursula von der Leyen ist die Präsident...,clean_punctuation,False,Brüssel Ursula von der Leyen ist die Präsi·Den...,all_parsed.txt
7,2,Brüssel Ursula von der Leyen ist die Präsi·Den...,Brüssel Ursula von der Leyen ist die Präsident...,rewrite_apposition,False,Brüssel Ursula von der Leyen ist die Präsi·Den...,all_parsed.txt
8,2,Brüssel Ursula von der Leyen ist die Präsi·Den...,Brüssel Ursula von der Leyen ist die Präsident...,simplify_subordinate,False,Brüssel Ursula von der Leyen ist die Präsi·Den...,all_parsed.txt
9,2,Brüssel Ursula von der Leyen ist die Präsi·Den...,Brüssel Ursula von der Leyen ist die Präsident...,convert_passive_to_active,False,Brüssel Ursula von der Leyen ist die Präsi·Den...,all_parsed.txt


In [None]:
#df.to_csv("master_data/output_assessment/all_simplifications.csv", index=False)

## Exploration

In [None]:
df_compound.info()

In [None]:
df_number = df[df["rule"] == "convert_word_to_number"]
df_number.info()

In [None]:
filtered_comp = df_compound[df_compound["applied"] == True]
filtered_comp.info()

In [None]:
filtered_number = df_number[df_number["applied"] == True]
filtered_number.info()

In [None]:
# def assess_rule_output(df):
#     results = []

#     for uid, group in df.groupby("uid"):
#         original = group["original"].iloc[0]              # the very first "original" sentence
#         simplified = group["simplified"].iloc[-1]         # the last simplification
#         applied_rules = group.loc[group["applied"] == True, "rule"].tolist()

#         results.append({
#             "uid": uid,
#             "original": original,
#             "simplified": simplified,
#             "applied_rules": applied_rules
#         })

#     return pd.DataFrame(results)

## Outdated approach to aggregate from applied=True approach

In [None]:
df.info()

In [None]:
#Filter out only applied rules
df_applied = df[df["applied"] == True]
df_applied.info()

In [None]:
df_applied_dropped.head(15)

In [None]:
df_applied_dropped.to_csv("master_data/output_assessment/all_applied_rules.csv", index=False)

In [None]:
#OUTDATED code for the original simplification approach on original/complex sentences

# # Get the last applied simplification per sentence UID
# # (Assume rules are applied in order of appearance)
# last_applied_per_uid = df_applied.groupby("uid").tail(1)

# # Also get original sentences from any row (all identical for a UID)
# originals_per_uid = df.groupby("uid").first().reset_index()[["uid", "original"]]

# # Merge to get (original, final simplified) pairs
# final_pairs = pd.merge(originals_per_uid, last_applied_per_uid[["uid", "simplified"]], on="uid")

# # Extract all applied rules per UID (True only)
# # gives out UID and a second column of all applied rules according to uid

# applied_rules_per_uid = (
#     df[df["applied"] == True]
#     .groupby("uid")["rule"]
#     .apply(list)
#     .reset_index()
#     .rename(columns={"rule": "applied_rules"})
# )

# # Merge with the final_pairs (which already has original + final simplified)
# final_pairs_with_rules = pd.merge(final_pairs, applied_rules_per_uid, on="uid", how="left")

In [None]:
# ### ====OUTDATED, was applied on applied=True ==== ###
# # Get the unique original sentences in the order of their first appearance
# #unique_originals = df_applied['original'].unique()
# grouped = df_applied_dropped.groupby('uid')

# processed_data = []

# #for sentence in unique_originals:
# for uid, group in grouped:
#         # Get all rows for the current original sentence
#         #group = df_applied[df_applied['original'] == sentence]
        
#         # Find all rules that were successfully applied for this group
#         applied_rules_list = group[group['applied'] == True]['rule'].tolist()
        
#         # We only want to include sentences where at least one rule was applied
#         if not applied_rules_list:
#             continue

#         # De-duplicate the list of rules while preserving order
#         unique_applied_rules = list(dict.fromkeys(applied_rules_list))

#         # Heuristic: The "main" original sentence is the longest one in the group
#         main_original_sentence = group.loc[group['original'].str.len().idxmax(), 'original']

#         # The final simplification is the 'simplified' text from the very last logged step
#         final_simplification_text = group.loc[group.index.max(), 'simplified']

#         # Alternative (if you want the very last simplification regardless of application):    
#         # The final simplification is the 'simplified' text from the very last entry in the group
#         #final_simplification_text = group['simplified'].iloc[-1]
        
#         # Append the structured data
#         processed_data.append({
#             'uid': uid,
#             'original_sentence': main_original_sentence,
#             'final_simplification': final_simplification_text,
#             'applied_rules': unique_applied_rules
#         })

# # Create the final DataFrame from our processed list
# result_df = pd.DataFrame(processed_data)

# Filter out and aggregate from simplification log

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 112654 entries, 0 to 112714
Data columns (total 7 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   uid                        112654 non-null  int64 
 1   original                   112654 non-null  object
 2   initial_original_sentence  112654 non-null  object
 3   rule                       112654 non-null  object
 4   applied                    112654 non-null  bool  
 5   simplified                 112654 non-null  object
 6   doc_name                   112654 non-null  object
dtypes: bool(1), int64(1), object(5)
memory usage: 6.1+ MB


In [9]:
# Define rule categories
PARTIAL_RULES = {"split_compound", "convert_word_to_number"}
SPLIT_RULES = {"rewrite_apposition", "simplify_subordinate"}

In [10]:
def normalize(s: str) -> str:
    """Whitespace-normalize a string for duplicate checks."""
    s = str(s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def split_into_sentences(text: str):
    """
    Very lightweight sentence splitter for cleanup/dedup.
    Splits on . ! ? while keeping punctuation.
    """
    text = text.strip()
    if not text:
        return []
    parts = re.split(r"(?<=[.!?])\s+", text)
    return [normalize(p) for p in parts if normalize(p)]

def dedup_preserve_order(items):
    """Remove duplicates while preserving order."""
    seen = set()
    out = []
    for it in items:
        key = normalize(it)
        if key not in seen:
            out.append(it)
            seen.add(key)
    return out

records = []

# Process sentence-by-sentence groups
for uid, group in df.groupby("uid", sort=False):
    group = group.reset_index(drop=True)

    # Collect applied rules
    applied_rules_list = group[group["applied"] == True]["rule"].tolist()
    if not applied_rules_list:
        continue

    # Deduplicate rules while preserving order
    seen_rules = set()
    unique_applied_rules = []
    for r in applied_rules_list:
        if r not in seen_rules:
            unique_applied_rules.append(r)
            seen_rules.add(r)

    # Did any split-type rule fire?
    split_applied = any(r in SPLIT_RULES for r in unique_applied_rules)

    # Start from the true original
    main_sentence = normalize(group["initial_original_sentence"].iloc[0])
    sentences = [main_sentence]
    seen_sentences = {main_sentence}

    # Replay transformations
    for _, row in group.iterrows():
        if not row["applied"]:
            continue

        rule = row["rule"]
        simplified_piece = normalize(row["simplified"]) if pd.notna(row["simplified"]) else ""
        original_piece   = normalize(row["original"]) if pd.notna(row["original"]) else ""

        if rule in PARTIAL_RULES:
            # Patch fragment into the last sentence
            if original_piece and original_piece in sentences[-1]:
                sentences[-1] = sentences[-1].replace(original_piece, simplified_piece, 1)

        elif rule in SPLIT_RULES:
            # Append new sentence(s), deduped
            if simplified_piece:
                new_sents = split_into_sentences(simplified_piece) or [simplified_piece]
                for ns in new_sents:
                    ns_norm = normalize(ns)
                    if ns_norm not in seen_sentences:
                        sentences.append(ns)
                        seen_sentences.add(ns_norm)

        else:
            # Full-sentence rewrite
            if simplified_piece:
                sentences[-1] = simplified_piece

    # --- Post-processing ---
    # sentences = dedup_preserve_order(sentences)

    # # Drop the original if split happened and we now have other sentences
    # original_norm = normalize(group["initial_original_sentence"].iloc[0])
    # if split_applied and any(normalize(s) != original_norm for s in sentences):
    #     sentences = [s for s in sentences if normalize(s) != original_norm]

    # # Join final sentences
    # final_text = " ".join(sentences).strip()

    # # Safety: remove the original if it still appears verbatim in the joined text
    # if split_applied and len(sentences) > 1 and original_norm in normalize(final_text):
    #     raw_original = group["initial_original_sentence"].iloc[0]
    #     final_text = final_text.replace(raw_original, "", 1).strip()
    #     final_text = normalize(final_text)

    # --- Post-processing ---
    sentences = dedup_preserve_order(sentences)

    original_raw  = group["initial_original_sentence"].iloc[0]
    original_norm = normalize(original_raw)

    if split_applied:
        # Check if the original still appears exactly as one of the collected sentences
        has_exact_original = any(normalize(s) == original_norm for s in sentences)
        has_transformed    = any(original_norm in normalize(s) and normalize(s) != original_norm for s in sentences)

        if has_exact_original and not has_transformed:
            # Only drop the original if it is *unchanged* and other sentences exist
            if len(sentences) > 1:
                sentences = [s for s in sentences if normalize(s) != original_norm]

    # Join final sentences
    final_text = " ".join(sentences).strip()


    # Store result
    records.append({
        "uid": uid,
        "original_sentence": group["initial_original_sentence"].iloc[0],
        "final_simplification": final_text,
        "applied_rules": unique_applied_rules
    })

# Build DataFrame
result_df = pd.DataFrame(records)

In [11]:
result_df.tail(15)

Unnamed: 0,uid,original_sentence,final_simplification,applied_rules
11092,16520,Derzeit gibt es eine schlimme Gesundheits-Kris...,Anschober hat Derzeit gibt es eine schlimme Ge...,"[split_compound, normalize_verb_tense]"
11093,16521,Deshalb braucht Österreich einen Gesundheits-M...,Deshalb braucht Österreich einen Gesundheits·M...,[split_compound]
11094,16524,Ein Arzt wird neuer Gesundheits-Minister von Ö...,Ein Arzt wird neuer Gesundheits·Minister von Ö...,[split_compound]
11095,16527,Aber jetzt wird er in Wien und in Niederösterr...,Man hat er verlängert.,[convert_passive_to_active]
11096,16528,Dort dauert der Lockdown nun bis 2. Mai .,Dort dauert der Lockdown nun bis 2 Mai .,[convert_word_to_number]
11097,16529,Der Lockdown sollte bis zum 18. April dauern .,Der Lockdown hat bis zum 18 April dauern gesollt.,"[convert_word_to_number, normalize_verb_tense]"
11098,16530,"Man weiß noch nicht , ob der Lockdown auch im ...",Man hat Man nicht verlängert.,[convert_passive_to_active]
11099,16531,Das soll am Mittwoch entschieden werden .,Man hat Das entschieden.,[convert_passive_to_active]
11100,16535,Im Ramadan essen und trinken die Muslime tagsü...,Im Ramadan essen und trinken die Muslime tagsü...,[normalize_verb_tense]
11101,16537,Der Fußball-Trainer Adi Hütter wechselt zu Mön...,Der Fußball·Trainer Adi Hütter wechselt zu Mön...,[split_compound]


In [12]:
# Sort the final result by UID to approximate the original file order
result_df = result_df.sort_values(by='uid').reset_index(drop=True)
result_df

Unnamed: 0,uid,original_sentence,final_simplification,applied_rules
0,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,[normalize_verb_tense]
1,2,Brüssel Ursula von der Leyen ist die Präsident...,Brüssel Ursula von der Leyen ist die Präsi·Den...,[split_compound]
2,3,Am Mittwoch hat sie ihre erste Rede zur Lage d...,Am Mittwoch hat sie ihre 1. Rede zur Lage der ...,[convert_word_to_number]
3,4,Bis zum Jahr 2030 soll es in der Europäische U...,Bis zum Jahr 2030 soll es in der Europäische U...,[normalize_verb_tense]
4,5,"Das ist sehr viel , denn in den letzten 29 Jah...","Das hat ist sehr viel , denn in den letzten 29...",[normalize_verb_tense]
...,...,...,...,...
11102,16539,Mönchengladbach Adi Hütter ist ein Fußball-Tra...,Mönchengladbach Adi Hütter ist ein Fußball·Tra...,[split_compound]
11103,16541,Hütter wird aber mit Saison-Ende in der deutsc...,Hütter wird aber mit Saison-Ende in der deutsc...,[normalize_verb_tense]
11104,16544,2024.,2024,[convert_word_to_number]
11105,16545,Für Hütter muss Mönchengladbach eine hohe Ablö...,Für Hütter muss Mönchengladbach eine hohe Ablö...,[normalize_verb_tense]


In [13]:
result_df.head()

Unnamed: 0,uid,original_sentence,final_simplification,applied_rules
0,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,[normalize_verb_tense]
1,2,Brüssel Ursula von der Leyen ist die Präsident...,Brüssel Ursula von der Leyen ist die Präsi·Den...,[split_compound]
2,3,Am Mittwoch hat sie ihre erste Rede zur Lage d...,Am Mittwoch hat sie ihre 1. Rede zur Lage der ...,[convert_word_to_number]
3,4,Bis zum Jahr 2030 soll es in der Europäische U...,Bis zum Jahr 2030 soll es in der Europäische U...,[normalize_verb_tense]
4,5,"Das ist sehr viel , denn in den letzten 29 Jah...","Das hat ist sehr viel , denn in den letzten 29...",[normalize_verb_tense]


In [14]:
result_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11107 entries, 0 to 11106
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   uid                   11107 non-null  int64 
 1   original_sentence     11107 non-null  object
 2   final_simplification  11107 non-null  object
 3   applied_rules         11107 non-null  object
dtypes: int64(1), object(3)
memory usage: 347.2+ KB


In [15]:
df_cleanup = result_df.copy()

In [16]:
df_cleanup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11107 entries, 0 to 11106
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   uid                   11107 non-null  int64 
 1   original_sentence     11107 non-null  object
 2   final_simplification  11107 non-null  object
 3   applied_rules         11107 non-null  object
dtypes: int64(1), object(3)
memory usage: 347.2+ KB


In [17]:

df_cleanup.columns = df_cleanup.columns.str.strip() # This removes leading/trailing spaces from each column name

def clean_all_whitespace(sentence):
  """
  Replaces multiple spaces inside a string with a single space,
  and then strips leading/trailing whitespace.
  """
  # 0: If the input is not a string, return it as is
  if not isinstance(sentence, str):
      return sentence
  # 1: Clean up all internal whitespace first.
  sentence = re.sub(r'\s+', ' ', sentence).strip()
  # 2: Strip whitespace from the beginning and end
  sentence = re.sub(r'\s+([.,:;?!])', r'\1', sentence)
  return sentence

columns_to_clean = ['original_sentence', 'final_simplification']

print(f"Attempting to strip whitespace from columns: {', '.join(columns_to_clean)}")

# Loop through the identified columns and apply the strip() method
for col in columns_to_clean:
  if col in df_cleanup.columns and df_cleanup[col].dtype == 'object':
    print(f"Cleaning column: '{col}'...")
    # Apply our new, more powerful cleaning function to each sentence in the column
    df_cleanup[col] = df_cleanup[col].apply(clean_all_whitespace)
  else:
    print(f"Column '{col}' not found or is not a text column.")

Attempting to strip whitespace from columns: original_sentence, final_simplification
Cleaning column: 'original_sentence'...
Cleaning column: 'final_simplification'...


In [18]:
print(df_cleanup.head().to_markdown(index=False))

|   uid | original_sentence                                                                              | final_simplification                                                                           | applied_rules              |
|------:|:-----------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|:---------------------------|
|     1 | Der Iran wird teilweise aus dem Atom-Abkommen aussteigen.                                      | Der Iran wird teilweise aus dem Atom-Abkommen aussteigen.                                      | ['normalize_verb_tense']   |
|     2 | Brüssel Ursula von der Leyen ist die Präsidentin von der Europäische Union-Kommission.         | Brüssel Ursula von der Leyen ist die Präsi·Dentin von der Europäische Union-Kommission.        | ['split_compound']         |
|     3 | Am Mittwoch hat sie ihre erste Rede zur Lage der Europäisc

In [19]:
df_cleanup.head(20)

Unnamed: 0,uid,original_sentence,final_simplification,applied_rules
0,1,Der Iran wird teilweise aus dem Atom-Abkommen ...,Der Iran wird teilweise aus dem Atom-Abkommen ...,[normalize_verb_tense]
1,2,Brüssel Ursula von der Leyen ist die Präsident...,Brüssel Ursula von der Leyen ist die Präsi·Den...,[split_compound]
2,3,Am Mittwoch hat sie ihre erste Rede zur Lage d...,Am Mittwoch hat sie ihre 1. Rede zur Lage der ...,[convert_word_to_number]
3,4,Bis zum Jahr 2030 soll es in der Europäische U...,Bis zum Jahr 2030 soll es in der Europäische U...,[normalize_verb_tense]
4,5,"Das ist sehr viel, denn in den letzten 29 Jahr...","Das hat ist sehr viel, denn in den letzten 29 ...",[normalize_verb_tense]
5,7,Die Europäische Union-Kommission will vor alle...,Die Europäische Union-Kommission will vor alle...,[normalize_verb_tense]
6,8,Dazu gehören zum Beispiel die Renovierung von ...,Dazu gehören zum Beispiel die Renovierung von ...,[split_compound]
7,10,Sie schlägt zum Beispiel Europäische Union-Ges...,Sie schlägt zum Beispiel Europäische Union·Ges...,[split_compound]
8,15,Die Maßnahmen gegen den Corona-Virus sind sehr...,Die Maßnahmen gegen den Corona·Virus sind sehr...,[split_compound]
9,17,Vor allem in Europa und in den USA schrumpft d...,man angenommen hat. Dann Vor allem in Europa u...,[simplify_subordinate]


In [20]:
output_filename = 'master_data/output_assessment/ordered_simplifications_with_rules_clean_FINAL_FINAL.csv'
df_cleanup.to_csv(output_filename, index=False)

In [21]:
print(f"\nFinally saved the final, ordered file: '{output_filename}'")
print("\nHere is a preview of the new format:")
print(df_cleanup.head().to_markdown(index=False))


Finally saved the final, ordered file: 'master_data/output_assessment/ordered_simplifications_with_rules_clean_FINAL_FINAL.csv'

Here is a preview of the new format:
|   uid | original_sentence                                                                              | final_simplification                                                                           | applied_rules              |
|------:|:-----------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|:---------------------------|
|     1 | Der Iran wird teilweise aus dem Atom-Abkommen aussteigen.                                      | Der Iran wird teilweise aus dem Atom-Abkommen aussteigen.                                      | ['normalize_verb_tense']   |
|     2 | Brüssel Ursula von der Leyen ist die Präsidentin von der Europäische Union-Kommission.         | Brüssel Ursula von der Leye

In [22]:
df_cleanup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11107 entries, 0 to 11106
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   uid                   11107 non-null  int64 
 1   original_sentence     11107 non-null  object
 2   final_simplification  11107 non-null  object
 3   applied_rules         11107 non-null  object
dtypes: int64(1), object(3)
memory usage: 347.2+ KB


In [34]:
#only keel original_sentence and final_simplification
final_pairs = df_cleanup[['uid', 'original_sentence', 'final_simplification']]
final_pairs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11107 entries, 0 to 11106
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   uid                   11107 non-null  int64 
 1   original_sentence     11107 non-null  object
 2   final_simplification  11107 non-null  object
dtypes: int64(1), object(2)
memory usage: 260.4+ KB


In [35]:
final_pairs.to_csv("master_data/output_assessment/final_simplified_pairs_cleaned_FINAL.csv", index=False)

# Assess the Performance using BERT Score

In [1]:
import pandas as pd
import json
from collections import defaultdict
from bert_score import score
from collections import Counter

import numpy as np
import ast


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
#original -> simplified

# Load your exported sentence pairs
df = pd.read_csv("master_data/output_assessment/final_simplified_pairs_cleaned_FINAL.csv")

originals = df["original_sentence"].tolist()
simplifieds = df["final_simplification"].tolist()

# Compute BERTScore using German-specific model
P, R, F1 = score(simplifieds, originals, model_type="xlm-roberta-large", lang="de")

# Add scores back to dataframe
df["bertscore_f1"] = F1.tolist()

# Save the results
df.to_csv("bert_score_results_25sept.csv", index=False)
print("Done! Results saved to 'bert_score_results_25sept.csv'.")

Done! Results saved to 'bert_score_results_25sept.csv'.


In [49]:
# -------------------------------------------------------
# 1. Load your files
# -------------------------------------------------------
# BERTScore results (per sentence)
df_scores = pd.read_csv("bert_score_results_25sept.csv")

# Rules + sentence pairs
df_rules = pd.read_csv("master_data/output_assessment/ordered_simplifications_with_rules_clean_FINAL_FINAL.csv")
# -------------------------------------------------------
# 2. Merge BERTScore into rules file
# -------------------------------------------------------
df = df_rules.merge(
    df_scores[["uid", "bertscore_f1"]],
    on="uid", how="left"
)

# -------------------------------------------------------
# 3. Helper: compute mean + std + CI
# -------------------------------------------------------
def mean_ci(x, n_boot=2000, alpha=0.05, seed=42):
    rng = np.random.default_rng(seed)
    arr = np.array(x, dtype=float)
    boots = [np.mean(rng.choice(arr, size=len(arr), replace=True)) for _ in range(n_boot)]
    lo, hi = np.percentile(boots, [100*alpha/2, 100*(1-alpha/2)])
    return float(np.mean(arr)), float(np.std(arr)), float(lo), float(hi)

# -------------------------------------------------------
# 4. Aggregate scores per rule
# -------------------------------------------------------
rule_to_scores = defaultdict(list)

for _, row in df.iterrows():
    if pd.isna(row["applied_rules"]) or pd.isna(row["bertscore_f1"]):
        continue

    rules = row["applied_rules"]

    # Parse applied_rules robustly
    if isinstance(rules, str):
        try:
            rules = json.loads(rules)  # works if JSON (with double quotes)
        except Exception:
            try:
                rules = ast.literal_eval(rules)  # works if Python-style list
            except Exception:
                continue

    if not isinstance(rules, list):
        continue

    for rule in rules:
        rule_to_scores[rule].append(row["bertscore_f1"])

# -------------------------------------------------------
# 5. Build results table
# -------------------------------------------------------
results = []
for rule, scores in rule_to_scores.items():
    mean_, std_, lo, hi = mean_ci(scores)
    results.append({
        "rule": rule,
        "N": len(scores),
        "mean_f1": mean_,
        "std": std_,
        "ci95_lo": lo,
        "ci95_hi": hi
    })

df_results = pd.DataFrame(results).sort_values("mean_f1")

print("\n=== Average BERTScore-F1 per rule (ranked) ===")
print(df_results.to_string(index=False, float_format="%.4f"))



=== Average BERTScore-F1 per rule (ranked) ===
                     rule    N  mean_f1    std  ci95_lo  ci95_hi
convert_passive_to_active 1653   0.9272 0.0292   0.9258   0.9286
     simplify_subordinate  273   0.9361 0.0216   0.9335   0.9387
           split_compound 4541   0.9623 0.0227   0.9616   0.9629
        clean_punctuation  116   0.9678 0.0246   0.9630   0.9722
   convert_word_to_number  917   0.9690 0.0295   0.9670   0.9709
       rewrite_apposition  214   0.9710 0.0274   0.9673   0.9747
     normalize_verb_tense 6409   0.9728 0.0244   0.9722   0.9734


In [50]:
# -------------------------------------------------------
# 6. Extract good/bad examples per rule
# -------------------------------------------------------
examples = defaultdict(lambda: {"good": [], "bad": []})

for rule, scores in rule_to_scores.items():
    subset = df[df["applied_rules"].str.contains(rule, na=False)]
    if subset.empty:
        continue

    # Sort by score
    subset_sorted = subset.sort_values("bertscore_f1")

    # Worst 2
    bad = subset_sorted.head(2)[["original_sentence", "final_simplification", "bertscore_f1"]].to_dict("records")
    # Best 2
    good = subset_sorted.tail(2)[["original_sentence", "final_simplification", "bertscore_f1"]].to_dict("records")

    examples[rule]["bad"] = bad
    examples[rule]["good"] = good

# Example printout
for rule, ex in examples.items():
    print(f"\n--- {rule} ---")
    print("Good examples:")
    for e in ex["good"]:
        print(f"  {e['bertscore_f1']:.3f} | {e['original_sentence']} -> {e['final_simplification']}")
    print("Bad examples:")
    for e in ex["bad"]:
        print(f"  {e['bertscore_f1']:.3f} | {e['original_sentence']} -> {e['final_simplification']}")



--- normalize_verb_tense ---
Good examples:
  1.000 | Mit der Quarantäne will man verhindern, dass andere Menschen krank werden. -> Mit der Quarantäne will man verhindern, dass andere Menschen krank werden.
  1.000 | Die Wahl soll im September sein. -> Die Wahl soll im September sein.
Bad examples:
  0.850 | Dabei kam heraus: -> ist Dabei heraus, gekommen
  0.851 | Am Donnerstag waren es nur noch 360. -> es ist Am Donners·Tag nur noch 360 gewesen

--- split_compound ---
Good examples:
  0.993 | In der Steiermark kann man in Liezen, Bruck an der Mur, Gleisdorf, Judenburg und Graz einen Corona-Test machen. -> In der Steiermark kann man in Liezen, Bruck an der Mur, Gleisdorf, Judenburg und Graz einen Corona·Test machen.
  0.994 | Nach dem so genannten Ibiza-Skandal ist der damalige FPÖ-Chef Heinz-Christian Strache als Vize-Kanzler zurück getreten. -> Nach dem so genannten Ibiza-Skandal ist der damalige FPÖ-Chef Heinz-Christian Strache als Vize·Kanzler zurück getreten.
Bad examples:
  0.8

In [4]:
df_count = pd.read_csv("master_data/output_assessment/ordered_simplifications_with_rules_clean_FINAL_FINAL.csv")

# Helper to parse applied_rules into Python list
def parse_rules(x):
    if pd.isna(x):
        return []
    if isinstance(x, list):
        return x
    if isinstance(x, str):
        try:
            return json.loads(x)  # works if JSON with double quotes
        except Exception:
            try:
                return ast.literal_eval(x)  # works if Python-style ['rule1','rule2']
            except Exception:
                return []
    return []

df_count["applied_rules"] = df_count["applied_rules"].apply(parse_rules)

# Flatten all rules into one big list
all_rules = [rule for rules in df_count["applied_rules"] for rule in rules]

# Count occurrences
rule_counts = Counter(all_rules)

# Convert to DataFrame for easy viewing
df_counts = pd.DataFrame(rule_counts.items(), columns=["rule", "count"]).sort_values("count", ascending=False)

print(df_counts)

                        rule  count
0       normalize_verb_tense   6409
1             split_compound   4541
6  convert_passive_to_active   1653
2     convert_word_to_number    917
3       simplify_subordinate    273
5         rewrite_apposition    214
4          clean_punctuation    116


In [None]:
# #Example calc

# # Original vs. Simplified sentences
# originals = ["Der Hund läuft schnell zur Tür."]
# simplifieds = ["Der Hund rennt zur Tür."]

# # Compute BERTScore using a German or multilingual model
# P, R, F1 = score(simplifieds, originals, lang="de", model_type="bert-base-multilingual-cased")

# print(f"Precision: {P.mean().item():.4f}")
# print(f"Recall: {R.mean().item():.4f}")
# print(f"F1: {F1.mean().item():.4f}")
