1. Load 'toy' policy model
2. define a dummy reward function (e.g. reward = length of putput or preseence of a keyword)
3. run one PPO iteration

Once that is done, bring in Ollama / Groq into the picture
- After the training is done and you want to apply a lightweight inference deployment
    - export fine-tuned weights as a hugging face repo
    - convert them to ggml format for llama.cpp or ollama
    - spin up an OpenWebUI or Groq isntance to serve them with low latency

Created a conda env with python 3.10. 
- Then:
    - pip install torch transformers accelerate trl datasets

## Train a reward model

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

from trl import PPOConfig, PPOTrainer
from datasets import Dataset, DatasetDict
import torch
import numpy as np
from tqdm.auto import tqdm

W0914 18:39:46.663000 32807 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


  warn("The installed version of bitsandbytes was compiled without GPU support. "


In [2]:
# for processing:
from transformers import GPT2Tokenizer
from trl import AutoModelForCausalLMWithValueHead

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [4]:
from ipynb.fs.full.reward_computation import compute_reward

Loading data file - german_dict/german_utf8.dic
0.0
1.0
Checking sentence: 'Der Riesenräder und das Riesenrad oder die Riesenräder sind sehr speziell.'
Dissect compound:  Riesenrad
Violation detected: Found unsplit compound 'Riesenrad'
Has unsplit compound violation: True

Checking sentence: 'Die Hütte steht in der Sonne.'
Dissect compound:  Hütte
Dissect compound:  Sonne
Has unsplit compound violation: False

Checking sentence: 'Der Donaudampfschifffahrtskapitänsmützenabzeichen und zweihundert Riesenrad sind sehr speziell.'
--------------------
Dissect compound:  Donaudampfschifffahrtskapitänsmützenabzeichen
Dissect compound:  Riesenrad
Compound violation score: 0.00
Number violation score: 0.00
--------------------

Checking sentence: 'Die Hütte steht in der Sonne.'
--------------------
Dissect compound:  Hütte
Dissect compound:  Sonne
Compound violation score: 1.00
Number violation score: 0.00
--------------------

Checking sentence: 'Das zweite Haus wurde drei Mal verkauft und kost

No sentence-transformers model found with name deepset/gbert-base. Creating a new one with mean pooling.


Model similarity function set to: 'dot'
Model loaded.
Grammar score (good): 1.00
Grammar score (bad): 0.89
Sentence: 'Diese Sätze sind gut nicht schreiben.'
Grammar score: 1.00

Sentence: 'Dieser Satz sind gut nicht schreiben.'
Grammar score: 0.86
Found 1 errors in the sentence: 'Dieser Satz sind falsch. Man sollte das vermeiden.'

--- Error 1 ---
Message: Bitte prüfen, ob hier „ist“ stehen sollte.
Category: GRAMMAR
Rule ID: DE_SUBJECT_VERB_AGREEMENT
Context: 'Dieser Satz sind falsch. Man sollte das vermeiden.'
Suggested replacements: ['ist']



# Start with Training Reward Model

# 1) Load the data

In [None]:
#load the datasest
df = pd.read_csv("data/ordered_simplifications_with_rules_clean_FINAL.csv", index_col=0)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.rename(columns={"original_sentence": "original", "final_simplification": "simplified"}, inplace=True)

In [None]:
df.info()

In [None]:
df = df.sample(n=50, random_state=42)

In [None]:
df.info()

# Choice for the  RewardModel

AutoModelForSequenceClassification
    - a family of AutoModel classes by huggingface
  - > loads a pretrained encoder and attach a classification head on top
That classification head is
 - for classification: outputs logits over N classes (e.g. positive/negative)
 - for regression: if I set num_labels=1, it outputs a single scalar

### BERT base vs DistilBERT

BERT base uncased
 - higher accuracy
 - 12 layers, 110M parameters.
 - High accuracy, but slower and heavier to train/infer.

DistilBERT - _Picking This one_
- faster PPO training
- A distilled (compressed) version of BERT.
- ~40% smaller, ~60% faster, only ~3% loss in accuracy.
- Often preferred as a reward model because PPO will call it a LOT (every generation gets scored).
- Faster inference = cheaper RL training.

## 2) Define Grid Search + Model and Hyperparameter Space

In [5]:
#Make sure that for each round in hyperparameter search, a new model is initialized from the chosen base model
def model_init():
    model_name = "distilbert-base-german-cased"
    return AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=1, problem_type="regression"
    )

In [None]:
def model_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True), # typical 1e-5 to 3e-5 WHAT TO ADD
        "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 4), # Kept it small for speed, HOW TO SET??
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32]),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
    }

In [7]:
CANDIDATE_WEIGHTS = [
    {"name": "balanced", "weights": {"rules_score": 0.5, "meaning_score": 0.3, "grammar_score": 0.2}}, #baseline
    {"name": "rules_heavy", "weights": {"rules_score": 0.7, "meaning_score": 0.2, "grammar_score": 0.1}},
    {"name": "meaning_heavy", "weights": {"rules_score": 0.2, "meaning_score": 0.7, "grammar_score": 0.1}},
    {"name": "grammar_focused", "weights": {"rules_score": 0.4, "meaning_score": 0.2, "grammar_score": 0.4}},
]

## 2.5) Load, Split and Tokenize Data ONCE 

In [None]:
#previous versions, keeping for reference

# #Define the preprocessing function
# # Tokenize the 'text' column (the simplified sentences)
# def preprocess_function(examples, tokenizer):
#     tokenized = tokenizer(
#         examples["simplified"],
#         truncation=True,
#         padding="max_length",
#         max_length=128,
#         #load_from_cache_file=False, ##make sure that the cached files are not used
#     )

# # Tokenize the 'text' column (the simplified sentences)
# def preprocess_function(examples, tokenizer):
#     tokenized = tokenizer(
#         examples["text"],
#         truncation=True,
#         padding="max_length",
#         max_length=128,
#         load_from_cache_file=False, ##make sure that the cached files are not used
#     )
#     tokenized["labels"] = [float(x) for x in examples["reward"]] #np.array(examples["labels"], dtype=np.float32)
#     return tokenized

In [None]:
# Predefine the preprocessing function
def preprocess_function(examples, tokenizer):
    """Tokenizes the 'simplified' text column."""
    # Important: Tokenization is applied on the 'simplified' column now, which becomes the 'text' for the RM
    return tokenizer(examples["simplified"], truncation=True, padding="max_length", max_length=128)

In [None]:
# --- 1. Load, Split, and Tokenize Data ONCE ---
print("--- Loading and tokenizing data once ---")
df = pd.read_csv("data/ordered_simplifications_with_rules.csv", index_col=0)
df.rename(columns={"original_sentence": "original", "final_simplification": "simplified"}, inplace=True)
df = df.sample(n=50, random_state=42) #reduce size for testing

In [None]:
# Create the initial dataset from the full DataFrame
full_dataset = Dataset.from_pandas(df)
# Split it into train and test sets
split_dataset = full_dataset.train_test_split(test_size=0.15, seed=42)

In [None]:
# Load the tokenizer and apply tokenization on both split sets

tokenizer_rm = AutoTokenizer.from_pretrained("distilbert-base-german-cased")

# Tokenize the base train and test sets without the labels
# We keep the original text columns to calculate rewards later
tokenized_train_base = split_dataset["train"].map(
    lambda examples: preprocess_function(examples, tokenizer_rm),
    batched=True
)
tokenized_test_base = split_dataset["test"].map(
    lambda examples: preprocess_function(examples, tokenizer_rm),
    batched=True
)
print("--- Data tokenized successfully. Starting grid search... ---")

### Eval Custom Metrics for regression RM
- for the chosen regression RM model MSE loss is chosen
- the following metrics are also loggeed
  - MSE (Mean Squared Error) → matches your training loss, so you can track consistency.
  - MAE (Mean Absolute Error) → more interpretable (average absolute difference).
  - R² (Coefficient of Determination) → tells you how well your model explains variance (1 = perfect, 0 = baseline).

In [8]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.squeeze()   # shape: (batch,)
    labels = labels.squeeze()

    mse = mean_squared_error(labels, preds)
    mae = mean_absolute_error(labels, preds)
    r2  = r2_score(labels, preds)

    return {"mse": mse, "mae": mae, "r2": r2}

## 3) Main Training Loop including Grid Search
- with the current setup and 50 rows, it took 19min

In [None]:
# --- 2. Grid Search and Training Loop ---
for config in CANDIDATE_WEIGHTS:
    config_name = config["name"]
    weights = config["weights"]
    output_dir_base = f"rm_out_{config_name}"

    print(f"\n--- Processing configuration: {config_name} ---")
    print(f"Weights: {weights}")

    # # --- a. Calculate rewards for this configuration (Fast) ---
    train_rewards = [
        compute_reward(ex['original'], ex['simplified'], weights) 
        for ex in tqdm(split_dataset["train"], desc="Calculating train rewards")
    ]
    test_rewards = [
        compute_reward(ex['original'], ex['simplified'], weights) 
        for ex in tqdm(split_dataset["test"], desc="Calculating test rewards")
    ]

    # --- b. Add the new rewards as the 'labels' column ---
    train_dataset_for_run = tokenized_train_base.add_column("labels", train_rewards)
    test_dataset_for_run = tokenized_test_base.add_column("labels", test_rewards)
    
    # --- c. Set the final format for the Trainer ---
    train_dataset_for_run.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
    test_dataset_for_run.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

    # --- d. Set up Training Arguments ---
    training_args = TrainingArguments(
        output_dir=output_dir_base,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="mse",
        greater_is_better=False,
    )

    trainer = Trainer(
        model_init=model_init,
        args=training_args,
        train_dataset=train_dataset_for_run,
        eval_dataset=test_dataset_for_run,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer_rm,
    )

    # --- e. Run Hyperparameter Search ---
    print(f"--- Running hyperparameter search for {config_name} ---")
    best_run = trainer.hyperparameter_search(
        direction="minimize", hp_space=model_hp_space, n_trials=2
    )
    print(f"Best run for {config_name}: {best_run}")

    # --- f. Train Final Model with Best Hyperparameters ---
    print(f"--- Training final model for {config_name} with best hyperparameters ---")
    for k, v in best_run.hyperparameters.items():
        setattr(trainer.args, k, v)
    
    trainer.train()

    # --- g. Save the Final Model ---
    final_output_dir = f"{output_dir_base}_final"
    trainer.save_model(final_output_dir)
    print(f"--- Saved final optimized model to {final_output_dir} ---\n")

--- Loading and tokenizing data once ---


Map:   0%|          | 0/42 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

--- Data tokenized successfully. Starting grid search... ---

--- Processing configuration: balanced ---
Weights: {'rules_score': 0.5, 'meaning_score': 0.3, 'grammar_score': 0.2}


Calculating train rewards:   0%|          | 0/42 [00:00<?, ?it/s]

Dissect compound:  Zahl
Dissect compound:  Person
Dissect compound:  Durch·Schnitt
Dissect compound:  Lebensmittel
Dissect compound:  Tisch
Dissect compound:  Masken
Dissect compound:  Beispiel
Dissect compound:  Bauarbeiter
Dissect compound:  Land
Dissect compound:  Feuerwehr
Dissect compound:  Wasser
Dissect compound:  Kellern
Dissect compound:  Frist
Dissect compound:  Brexit
Dissect compound:  Geschäfte
Dissect compound:  Lokale
Dissect compound:  Notarzt
Dissect compound:  Schuss·Waffen
Dissect compound:  Pistolen
Dissect compound:  Gewehre
Dissect compound:  Polizei
Dissect compound:  Bub
Dissect compound:  Ziel
Dissect compound:  Klima·Volksbegehren
Dissect compound:  Beispiel
Dissect compound:  Ausbau
Dissect compound:  Zoo
Dissect compound:  Orang·Utan
Dissect compound:  Jahre
Dissect compound:  Bauern
Dissect compound:  Gegend
Dissect compound:  Exil
Dissect compound:  Viertel
Dissect compound:  SPÖ·Mitarbeiter
Dissect compound:  Klassen
Dissect compound:  Finanz-Minister
Dis

Calculating test rewards:   0%|          | 0/8 [00:00<?, ?it/s]

Dissect compound:  Strom
Dissect compound:  Lieder
Dissect compound:  Folklore
Dissect compound:  Preis
Dissect compound:  Album
Dissect compound:  Möglichkeiten
Dissect compound:  Dichterin
Dissect compound:  Literatur·Nobelpreis
Dissect compound:  Zeit
Dissect compound:  Auto·Fahrer
Dissect compound:  Schnee
Dissect compound:  Matsch
Dissect compound:  Eis
Dissect compound:  Autos
Dissect compound:  Ausbruch
Dissect compound:  Virus
Dissect compound:  Corona-Pandemie
Dissect compound:  Lebens·Erwartung
Dissect compound:  Mal
Dissect compound:  Zeit


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-09-14 18:40:18,552] A new study created in memory with name: no-name-c0eb53af-9105-4877-939d-c7462f609c19


--- Running hyperparameter search for balanced ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/8 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.39793527126312256, 'eval_mse': 0.39793527126312256, 'eval_mae': 0.6295639872550964, 'eval_r2': -504.1300048828125, 'eval_runtime': 0.3581, 'eval_samples_per_second': 22.338, 'eval_steps_per_second': 2.792, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2510538399219513, 'eval_mse': 0.2510538399219513, 'eval_mae': 0.499216765165329, 'eval_r2': -317.6820373535156, 'eval_runtime': 0.8264, 'eval_samples_per_second': 9.681, 'eval_steps_per_second': 1.21, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.17322784662246704, 'eval_mse': 0.17322784662246704, 'eval_mae': 0.4136693775653839, 'eval_r2': -218.89149475097656, 'eval_runtime': 1.2237, 'eval_samples_per_second': 6.537, 'eval_steps_per_second': 0.817, 'epoch': 3.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.14334534108638763, 'eval_mse': 0.14334532618522644, 'eval_mae': 0.3756512999534607, 'eval_r2': -180.9593048095703, 'eval_runtime': 2.3401, 'eval_samples_per_second': 3.419, 'eval_steps_per_second': 0.427, 'epoch': 4.0}


[I 2025-09-14 18:41:15,193] Trial 0 finished with value: -180.44030818343163 and parameters: {'learning_rate': 2.3113453937889977e-05, 'num_train_epochs': 4, 'per_device_train_batch_size': 32, 'weight_decay': 0.0837112572386638}. Best is trial 0 with value: -180.44030818343163.


{'train_runtime': 56.1171, 'train_samples_per_second': 2.994, 'train_steps_per_second': 0.143, 'train_loss': 0.3073101341724396, 'epoch': 4.0}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/9 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.0850880965590477, 'eval_mse': 0.0850880965590477, 'eval_mae': 0.2886233925819397, 'eval_r2': -107.00889587402344, 'eval_runtime': 0.1766, 'eval_samples_per_second': 45.289, 'eval_steps_per_second': 5.661, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.004014439415186644, 'eval_mse': 0.004014439415186644, 'eval_mae': 0.05303340405225754, 'eval_r2': -4.0958380699157715, 'eval_runtime': 0.5444, 'eval_samples_per_second': 14.694, 'eval_steps_per_second': 1.837, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.01033814437687397, 'eval_mse': 0.01033814437687397, 'eval_mae': 0.08971166610717773, 'eval_r2': -12.123005867004395, 'eval_runtime': 0.3507, 'eval_samples_per_second': 22.808, 'eval_steps_per_second': 2.851, 'epoch': 3.0}


[I 2025-09-14 18:41:45,216] Trial 1 finished with value: -12.022956056520343 and parameters: {'learning_rate': 4.705555295931991e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.09409394281344818}. Best is trial 0 with value: -180.44030818343163.


{'train_runtime': 28.8909, 'train_samples_per_second': 4.361, 'train_steps_per_second': 0.312, 'train_loss': 0.15060599644978842, 'epoch': 3.0}
Best run for balanced: BestRun(run_id='0', objective=-180.44030818343163, hyperparameters={'learning_rate': 2.3113453937889977e-05, 'num_train_epochs': 4, 'per_device_train_batch_size': 32, 'weight_decay': 0.0837112572386638}, run_summary=None)
--- Training final model for balanced with best hyperparameters ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/8 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.39793533086776733, 'eval_mse': 0.3979353606700897, 'eval_mae': 0.6295640468597412, 'eval_r2': -504.130126953125, 'eval_runtime': 0.3649, 'eval_samples_per_second': 21.924, 'eval_steps_per_second': 2.741, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2510538101196289, 'eval_mse': 0.2510538101196289, 'eval_mae': 0.4992167055606842, 'eval_r2': -317.6820068359375, 'eval_runtime': 0.8819, 'eval_samples_per_second': 9.072, 'eval_steps_per_second': 1.134, 'epoch': 2.0}




  0%|          | 0/1 [00:01<?, ?it/s]

{'eval_loss': 0.17322784662246704, 'eval_mse': 0.17322784662246704, 'eval_mae': 0.4136693477630615, 'eval_r2': -218.89149475097656, 'eval_runtime': 3.9559, 'eval_samples_per_second': 2.022, 'eval_steps_per_second': 0.253, 'epoch': 3.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.14334538578987122, 'eval_mse': 0.14334538578987122, 'eval_mae': 0.37565135955810547, 'eval_r2': -180.95938110351562, 'eval_runtime': 0.5354, 'eval_samples_per_second': 14.943, 'eval_steps_per_second': 1.868, 'epoch': 4.0}
{'train_runtime': 45.0878, 'train_samples_per_second': 3.726, 'train_steps_per_second': 0.177, 'train_loss': 0.30731016397476196, 'epoch': 4.0}
--- Saved final optimized model to rm_out_balanced_final ---


--- Processing configuration: rules_heavy ---
Weights: {'rules_score': 0.7, 'meaning_score': 0.2, 'grammar_score': 0.1}


Calculating train rewards:   0%|          | 0/42 [00:00<?, ?it/s]

Dissect compound:  Zahl
Dissect compound:  Person
Dissect compound:  Durch·Schnitt
Dissect compound:  Lebensmittel
Dissect compound:  Tisch
Dissect compound:  Masken
Dissect compound:  Beispiel
Dissect compound:  Bauarbeiter
Dissect compound:  Land
Dissect compound:  Feuerwehr
Dissect compound:  Wasser
Dissect compound:  Kellern
Dissect compound:  Frist
Dissect compound:  Brexit
Dissect compound:  Geschäfte
Dissect compound:  Lokale
Dissect compound:  Notarzt
Dissect compound:  Schuss·Waffen
Dissect compound:  Pistolen
Dissect compound:  Gewehre
Dissect compound:  Polizei
Dissect compound:  Bub
Dissect compound:  Ziel
Dissect compound:  Klima·Volksbegehren
Dissect compound:  Beispiel
Dissect compound:  Ausbau
Dissect compound:  Zoo
Dissect compound:  Orang·Utan
Dissect compound:  Jahre
Dissect compound:  Bauern
Dissect compound:  Gegend
Dissect compound:  Exil
Dissect compound:  Viertel
Dissect compound:  SPÖ·Mitarbeiter
Dissect compound:  Klassen
Dissect compound:  Finanz-Minister
Dis

Calculating test rewards:   0%|          | 0/8 [00:00<?, ?it/s]

Dissect compound:  Strom
Dissect compound:  Lieder
Dissect compound:  Folklore
Dissect compound:  Preis
Dissect compound:  Album
Dissect compound:  Möglichkeiten
Dissect compound:  Dichterin
Dissect compound:  Literatur·Nobelpreis
Dissect compound:  Zeit
Dissect compound:  Auto·Fahrer
Dissect compound:  Schnee
Dissect compound:  Matsch
Dissect compound:  Eis
Dissect compound:  Autos
Dissect compound:  Ausbruch
Dissect compound:  Virus
Dissect compound:  Corona-Pandemie
Dissect compound:  Lebens·Erwartung
Dissect compound:  Mal
Dissect compound:  Zeit


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-09-14 18:42:47,115] A new study created in memory with name: no-name-a67380fc-d1dd-45cc-8335-39f28e2559a3


--- Running hyperparameter search for rules_heavy ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/6 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.23542112112045288, 'eval_mse': 0.23542112112045288, 'eval_mae': 0.4822903275489807, 'eval_r2': -134.37728881835938, 'eval_runtime': 0.4293, 'eval_samples_per_second': 18.637, 'eval_steps_per_second': 2.33, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.08608052879571915, 'eval_mse': 0.08608052879571915, 'eval_mae': 0.28667086362838745, 'eval_r2': -48.5000114440918, 'eval_runtime': 0.4463, 'eval_samples_per_second': 17.927, 'eval_steps_per_second': 2.241, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.043034877628088, 'eval_mse': 0.043034877628088, 'eval_mae': 0.19651466608047485, 'eval_r2': -23.74690818786621, 'eval_runtime': 1.8654, 'eval_samples_per_second': 4.289, 'eval_steps_per_second': 0.536, 'epoch': 3.0}


[I 2025-09-14 18:43:42,688] Trial 0 finished with value: -23.507358644157648 and parameters: {'learning_rate': 3.9910352452023014e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.06796030512435813}. Best is trial 0 with value: -23.507358644157648.


{'train_runtime': 53.0541, 'train_samples_per_second': 2.375, 'train_steps_per_second': 0.113, 'train_loss': 0.23748638232549033, 'epoch': 3.0}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/9 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.39701712131500244, 'eval_mse': 0.39701712131500244, 'eval_mae': 0.6283830404281616, 'eval_r2': -227.30194091796875, 'eval_runtime': 0.9167, 'eval_samples_per_second': 8.727, 'eval_steps_per_second': 1.091, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2834542989730835, 'eval_mse': 0.2834542989730835, 'eval_mae': 0.5300941467285156, 'eval_r2': -161.99842834472656, 'eval_runtime': 0.5459, 'eval_samples_per_second': 14.654, 'eval_steps_per_second': 1.832, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2431916445493698, 'eval_mse': 0.24319162964820862, 'eval_mae': 0.490520179271698, 'eval_r2': -138.8456573486328, 'eval_runtime': 0.5464, 'eval_samples_per_second': 14.64, 'eval_steps_per_second': 1.83, 'epoch': 3.0}


[I 2025-09-14 18:44:46,340] Trial 1 finished with value: -138.1119455397129 and parameters: {'learning_rate': 1.442456559412578e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.005419066603899559}. Best is trial 1 with value: -138.1119455397129.


{'train_runtime': 61.7631, 'train_samples_per_second': 2.04, 'train_steps_per_second': 0.146, 'train_loss': 0.3732747501797146, 'epoch': 3.0}
Best run for rules_heavy: BestRun(run_id='1', objective=-138.1119455397129, hyperparameters={'learning_rate': 1.442456559412578e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.005419066603899559}, run_summary=None)
--- Training final model for rules_heavy with best hyperparameters ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/9 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.39701712131500244, 'eval_mse': 0.39701709151268005, 'eval_mae': 0.6283829808235168, 'eval_r2': -227.3019256591797, 'eval_runtime': 1.1713, 'eval_samples_per_second': 6.83, 'eval_steps_per_second': 0.854, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2834542393684387, 'eval_mse': 0.2834542393684387, 'eval_mae': 0.5300941467285156, 'eval_r2': -161.99839782714844, 'eval_runtime': 0.4726, 'eval_samples_per_second': 16.928, 'eval_steps_per_second': 2.116, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2431916892528534, 'eval_mse': 0.2431916743516922, 'eval_mae': 0.4905202090740204, 'eval_r2': -138.84568786621094, 'eval_runtime': 1.3739, 'eval_samples_per_second': 5.823, 'eval_steps_per_second': 0.728, 'epoch': 3.0}
{'train_runtime': 62.6641, 'train_samples_per_second': 2.011, 'train_steps_per_second': 0.144, 'train_loss': 0.3732748031616211, 'epoch': 3.0}
--- Saved final optimized model to rm_out_rules_heavy_final ---


--- Processing configuration: meaning_heavy ---
Weights: {'rules_score': 0.2, 'meaning_score': 0.7, 'grammar_score': 0.1}


Calculating train rewards:   0%|          | 0/42 [00:00<?, ?it/s]

Dissect compound:  Zahl
Dissect compound:  Person
Dissect compound:  Durch·Schnitt
Dissect compound:  Lebensmittel
Dissect compound:  Tisch
Dissect compound:  Masken
Dissect compound:  Beispiel
Dissect compound:  Bauarbeiter
Dissect compound:  Land
Dissect compound:  Feuerwehr
Dissect compound:  Wasser
Dissect compound:  Kellern
Dissect compound:  Frist
Dissect compound:  Brexit
Dissect compound:  Geschäfte
Dissect compound:  Lokale
Dissect compound:  Notarzt
Dissect compound:  Schuss·Waffen
Dissect compound:  Pistolen
Dissect compound:  Gewehre
Dissect compound:  Polizei
Dissect compound:  Bub
Dissect compound:  Ziel
Dissect compound:  Klima·Volksbegehren
Dissect compound:  Beispiel
Dissect compound:  Ausbau
Dissect compound:  Zoo
Dissect compound:  Orang·Utan
Dissect compound:  Jahre
Dissect compound:  Bauern
Dissect compound:  Gegend
Dissect compound:  Exil
Dissect compound:  Viertel
Dissect compound:  SPÖ·Mitarbeiter
Dissect compound:  Klassen
Dissect compound:  Finanz-Minister
Dis

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Dissect compound:  Schießerei
Dissect compound:  Synagoge
Dissect compound:  Million
Dissect compound:  Menschen
Dissect compound:  Corona·Virus


Calculating test rewards:   0%|          | 0/8 [00:00<?, ?it/s]

Dissect compound:  Strom
Dissect compound:  Lieder
Dissect compound:  Folklore
Dissect compound:  Preis
Dissect compound:  Album
Dissect compound:  Möglichkeiten
Dissect compound:  Dichterin
Dissect compound:  Literatur·Nobelpreis
Dissect compound:  Zeit
Dissect compound:  Auto·Fahrer
Dissect compound:  Schnee
Dissect compound:  Matsch
Dissect compound:  Eis
Dissect compound:  Autos
Dissect compound:  Ausbruch
Dissect compound:  Virus
Dissect compound:  Corona-Pandemie
Dissect compound:  Lebens·Erwartung
Dissect compound:  Mal
Dissect compound:  Zeit


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-09-14 18:51:25,406] A new study created in memory with name: no-name-6d70b450-bb8e-4228-9fbc-d8a59627ff45


--- Running hyperparameter search for meaning_heavy ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.5381573438644409, 'eval_mse': 0.5381573438644409, 'eval_mae': 0.732668399810791, 'eval_r2': -981.9249877929688, 'eval_runtime': 0.5128, 'eval_samples_per_second': 15.6, 'eval_steps_per_second': 1.95, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.4720667600631714, 'eval_mse': 0.4720667600631714, 'eval_mae': 0.686156690120697, 'eval_r2': -861.212890625, 'eval_runtime': 0.7465, 'eval_samples_per_second': 10.716, 'eval_steps_per_second': 1.34, 'epoch': 2.0}


[I 2025-09-14 18:52:26,575] Trial 0 finished with value: -860.0546671748161 and parameters: {'learning_rate': 1.5054253165962535e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32, 'weight_decay': 0.09454363237216178}. Best is trial 0 with value: -860.0546671748161.


{'train_runtime': 56.3304, 'train_samples_per_second': 1.491, 'train_steps_per_second': 0.071, 'train_loss': 0.5417770147323608, 'epoch': 2.0}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/9 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.4675424098968506, 'eval_mse': 0.467542439699173, 'eval_mae': 0.6828292608261108, 'eval_r2': -852.9494018554688, 'eval_runtime': 0.3974, 'eval_samples_per_second': 20.129, 'eval_steps_per_second': 2.516, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.34296509623527527, 'eval_mse': 0.3429650664329529, 'eval_mae': 0.5846346616744995, 'eval_r2': -625.413330078125, 'eval_runtime': 1.1776, 'eval_samples_per_second': 6.794, 'eval_steps_per_second': 0.849, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2984173595905304, 'eval_mse': 0.2984173595905304, 'eval_mae': 0.5452090501785278, 'eval_r2': -544.0485229492188, 'eval_runtime': 0.4383, 'eval_samples_per_second': 18.253, 'eval_steps_per_second': 2.282, 'epoch': 3.0}


[I 2025-09-14 18:53:35,051] Trial 1 finished with value: -543.2048965394497 and parameters: {'learning_rate': 1.439470108427635e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 16, 'weight_decay': 0.09564979132984286}. Best is trial 0 with value: -860.0546671748161.


{'train_runtime': 63.2216, 'train_samples_per_second': 1.993, 'train_steps_per_second': 0.142, 'train_loss': 0.4083252747853597, 'epoch': 3.0}
Best run for meaning_heavy: BestRun(run_id='0', objective=-860.0546671748161, hyperparameters={'learning_rate': 1.5054253165962535e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 32, 'weight_decay': 0.09454363237216178}, run_summary=None)
--- Training final model for meaning_heavy with best hyperparameters ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.5381574034690857, 'eval_mse': 0.5381574034690857, 'eval_mae': 0.732668399810791, 'eval_r2': -981.925048828125, 'eval_runtime': 0.626, 'eval_samples_per_second': 12.779, 'eval_steps_per_second': 1.597, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.4720667600631714, 'eval_mse': 0.4720667600631714, 'eval_mae': 0.686156690120697, 'eval_r2': -861.212890625, 'eval_runtime': 1.2449, 'eval_samples_per_second': 6.426, 'eval_steps_per_second': 0.803, 'epoch': 2.0}
{'train_runtime': 44.1107, 'train_samples_per_second': 1.904, 'train_steps_per_second': 0.091, 'train_loss': 0.5417770147323608, 'epoch': 2.0}
--- Saved final optimized model to rm_out_meaning_heavy_final ---


--- Processing configuration: grammar_focused ---
Weights: {'rules_score': 0.4, 'meaning_score': 0.2, 'grammar_score': 0.4}


Calculating train rewards:   0%|          | 0/42 [00:00<?, ?it/s]

Dissect compound:  Zahl
Dissect compound:  Person
Dissect compound:  Durch·Schnitt
Dissect compound:  Lebensmittel
Dissect compound:  Tisch
Dissect compound:  Masken
Dissect compound:  Beispiel
Dissect compound:  Bauarbeiter
Dissect compound:  Land
Dissect compound:  Feuerwehr
Dissect compound:  Wasser
Dissect compound:  Kellern
Dissect compound:  Frist
Dissect compound:  Brexit
Dissect compound:  Geschäfte
Dissect compound:  Lokale
Dissect compound:  Notarzt
Dissect compound:  Schuss·Waffen
Dissect compound:  Pistolen
Dissect compound:  Gewehre
Dissect compound:  Polizei
Dissect compound:  Bub
Dissect compound:  Ziel
Dissect compound:  Klima·Volksbegehren
Dissect compound:  Beispiel
Dissect compound:  Ausbau
Dissect compound:  Zoo
Dissect compound:  Orang·Utan
Dissect compound:  Jahre
Dissect compound:  Bauern
Dissect compound:  Gegend
Dissect compound:  Exil
Dissect compound:  Viertel
Dissect compound:  SPÖ·Mitarbeiter
Dissect compound:  Klassen
Dissect compound:  Finanz-Minister
Dis

Calculating test rewards:   0%|          | 0/8 [00:00<?, ?it/s]

Dissect compound:  Strom
Dissect compound:  Lieder
Dissect compound:  Folklore
Dissect compound:  Preis
Dissect compound:  Album
Dissect compound:  Möglichkeiten
Dissect compound:  Dichterin
Dissect compound:  Literatur·Nobelpreis
Dissect compound:  Zeit
Dissect compound:  Auto·Fahrer
Dissect compound:  Schnee
Dissect compound:  Matsch
Dissect compound:  Eis
Dissect compound:  Autos
Dissect compound:  Ausbruch
Dissect compound:  Virus
Dissect compound:  Corona-Pandemie
Dissect compound:  Lebens·Erwartung
Dissect compound:  Mal
Dissect compound:  Zeit


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-09-14 18:54:44,473] A new study created in memory with name: no-name-fc907a5a-f0fa-4f33-902e-078afe069d32


--- Running hyperparameter search for grammar_focused ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/6 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.31654101610183716, 'eval_mse': 0.31654101610183716, 'eval_mae': 0.5610895752906799, 'eval_r2': -564.7183837890625, 'eval_runtime': 0.4701, 'eval_samples_per_second': 17.019, 'eval_steps_per_second': 2.127, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.1577865034341812, 'eval_mse': 0.1577865034341812, 'eval_mae': 0.3946250379085541, 'eval_r2': -280.99420166015625, 'eval_runtime': 0.9074, 'eval_samples_per_second': 8.817, 'eval_steps_per_second': 1.102, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.10440894961357117, 'eval_mse': 0.10440894961357117, 'eval_mae': 0.31951093673706055, 'eval_r2': -185.59844970703125, 'eval_runtime': 0.4273, 'eval_samples_per_second': 18.722, 'eval_steps_per_second': 2.34, 'epoch': 3.0}


[I 2025-09-14 18:55:39,517] Trial 0 finished with value: -185.17452982068062 and parameters: {'learning_rate': 3.4225812239941936e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 32, 'weight_decay': 0.026386231545732487}. Best is trial 0 with value: -185.17452982068062.


{'train_runtime': 50.5122, 'train_samples_per_second': 2.494, 'train_steps_per_second': 0.119, 'train_loss': 0.3165178894996643, 'epoch': 3.0}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/8 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.4164586067199707, 'eval_mse': 0.4164586067199707, 'eval_mae': 0.6439814567565918, 'eval_r2': -743.2899169921875, 'eval_runtime': 0.3678, 'eval_samples_per_second': 21.748, 'eval_steps_per_second': 2.719, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.2666419744491577, 'eval_mse': 0.2666419446468353, 'eval_mae': 0.5146216154098511, 'eval_r2': -475.53936767578125, 'eval_runtime': 0.4224, 'eval_samples_per_second': 18.939, 'eval_steps_per_second': 2.367, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.18672196567058563, 'eval_mse': 0.18672195076942444, 'eval_mae': 0.42981430888175964, 'eval_r2': -332.707275390625, 'eval_runtime': 0.3662, 'eval_samples_per_second': 21.844, 'eval_steps_per_second': 2.73, 'epoch': 3.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.15587641298770905, 'eval_mse': 0.15587639808654785, 'eval_mae': 0.39217400550842285, 'eval_r2': -277.5804748535156, 'eval_runtime': 1.6777, 'eval_samples_per_second': 4.768, 'eval_steps_per_second': 0.596, 'epoch': 4.0}


[I 2025-09-14 18:57:15,134] Trial 1 finished with value: -277.03242444992065 and parameters: {'learning_rate': 2.291994616609982e-05, 'num_train_epochs': 4, 'per_device_train_batch_size': 32, 'weight_decay': 0.031242999994574305}. Best is trial 1 with value: -277.03242444992065.


{'train_runtime': 93.4935, 'train_samples_per_second': 1.797, 'train_steps_per_second': 0.086, 'train_loss': 0.33749592304229736, 'epoch': 4.0}
Best run for grammar_focused: BestRun(run_id='1', objective=-277.03242444992065, hyperparameters={'learning_rate': 2.291994616609982e-05, 'num_train_epochs': 4, 'per_device_train_batch_size': 32, 'weight_decay': 0.031242999994574305}, run_summary=None)
--- Training final model for grammar_focused with best hyperparameters ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/8 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.4164586067199707, 'eval_mse': 0.4164586067199707, 'eval_mae': 0.6439814567565918, 'eval_r2': -743.2899169921875, 'eval_runtime': 1.1356, 'eval_samples_per_second': 7.045, 'eval_steps_per_second': 0.881, 'epoch': 1.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.26664191484451294, 'eval_mse': 0.26664191484451294, 'eval_mae': 0.5146216154098511, 'eval_r2': -475.539306640625, 'eval_runtime': 0.5576, 'eval_samples_per_second': 14.348, 'eval_steps_per_second': 1.794, 'epoch': 2.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.18672195076942444, 'eval_mse': 0.18672195076942444, 'eval_mae': 0.42981433868408203, 'eval_r2': -332.707275390625, 'eval_runtime': 0.7917, 'eval_samples_per_second': 10.104, 'eval_steps_per_second': 1.263, 'epoch': 3.0}




  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 0.15587642788887024, 'eval_mse': 0.15587642788887024, 'eval_mae': 0.39217400550842285, 'eval_r2': -277.5805358886719, 'eval_runtime': 0.4987, 'eval_samples_per_second': 16.042, 'eval_steps_per_second': 2.005, 'epoch': 4.0}
{'train_runtime': 90.1293, 'train_samples_per_second': 1.864, 'train_steps_per_second': 0.089, 'train_loss': 0.33749592304229736, 'epoch': 4.0}
--- Saved final optimized model to rm_out_grammar_focused_final ---



# Outdated RM code

In [None]:
# for config in CANDIDATE_WEIGHTS:
#     config_name = config["name"]
#     weights = config["weights"]
#     output_dir_base = f"rm_out_{config_name}"

#     print(f"--- Processing configuration: {config_name} ---")
#     print(f"Weights: {weights}")

#     # --- 1. Generate Reward Data for this specific configuration ---
#     reward_data = []
#     for index, row in tqdm(df.iterrows(), total=df.shape[0], desc=f"Generating rewards for {config_name}"):
#         original = row['original']
#         simplified = row['simplified']

#         try:
#             # Use the new function with the current set of weights
#             score = compute_reward(original, simplified, weights)
#             reward_data.append({"text": simplified, "reward": float(score)})
#         except Exception as e:
#             # print(f"Row {index} failed for {config_name}: {e}")
#             continue

#     #Create DataSet and split into train and test
#     reward_df = pd.DataFrame(reward_data)
#     reward_dataset = Dataset.from_pandas(reward_df).train_test_split(test_size=0.15, seed=42)

#     #Load tokenizer and process the dataset
#     tokenizer_rm = AutoTokenizer.from_pretrained("distilbert-base-german-cased")
#     tokenized_dataset_rm = reward_dataset.map(
#         lambda examples: preprocess_function(examples, tokenizer_rm), # Pass tokenizer to the map function
#         batched=True, remove_columns=["text"]
#     )
#     #tokenized_dataset_rm = reward_dataset.map(preprocess_function, batched=True, remove_columns=["text"]) #remove text after tokenization
    
    
#     tokenized_dataset_rm.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

#     # --- 2. Set up Training Parameters ---
#     training_args = TrainingArguments(
#         output_dir=output_dir_base,
#         evaluation_strategy="epoch",
#         save_strategy="epoch",
#         load_best_model_at_end=True,
#         metric_for_best_model="mse",
#         greater_is_better=False,
#         pin_memory=False, ##TRUE if GPU is available
#     )

#     trainer = Trainer(
#         model_init=model_init, # Use the model_init function
#         args=training_args,
#         train_dataset=tokenized_dataset_rm["train"],
#         eval_dataset=tokenized_dataset_rm["test"],
#         compute_metrics=compute_metrics,
#         tokenizer=tokenizer_rm,
#     )
#     # --- 3. Run Hyperparameter Search ---
#     print(f"--- Running hyperparameter search for {config_name} ---")
#     best_run = trainer.hyperparameter_search(
#         direction="minimize",
#         hp_space=model_hp_space,
#         n_trials= 2 #10, # Number of hyperparameter combinations to try
#     )
#     print(f"Best run for {config_name}: {best_run}")

#     # --- 4. Train Final Model with Best Hyperparameters ---
#     print(f"--- Training final model for {config_name} with best hyperparameters ---")
#     for k, v in best_run.hyperparameters.items():
#         setattr(trainer.args, k, v)
    
#     trainer.train()

#     # --- 5. Save the Final Model ---
#     final_output_dir = f"{output_dir_base}_final"
#     trainer.save_model(final_output_dir)
#     print(f"--- Saved final optimized model to {final_output_dir} ---\n")


In [10]:
# for config in CANDIDATE_WEIGHTS:
#     config_name = config["name"]
#     weights = config["weights"]
#     output_dir_base = f"rm_out_{config_name}"

#     print(f"--- Processing configuration: {config_name} ---")
#     print(f"Weights: {weights}")

#     # --- 1. Generate Reward Data for this specific configuration ---
#     print("Calculating reward scores...")
#     train_rewards = [
#         compute_reward(ex['original'], ex['simplified'], weights) 
#         for ex in tqdm(split_dataset["train"], desc="Train rewards")
#     ]
#     test_rewards = [
#         compute_reward(ex['original'], ex['simplified'], weights) 
#         for ex in tqdm(split_dataset["test"], desc="Test rewards")
#     ]

#     # --- 1.2. Add reward 'labels' to the pre-tokenized datasets ---
#     train_dataset_for_run = tokenized_train_base.add_column("labels", train_rewards)
#     test_dataset_for_run = tokenized_test_base.add_column("labels", test_rewards)
    
#     # --- 1.3. Finalize dataset format for the Trainer ---
#     train_dataset_for_run.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
#     test_dataset_for_run.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


#     # --- 2. Set up Training Parameters ---
#     training_args = TrainingArguments(
#         output_dir=output_dir_base,
#         evaluation_strategy="epoch",
#         save_strategy="epoch",
#         load_best_model_at_end=True,
#         metric_for_best_model="mse",
#         greater_is_better=False,
#         pin_memory=False, ##TRUE if GPU is available
#     )

#     trainer = Trainer(
#         model_init=model_init, # Use the model_init function
#         args=training_args,
#         train_dataset=train_dataset_for_run,
#         eval_dataset=test_dataset_for_run,
#         compute_metrics=compute_metrics,
#         tokenizer=tokenizer_rm,
#     )
#     # --- 3. Run Hyperparameter Search ---
#     print(f"--- Running hyperparameter search for {config_name} ---")
#     best_run = trainer.hyperparameter_search(
#         direction="minimize",
#         hp_space=model_hp_space,
#         n_trials= 2 #10, # Number of hyperparameter combinations to try
#     )
#     print(f"Best run for {config_name}: {best_run}")

#     # --- 4. Train Final Model with Best Hyperparameters ---
#     print(f"--- Training final model for {config_name} with best hyperparameters ---")
#     for k, v in best_run.hyperparameters.items():
#         setattr(trainer.args, k, v)
    
#     trainer.train()

#     # --- 5. Save the Final Model ---
#     final_output_dir = f"{output_dir_base}_final"
#     trainer.save_model(final_output_dir)
#     print(f"--- Saved final optimized model to {final_output_dir} ---\n")


## Compute the Rewards

In [None]:
# 2. Generate reward scores for each example
# This will take time, as it runs your full spaCy/SBERT pipeline for each row
reward_data = []
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    original = row['original']
    simplified = row['simplified']
    
    # implement custom reward computation function
    try:
        score = compute_reward(original, simplified, weights)
    except Exception as e:
        #print(f"Row {index} failed with error: {e}")
        #print(f"Original: {original}")
        #print(f"Simplified: {simplified}")
        continue  # Assign a default score in case of failure
        
    
    reward_data.append({
        "text": simplified, # The input for the reward model
        "reward": float(score)  # The score we want it to predict
    })

In [None]:
#turn the reward data into a df
reward_df = pd.DataFrame(reward_data)

# 3. Create a Hugging Face Dataset
reward_dataset = Dataset.from_pandas(reward_df)

# 4. Split into training and testing sets
reward_dataset = reward_dataset.train_test_split(test_size=0.15, seed=42) # 85/90% train, 15/10% test

In [None]:
# import matplotlib.pyplot as plt


# # 1. Create a larger figure to give the labels more space
# fig, ax = plt.subplots(figsize=(18, 8))

# # 2. Create the bar plot
# ax.bar(reward_df['text'], reward_df['labels'], color='skyblue')
# ax.set_ylim(0, 1)
# plt.tight_layout()

# reward_df_sort = reward_df.sort_values('labels')
# reward_df_sort_capped = reward_df_sort[:20]  # Take the top 20 rows for better visibility

In [None]:
# reward_df_sort = reward_df.sort_values('labels')
# reward_df_sort_capped = reward_df_sort[:20]  # Take the top 20 rows for better visibility

In [None]:
# reward_df_sort.head()

In [None]:
# reward_df_sort.tail()

### Load Model

In [None]:
model_name = "distilbert-base-german-cased" #bert-base-german-cased
tokenizer_rm = AutoTokenizer.from_pretrained(model_name)
model_rm = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1, problem_type="regression")

### Eval Metriics for regression RM
- for the chosen regression RM model MSE loss is chosen
- the following metrics are also loggeed
  - MSE (Mean Squared Error) → matches your training loss, so you can track consistency.
  - MAE (Mean Absolute Error) → more interpretable (average absolute difference).
  - R² (Coefficient of Determination) → tells you how well your model explains variance (1 = perfect, 0 = baseline).

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.squeeze()   # shape: (batch,)
    labels = labels.squeeze()

    mse = mean_squared_error(labels, preds)
    mae = mean_absolute_error(labels, preds)
    r2  = r2_score(labels, preds)

    return {"mse": mse, "mae": mae, "r2": r2}

In [None]:
reward_df.head()

### Dataset formating

In [None]:
# Tokenize the 'text' column (the simplified sentences)
def preprocess_function(examples):
    tokenized = tokenizer_rm(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )
    tokenized["labels"] = [float(x) for x in examples["reward"]] #np.array(examples["labels"], dtype=np.float32)
    return tokenized

tokenized_dataset_rm = reward_dataset.map(preprocess_function, batched=True, remove_columns=["text"]) #remove text after tokenization
tokenized_dataset_rm.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


In [None]:
# training_args = TrainingArguments(
#     output_dir="rm_out",
#     num_train_epochs=3,                     # increase if dataset is small
#     per_device_train_batch_size=16,         # 8 if GPU-limited
#     per_device_eval_batch_size=16,
#     learning_rate=2e-5,                     # tune 1e-5 ↔ 3e-5
#     weight_decay=0.01,
#     logging_steps=50,                       # not too spammy
#     do_eval=True,            # run evaluation during training
#     eval_steps=200,                         # less frequent, more meaningful
#     metric_for_best_model="mse",            # if you define compute_metrics
#     greater_is_better=False,              # because lower MSE is better
# )

In [None]:
training_args = TrainingArguments(
    output_dir="rm_out",
    evaluation_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_total_limit=2,
    logging_steps=50,
    #load_best_model_at_end=True,
    metric_for_best_model="mse",
    greater_is_better=False,
)

In [None]:
# Trainer instantiation

trainer = Trainer(
    model=model_rm,
    args=training_args,
    train_dataset=tokenized_dataset_rm["train"],
    eval_dataset=tokenized_dataset_rm["test"],
    compute_metrics=compute_metrics,
)

In [None]:
# # Define the hyperparameter search space
# def model_hp_space(trial):
#     return {
#         "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
#         "num_train_epochs": trial.suggest_int("num_train_epochs", 2, 5),
#         "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
#         "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
#     }

# # Run hyperparameter search (Optuna by default)
# best_run = trainer.hyperparameter_search(
#     direction="minimize",   # since we want to minimize mse
#     hp_space=model_hp_space,
#     n_trials=10,            # how many configs to try
# )

#RuntimeError: To use hyperparameter search, you need to pass your model through a model_init function.

In [None]:
# Start the training process
trainer.train()

In [None]:
# Save the final, trained reward model
trainer.save_model("model_rm")
tokenizer_rm.save_pretrained("model_rm")

### Model_rm has been trained with the whole dataset now

# Start the PPO process

- if you use AutoModelForCauselLMWithValueHead - it already bundles BOTH the policy and the value head into a single model
- you do not need to supply a separate value model

- I managed to make it work with gpt-2
- there might be some adjustments needed, once frhew/sigdial_ft_a2 actually loads

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
)
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
import torch.nn.functional as F
from transformers import BitsAndBytesConfig


In [None]:
# --- 1. Configuration and Model Loading ---
# load your target model using quantization
MODEL_ID = "frhew/sigdial_ft_a2" 
RM_PATH = "reward_model_ger" 

In [None]:
# --- 2. Load Models and Tokenizers with Quantization ---

# Define the quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16  # Use bfloat16 for modern GPUs
)

In [None]:
# Policy Model (the one we're training)
# We load it with the quantization config.
# device_map="auto" is still useful here to correctly place the quantized model on the GPU.
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    MODEL_ID,
    quantization_config=quantization_config,
    device_map="auto"
)

In [None]:
# # 1. choose your base model
# MODEL_ID = "gpt2"
# tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
# policy_model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
# value_model  = AutoModelForCausalLM.from_pretrained(MODEL_ID)   # simple value net

# RM_PATH = "model_rm"
# reward_model = AutoModelForSequenceClassification.from_pretrained(RM_PATH) # or instantiate your RM
# tokenizer_rm = AutoTokenizer.from_pretrained(RM_PATH)

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)

In [None]:
# #1 use policy model and value head - both will be the same

#device = "cuda" if torch.cuda.is_available() else "cpu"

# MODEL_ID = "frhew/sigdial_ft_a2"
# tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# # Add a padding token if one doesn't exist
# if tokenizer.pad_token is None:
#     tokenizer.pad_token = tokenizer.eos_token

# policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
#     MODEL_ID, 
#     device_map="auto", 
#     dtype=torch.float16)
# #value_model = policy_model  # use the same model for policy and value head, PPOTrainer will handle this automatically

In [None]:
policy_model.save_pretrained("frhew/sigdial_ft_a2_local")
tokenizer.save_pretrained("frhew/sigdial_ft_a2_local")

In [None]:
MODEL_ID = "dbmdz/german-gpt2"
RM_PATH = "model_rm"

In [None]:
# PPO Configuration
config = PPOConfig(
    model_name=MODEL_ID,
    learning_rate=1.41e-5,
    batch_size=8, # Use a slightly larger batch size
    mini_batch_size=2,
    gradient_accumulation_steps=1,
    kl_penalty="kl",
    target_kl=0.1,
    use_score_scaling=True, #enable reward normalization
    score_clip=10.0,
    seed=42,
    log_with=None, # Set to "wandb" if you use it
)

# # 2. Configure PPO
# config = PPOConfig(
#     # — PPO-specific
#     exp_name                      = "my_ppo_test",
#     reward_model_path             = "model_rm",       # only needed if you want to reload RM via name
#     num_ppo_epochs                = 3,
#     kl_coef                       = 0.1,
#     cliprange                     = 0.2,
#     vf_coef                       = 0.1,
#     gamma                         = 0.99,
#     lam                           = 0.95,
#     # — TrainingArguments / OnPolicyConfig
#     learning_rate                 = 1e-5,
#     per_device_train_batch_size   = 2,
#     gradient_accumulation_steps   = 1,
#     num_mini_batches              = 1,
#     local_rollout_forward_batch_size = 1,
#     response_length               = 20,
#     temperature                   = 1.0,
#     report_to                     = None,            # or "wandb"
#     fp16                          = False,
#     bf16= False,
#)

In [None]:
#1. policy/ref model - both will be the same

policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL_ID)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(MODEL_ID)


policy_tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
policy_tokenizer.pad_token = policy_tokenizer.eos_token
policy_tokenizer.padding_side  = "left" #GPT-2 a decoder only model expects any pdding tokens on the LEFT of the sequence, by default tokenizers pad on the right
# policy_model.save_pretrained("gpt2_local")
# tokenizer.save_pretrained("gpt2_local")

#load the pretrained reward model & tokenizer

reward_model = AutoModelForSequenceClassification.from_pretrained(RM_PATH)
reward_tokenizer = AutoTokenizer.from_pretrained(RM_PATH)

In [None]:
# Ensure the reward model is in evaluation mode and move to device
reward_model.to("cuda" if torch.cuda.is_available() else "cpu")
reward_model.eval()

## Prepare Dataset and PPO Trainer

In [None]:
df = pd.read_csv("data/ordered_simplifications_with_rules.csv", index_col=0)

#Use the original sentence as the prompt/query
df.rename(columns={"original_sentence": "query"}, inplace=True)

#Create a train/eval set
train_df, eval_df = train_test_split(df, test_size=0.15, random_state=42)

print(f"Training set size: {len(train_df)}, Evaluation set size: {len(eval_df)}")

In [None]:
# For demonstration, using a small sample
train_sample_df = train_df.sample(n=32, random_state=42)

# Create the PPO dataset ONLY from the training split
dataset = Dataset.from_pandas(train_sample_df)
print("Created Hugging Face Dataset:")
print(dataset)

In [None]:
# Step 2: Create the correct tokenization function
def tokenize_function(examples):
    # The tokenizer, when called on a list of strings, returns a dictionary
    # with `input_ids` and `attention_mask`. This is the format we need.
    # We set the padding and truncation strategy here.
    output = policy_tokenizer(examples["query"], truncation=True, padding="max_length", max_length=40)
    return output

# Step 3: Apply the tokenization using .map()
# batched=True makes it much faster
tokenized_dataset = dataset.map(tokenize_function, 
                                batched=True,
                                remove_columns=['final_simplification', 'applied_rules', 'uid', 'query']
                                )

# Important: The PPOTrainer needs the columns in torch format
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])

print("\nTokenized Dataset format:")
print(tokenized_dataset)
print(tokenized_dataset[0])

In [None]:
ppo_trainer = PPOTrainer(
    config=config,
    model=policy_model,
    ref_model=ref_model,
    tokenizer=policy_tokenizer,
    dataset=tokenized_dataset,
)

In [None]:

def compute_rewards_from_rm(responses: list[str]) -> torch.Tensor:
    """
    Computes rewards for a list of strings using the trained regression reward model.
    """
    with torch.no_grad():
        # Tokenize the responses
        enc = reward_tokenizer(
            responses,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=128 # Match the max_length used during RM training
        )
        
        # Move tensors to the same device as the reward model
        inputs = {k: v.to(reward_model.device) for k, v in enc.items()}
        
        # Get the reward model's output
        out = reward_model(**inputs)
        
        # The output logits are the scalar reward values.
        # Squeeze to remove the extra dimension.
        rewards = out.logits.squeeze(-1)
        
    return rewards

In [None]:
# --- 5. The PPO Training Loop ---

# Generation settings for creating responses
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": policy_tokenizer.eos_token_id,
    "max_new_tokens": 50, # How long the simplifications should be
}

In [None]:
for batch in tqdm(ppo_trainer.dataloader, "Training Step"):
    #2D tensor
    query_tensors = batch["input_ids"]

    #Debug -Convert the 2D batch tensor into a list of 1D tensors for generation
    queries_list = [q for q in query_tensors]

    # Generate responses from the policy model
    response_tensors = ppo_trainer.generate(queries_list, **generation_kwargs)
    
    # Decode the responses. Your batch_decode is more efficient than a loop.
    batch["response"] = policy_tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

    # Compute rewards
    rewards_tensor = compute_rewards_from_rm(batch["response"])
    rewards_list = [r for r in rewards_tensor]
    
    # Perform the PPO optimization step
    stats = ppo_trainer.step(queries_list, response_tensors, rewards_list)
    ppo_trainer.log_stats(stats, batch, rewards_list)

print("PPO Training Finished!")

# Save the trained model
print("Saving PPO-tuned model...")
ppo_trainer.save_pretrained("my_ppo_tuned_model")
print("Model saved to 'my_ppo_tuned_model'")

In [None]:
print("\n--- Evaluating and Logging Final Model Performance ---")

# Take a small, fresh sample from your original dataframe for evaluation
# Using a different random_state ensures we get a different set of prompts

eval_df = eval_df.sample(n=8, random_state=49)
eval_dataset = Dataset.from_pandas(eval_df)

# Tokenize the evaluation prompts
# Note: We are not using the dataloader here, just tokenizing a small batch
eval_prompts = policy_tokenizer(
    eval_df["query"].tolist(), # Use the pandas DataFrame here
    return_tensors="pt",
    padding=True,
    truncation=True
)
eval_queries_list = [q.to(ppo_trainer.accelerator.device) for q in eval_prompts['input_ids']] # Ensure tensors are on the correct device

# Generate responses with the FINAL trained model
# We use torch.no_grad() for efficiency as we are not training
with torch.no_grad():
    eval_response_tensors = ppo_trainer.generate(eval_queries_list, **generation_kwargs)

# THE FIX: Isolate the generated tokens before decoding
# ==============================================================================
generated_tokens = []
# Get the original prompts' attention mask to find their true, unpadded lengths
prompt_attention_mask = eval_prompts['attention_mask']

for i, response_tensor in enumerate(eval_response_tensors):
    # Find the actual length of the prompt by summing its attention mask
    prompt_len = torch.sum(prompt_attention_mask[i])
    
    # Slice the individual response tensor to get only the newly generated part
    generated_part = response_tensor[prompt_len:]
    generated_tokens.append(generated_part)

# Decode only the generated part
eval_responses = policy_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# ==============================================================================

# Score the final responses with the reward model
eval_rewards = compute_rewards_from_rm(eval_responses)

# Print the results in a clean format
print("\n--- Final Model Generations ---")
for i in range(len(eval_dataset["query"])):
    print(f"Query:    {eval_dataset['query'][i]}")
    print(f"Response: {eval_responses[i]}")
    print(f"Reward:   {eval_rewards[i].item():.4f}")
    print("-" * 50)

In [None]:
# from transformers import (
#     AutoTokenizer,
#     AutoModelForCausalLM,
#     AutoModelForSequenceClassification,
# )
# from trl import PPOConfig, PPOTrainer
# from datasets import load_dataset

# # 1️⃣  Load your policy & value LMs
# MODEL_ID     = "gpt2"
# tokenizer    = AutoTokenizer.from_pretrained(MODEL_ID)
# policy_model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
# value_model  = AutoModelForCausalLM.from_pretrained(MODEL_ID)

# # 2️⃣  Load your fine-tuned Reward Model (must use SequenceClassification loader)
# tokenizer_rm = AutoTokenizer.from_pretrained("rm_dummy")
# reward_model = AutoModelForSequenceClassification.from_pretrained("rm_dummy")

# # 3️⃣  Build your PPO config
# config = PPOConfig(
#     exp_name                         = "ppo_with_correct_rm",
#     num_ppo_epochs                   = 4,
#     kl_coef                          = 0.1,
#     cliprange                        = 0.2,
#     vf_coef                          = 0.1,
#     gamma                            = 0.99,
#     lam                              = 0.95,

#     learning_rate                    = 1e-5,
#     per_device_train_batch_size      = 2,
#     gradient_accumulation_steps      = 1,
#     num_mini_batches                 = 1,
#     local_rollout_forward_batch_size = 1,
#     response_length                  = 20,
#     temperature                      = 1.0,
#     report_to                        = None,
#     fp16                             = False,
# )

# # 4️⃣  Instantiate your PPOTrainer
# ppo_trainer = PPOTrainer(
#     args         = config,
#     model        = policy_model,
#     ref_model    = None,           # default: frozen copy of policy_model
#     reward_model = reward_model,
#     value_model  = value_model,
#     tokenizer    = tokenizer,
#     train_dataset= load_dataset("daily_dialog", split="train[:8]"),
# )

# # 5️⃣  Wrap RM inference into your reward function
# import torch.nn.functional as F

# def compute_rewards(responses):
#     # tokenize the generated strings with the RM’s tokenizer
#     enc = tokenizer_rm(
#         responses,
#         padding=True,
#         truncation=True,
#         max_length=64,
#         return_tensors="pt",
#     ).to(reward_model.device)

#     # forward pass through your classifier head
#     out = reward_model(**enc)
#     # assume label 1 = “good”; take its probability
#     probs = F.softmax(out.logits, dim=-1)[:, 1]
#     return probs.tolist()

# # 6️⃣  Run a single PPO step
# batch    = ["Tell me a joke.", "What is the capital of France?"]
# tok      = tokenizer(batch, return_tensors="pt", padding=True)
# queries  = tok.input_ids.to(ppo_trainer.args.device)

# # generate responses
# response_tensors = ppo_trainer.generate(queries)
# responses        = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

# # score them
# rewards = compute_rewards(responses)

# # update the policy
# stats = ppo_trainer.step(queries, response_tensors, rewards)

# print("RESPONSES:", responses)
# print("REWARDS:  ", rewards)
# print("PPO STATS:", stats)


In [None]:
# # 5. train loop
# for epoch in range(2):        # two passes for demo
#     for prompt in prompts:
#         # tokenize prompt
#         inputs = tokenizer(prompt, return_tensors="pt")
#         query_tensors = inputs.input_ids.to(ppo_trainer.device)

#         # generate a response (this calls model.generate under the hood)
#         response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=20)

#         # decode
#         responses = tokenizer.batch_decode(response_tensors, skip_special_tokens=True)

#         # compute rewards
#         rewards = reward_fn(responses)

#         # one PPO step (updates model in-place)
#         stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

#         print(f"PROMPT: {prompt!r}")
#         print(f"RESPONSES: {responses}")
#         print(f"REWARDS:   {rewards}")
#         print(f"PPO stats: {stats}")
#         print("-" * 40)