<a href="https://colab.research.google.com/github/mario1870/swabianGPT/blob/main/swabianGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Setup**
-> install requirements<br>
-> login to Huggingface (make sure to set your token as env with name hf_token)<br>
-> login to Weights & Biases (make sure to set your token as env with name wanb)

**Install requirements**<br>
-q: "quiet" - only shows warnings & notifications<br>
-U: "upgrade" - upgrades package to newest version

In [None]:
%%capture
%pip install -q -U transformers
%pip install -q -U datasets
%pip install -q -U accelerate
%pip install -q -U peft
%pip install -q -U trl
%pip install -q -U bitsandbytes
%pip install -q -U wandb
%pip install pyarrow==15.0.2
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

**Login with Huggingface**

In [None]:
from google.colab import userdata
from huggingface_hub import login

# load token from env-variables in Colab
token = userdata.get('hf_token')
login(token=token)

**Login with W&B**<br>
To create training-reports

In [None]:
import wandb
from google.colab import userdata

# load token from env-variables in Colab
wb_token = userdata.get("wanb")

wandb.login(key=wb_token)
run = wandb.init(
    project='swabianGPT', # enter your own projectname
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mmarioraach01[0m ([33mmarioraach01-student[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# **Step 1: SFT**
1. step of training.<br>
Supervise Fine-tune your model easily with SFTTrainer

## **a) Load model**
-> load model & tokenizer from HF <br>
-> apply LoRA-Adapters to model

**Imports**

In [None]:
from unsloth import FastLanguageModel
import torch

**Loading the Model**

In [None]:
def load_model(model_name, max_seq_length=2048, load_in_4bit=True, dtype=None):
  # Load model & Tokenizer from unsloth
  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name=model_name,
      max_seq_length=max_seq_length,
      load_in_4bit=load_in_4bit,
      dtype=dtype,
  )
  return model, tokenizer

**Applying LoRA**

In [None]:
def apply_lora_to_original_model(model, tokenizer):
  model = FastLanguageModel.get_peft_model(
      model,
      r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
      target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
      lora_alpha = 128,
      lora_dropout = 0, # Supports any, but = 0 is optimized
      bias = "none",    # Supports any, but = "none" is optimized
      # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
      use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
      use_rslora = False,  # We support rank stabilized LoRA
      loftq_config = None, # And LoftQ
  )
  return model, tokenizer

NameError: name 'FastLanguageModel' is not defined

**Set to True to run the code**

In [None]:
if False:
  model, tokenizer = load_model(model_name="unsloth/Meta-Llama-3.1-8B")
  model, tokenizer = apply_lora_to_original_model(model, tokenizer)

## **b) Preprocess dataset**

**Imports**

In [None]:
import pandas as pd
from google.colab import userdata
from datasets import Dataset

Step 1 - Data Quality

In [None]:
def clean_df(df):
   """
   Cleans and preprocesses a DataFrame containing Swabian-German translations.

   This function performs several cleaning steps:
   1. Removes rows with missing values
   2. Removes duplicate entries
   3. Removes rows with empty text (if 'text' column exists)
   4. Strips whitespace from all text columns
   5. Ensures Swabian sentences end with proper punctuation

   Args:
       df (pd.DataFrame): Input DataFrame containing translation pairs
                         Expected columns: 'schwaebisch' and optional 'text'

   Returns:
       pd.DataFrame: Cleaned DataFrame with invalid or problematic rows removed

   Example:
       >>> df = pd.DataFrame({
       >>>     'schwaebisch': ['Servus!  ', 'Griaß di', ''],
       >>>     'hochdeutsch': ['Hallo!', 'Grüß dich', 'Test']
       >>> })
       >>> cleaned_df = clean_df(df)
       # Returns DataFrame with only 'Servus!' row as it has proper punctuation
   """
   # Remove any rows containing NaN values to ensure data completeness
   df = df.dropna()

   # Remove exact duplicates to prevent redundancy in training data
   df = df.drop_duplicates()

   # If 'text' column exists, remove rows where it's empty after stripping whitespace
   if 'text' in df.columns:
       df = df[df['text'].str.strip() != '']

   # Strip whitespace from all text columns to ensure consistent formatting
   text_columns = df.select_dtypes(include=['object']).columns
   for column in text_columns:
       df[column] = df[column].str.strip()

   # Keep only sentences ending with proper punctuation (.!?)
   # This helps ensure we're working with complete sentences
   df = df[df['schwaebisch'].str.endswith(('.', '!', '?'))]

   return df

In [None]:
def generate_synthetic_data_with_anthropic_api(df, api_key):
   """
   Generates synthetic training data by expanding existing Swabian-German translation pairs
   using the Anthropic API. Each translation pair is enriched with minimal context
   while preserving the original translation.

   Args:
       df (pd.DataFrame): DataFrame containing translation pairs with columns
                         'schwaebisch' and 'hochdeutsch'
       api_key (str): Authentication key for the Anthropic API

   Returns:
       list: List of dictionaries containing original and generated translations
             Keys: 'hochdeutsch', 'schwaebisch', 'neuer_schwaebisch_text'

   Example:
       >>> df = pd.DataFrame({
       >>>     'hochdeutsch': ['Guten Morgen'],
       >>>     'schwaebisch': ['Guada Morga']
       >>> })
       >>> generated = generate_synthetic_data_with_anthropic_api(df, 'your-api-key')
   """
   # Install and import required package
   !pip install anthropic
   import anthropic

   # Initialize Anthropic API client
   client = anthropic.Anthropic(
       api_key=api_key,
       base_url="https://api.x.ai",
   )

   generated_data = []

   # Process each translation pair
   for i in range(len(df)):
       schwaebisch = df.iloc[i]['schwaebisch']
       hochdeutsch = df.iloc[i]['hochdeutsch']

       print(f"{i}: {hochdeutsch}")  # Progress tracking

       # Construct prompt with detailed instructions and examples
       prompt = f"""Original Übersetzungspaar:
       Hochdeutsch: {hochdeutsch}
       Schwäbisch: {schwaebisch}

       Erstelle daraus ein kurzes Trainingsbeispiel zum finetunen eines LLMs auf den schwäbischen Dialekt.
       Füge minimalen Kontext in hinzu, sodass der neue Satz die ursprünglichen enthält, aber einen vollständigen Satz bildet.
       Formatiere das so, dass die wörtliche Übersetzung wenn "wörtl." dabei ist nur minimal in die Übersetzung zählt, aber nicht explizit erwähnt wird.

       Gebe die Ausgabe nur in Hochdeutsch mit sehr wenig Kontext und mit der passenden schwäbischen Übersetzung aus.

       Sehr wichtig: Nutze nur sehr einfache schwäbische Wörter und baue die Sätze sehr kurz und simpel auf!
       Achte auf eine korrekte Satzbildung und eine korrekte Anwendung des schwäbischen Dialekts!
       Füge nur den erweiterten Satz und seine Übersetzung hinzu. Keine zusätzliche Erklärung!

       Beispielsätze für die grundlegende schwäbische Syntax:
       Hochdeutsch: "Ich habe keine Zeit, ich muss noch die Stangen wegputzen und die Wäsche aufhängen."
       Schwäbisch: I han koi Zeit, i muaß no d'Gschdäng wegbutza ond d'Wäsch aufhenga.

       Hochdeutsch: "Die Oma macht die besten Maultaschen mit Kartoffelsalat."
       Schwäbisch: "D'Omma macht d'beschde Maultäschla mit Kardofflsalat."

       Hochdeutsch: "Das geht doch nicht, du kannst doch nicht einfach so ein Dummkopf sein!"
       Schwäbisch: "Des goht doch et, du kannsch doch et oifach so en Seggel sei!"

       Beispielformat:
       Original Übersetzungspaar:
       Hochdeutsch: Guten Morgen
       Schwäbisch: Guada Morga

       Outputformat:
       Hochdeutsch: Guten Morgen\nSchwäbisch: Guada Morga
       """

       try:
           # Generate new translation using Anthropic API
           message = client.messages.create(
               model="grok-beta",
               max_tokens=128,
               system="You are an expert translator for standard german and the swabian dialect.",
               messages=[
                   {"role": "user", "content": prompt}
               ]
           )

           print(f"{i}: {message.content}")  # Log generated content

           # Store original and generated translations
           generated_data.append({
               'hochdeutsch': hochdeutsch,
               'schwaebisch': schwaebisch,
               'neuer_schwaebisch_text': message.content
           })

       except Exception as e:
           print(f"Error at line {i}: {e}")
           continue  # Skip failed generations and continue with next pair

   return generated_data

Collecting anthropic
  Downloading anthropic-0.39.0-py3-none-any.whl.metadata (22 kB)
Downloading anthropic-0.39.0-py3-none-any.whl (198 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.4/198.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.39.0


In [None]:
def load_dataset_as_dataframe():
   """
   Loads and preprocesses the Swabian-German translation dataset from a TSV file.

   The function:
   1. Reads the TSV file from the specified Google Drive location
   2. Cleans the data using the clean_df function
   3. Prints basic dataset statistics

   Returns:
       pd.DataFrame: Cleaned DataFrame containing Swabian-German translations
                    with columns 'hochdeutsch' and 'schwaebisch'

   Raises:
       FileNotFoundError: If the TSV file is not found at the specified location
       pd.errors.EmptyDataError: If the TSV file is empty

   Example:
       >>> df = load_dataset_as_dataframe()
       >>> print(f"Loaded {len(df)} translation pairs")
   """
   try:
       # Load TSV file from Google Drive
       file_path = '/content/drive/MyDrive/datasets/tsv_latest_2.tsv'
       df_shg = pd.read_csv(file_path, sep='\t')

       # Clean and preprocess the dataset
       df_shg = clean_df(df_shg)

       # Print dataset statistics
       print("First 5 entries of the cleaned dataset:")
       print(df_shg.head())
       print(f"\nTotal number of translation pairs: {len(df_shg)}")

       return df_shg

   except FileNotFoundError:
       raise FileNotFoundError(f"Dataset file not found at {file_path}. Please check the path.")
   except pd.errors.EmptyDataError:
       raise pd.errors.EmptyDataError("The TSV file is empty.")
   except Exception as e:
       raise Exception(f"Error loading dataset: {str(e)}")

# Load and prepare the dataset
df_shg = load_dataset_as_dataframe()

                                         hochdeutsch  \
0  Aber ja, das können wir auf jeden Fall so mach...   
1  Das ist ja alles völlig verquer, von hinten na...   
2  Donnerwetter, ist das eine Überraschung! *Ausr...   
3  Du bist ein kleiner Gauner, das weiß doch jede...   
4  Du bist ja so ein Angsthase! Trau dich doch en...   

                                         schwaebisch  
0  Ha scho, des kenna mr uf jeda Fall so macha, k...  
1  Des isch ja alls vollkomma hindrschefirre gmac...  
2  Herrgottsakrament, isch des a Überraschung! He...  
3  Du bisch a Herrgottsfeddz, des woiß doch jeder...  
4  Du bisch fei so en Angschdhas! Drau de doch en...  
11878


In [None]:
def create_translation_dataset(df):
   """
   Creates a bidirectional translation dataset from a DataFrame containing
   Swabian-German language pairs. For each pair, creates two training examples:
   1. Swabian to Standard German
   2. Standard German to Swabian

   Args:
       df (pd.DataFrame): DataFrame containing translation pairs with columns
                         'schwaebisch' and 'hochdeutsch'

   Returns:
       datasets.Dataset: Hugging Face Dataset containing instruction-tuned
                        translation examples in both directions

   Example:
       >>> df = pd.DataFrame({
       >>>     'hochdeutsch': ['Guten Tag', 'Auf Wiedersehen'],
       >>>     'schwaebisch': ['Griaß Gott', 'Ade']
       >>> })
       >>> dataset = create_translation_dataset(df)
       >>> print(f"Created {len(dataset)} examples")  # Will print 4 examples

   Notes:
       - Each row in the input DataFrame generates two training examples
       - The resulting dataset has the format required for instruction fine-tuning:
         {'instruction': str, 'input': str, 'output': str}
       - Instructions emphasize correct sentence structure in both dialects
   """
   data = []

   # Iterate through each translation pair
   for _, row in df.iterrows():
       # Create Swabian to Standard German example
       data.append({
           "instruction": "Übersetze den schwäbischen Text ins Hochdeutsche. "
                         "Achte auf eine sinnvolle und korrekte Satzbildung!",
           "input": row['schwaebisch'],
           "output": row['hochdeutsch']
       })

       # Create Standard German to Swabian example
       data.append({
           "instruction": "Übersetze den hochdeutschen Text ins Schwäbische. "
                         "Achte auf eine sinnvolle und korrekte Satzbildung!",
           "input": row['hochdeutsch'],
           "output": row['schwaebisch']
       })

   # Convert to Hugging Face Dataset format
   return Dataset.from_list(data)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Mario12355/synthetic_data_schwaebisch_deutsch/commit/fdc6c23c83f9d882e3f34279cb44dcf9f9307b70', commit_message='Upload dataset', commit_description='', oid='fdc6c23c83f9d882e3f34279cb44dcf9f9307b70', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/Mario12355/synthetic_data_schwaebisch_deutsch', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Mario12355/synthetic_data_schwaebisch_deutsch'), pr_revision=None, pr_num=None)

**Set to True to run the code**

In [None]:
"""
Data Processing Pipeline for SwabianGPT

This script handles the complete data processing pipeline from raw CSV to
the final dataset on Hugging Face. It's currently disabled (if False) as
these steps only need to be run once during initial setup.

Pipeline Steps:
1. Load and clean initial data
2. Generate synthetic context using Anthropic API
3. Manual processing in Excel/Sheets
4. Load processed data
5. Create and upload final dataset

Requirements:
- Access to source CSV files
- Anthropic API key
- Hugging Face account with write access
- Google Drive mounted (for Colab)
"""

if False:  # Pipeline is disabled by default to prevent accidental execution
   # Step 1: Load and clean initial dataset
   path_to_csv = "/content/file1_cleaned.csv"
   try:
       dataframe = pd.read_csv(path_to_csv)
       cleaned_dataframe = clean_df(dataframe)
       print(f"Loaded and cleaned initial dataset: {len(cleaned_dataframe)} entries")
   except Exception as e:
       print(f"Error loading initial dataset: {e}")

   # Step 2: Generate synthetic training data with context
   # This step expands simple translations with natural language context
   try:
       api_key = userdata.get('xai')
       new_data = generate_synthetic_data_with_anthropic_api(cleaned_dataframe, api_key)
       print(f"Generated synthetic data: {len(new_data)} entries")
   except Exception as e:
       print(f"Error generating synthetic data: {e}")

   # Step 3: Manual Processing
   # At this point, data is manually reviewed and cleaned in Excel/Sheets
   # This ensures translation quality and context appropriateness

   # Step 4: Load processed and reviewed dataset
   try:
       processed_df = pd.read_csv('/content/drive/MyDrive/datasets/tsv_latest_2.tsv',
                                sep='\t')
       cleaned_processed_df = clean_df(processed_df)
       print(f"Loaded processed dataset: {len(cleaned_processed_df)} entries")
   except Exception as e:
       print(f"Error loading processed dataset: {e}")

   # Step 5: Create final dataset and upload to Hugging Face
   try:
       translation_dataset = create_translation_dataset(cleaned_processed_df)
       translation_dataset.push_to_hub(
           "Mario12355/synthetic_data_schwaebisch_deutsch",
           private=True
       )
       print("Successfully uploaded dataset to Hugging Face")
   except Exception as e:
       print(f"Error uploading to Hugging Face: {e}")

## **c) Train model**
-> load dataset from HF-hub<br>
-> init trainer<br>
-> start training

**Imports**

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from datasets import load_dataset

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [None]:
def formatting_prompts_func(examples):
   """
   Formats examples into prompts following the Alpaca training format.
   Each example is formatted as instruction + input + output with the required
   EOS token to properly terminate sequences during training.

   Args:
       examples (dict): Dictionary containing lists of:
           - instruction (str): Translation instruction
           - input (str): Text to translate
           - output (str): Expected translation

   Returns:
       dict: Dictionary with key 'text' containing formatted prompts

   Example:
       >>> examples = {
       >>>     "instruction": ["Übersetze ins Hochdeutsche"],
       >>>     "input": ["Griaß Gott"],
       >>>     "output": ["Guten Tag"]
       >>> }
       >>> formatted = formatting_prompts_func(examples)
       >>> print(formatted["text"][0])
       "Below is an instruction... {EOS_TOKEN}"

   Notes:
       - Uses global alpaca_prompt template and EOS_TOKEN
       - EOS_TOKEN must be set before using this function
       - Critical for proper sequence termination during training
   """
   # Initialize lists for batch processing
   instructions = examples["instruction"]
   inputs = examples["input"]
   outputs = examples["output"]
   texts = []

   # Process each example in the batch
   for instruction, input, output in zip(instructions, inputs, outputs):
       # Format using Alpaca template and add EOS token
       # EOS token is crucial to prevent infinite generation
       formatted_prompt = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
       texts.append(formatted_prompt)

   return {"text": texts}

# Global variable for sequence termination
EOS_TOKEN = tokenizer.eos_token  # Must be set before using formatting_prompts_func

**Training Configuration**

In [None]:
def initialize_sft_trainer(epochs):
   """
   Initializes a Supervised Fine-Tuning (SFT) trainer with optimized parameters
   for translation model training.

   Args:
       epochs (int): Number of training epochs to run

   Returns:
       SFTTrainer: Configured trainer instance ready for model fine-tuning

   Notes:
       Key configurations:
       - Uses 8-bit AdamW optimizer for memory efficiency
       - Implements gradient accumulation for larger effective batch size
       - Automatically selects between FP16/BF16 based on hardware support
       - Integrates with Weights & Biases for experiment tracking

   Hardware Requirements:
       - Minimum 16GB GPU RAM recommended
       - Supports both consumer and datacenter GPUs

   Example:
       >>> trainer = initialize_sft_trainer(epochs=3)
       >>> trainer.train()
   """
   trainer = SFTTrainer(
       # Model Configuration
       model=model,
       tokenizer=tokenizer,
       train_dataset=dataset,
       dataset_text_field="text",
       max_seq_length=2048,      # Maximum sequence length for training
       dataset_num_proc=2,       # Number of preprocessing workers
       packing=False,            # Disabled for translation tasks

       # Training Arguments
       args=TrainingArguments(
           # Batch Size Configuration
           per_device_train_batch_size=8,   # Batch size per GPU
           gradient_accumulation_steps=8,    # Accumulate gradients for larger effective batch

           # Learning Rate Schedule
           warmup_steps=20,                 # Gradual LR warmup
           num_train_epochs=epochs,         # Total number of training epochs
           learning_rate=2e-4,             # Initial learning rate
           lr_scheduler_type="linear",      # Linear LR decay

           # Optimization Settings
           optim="adamw_8bit",             # Memory-efficient optimizer
           weight_decay=0.01,              # L2 regularization
           fp16=not is_bfloat16_supported(),  # Use FP16 if BF16 not available
           bf16=is_bfloat16_supported(),      # Prefer BF16 when supported

           # Training Management
           logging_steps=1,                 # Log metrics every step
           seed=3407,                       # Fixed seed for reproducibility
           output_dir="outputs",            # Save directory
           report_to="wandb",              # Log to Weights & Biases
       ),
   )

   return trainer

Set to True to start training

In [None]:
"""
SwabianGPT Training Script

This script handles the SFT (Supervised Fine-Tuning) training process.
Currently disabled (if False) as this is a one-time training operation
that should be run carefully due to computational resources and costs.

Process:
1. Load the prepared dataset from Hugging Face
2. Format the prompts for training
3. Initialize and run the SFT trainer
4. Monitor training with Weights & Biases
"""

if False:  # Training pipeline is disabled by default
   try:
       # Step 1: Load dataset from Hugging Face Hub
       print("Loading dataset...")
       dataset = load_dataset(
           "Mario12355/synthetic_data_schwaebisch_deutsch",
           split="train"
       )
       print(f"Loaded {len(dataset)} examples")

       # Step 2: Format prompts for training
       print("Formatting prompts...")
       dataset = dataset.map(
           formatting_prompts_func,
           batched=True,
           desc="Formatting prompts"  # Progress description
       )
       print("Prompt formatting complete")

       # Step 3: Initialize trainer with specified epochs
       print("Initializing trainer...")
       trainer = initialize_sft_trainer(epochs=1)

       # Step 4: Start training
       print("Starting training...")
       trainer_stats = trainer.train()

       # Log final training statistics
       print("\nTraining completed!")
       print(f"Final loss: {trainer_stats.training_loss}")
       print(f"Total training time: {trainer_stats.training_time:.2f} seconds")

       # Clean up W&B run
       run.finish()
       print("Weights & Biases logging completed")

   except Exception as e:
       print(f"Error during training: {e}")
       # Ensure W&B run is properly closed even if training fails
       run.finish()
       raise

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 23,756 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 371
 "-____-"     Number of trainable parameters = 335,544,320


Step,Training Loss
1,2.9354
2,2.8742
3,2.832
4,2.5094
5,2.1776
6,1.985
7,1.7945
8,1.4599
9,1.3729
10,1.1789


0,1
train/epoch,▁▁▁▁▁▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇██
train/grad_norm,█▆▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,████▇▇▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▃▃▂▂▂▂▂▂▁▁▁▁
train/loss,█▇▅▃▂▂▂▁▂▂▁▂▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
total_flos,1.6531672776258355e+17
train/epoch,0.99933
train/global_step,371.0
train/grad_norm,0.31192
train/learning_rate,0.0
train/loss,0.6309
train_loss,0.77217
train_runtime,10443.5755
train_samples_per_second,2.275
train_steps_per_second,0.036


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

10443.5755 seconds used for training.
174.06 minutes used for training.
Peak reserved memory = 11.697 GB.
Peak reserved memory for training = 4.607 GB.
Peak reserved memory % of max memory = 79.312 %.
Peak reserved memory for training % of max memory = 31.238 %.


In [None]:
model.save_pretrained("llama_3.1_20.11_fini") # Local saving
tokenizer.save_pretrained("llama_3.1_20.11_fini")

model.push_to_hub("Mario12355/llama_3.1_20.11_fini") # Online saving
tokenizer.push_to_hub("Mario12355/llama_3.1_20.11_fini") # Online saving

No files have been modified since last commit. Skipping to prevent empty commit.


Saved model to https://huggingface.co/Mario12355/llama_3.1_20.11_fini


No files have been modified since last commit. Skipping to prevent empty commit.


In [None]:
#model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
#model.push_to_hub_merged("Mario12355/lora_model_8b_08.11", tokenizer, save_method="merged_16bit")

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("GGUF", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("Mario12355/gguf", tokenizer)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "Mario12355/gguf", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
    )

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.63 out of 12.67 RAM for saving.


 38%|███▊      | 12/32 [00:01<00:02,  9.83it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:49<00:00,  3.41s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving Mario12355/gguf/pytorch_model-00001-of-00004.bin...
Unsloth: Saving Mario12355/gguf/pytorch_model-00002-of-00004.bin...
Unsloth: Saving Mario12355/gguf/pytorch_model-00003-of-00004.bin...
Unsloth: Saving Mario12355/gguf/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m', 'q8_0', 'q5_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at Mario12355/gguf into f16 GGUF format.
The output location will be /content/Mario12355/gguf/unsloth.F16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: gguf
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-gguf:token_embd.

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/Mario12355/gguf
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Mario12355/gguf
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q5_K_M.gguf:   0%|          | 0.00/5.73G [00:00<?, ?B/s]

HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/api/models/Mario12355/gguf/commit/main (Request ID: Root=1-673f5edb-78eea6b42526542e082754a7;b7556d0b-cb6b-4f5d-a9df-0a7c8ccfe470)

Internal Error - We're working hard to fix this as soon as possible!

## **d) Inference after SFT**
-> load finetuned model from HF-hub<br>
-> generate a translation with word-by-word-streaming

**Imports**

In [None]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

**Load Model**

In [None]:
def load_sft_finetuned_model_from_huggingface():
   """
   Loads the SFT fine-tuned SwabianGPT model from Hugging Face Hub.
   Uses optimized loading settings for efficient inference.

   Returns:
       tuple: (model, tokenizer)
           - model: The loaded language model in 4-bit quantization
           - tokenizer: Associated tokenizer for the model

   Notes:
       - Uses 4-bit quantization for memory efficiency
       - Maximum sequence length set to 2048 tokens
       - Model is loaded from personal Hugging Face repository
       - Requires ~8GB GPU memory with 4-bit quantization

   Example:
       >>> model, tokenizer = load_sft_finetuned_model_from_huggingface()
       >>> text = "Übersetze ins Schwäbische: Guten Tag"
       >>> inputs = tokenizer(text, return_tensors="pt")
       >>> outputs = model.generate(**inputs)

   Raises:
       Exception: If model loading fails (e.g., connection issues, insufficient memory)
   """
   try:
       # Load model and tokenizer with optimized settings
       model, tokenizer = FastLanguageModel.from_pretrained(
           model_name="Mario12355/llama_3.1_20.11_fini",  # Fine-tuned model path
           max_seq_length=2048,                           # Maximum context length
           dtype=None,                                    # Auto-detect optimal dtype
           load_in_4bit=True,                            # Use 4-bit quantization
       )

       print("Model loaded successfully")
       print(f"Model loaded in 4-bit quantization")
       print(f"Maximum sequence length: 2048 tokens")

       return model, tokenizer

   except Exception as e:
       print(f"Error loading model: {str(e)}")
       print("Please check:")
       print("- Internet connection")
       print("- GPU memory availability")
       print("- Model repository access permissions")
       raise

In [None]:
def generate_answer_with_stream(direction, input):
   """
   Generates a streaming translation between Swabian and Standard German.

   Args:
       direction (str): Translation direction, either:
           - 'hochdeutsch_to_schwaebisch': Standard German to Swabian
           - 'schwaebisch_to_hochdeutsch': Swabian to Standard German
       input (str): Text to translate

   Returns:
       None: Outputs translation directly via TextStreamer

   Example:
       >>> generate_answer_with_stream(
       >>>     direction="hochdeutsch_to_schwaebisch",
       >>>     input="Guten Tag, wie geht es dir?"
       >>> )
       "Griaß Gott, wie goht's dr?"

   Raises:
       ValueError: If an invalid translation direction is specified
   """
   # Set instruction based on translation direction
   if direction == "hochdeutsch_to_schwaebisch":
       instruction = ("Übersetze den hochdeutschen Text ins Schwäbische. "
                     "Achte auf eine sinnvolle und korrekte Satzbildung!")
   elif direction == "schwaebisch_to_hochdeutsch":
       instruction = ("Übersetze den schwäbischen Text ins Hochdeutsche. "
                     "Achte auf eine sinnvolle und korrekte Satzbildung!")
   else:
       raise ValueError(
           "Invalid direction. Must be 'hochdeutsch_to_schwaebisch' "
           "or 'schwaebisch_to_hochdeutsch'."
       )

   # Prepare input for model using Alpaca prompt template
   inputs = tokenizer(
       [
           alpaca_prompt.format(
               instruction,     # Translation instruction
               input,          # Text to translate
               "",            # Empty output for generation
           )
       ],
       return_tensors="pt"
   ).to("cuda")

   # Initialize streamer for real-time output
   text_streamer = TextStreamer(tokenizer)

   # Generate translation with streaming
   _ = model.generate(
       **inputs,
       streamer=text_streamer,
       max_new_tokens=128      # Limit output length
   )

In [None]:
"""
SwabianGPT Inference Script

This script handles model loading and inference testing.
Both sections are currently disabled (if False) as they are
for demonstration and testing purposes.
"""

# Section 1: Model Loading
if False:
   # Load model if not already loaded
   if model is None:
       print("Loading fine-tuned model...")
       try:
           model, tokenizer = load_sft_finetuned_model_from_huggingface()
           FastLanguageModel.for_inference(model)  # Enable optimized inference
           print("Model loaded and optimized for inference")
       except Exception as e:
           print(f"Error loading model: {e}")

# Section 2: Translation Testing
if False:
   # Test translation with sample text
   try:
       print("Testing Swabian to Standard German translation...")
       test_input = ("Oinr isch emmer dr Arsch, ond er woiß id mol warom. "
                    "Oiner bleibt emmer übrig ond koiner schert sich drom")

       print("\nInput text:")
       print(f"Swabian: {test_input}")
       print("\nTranslation:")

       generate_answer_with_stream(
           direction="schwaebisch_to_hochdeutsch",
           input=test_input
       )
   except Exception as e:
       print(f"Error during translation: {e}")

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Übersetze den schwäbischen Text ins Hochdeutsche. Achte auf eine sinnvolle und korrekte Satzbildung!

### Input:
Oinr isch emmer dr Arsch, ond er woiß id mol warom. Oiner bleibt emmer übrig ond koiner schert sich drom

### Response:
Einer ist immer der Arsch, und er weiß nicht mal warum. Einer bleibt immer übrig und keiner schert sich darum<|end_of_text|>


# **Step 2: DPO**
2. Step of training<br>
Direct Preference Optimization training using DPOTrainer

## **Data Preparation for DPO Part 1 - Generate Dataframe for manual preferencing**
-> create dataframe with label, prompt, output1 & output2<br>
-> 50% hochdeutsch -> schwaebisch & 50% schwaebisch -> hochdeutsch

**Imports**

In [None]:
import pandas as pd

In [None]:
def generate_answer_without_stream(direction, prompt):
   """
   Generates a translation between Swabian and Standard German without streaming.
   Includes temperature and repetition penalty for more controlled generation.

   Args:
       direction (str): Translation direction, either:
           - 'hochdeutsch_to_schwaebisch': Standard German to Swabian
           - 'schwaebisch_to_hochdeutsch': Swabian to Standard German
       prompt (str): Text to translate

   Returns:
       str: Generated translation with special tokens and prompt removed

   Example:
       >>> result = generate_answer_without_stream(
       >>>     direction="hochdeutsch_to_schwaebisch",
       >>>     prompt="Guten Tag, wie geht es dir?"
       >>> )
       >>> print(result)
       "Griaß Gott, wie goht's dr?"

   Raises:
       ValueError: If an invalid translation direction is specified
   """
   # Set instruction based on translation direction
   if direction == "hochdeutsch_to_schwaebisch":
       instruction = ("Übersetze den hochdeutschen Text ins Schwäbische. "
                     "Achte auf eine sinnvolle und korrekte Satzbildung!")
   elif direction == "schwaebisch_to_hochdeutsch":
       instruction = ("Übersetze den schwäbischen Text ins Hochdeutsche. "
                     "Achte auf eine sinnvolle und korrekte Satzbildung!")
   else:
       raise ValueError(
           "Invalid direction. Must be 'hochdeutsch_to_schwaebisch' "
           "or 'schwaebisch_to_hochdeutsch'."
       )

   # Configure generation parameters for better quality
   generation_config = {
       "max_new_tokens": 128,      # Maximum length of generated translation
       "temperature": 0.3,         # Lower temperature for more focused outputs
       "repetition_penalty": 1.15  # Penalize repetitive text
   }

   # Prepare input using Alpaca prompt template
   inputs = tokenizer(
       [
           alpaca_prompt.format(
               instruction,  # Translation instruction
               prompt,      # Text to translate
               "",         # Empty output for generation
           )
       ],
       return_tensors="pt"
   ).to("cuda")

   # Generate translation
   output = model.generate(**inputs, **generation_config)

   # Decode the output tokens to text
   decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)

   # Extract only the response part (remove prompt and instruction)
   response_start = decoded_output.find("### Response:") + len("### Response:")
   response = decoded_output[response_start:].strip()

   return response

In [None]:
def generate_df_with_50_50_samples(number_of_samples, lookuptable_dataframe):
   """
   Creates a balanced dataset for DPO training by randomly sampling equal numbers
   of Standard German and Swabian sentences from a lookup table.

   Args:
       number_of_samples (int): Total number of samples to generate (should be even)
       lookuptable_dataframe (pd.DataFrame): DataFrame containing translation pairs
                                           with 'hochdeutsch' and 'schwaebisch' columns

   Returns:
       pd.DataFrame: New DataFrame with columns:
           - label: Translation direction
           - prompt: Text to translate
           - output1: Placeholder for first translation attempt
           - output2: Placeholder for second translation attempt

   Example:
       >>> lookup_df = pd.DataFrame({
       >>>     'hochdeutsch': ['Guten Tag', 'Auf Wiedersehen'],
       >>>     'schwaebisch': ['Griaß Gott', 'Ade']
       >>> })
       >>> result = generate_df_with_50_50_samples(2, lookup_df)
       >>> print(result.shape)  # (2, 4)

   Notes:
       - Number of samples should be even for balanced split
       - Uses random_state=8 for reproducibility
       - Output columns (output1, output2) are initialized as None
   """
   # Clean input DataFrame
   dpo_cleaned_dataframe = clean_df(lookuptable_dataframe)

   # Shuffle DataFrame with fixed random seed for reproducibility
   dpo_shuffled_cleaned_dataframe = dpo_cleaned_dataframe.sample(
       frac=1,
       random_state=8
   )

   # Split samples equally between German and Swabian
   half_length = number_of_samples // 2
   values_hochdeutsch = dpo_shuffled_cleaned_dataframe['hochdeutsch'].iloc[:half_length]
   values_schwaebisch = dpo_shuffled_cleaned_dataframe['schwaebisch'].iloc[half_length:number_of_samples]

   # Create DataFrame for Standard German to Swabian translations
   hochdeutsch_df = pd.DataFrame({
       'label': 'hochdeutsch -> schwaebisch',
       'prompt': values_hochdeutsch,
       'output1': None,  # Placeholder for first translation
       'output2': None,  # Placeholder for second translation
   })

   # Create DataFrame for Swabian to Standard German translations
   schwaebisch_df = pd.DataFrame({
       'label': 'schwaebisch -> hochdeutsch',
       'prompt': values_schwaebisch,
       'output1': None,
       'output2': None,
   })

   # Combine DataFrames and reset index
   new_df = pd.concat([hochdeutsch_df, schwaebisch_df]).reset_index(drop=True)

   # Validate output
   assert len(new_df) == number_of_samples, "Output size doesn't match requested samples"
   assert len(new_df[new_df['label'] == 'hochdeutsch -> schwaebisch']) == len(new_df[new_df['label'] == 'schwaebisch -> hochdeutsch']), "Unbalanced split"

   return new_df

In [None]:
def generate_2_outputs_and_load_it_into_df(df):
   """
   Generates two different translations for each prompt in the DataFrame using
   the fine-tuned model. Ensures the translations are different by regenerating
   if necessary.

   Args:
       df (pd.DataFrame): DataFrame containing prompts with columns:
           - label: Translation direction
           - prompt: Text to translate
           - output1: First translation (will be filled)
           - output2: Second translation (will be filled)

   Returns:
       pd.DataFrame: Updated DataFrame with generated translations in
                    output1 and output2 columns

   Example:
       >>> df = pd.DataFrame({
       >>>     'label': ['hochdeutsch -> schwaebisch'],
       >>>     'prompt': ['Guten Tag'],
       >>>     'output1': [None],
       >>>     'output2': [None]
       >>> })
       >>> result = generate_2_outputs_and_load_it_into_df(df)
       >>> print(result['output1'].iloc[0])  # First translation
       >>> print(result['output2'].iloc[0])  # Second translation

   Notes:
       - Generates translations in batches with progress tracking
       - Ensures output1 and output2 are different for each prompt
       - Handles both translation directions
   """
   print(f"Generating translations for {len(df)} prompts...")

   # Iterate through DataFrame with progress tracking
   for index, row in df.iterrows():
       try:
           # Get translation direction and text
           label = row['label']
           prompt = row['prompt']

           # Set instruction based on translation direction
           if label == "hochdeutsch -> schwaebisch":
               instruction = ("Übersetze den hochdeutschen Text ins Schwäbische. "
                            "Achte auf eine sinnvolle und korrekte Satzbildung!")
           elif label == "schwaebisch -> hochdeutsch":
               instruction = ("Übersetze den schwäbischen Text ins Hochdeutsche. "
                            "Achte auf eine sinnvolle und korrekte Satzbildung!")
           else:
               raise ValueError(
                   "Invalid label. Must be 'hochdeutsch -> schwaebisch' "
                   "or 'schwaebisch -> hochdeutsch'."
               )

           # Generate first translation
           output1 = generate_answer_without_stream(
               direction=label,
               prompt=prompt
           )

           # Generate second translation (ensure it's different)
           attempts = 0
           max_attempts = 5
           while attempts < max_attempts:
               output2 = generate_answer_without_stream(
                   direction=label,
                   prompt=prompt
               )
               if output1 != output2:
                   break
               attempts += 1

           if output1 == output2:
               print(f"Warning: Could not generate different translations "
                     f"for prompt at index {index} after {max_attempts} attempts")

           # Update DataFrame with generated translations
           df.at[index, 'output1'] = output1
           df.at[index, 'output2'] = output2

           # Progress update every 10 prompts
           if (index + 1) % 10 == 0:
               print(f"Processed {index + 1}/{len(df)} prompts")

       except Exception as e:
           print(f"Error processing prompt at index {index}: {e}")
           # Continue with next prompt if one fails
           continue

   print("Translation generation completed!")
   return df

In [None]:
"""
DPO Dataset Generation Script

This script generates a balanced dataset for DPO (Direct Preference Optimization) training
by creating pairs of translations for both directions (Swabian ↔ Standard German).
Currently disabled (if False) as this is a one-time operation for dataset creation.

Note: This process can be time-consuming as it generates multiple translations
for each prompt to ensure diversity in the training data.
"""

if False:  # Dataset generation pipeline is disabled by default
   try:
       print("Starting DPO dataset generation process...")

       # Step 1: Load and prepare initial dataset
       print("\nLoading source translations...")
       dpo_dataframe = pd.read_csv(
           '/content/drive/MyDrive/datasets/tsv_latest_2.tsv',
           sep='\t'
       )
       print(f"Loaded {len(dpo_dataframe)} source translations")

       # Step 2: Create balanced sample set
       print("\nGenerating balanced sample set...")
       dpo_dataframe = generate_df_with_50_50_samples(
           number_of_samples=500,  # Total samples to generate
           lookuptable_dataframe=dpo_dataframe
       )
       print("Sample distribution:")
       print(dpo_dataframe['label'].value_counts())

       # Step 3: Generate translation pairs
       print("\nGenerating translation pairs...")
       print("This may take a while as each prompt needs two different translations")
       dpo_dataframe = generate_2_outputs_and_load_it_into_df(df=dpo_dataframe)

       # Step 4: Validate results
       null_count = dpo_dataframe[['output1', 'output2']].isnull().sum().sum()
       if null_count > 0:
           print(f"\nWarning: Found {null_count} missing translations")
       else:
           print("\nAll translations generated successfully!")

       # Print sample of results
       print("\nSample of generated translations:")
       sample = dpo_dataframe.sample(n=3, random_state=42)
       for _, row in sample.iterrows():
           print(f"\nDirection: {row['label']}")
           print(f"Prompt: {row['prompt']}")
           print(f"Translation 1: {row['output1']}")
           print(f"Translation 2: {row['output2']}")

   except Exception as e:
       print(f"\nError during dataset generation: {e}")
       print("Please check:")
       print("- File path and permissions")
       print("- Available disk space")
       print("- GPU memory availability")

                        label  \
0  hochdeutsch -> schwaebisch   
1  hochdeutsch -> schwaebisch   
2  hochdeutsch -> schwaebisch   
3  hochdeutsch -> schwaebisch   
4  hochdeutsch -> schwaebisch   

                                              prompt output1 output2  
0                Ich gehe auf gut Glück in den Wald.    None    None  
1  Mann, ich bin heute total daneben, ich habe ge...    None    None  
2  Der betrunkene Mann torkelte schwankend die St...    None    None  
3                          Der Fernseher steht fern.    None    None  
4  der Hund hat einen Knochen im Maul und will ih...    None    None  
hochdeutsch -> schwaebisch
0
hochdeutsch -> schwaebisch
1
hochdeutsch -> schwaebisch
2
hochdeutsch -> schwaebisch
3
hochdeutsch -> schwaebisch
4
hochdeutsch -> schwaebisch
5
hochdeutsch -> schwaebisch
6
hochdeutsch -> schwaebisch
7
hochdeutsch -> schwaebisch
8
hochdeutsch -> schwaebisch
9
hochdeutsch -> schwaebisch
10
hochdeutsch -> schwaebisch
11
hochdeutsch -> schwaebisc

## **Data Preparation for DPO Part 2 - Generate Dataset based on the Preference-Dataframe**
-> Load processed dataset<br>
-> reformat the dataset to prompt, chosen, rejected<br>
-> upload to HF-hub

**Imports**

In [None]:
import pandas as pd
from datasets import Dataset, load_dataset

In [None]:
def load_preference_df_from_csv():
   """
   Loads the manually curated preference dataset for DPO training from a TSV file.
   This dataset contains pairs of translations with human preference annotations.

   Returns:
       pd.DataFrame: DataFrame containing preference data with columns:
           - text: Original text to translate
           - chosen: Preferred translation
           - rejected: Less preferred translation

   Raises:
       FileNotFoundError: If the TSV file cannot be found
       pd.errors.EmptyDataError: If the TSV file is empty

   Example:
       >>> df = load_preference_df_from_csv()
       >>> print(f"Loaded {len(df)} preference pairs")
       >>> print(df.head())

   Notes:
       - File should be tab-separated
       - Expects specific file path in Google Drive
       - Used for Direct Preference Optimization (DPO) training
   """
   try:
       # Load preference dataset from TSV - enter your location
       file_path = '/content/Unbenannte Tabelle - dpo_translations.tsv'
       df = pd.read_csv(file_path, sep='\t')

       # Validate loaded data
       required_columns = ['text', 'chosen', 'rejected']
       missing_columns = [col for col in required_columns if col not in df.columns]
       if missing_columns:
           raise ValueError(f"Missing required columns: {missing_columns}")

       # Print dataset statistics
       print(f"Successfully loaded preference dataset:")
       print(f"- Total preference pairs: {len(df)}")
       print(f"- Columns: {', '.join(df.columns)}")

       return df

   except FileNotFoundError:
       raise FileNotFoundError(
           f"Preference dataset not found at {file_path}. "
           "Please check the file path."
       )
   except pd.errors.EmptyDataError:
       raise pd.errors.EmptyDataError(
           "The TSV file is empty. Please check the file content."
       )
   except Exception as e:
       raise Exception(f"Error loading preference dataset: {str(e)}")

In [None]:
def create_preference_dataset(df):
   """
   Creates a preference dataset for DPO training from a DataFrame containing
   pairs of translations. Each example includes an instruction, input text,
   and two translations (chosen and rejected).

   Args:
       df (pd.DataFrame): DataFrame containing translation pairs with columns:
           - label: Translation direction
           - prompt: Text to translate
           - output1: First translation (assumed chosen)
           - output2: Second translation (assumed rejected)

   Returns:
       datasets.Dataset: Hugging Face Dataset formatted for DPO training
                        with columns: prompt, chosen, rejected

   Example:
       >>> input_df = pd.DataFrame({
       >>>     'label': ['hochdeutsch -> schwaebisch'],
       >>>     'prompt': ['Guten Tag'],
       >>>     'output1': ['Griaß Gott'],
       >>>     'output2': ['Servus']
       >>> })
       >>> dataset = create_preference_dataset(input_df)
       >>> print(dataset[0])

   Raises:
       ValueError: If translation direction is invalid
   """
   preference_examples = []

   # Process each row in the DataFrame
   for index, row in df.iterrows():
       try:
           # Extract values from row
           label = row['label']
           prompt = row['prompt']
           output1 = row['output1']
           output2 = row['output2']

           # Set instruction based on translation direction
           if label == "hochdeutsch -> schwaebisch":
               instruction = ("Übersetze den hochdeutschen Text ins Schwäbische. "
                            "Achte auf eine sinnvolle und korrekte Satzbildung!")
           elif label == "schwaebisch -> hochdeutsch":
               instruction = ("Übersetze den schwäbischen Text ins Hochdeutsche. "
                            "Achte auf eine sinnvolle und korrekte Satzbildung!")
           else:
               raise ValueError(
                   "Invalid label. Must be 'hochdeutsch -> schwaebisch' "
                   "or 'schwaebisch -> hochdeutsch'."
               )

           # Create preference example
           preference_examples.append({
               "prompt": f"{instruction}: {prompt}",
               "chosen": output1,
               "rejected": output2
           })

           # Progress update for large datasets
           if (index + 1) % 100 == 0:
               print(f"Processed {index + 1}/{len(df)} examples")

       except Exception as e:
           print(f"Error processing row {index}: {e}")
           continue

   # Convert to Hugging Face Dataset format
   try:
       dataset = Dataset.from_list(preference_examples)
       print(f"\nCreated preference dataset with {len(dataset)} examples")
       return dataset

   except Exception as e:
       raise Exception(f"Error creating dataset: {str(e)}")

In [None]:
def upload_dpo_dataset_to_hf_hub(dataset, name, private=True):
   """
   Uploads a DPO (Direct Preference Optimization) dataset to the Hugging Face Hub.

   Args:
       dataset (datasets.Dataset): Hugging Face dataset to upload
       name (str): Repository name on Hugging Face Hub (format: "username/dataset-name")
       private (bool, optional): Whether to make the dataset private. Defaults to True

   Raises:
       ValueError: If repository name format is invalid
       Exception: If upload fails

   Example:
       >>> dataset = create_preference_dataset(df)
       >>> upload_dpo_dataset_to_hf_hub(
       >>>     dataset=dataset,
       >>>     name="username/swabian-preferences",
       >>>     private=True
       >>> )

   Notes:
       - Requires Hugging Face authentication
       - Repository name should follow format "username/dataset-name"
       - Private datasets require Pro subscription
   """
   try:
       # Validate repository name format
       if "/" not in name:
           raise ValueError(
               "Invalid repository name. Must be in format 'username/dataset-name'"
           )

       print(f"Starting upload to {name}...")
       print(f"Dataset visibility: {'Private' if private else 'Public'}")

       # Upload dataset to Hugging Face Hub
       dataset.push_to_hub(
           name,
           private=private
       )

       print(f"\nSuccessfully uploaded dataset to {name}")
       print(f"Dataset contains {len(dataset)} examples")
       print(f"Access your dataset at: https://huggingface.co/datasets/{name}")

   except Exception as e:
       print(f"\nError uploading dataset: {str(e)}")
       print("\nPlease check:")
       print("- Hugging Face authentication")
       print("- Repository name format")
       print("- Internet connection")
       print("- Hub access permissions")
       raise

IndentationError: expected an indented block after function definition on line 1 (<ipython-input-20-b8530a8cc5e1>, line 2)

In [None]:
"""
DPO Preference Dataset Processing Pipeline

This script handles the complete pipeline for creating and uploading a
preference dataset for DPO (Direct Preference Optimization) training:
1. Load human-annotated preferences
2. Format data for DPO training
3. Upload to Hugging Face Hub

Currently disabled (if False) as this is a one-time setup operation.
"""

if False:  # DPO dataset processing pipeline is disabled by default
   try:
       print("Starting DPO preference dataset processing...\n")

       # Step 1: Load preference annotations
       print("Loading preference annotations...")
       df = load_preference_df_from_csv()
       print(f"Loaded {len(df)} annotated translation pairs\n")

       # Step 2: Create formatted dataset
       print("Creating DPO training dataset...")
       dataset = create_preference_dataset(df)
       print(f"Successfully created dataset with {len(dataset)} examples\n")

       # Step 3: Upload to Hugging Face Hub
       print("Uploading dataset to Hugging Face Hub...")
       upload_dpo_dataset_to_hf_hub(
           dataset=dataset,
           name="Mario12355/preference_dataset_1"  # Repository name
       )
       print("Pipeline completed successfully!")

   except Exception as e:
       print(f"\nError in DPO dataset pipeline: {e}")
       print("\nPipeline failed. Please check the error message above.")

       # Provide specific error handling guidance
       if "authentication" in str(e).lower():
           print("Hint: Check your Hugging Face authentication token")
       elif "permission" in str(e).lower():
           print("Hint: Verify your Hugging Face account permissions")

## **DPO Training with Unsloth**

**Imports**

In [None]:
from unsloth import PatchDPOTrainer
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from datasets import load_dataset
from transformers import TrainingArguments
from trl import DPOConfig, DPOTrainer

In [None]:
def load_model(model_name, max_seq_length=2048, dtype=None, load_in_4bit=True):
   """
   Loads and configures a language model for fine-tuning using optimized settings.
   Implements FastLanguageModel with LoRA for efficient training.

   Args:
       model_name (str): Name/path of the pretrained model on Hugging Face Hub
       max_seq_length (int, optional): Maximum sequence length. Defaults to 2048
       dtype (torch.dtype, optional): Model precision. Defaults to None (auto-detect)
       load_in_4bit (bool, optional): Whether to use 4-bit quantization. Defaults to True

   Returns:
       tuple: (model, tokenizer)
           - model: Configured model with LoRA adapters
           - tokenizer: Associated tokenizer

   Example:
       >>> model, tokenizer = load_model(
       >>>     model_name="Mario12355/llama_3.1_20.11_fini",
       >>>     max_seq_length=2048,
       >>>     load_in_4bit=True
       >>> )

   Notes:
       LoRA Configuration:
       - rank (r): 128 for higher capacity
       - target modules: All attention and FFN layers
       - alpha: 128 for stable training
       - Optimized settings: no dropout, no bias
       - Uses unsloth gradient checkpointing for long sequences

   Hardware Requirements:
       - 4-bit quantization reduces VRAM usage significantly
       - Supports consumer GPUs with 8GB+ VRAM
       - Gradient checkpointing further reduces memory usage
   """
   try:
       print(f"Loading model: {model_name}")

       # Step 1: Load base model and tokenizer
       model, tokenizer = FastLanguageModel.from_pretrained(
           model_name=model_name,
           max_seq_length=max_seq_length,
           dtype=dtype,
           load_in_4bit=load_in_4bit,
       )
       print("Base model loaded successfully")

       # Step 2: Configure LoRA adapters
       print("Configuring LoRA adapters...")
       model = FastLanguageModel.get_peft_model(
           model,
           # LoRA hyperparameters
           r=128,                # Rank for LoRA adaptations
           lora_alpha=128,       # Scale factor for LoRA
           lora_dropout=0,       # Optimized: no dropout

           # Target all important model components
           target_modules=[
               "q_proj",        # Query projection
               "k_proj",        # Key projection
               "v_proj",        # Value projection
               "o_proj",        # Output projection
               "gate_proj",     # Gate projection
               "up_proj",       # Upscaling projection
               "down_proj",     # Downscaling projection
           ],

           # Optimization settings
           bias="none",         # Optimized: no bias
           use_gradient_checkpointing="unsloth",  # Memory optimization
           use_rslora=False,    # Standard LoRA (not rank stabilized)
           loftq_config=None,   # No LoftQ quantization
       )
       print("LoRA configuration completed")

       return model, tokenizer

   except Exception as e:
       print(f"\nError loading model: {str(e)}")
       print("\nPlease check:")
       print("- Model name/path")
       print("- GPU memory availability")
       print("- Internet connection")
       raise

In [None]:
def load_dataset(dataset_name):
   """
   Loads a dataset from the Hugging Face Hub for DPO (Direct Preference Optimization)
   training. Includes validation and error handling.

   Args:
       dataset_name (str): Name of dataset on Hugging Face Hub
                          (format: "username/dataset-name")

   Returns:
       datasets.Dataset: Loaded dataset object from Hugging Face

   Raises:
       ValueError: If dataset name format is invalid
       Exception: If dataset loading fails

   Example:
       >>> dataset = load_dataset("Mario12355/preference_dataset_1")
       >>> print(f"Loaded {len(dataset['train'])} training examples")

   Notes:
       - Requires internet connection
       - May require authentication for private datasets
       - Expected columns: prompt, chosen, rejected
   """
   try:
       # Validate dataset name format
       if "/" not in dataset_name:
           raise ValueError(
               "Invalid dataset name. Must be in format 'username/dataset-name'"
           )

       print(f"Loading dataset: {dataset_name}")

       # Load dataset from Hugging Face Hub
       dataset = load_dataset(dataset_name)

       # Validate dataset structure
       if 'train' not in dataset:
           raise ValueError("Dataset must contain a 'train' split")

       required_columns = ['prompt', 'chosen', 'rejected']
       missing_columns = [col for col in required_columns
                        if col not in dataset['train'].features]
       if missing_columns:
           raise ValueError(f"Missing required columns: {missing_columns}")

       # Print dataset statistics
       print(f"\nDataset loaded successfully:")
       print(f"- Total examples: {len(dataset['train'])}")
       print(f"- Available splits: {', '.join(dataset.keys())}")
       print(f"- Columns: {', '.join(dataset['train'].features.keys())}")

       return dataset

   except Exception as e:
       print(f"\nError loading dataset: {str(e)}")
       print("\nPlease check:")
       print("- Dataset name")
       print("- Internet connection")
       print("- Hub authentication (for private datasets)")
       print("- Dataset structure and format")
       raise

In [None]:
def create_dpo_trainer(model, tokenizer, dataset, epochs=3):
   """
   Initializes a DPO (Direct Preference Optimization) trainer with optimized
   configurations for fine-tuning language models on preference data.

   Args:
       model (PreTrainedModel): Model to be trained
       tokenizer (PreTrainedTokenizer): Associated tokenizer
       dataset (datasets.Dataset): Dataset containing preference pairs
       epochs (int, optional): Number of training epochs. Defaults to 3

   Returns:
       DPOTrainer: Configured trainer ready for preference optimization

   Example:
       >>> model, tokenizer = load_model("Mario12355/llama_3.1_20.11_fini")
       >>> dataset = load_dataset("Mario12355/preference_dataset_1")
       >>> trainer = create_dpo_trainer(model, tokenizer, dataset, epochs=3)
       >>> trainer.train()

   Notes:
       Training Configuration:
       - Uses 8-bit AdamW optimizer
       - Implements gradient accumulation
       - Automatically selects between FP16/BF16
       - Integrates with Weights & Biases
       - Beta=0.1 for preference learning

   Hardware Requirements:
       - Recommended: 16GB+ VRAM
       - Supports gradient accumulation for memory efficiency
   """
   try:
       print("Initializing DPO trainer...")

       # Apply DPO patches
       PatchDPOTrainer()

       # Create DPO trainer with optimized settings
       dpo_trainer = DPOTrainer(
           # Model configuration
           model=model,
           ref_model=None,          # No reference model needed
           tokenizer=tokenizer,

           # Training parameters
           beta=0.1,                # Preference learning temperature
           max_length=2048,         # Maximum sequence length
           max_prompt_length=512,   # Maximum prompt length
           train_dataset=dataset["train"],

           # Training configuration
           args=DPOConfig(
               # Batch size and accumulation
               per_device_train_batch_size=4,
               gradient_accumulation_steps=8,

               # Training schedule
               warmup_ratio=0.1,    # Gradual warmup
               num_train_epochs=epochs,

               # Optimization settings
               optim="adamw_8bit",  # Memory-efficient optimizer
               fp16=not is_bfloat16_supported(),
               bf16=is_bfloat16_supported(),

               # Training management
               logging_steps=1,
               seed=42,             # Reproducibility
               output_dir="outputs",
               report_to="wandb",   # Experiment tracking
           ),
       )

       print("\nDPO trainer initialized with settings:")
       print(f"- Epochs: {epochs}")
       print(f"- Batch size: 4 (effective batch size: {4 * 8})")
       print(f"- Training examples: {len(dataset['train'])}")
       print(f"- Using {'BF16' if is_bfloat16_supported() else 'FP16'} precision")

       return dpo_trainer

   except Exception as e:
       print(f"\nError creating DPO trainer: {str(e)}")
       print("\nPlease check:")
       print("- Model and tokenizer compatibility")
       print("- Dataset format")
       print("- GPU memory availability")
       raise

In [None]:
def train_model(dpo_trainer):
   """
   Executes DPO (Direct Preference Optimization) training and handles the
   training process with comprehensive logging and error handling.

   Args:
       dpo_trainer (DPOTrainer): Configured DPO trainer instance

   Returns:
       TrainOutput: Training statistics and metrics

   Example:
       >>> trainer = create_dpo_trainer(model, tokenizer, dataset)
       >>> results = train_model(trainer)
       >>> print(f"Final loss: {results.training_loss}")

   Notes:
       - Integrates with Weights & Biases for experiment tracking
       - Saves checkpoints to the specified output directory
       - Training progress is logged at each step
       - Early stopping is not implemented by default
   """
   try:
       print("\nStarting DPO training...")
       print("Training progress will be logged to Weights & Biases")

       # Record start time for duration calculation
       start_time = time.time()

       # Execute training
       training_output = dpo_trainer.train()

       # Calculate training duration
       duration = time.time() - start_time

       # Print training summary
       print("\nTraining completed successfully!")
       print(f"Training duration: {duration/60:.2f} minutes")
       print(f"Final loss: {training_output.training_loss:.4f}")

       # Print model save location
       print(f"\nModel checkpoints saved to: {dpo_trainer.args.output_dir}")
       print("You can now use the model for inference or push it to the Hub")

       return training_output

   except Exception as e:
       print(f"\nError during training: {str(e)}")
       print("\nPossible issues:")
       print("- Out of GPU memory")
       print("- Training instability")
       print("- Data formatting problems")
       print("\nCheck the error message above and your GPU memory usage")
       raise

In [None]:
def push_model_to_hub(name):
   """
   Uploads both the trained model and its tokenizer to the Hugging Face Hub.

   Args:
       name (str): Repository name on Hugging Face Hub (format: "username/model-name")

   Raises:
       ValueError: If repository name format is invalid
       Exception: If upload fails

   Example:
       >>> push_model_to_hub("Mario12355/swabian-translator-v1")
       "Model and tokenizer uploaded successfully!"

   Notes:
       - Requires Hugging Face authentication
       - Uploads both model weights and tokenizer files
       - Model size affects upload time
       - Ensure stable internet connection
   """
   try:
       # Validate repository name
       if "/" not in name:
           raise ValueError(
               "Invalid repository name. Must be in format 'username/model-name'"
           )

       print(f"Starting upload to {name}...")
       print("This may take several minutes depending on model size")

       # Upload model with progress tracking
       print("\nUploading model weights...")
       try:
           model.push_to_hub(name)
           print("Model uploaded successfully")
       except Exception as e:
           raise Exception(f"Error uploading model: {str(e)}")

       # Upload tokenizer
       print("\nUploading tokenizer...")
       try:
           tokenizer.push_to_hub(name)
           print("Tokenizer uploaded successfully")
       except Exception as e:
           raise Exception(f"Error uploading tokenizer: {str(e)}")

       print(f"\nUpload completed successfully!")
       print(f"Your model is now available at: https://huggingface.co/{name}")
       print("\nUsage example:")
       print(f"from transformers import AutoModel, AutoTokenizer")
       print(f"model = AutoModel.from_pretrained('{name}')")
       print(f"tokenizer = AutoTokenizer.from_pretrained('{name}')")

   except Exception as e:
       print(f"\nError during upload: {str(e)}")
       print("\nPlease check:")
       print("- Hugging Face authentication")
       print("- Internet connection")
       print("- Repository name format")
       print("- Write permissions")
       raise

In [None]:
"""
DPO Training Pipeline Script

This script executes the complete DPO (Direct Preference Optimization) training pipeline:
1. Load the pre-trained model and preference dataset
2. Configure and initialize the DPO trainer
3. Execute training
4. Upload the trained model to Hugging Face Hub

Currently disabled (if False) as this is a one-time training operation
that should be run carefully due to computational resources and costs.
"""

if False:  # Training pipeline is disabled by default
   try:
       # Step 1: Load model, tokenizer, and dataset
       print("Initializing training components...")

       print("\nLoading model and tokenizer...")
       model, tokenizer = load_model(
           model_name="Mario12355/llama_3.1_20.11_fini"
       )

       print("\nLoading preference dataset...")
       dataset = load_dataset(
           dataset_name="Mario12355/preference_dataset_1"
       )

       # Step 2: Create DPO trainer
       print("\nConfiguring DPO trainer...")
       dpo_trainer = create_dpo_trainer(
           model=model,
           tokenizer=tokenizer,
           dataset=dataset
       )

       # Step 3: Execute training
       print("\nStarting DPO training...")
       train_model(dpo_trainer=dpo_trainer)

       # Step 4: Upload trained model
       print("\nUploading trained model to Hugging Face Hub...")
       push_model_to_hub(
           name="Mario12355/swabian_german_translator"
       )

       print("\nDPO training pipeline completed successfully!")

   except Exception as e:
       print(f"\nError in training pipeline: {str(e)}")
       print("\nPipeline failed. Please check:")
       print("- GPU availability and memory")
       print("- Dataset integrity")
       print("- Internet connection")
       print("- Hub permissions")

       # Ensure W&B run is properly closed even if training fails
       try:
           import wandb
           if wandb.run is not None:
               wandb.finish()
       except:
           pass

       raise

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!




ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

# **Inference**
-> load finetuned model from HF-hub<br>
-> generate a translation with word-by-word-streaming

**Imports**

In [None]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

**Load Model**

In [None]:
def load_model(model_name, max_seq_length=2048, dtype=None, load_in_4bit=True):
   """
   Loads a pre-trained language model and its tokenizer with optimized settings
   for inference.

   Args:
       model_name (str): Name/path of the model on Hugging Face Hub
       max_seq_length (int, optional): Maximum sequence length. Defaults to 2048
       dtype (torch.dtype, optional): Model precision. Defaults to None (auto-detect)
       load_in_4bit (bool, optional): Whether to use 4-bit quantization. Defaults to True

   Returns:
       tuple: (model, tokenizer)
           - model: Loaded language model
           - tokenizer: Associated tokenizer

   Example:
       >>> model, tokenizer = load_model(
       >>>     model_name="Mario12355/swabian-translator",
       >>>     load_in_4bit=True
       >>> )

   Notes:
       - 4-bit quantization reduces memory usage significantly
       - Auto-detects optimal precision settings
       - Suitable for inference on consumer GPUs
   """
   try:
       print(f"Loading model: {model_name}")

       # Load model and tokenizer with specified settings
       model, tokenizer = FastLanguageModel.from_pretrained(
           model_name=model_name,
           max_seq_length=max_seq_length,  # Context length
           dtype=dtype,                    # Precision setting
           load_in_4bit=load_in_4bit,     # Quantization
       )

       print("Model loaded successfully")
       print(f"Max sequence length: {max_seq_length}")
       print(f"4-bit quantization: {'enabled' if load_in_4bit else 'disabled'}")

       return model, tokenizer

   except Exception as e:
       print(f"\nError loading model: {str(e)}")
       print("\nPlease check:")
       print("- Model name/path")
       print("- GPU memory availability")
       print("- Internet connection")
       raise

In [None]:
def generate_answer_with_stream(direction, input):
   """
   Generates a streaming translation between Standard German and Swabian dialect.
   Outputs the translation token by token in real-time.

   Args:
       direction (str): Translation direction, either:
           - 'hochdeutsch_to_schwaebisch': Standard German to Swabian
           - 'schwaebisch_to_hochdeutsch': Swabian to Standard German
       input (str): Text to translate

   Raises:
       ValueError: If an invalid translation direction is specified

   Example:
       >>> generate_answer_with_stream(
       >>>     direction="hochdeutsch_to_schwaebisch",
       >>>     input="Guten Tag, wie geht es dir?"
       >>> )
       # Streams: "Griaß Gott, wie goht's dir?"
   """
   try:
       # Set instruction based on translation direction
       if direction == "hochdeutsch_to_schwaebisch":
           instruction = ("Übersetze den hochdeutschen Text ins Schwäbische. "
                        "Achte auf eine sinnvolle und korrekte Satzbildung!")
       elif direction == "schwaebisch_to_hochdeutsch":
           instruction = ("Übersetze den schwäbischen Text ins Hochdeutsche. "
                        "Achte auf eine sinnvolle und korrekte Satzbildung!")
       else:
           raise ValueError(
               "Invalid direction. Must be 'hochdeutsch_to_schwaebisch' "
               "or 'schwaebisch_to_hochdeutsch'."
           )

       # Prepare input using Alpaca prompt template
       inputs = tokenizer(
           [
               alpaca_prompt.format(
                   instruction,  # Translation instruction
                   input,       # Text to translate
                   "",         # Empty output for generation
               )
           ],
           return_tensors="pt"
       ).to("cuda")

       # Initialize streamer and generate translation
       text_streamer = TextStreamer(tokenizer)
       _ = model.generate(
           **inputs,
           streamer=text_streamer,
           max_new_tokens=128  # Limit output length
       )

   except Exception as e:
       print(f"\nError during translation: {str(e)}")
       print("\nPlease check:")
       print("- Input text format")
       print("- GPU availability")
       print("- Model loading status")
       raise

In [None]:
"""
SwabianGPT Inference Script

This script loads the DPO-trained model and demonstrates real-time translation
from Swabian to Standard German with token-by-token streaming output.
"""

if True:  # Model loading section
   try:
       print("Initializing SwabianGPT translator...")

       # Load model and tokenizer
       model, tokenizer = load_model(
           model_name="Mario12355/swabian_german_translator"
       )

       # Enable optimized inference
       FastLanguageModel.for_inference(model)
       print("Model loaded and optimized for inference")

   except Exception as e:
       print(f"\nError initializing model: {str(e)}")
       print("Please check your GPU availability and internet connection")
       raise

# Translation demonstration
if True:
   try:
       print("\nTranslating from Swabian to Standard German:")
       print("Input: Oinr isch emmer dr Arsch, ond er woiß id mol warom.")
       print("Translation:")

       generate_answer_with_stream(
           direction="schwaebisch_to_hochdeutsch",
           input="Oinr isch emmer dr Arsch, ond er woiß id mol warom."
       )

   except Exception as e:
       print(f"\nError during translation: {str(e)}")
       print("Please check the model initialization status")
       raise

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA L4. Max memory: 22.168 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.9. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Übersetze den schwäbischen Text ins Hochdeutsche. Achte auf eine sinnvolle und korrekte Satzbildung!

### Input:
Oinr isch emmer dr Arsch, ond er woiß id mol warom.

### Response:
Einer ist immer der Arsch, und er weiß nicht mal warum.<|end_of_text|>


In [None]:
"""
SwabianGPT Demo Script

This script demonstrates the real-time translation capabilities
of SwabianGPT from Swabian dialect to Standard German.
"""

if True:
   try:
       print("\nSwabianGPT Translation Demo")
       print("-" * 40)

       # Input text to translate
       input_text = "Oinr isch immer der Arsch, und er woiß id mol warum."

       print("Input (Swabian):")
       print(f"{input_text}\n")
       print("Translation (Standard German):")

       # Generate streaming translation
       generate_answer_with_stream(
           direction="schwaebisch_to_hochdeutsch",
           input=input_text
       )

       print("\n" + "-" * 40)

   except Exception as e:
       print(f"\nTranslation error: {str(e)}")
       print("\nPlease verify:")
       print("- Model is properly loaded")
       print("- Input text formatting")
       print("- GPU availability")
       raise

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Übersetze den schwäbischen Text ins Hochdeutsche. Achte auf eine sinnvolle und korrekte Satzbildung!

### Input:
Oinr isch immer der Arsch,​ und er woiß id mol warum.​

### Response:
Ein Mann ist immer der Arsch, und er weiß nicht mal warum.<|end_of_text|>


In [None]:
"""
SwabianGPT Model Loading Script

This script loads a fine-tuned and DPO-optimized language model
for Swabian dialect translation using PEFT (Parameter-Efficient Fine-Tuning).
"""

try:
   # Import required libraries
   from peft import AutoPeftModelForCausalLM
   from transformers import AutoTokenizer

   print("Loading SwabianGPT model and tokenizer...")

   # Load the fine-tuned model with PEFT adaptations
   model = AutoPeftModelForCausalLM.from_pretrained(
       "Mario12355/swabian_german_translator",  # Model repository name
       device_map="auto",                       # Automatic device placement
       torch_dtype="auto"                       # Automatic precision selection
   )
   print("Model loaded successfully")

   # Load the associated tokenizer
   tokenizer = AutoTokenizer.from_pretrained(
       "Mario12355/llama_3.1_20.11_fini_dpo"   # Same repository for tokenizer
   )
   print("Tokenizer loaded successfully")

   print("\nModel is ready for inference!")
   print("Use model.generate() for text generation")
   print("Use tokenizer.encode() for input preprocessing")

except Exception as e:
   print(f"\nError loading model: {str(e)}")
   print("\nPlease check:")
   print("- Internet connection")
   print("- Model repository accessibility")
   print("- GPU/CPU availability")
   print("- Required library installations")
   raise

`low_cpu_mem_usage` was None, now default to True since model is quantized.


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]