<a href="https://colab.research.google.com/github/kyrillosishak/GermanNumberSimplifying/blob/main/ML_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This approach uses a two-phase strategy to create a model that can simplify numbers in German text (like converting "324.620,22 Euro" to "etwa 325.000 Euro").
- Phase 1: Synthetic Data Generation
Instead of manually creating training examples, the approach uses LLaMA 3 (a large language model) to generate thousands of diverse training examples. It's shown a few examples of how numbers should be simplified, and then asked to generate many more similar examples. This is clever because:

It saves enormous time compared to manual data creation
LLaMA 3 understands context and generates natural-sounding German sentences
It can create diverse examples covering many different number formats and contexts

- Phase 2: Training
Once we have this large dataset, it's used to train a much smaller, specialized model (mT5) that focuses solely on number simplification. Think of it like having a master craftsman (LLaMA 3) teach a specific skill to an apprentice (mT5). The smaller model:

Is faster and more efficient than LLaMA 3
Specializes in just one task, doing it very well
Can run on less powerful hardware

This two-phase approach combines the best of both worlds:

1. Uses a powerful model for data generation
2. Uses a efficient model for the actual task

Results in a practical, focused tool for number simplification

It's like using a factory (LLaMA 3) to create training materials, then using those materials to train a specialized worker (mT5) who becomes very good at one specific job.

# Phase 1 : Synthetic Data Generation

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
)

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 

In [None]:
model_name = "meta-llama/Llama-3.1-8B-Instruct"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id =  tokenizer.unk_token_id
tokenizer.padding_side = 'left'

In [None]:
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

In [None]:
from transformers import GenerationConfig
import re
import json
from tqdm import tqdm

# Ensure pad_token_id is set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Define the generation function
def generate(instruction):
    prompt = f"<|begin_of_text|> {instruction} <|end_of_text|>\n"
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)
    input_ids = inputs["input_ids"].cuda()
    attention_mask = inputs["attention_mask"].cuda()  # Explicitly pass attention_mask

    # Define generation configuration
    generation_config = GenerationConfig(
        pad_token_id=tokenizer.pad_token_id,
        temperature=2.0,
        top_p=1.0,
        top_k=50,
        num_beams=1,
        return_legacy_cache=True  # Maintain legacy behavior if desired
    )

    # Generate output
    generation_output = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256
    )

    # Decode and return the output
    outputs = []
    for seq in generation_output.sequences:
        outputs.append(tokenizer.decode(seq, skip_special_tokens=True))
    return outputs

In [None]:
# Example data and template
examples = """
This is examples of a sentence with numbers and simplification of this number:

Input: 324.620,22 Euro wurden gespendet.
Output: etwa 325.000 Euro wurden gespendet.

Input: 1.897 Menschen nahmen teil.
Output: etwa 2.000 Menschen nahmen teil.

Input: 25 Prozent der Bevölkerung sind betroffen.
Output: jeder Vierte der Bevölkerung sind betroffen.

Input: 90 Prozent stimmten zu.
Output: fast alle stimmten zu.

Input: 14 Prozent lehnten ab.
Output: wenige lehnten ab.

Input: Bei 38,7 Grad Celsius ist es sehr heiß.
Output: Bei etwa 39 Grad Celsius ist es sehr heiß.

Input: denn die Rente steigt um 4,57 Prozent.
Output: denn die Rente steigt um wenige.

Input: Im Jahr 2024 gab es 1.234 Ereignisse.
Output: Im Jahr 2024 gab es etwa 1.000 Ereignisse.

Input: Am 1. Januar 2024 waren es 5.678 Teilnehmer.
Output: Am 1. Januar 2024 waren es etwa 6.000 Teilnehmer.

Input: Im Jahr 2025 gab es 2018 Ereignisse.
Output: Im Jahr 2025 gab es etwa 2000 Ereignisse.
"""

prompt_template_generate = f"""
{examples}

Generate a new 100 sentences with a number and simplify them in German language:
Input:
Output:
"""

In [None]:
# Extract Input and Output pairs from the generated text
def extract_pairs(text):
    input_output_pattern = re.compile(r"Input: (.+?)\nOutput: (.+?)\n")
    matches = input_output_pattern.findall(text)
    return [{"input": match[0].strip(), "output": match[1].strip()} for match in matches]

# Main function to run and collect data
def collect_data(num_samples, prompt_template, batch_size=100):
    collected_data = []

    with tqdm(total=num_samples) as pbar:
        while len(collected_data) < num_samples:
            outputs = generate(prompt_template)
            for output in outputs:
                pairs = extract_pairs(output)
                collected_data.extend(pairs)
                pbar.update(len(pairs)) #update progress bar
                if len(collected_data) >= num_samples:
                    break

    return collected_data[:num_samples]

In [None]:
# Collect 10,000 examples
data = collect_data(num_samples=10000, prompt_template)

# Save the data to a JSON file
with open("simplified_numbers_data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

print(f"Collected {len(data)} input-output pairs and saved to 'simplified_numbers_data.json'")

# This is example of generated data

# Phase 2: Training

Let's now train an encoder decoder model (T5)

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    T5ForConditionalGeneration,
    T5Tokenizer,
    AdamW,
    get_linear_schedule_with_warmup
)

In [None]:
# Create a custom dataset class
class NumberToTextDataset(Dataset):
    def __init__(self, texts, targets, tokenizer, max_length=128):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        target = str(self.targets[idx])

        # Prepare input
        inputs = self.tokenizer.encode_plus(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors="pt"
        )

        # Prepare target
        targets = self.tokenizer.encode_plus(
            target,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors="pt"
        )

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': targets['input_ids'].squeeze()
        }

In [None]:
def train_model():
    # Initialize tokenizer and model
    model_name = "google/mt5-small"  # You can also use "google/mt5-base" or larger models
    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # Create dataset
    dataset = NumberToTextDataset(
        df['input'],
        df['output'],
        tokenizer
    )

    # Create dataloader
    dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

    # Training settings
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=5e-5)
    num_epochs = 50
    num_training_steps = num_epochs * len(dataloader)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps
    )

    # Training loop
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0

        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss
            total_loss += loss.item()

            loss.backward()
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

    # Save the model
    model.save_pretrained("number_to_text_model")
    tokenizer.save_pretrained("number_to_text_model")

    return model, tokenizer

In [None]:
# Function to test the model
def test_model(model, tokenizer, text):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    inputs = tokenizer.encode_plus(
        text,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    ).to(device)

    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=128,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
model, tokenizer = train_model()

# Test the model
test_input = "324.620,22 Euro wurden gespendet."
prediction = test_model(model, tokenizer, test_input)
print(f"Input: {test_input}")
print(f"Prediction: {prediction}")