# WMT16 English-German with FLAN-T5.
Transfer Learning for Enhanced Translation Performance

## Install, import and setup

In [1]:
%pip install transformers torch accelerate nltk rouge-score



In [2]:
%pip install --upgrade datasets fsspec

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [3]:
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, T5ForConditionalGeneration
from datasets import load_dataset
import torch.nn as nn
from tqdm import tqdm
import numpy as np
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Download required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [5]:
# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A'}")

Using device: cuda
GPU: Tesla T4


## Load and Explore the WMT16 English-German Dataset

In [6]:
try:
    dataset = load_dataset('wmt16', 'de-en', trust_remote_code=True)
    print("Successfully loaded WMT16 de-en dataset")
except:
    # Fallback to a different WMT dataset if wmt16 is not available
    print("WMT16 not available, using WMT14 as fallback")
    dataset = load_dataset('wmt14', 'de-en', trust_remote_code=True)

print(f"Dataset structure: {dataset}")
print(f"Training samples: {len(dataset['train'])}")
print(f"Validation samples: {len(dataset['validation'])}")

# Display sample data
sample = dataset['train'][0]
print(f"\nSample translation pair:")
print(f"English: {sample['translation']['en']}")
print(f"German: {sample['translation']['de']}")

README.md:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/282M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/267M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/277M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/343k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/475k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4548885 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2169 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2999 [00:00<?, ? examples/s]

Successfully loaded WMT16 de-en dataset
Dataset structure: DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 4548885
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 2169
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2999
    })
})
Training samples: 4548885
Validation samples: 2169

Sample translation pair:
English: Resumption of the session
German: Wiederaufnahme der Sitzungsperiode


## Load Pre-trained FLAN-T5 Model and Tokenizer

In [7]:
# Load the tokenizer and model
model_name = "google/flan-t5-base"
print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = T5ForConditionalGeneration.from_pretrained(model_name)
model.to(device)


print(f"✅ Model loaded successfully")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Vocabulary size: {tokenizer.vocab_size}")

Loading model: google/flan-t5-base


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

✅ Model loaded successfully
Model parameters: 247,577,856
Vocabulary size: 32100


## Data Preprocessing and Preparation

In [10]:
def preprocess_function(examples):
    # Extract English and German texts from the translation pairs
    english_texts = [translation['en'] for translation in examples['translation']]
    german_texts = [translation['de'] for translation in examples['translation']]

    # Create input prompts for translation
    inputs = [f"Translate from English to German: {en}" for en in english_texts]
    targets = german_texts

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=128,
        truncation=True,
        padding="max_length"
    )

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=128,
            truncation=True,
            padding="max_length"
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [19]:
train_size = 100000
val_size = 10000

print(f"Selecting {train_size} training samples and {val_size} validation samples")

train_dataset = dataset['train'].select(range(min(train_size, len(dataset['train']))))
val_dataset = dataset['validation'].select(range(min(val_size, len(dataset['validation']))))

print("Preprocessing datasets...")
train_dataset = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(preprocess_function, batched=True, remove_columns=val_dataset.column_names)

print("Data preprocessing completed")

Selecting 100000 training samples and 10000 validation samples
Preprocessing datasets...


Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2169 [00:00<?, ? examples/s]

Data preprocessing completed


## Create PyTorch DataLoaders

In [20]:
class TranslationDataset(torch.utils.data.Dataset):
    def __init__(self, dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        return {
            'input_ids': torch.tensor(item['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(item['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(item['labels'], dtype=torch.long)
        }

# Create PyTorch datasets and dataloaders
train_torch_dataset = TranslationDataset(train_dataset)
val_torch_dataset = TranslationDataset(val_dataset)

batch_size = 8
train_dataloader = DataLoader(train_torch_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_torch_dataset, batch_size=batch_size, shuffle=False)

print(f"DataLoaders created with batch size: {batch_size}")

DataLoaders created with batch size: 8


## Model Configuration and Transfer Learning Setup

In [21]:
print("Model architecture:")
total_params = 0
trainable_params = 0

for name, param in model.named_parameters():
    total_params += param.numel()
    if param.requires_grad:
        trainable_params += param.numel()

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters before freezing: {trainable_params:,}")

Model architecture:
Total parameters: 247,577,856
Trainable parameters before freezing: 162,623,616


### Freeze encoder layers for transfer learning

In [22]:
print("\nApplying transfer learning strategy...")
print("Freezing encoder layers to preserve pre-trained knowledge...")

for name, param in model.named_parameters():
    if 'encoder' in name:
        param.requires_grad = False

# Count trainable parameters after freezing
trainable_params_after = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params_after

print(f"Parameters frozen: {frozen_params:,}")
print(f"Trainable parameters after freezing: {trainable_params_after:,}")
print(f"Reduction in trainable parameters: {(1 - trainable_params_after/trainable_params)*100:.1f}%")


Applying transfer learning strategy...
Freezing encoder layers to preserve pre-trained knowledge...
Parameters frozen: 84,954,240
Trainable parameters after freezing: 162,623,616
Reduction in trainable parameters: 0.0%


## Training Setup and Implementation

In [23]:
learning_rate = 3e-4
epochs = 3
warmup_steps = 1000

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=0.1, total_iters=warmup_steps)

print(f"Learning rate: {learning_rate}")
print(f"Training epochs: {epochs}")
print(f"Warmup steps: {warmup_steps}")

def train_epoch(model, dataloader, optimizer, scheduler, device, epoch):
    """Train the model for one epoch"""
    model.train()
    total_loss = 0
    num_batches = len(dataloader)

    progress_bar = tqdm(dataloader, desc=f"Training Epoch {epoch}")

    for batch_idx, batch in enumerate(progress_bar):
        # Move batch to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Replace padding tokens with -100 for loss calculation
        labels[labels == tokenizer.pad_token_id] = -100

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss

        # Backward pass
        optimizer.zero_grad()
        loss.backward()

        # Gradient clipping for stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()

        # Learning rate scheduling
        if batch_idx < warmup_steps:
            scheduler.step()

        total_loss += loss.item()

        # Update progress bar
        current_lr = optimizer.param_groups[0]['lr']
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'avg_loss': f'{total_loss/(batch_idx+1):.4f}',
            'lr': f'{current_lr:.2e}'
        })

    return total_loss / num_batches

def evaluate_model(model, dataloader, device):
    """Evaluate the model"""
    model.eval()
    total_loss = 0

    with torch.no_grad():
        progress_bar = tqdm(dataloader, desc="Evaluating")
        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            labels[labels == tokenizer.pad_token_id] = -100

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            total_loss += outputs.loss.item()
            progress_bar.set_postfix({'loss': f'{outputs.loss.item():.4f}'})

    return total_loss / len(dataloader)

Learning rate: 0.0003
Training epochs: 3
Warmup steps: 1000


## Model Training

In [None]:
best_val_loss = float('inf')
training_history = {'train_loss': [], 'val_loss': []}

for epoch in range(epochs):
    print(f"\nEpoch {epoch + 1}/{epochs}")
    print("-" * 50)

    # Training
    train_loss = train_epoch(model, train_dataloader, optimizer, scheduler, device, epoch + 1)
    print(f"Training Loss: {train_loss:.4f}")

    # Validation
    val_loss = evaluate_model(model, val_dataloader, device)
    print(f"Validation Loss: {val_loss:.4f}")

    # Save training history
    training_history['train_loss'].append(train_loss)
    training_history['val_loss'].append(val_loss)

    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        print(f"New best model! Validation loss: {val_loss:.4f}")
        # torch.save(model.state_dict(), 'best_translation_model.pth')

    print(f"Best validation loss: {best_val_loss:.4f}")

print("\n✅ Training completed!")

## Translation and BLEU Score Evaluation

In [17]:
def translate_text(model, tokenizer, text, device, max_length=128):
    """Translate English text to German"""
    model.eval()

    # Prepare input
    input_text = f"Translate from English to German: {text}"
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=max_length,
        truncation=True,
        padding=True
    ).to(device)

    # Generate translation
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=max_length,
            num_beams=4,
            early_stopping=True,
            do_sample=False,
            temperature=1.0
        )

    # Decode output
    translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return translation

def calculate_scores(reference, hypothesis):
    """Calculate ROUGE and BLEU scores for a single translation pair"""
    # Initialize scorers
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    bleu_smoothing = SmoothingFunction().method4

    # Calculate ROUGE scores
    rouge_scores = rouge.score(reference, hypothesis)

    # Calculate BLEU score
    reference_tokens = [nltk.word_tokenize(reference)]
    hypothesis_tokens = nltk.word_tokenize(hypothesis)
    bleu_score = sentence_bleu(reference_tokens, hypothesis_tokens, smoothing_function=bleu_smoothing)

    return rouge_scores, bleu_score

In [18]:
# Evaluate on a subset of validation data
eval_size = 100
print(f"Evaluating on {eval_size} validation samples...")

# Get original validation data for reference
original_val = dataset['validation'].select(range(eval_size))

print("Generating translations and calculating scores...")

total_rouge1 = 0
total_rouge2 = 0
total_rougeL = 0
total_bleu = 0

print("\n" + "="*80)
print("SAMPLE TRANSLATIONS WITH SCORES")
print("="*80)

for i in tqdm(range(min(10, eval_size)), desc="Sample translations"):
    sample = original_val[i]
    english_text = sample['translation']['en']
    reference_german = sample['translation']['de']

    # Generate translation
    hypothesis_german = translate_text(model, tokenizer, english_text, device)

    # Calculate scores
    rouge_scores, bleu_score = calculate_scores(reference_german, hypothesis_german)

    # Accumulate scores
    total_rouge1 += rouge_scores['rouge1'].fmeasure
    total_rouge2 += rouge_scores['rouge2'].fmeasure
    total_rougeL += rouge_scores['rougeL'].fmeasure
    total_bleu += bleu_score

    # Display sample
    print(f"\nExample {i+1}:")
    print(f"🇺🇸 English: {english_text}")
    print(f"🇩🇪 Reference: {reference_german}")
    print(f"Generated: {hypothesis_german}")
    print(f"ROUGE-1: {rouge_scores['rouge1'].fmeasure:.3f}")
    print(f"ROUGE-2: {rouge_scores['rouge2'].fmeasure:.3f}")
    print(f"ROUGE-L: {rouge_scores['rougeL'].fmeasure:.3f}")
    print(f"BLEU Score: {bleu_score:.3f}")
    print("-" * 80)

# Calculate remaining samples for overall averages
print(f"\nCalculating scores for remaining {eval_size-10} samples...")
for i in tqdm(range(10, eval_size), desc="Calculating scores"):
    sample = original_val[i]
    english_text = sample['translation']['en']
    reference_german = sample['translation']['de']

    # Generate translation
    hypothesis_german = translate_text(model, tokenizer, english_text, device)

    # Calculate scores
    rouge_scores, bleu_score = calculate_scores(reference_german, hypothesis_german)

    # Accumulate scores
    total_rouge1 += rouge_scores['rouge1'].fmeasure
    total_rouge2 += rouge_scores['rouge2'].fmeasure
    total_rougeL += rouge_scores['rougeL'].fmeasure
    total_bleu += bleu_score

# Calculate averages
avg_rouge1 = total_rouge1 / eval_size
avg_rouge2 = total_rouge2 / eval_size
avg_rougeL = total_rougeL / eval_size
avg_bleu = total_bleu / eval_size

print("\n" + "="*60)
print("EVALUATION RESULTS")
print("="*60)
print(f"Average ROUGE-1: {avg_rouge1:.3f}")
print(f"Average ROUGE-2: {avg_rouge2:.3f}")
print(f"Average ROUGE-L: {avg_rougeL:.3f}")
print(f"Average BLEU Score: {avg_bleu:.3f}")

# Performance summary
print("\n" + "="*60)
print("PERFORMANCE SUMMARY")
print("="*60)
print(f"Model: {model_name}")
print(f"Dataset: WMT16 English-German (or WMT14 fallback)")
print(f"Training Samples: {train_size}")
print(f"Validation Samples: {val_size}")
print(f"Evaluation Samples: {eval_size}")
print(f"Transfer Learning: Encoder frozen, decoder fine-tuned")
print(f"Training Epochs: {epochs}")
print(f"Final Validation Loss: {training_history['val_loss'][-1]:.4f}")
print(f"Average BLEU Score: {avg_bleu:.3f}")

Evaluating on 100 validation samples...
Generating translations and calculating scores...

SAMPLE TRANSLATIONS WITH SCORES


Sample translations:  10%|█         | 1/10 [00:01<00:12,  1.44s/it]


Example 1:
🇺🇸 English: India and Japan prime ministers meet in Tokyo
🇩🇪 Reference: Die Premierminister Indiens und Japans trafen sich in Tokio.
Generated: Die Minister Indien und japanischen Premierminister treffen in Tokio.
ROUGE-1: 0.667
ROUGE-2: 0.250
ROUGE-L: 0.556
BLEU Score: 0.153
--------------------------------------------------------------------------------


Sample translations:  20%|██        | 2/10 [00:02<00:11,  1.48s/it]


Example 2:
🇺🇸 English: India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.
🇩🇪 Reference: Indiens neuer Premierminister Narendra Modi trifft bei seinem ersten wichtigen Auslandsbesuch seit seinem Wahlsieg im Mai seinen japanischen Amtskollegen Shinzo Abe in Toko, um wirtschaftliche und sicherheitspolitische Beziehungen zu besprechen.
Generated: Der neue Präsidentin Indien, Narendra Modi, wird seinem jäsenischen Kollegen, Shinzo Abe, in Tokio treffen, um wirtschaftliche und Sicherheitsbeziehungen zu diskutieren.
ROUGE-1: 0.415
ROUGE-2: 0.196
ROUGE-L: 0.415
BLEU Score: 0.108
--------------------------------------------------------------------------------


Sample translations:  30%|███       | 3/10 [00:03<00:08,  1.19s/it]


Example 3:
🇺🇸 English: Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.
🇩🇪 Reference: Herr Modi befindet sich auf einer fünftägigen Reise nach Japan, um die wirtschaftlichen Beziehungen mit der drittgrößten Wirtschaftsnation der Welt zu festigen.
Generated: Herr Modi ist auf eine fünf-jährige Reise in Japan zu stärken, um die Wirtschaftsbeziehungen mit der dritten größten Wirtschaft in der Welt zu stärken.
ROUGE-1: 0.509
ROUGE-2: 0.189
ROUGE-L: 0.509
BLEU Score: 0.111
--------------------------------------------------------------------------------


Sample translations:  40%|████      | 4/10 [00:04<00:05,  1.11it/s]


Example 4:
🇺🇸 English: High on the agenda are plans for greater nuclear co-operation.
🇩🇪 Reference: Pläne für eine stärkere kerntechnische Zusammenarbeit stehen ganz oben auf der Tagesordnung.
Generated: Auf der Tagesordnung stehen die Vorschläge für eine stärkere Nuklearkontrolle.
ROUGE-1: 0.643
ROUGE-2: 0.462
ROUGE-L: 0.357
BLEU Score: 0.132
--------------------------------------------------------------------------------


Sample translations:  50%|█████     | 5/10 [00:05<00:04,  1.02it/s]


Example 5:
🇺🇸 English: India is also reportedly hoping for a deal on defence collaboration between the two nations.
🇩🇪 Reference: Berichten zufolge hofft Indien darüber hinaus auf einen Vertrag zur Verteidigungszusammenarbeit zwischen den beiden Nationen.
Generated: In Indien hoffe ich, daß es zwischen den beiden Staaten eine Verhandlungsverhandlungen über die Vergemeinschaftung der Verteidigung ergibt.
ROUGE-1: 0.294
ROUGE-2: 0.125
ROUGE-L: 0.235
BLEU Score: 0.071
--------------------------------------------------------------------------------


Sample translations:  60%|██████    | 6/10 [00:06<00:03,  1.04it/s]


Example 6:
🇺🇸 English: Karratha police arrest 20-year-old after high speed motorcycle chase
🇩🇪 Reference: Polizei von Karratha verhaftet 20-Jährigen nach schneller Motorradjagd
Generated: Die Polizei in Karratha ermordete 20 Jahre nach einer hohen Geschwindigkeitsmärkten Unfall 20 Jahre nach.
ROUGE-1: 0.308
ROUGE-2: 0.000
ROUGE-L: 0.308
BLEU Score: 0.021
--------------------------------------------------------------------------------


Sample translations:  70%|███████   | 7/10 [00:07<00:03,  1.10s/it]


Example 7:
🇺🇸 English: A motorcycle has been seized after it was ridden at 125km/h in a 70km/h zone and through bushland to escape police in the Pilbara.
🇩🇪 Reference: Ein Motorrad wurde beschlagnahmt, nachdem der Fahrer es mit 125 km/h in einer 70 km/h-Zone und durch Buschland gefahren hatte, um der Polizei in Bilbara zu entkommen.
Generated: Eine Motor wurde nach seiner Beförderung von 125 km/h in einem Gebiet der 70 km/h und durch schwammlichen Gebiete gefangen, um die Polizei in der Pilbara zu fahren.
ROUGE-1: 0.557
ROUGE-2: 0.237
ROUGE-L: 0.492
BLEU Score: 0.079
--------------------------------------------------------------------------------


Sample translations:  80%|████████  | 8/10 [00:08<00:02,  1.15s/it]


Example 8:
🇺🇸 English: Traffic police on patrol in Karratha this morning tried to pull over a blue motorcycle when they spotted it reaching 125km/h as it pulled out of a service station on Bathgate Road.
🇩🇪 Reference: Verkehrspolizisten in Karratha versuchten heute morgen, ein blaues Motorrad zu stoppen, nachdem sie es dabei beobachtet hatten, wie es mit 125 km/h eine Tankstelle auf der Bathdate Road verließ.
Generated: Der Verkehrspolizei in Karratha hat heute morgen einen blauen Motorwerk erschüttert, wenn sie es auf dem Weg von einer Dienststelle im Bathgate Road gelangt hat, 125 km/h erreicht.
ROUGE-1: 0.400
ROUGE-2: 0.172
ROUGE-L: 0.300
BLEU Score: 0.042
--------------------------------------------------------------------------------


Sample translations:  90%|█████████ | 9/10 [00:10<00:01,  1.19s/it]


Example 9:
🇺🇸 English: Police say the rider then failed to stop and continued on to Burgess Road before turning into bushland, causing the officers to lose sight of it.
🇩🇪 Reference: Die Polizei berichtet, dass der Fahrer die Haltesignale dann ignorierte und weiter auf der Burgess Road fuhr, bevor er in das Buschland abbog, wo die Beamten es aus den Augen verloren.
Generated: Die Polizei sagen, daß der Fahrer dann nicht gekommen ist, und er folgte nach Burgessstraße, bevor er sich in den Bänken zu fahren, daß die Behörde es nicht gesehen haben.
ROUGE-1: 0.375
ROUGE-2: 0.097
ROUGE-L: 0.344
BLEU Score: 0.065
--------------------------------------------------------------------------------


Sample translations: 100%|██████████| 10/10 [00:10<00:00,  1.10s/it]



Example 10:
🇺🇸 English: The motorcycle and a person matching the description of the rider was then spotted at a house on Walcott Way in Bulgarra.
🇩🇪 Reference: Das Motorrad sowie eine Person, die der Beschreibung des Fahrers entsprach wurden später bei einem Haus im Walcott Way in Bulgarra gesehen.
Generated: Der Motorrad und eine Person, die die Beschreibung der Fahrer entspricht, wurden dann auf einem Haus auf Walcott Way in Bulgarra gefunden.
ROUGE-1: 0.622
ROUGE-2: 0.279
ROUGE-L: 0.578
BLEU Score: 0.229
--------------------------------------------------------------------------------

Calculating scores for remaining 90 samples...


Calculating scores: 100%|██████████| 90/90 [01:41<00:00,  1.13s/it]


EVALUATION RESULTS
Average ROUGE-1: 0.417
Average ROUGE-2: 0.178
Average ROUGE-L: 0.368
Average BLEU Score: 0.100

PERFORMANCE SUMMARY
Model: google/flan-t5-base
Dataset: WMT16 English-German (or WMT14 fallback)
Training Samples: 5000
Validation Samples: 500
Evaluation Samples: 100
Transfer Learning: Encoder frozen, decoder fine-tuned
Training Epochs: 3
Final Validation Loss: 2.6233
Average BLEU Score: 0.100





# BLEU Score Interpretation Guide

## BLEU Score Ranges

| Score Range | Translation Quality |
|-------------|-------------------|
| 0.0 - 0.1   | Very poor translation quality |
| 0.1 - 0.2   | Poor translation quality |
| 0.2 - 0.3   | Fair translation quality |
| 0.3 - 0.4   | Good translation quality |
| 0.4 - 0.5   | Very good translation quality |
| 0.5+        | Excellent translation quality |

## Important Considerations

BLEU scores can vary significantly based on several factors:

- **Dataset characteristics and domain**: Technical texts vs. conversational language
- **Reference translation quality**: Professional vs. automated translations
- **Model size and training data**: Larger models with more data typically score higher
- **Language pair difficulty**: Some language pairs are inherently more challenging

## Additional Notes

- BLEU scores should be interpreted relative to the specific task and domain
- A score of 0.3+ is generally considered acceptable for most practical applications
- Scores above 0.4 indicate high-quality translations suitable for professional use
- Always consider multiple evaluation metrics alongside BLEU for comprehensive assessment