# SQL-to-Text Model Evaluation

This notebook evaluates pre-trained models for SQL-to-text generation without any training.

## Features
- Evaluate any HuggingFace model (seq2seq or causal LM)
- Model-specific batch sizes based on memory requirements
- Batched generation for efficient GPU utilization
- GPU memory monitoring
- Multiple metrics: BLEU, ROUGE, CHRF, LaBSE similarity

In [1]:
# Install dependencies
!pip install -q torch transformers datasets sacremoses sentence-transformers sacrebleu rouge-score

## Configuration

Set the model and evaluation parameters:

In [2]:
MODEL_NAME = "cointegrated/rut5-base"
DATA_DIR = "."
OUTPUT_FILE = "evaluation_results.json"
NUM_SAMPLES = None  # Use full dataset
MAX_NEW_TOKENS = 100
TEMPERATURE = 0.7

# Model-specific batch sizes (adjust based on GPU memory)
MODEL_BATCH_SIZES = {
    "cointegrated/rut5-base": 256,
    "cointegrated/rut5-small": 256,
    "google/flan-t5-base": 128,
    "google/flan-t5-small": 256,
    "google/flan-t5-large": 64,
    "facebook/bart-base": 128,
    "facebook/bart-large": 32,
    "Qwen/Qwen2.5-0.5B-Instruct": 256,
    "Qwen/Qwen2.5-1.5B-Instruct": 128,
    "Qwen/Qwen2.5-3B-Instruct": 64,
    "Qwen/Qwen2.5-7B-Instruct": 32,
    "meta-llama/Llama-3.2-1B-Instruct": 256,
    "meta-llama/Llama-3.2-3B-Instruct": 128,
    "meta-llama/Llama-3.2-8B-Instruct": 32,
    "t-tech/T-lite-it-1.0": 32
}

def get_batch_size(model_name: str, default_size: int = 128) -> int:
    """Get batch size for model, with fallback to default."""
    return MODEL_BATCH_SIZES.get(model_name, default_size)

# Get batch size for selected model
BATCH_SIZE = get_batch_size(MODEL_NAME, 128)

print(f"Model: {MODEL_NAME}")
print(f"Samples to evaluate: All")
print(f"Batch size: {BATCH_SIZE}")

Model: cointegrated/rut5-base
Samples to evaluate: All
Batch size: 256


In [3]:
from google.colab import files
print("Please upload pauq_dev.json")
uploaded = files.upload()

Please upload pauq_dev.json


Saving pauq_dev.json to pauq_dev (1).json


## GPU Memory Check

In [4]:
import torch

print("=" * 80)
print("GPU Memory Check")
print("=" * 80)

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Device: {device}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    allocated = torch.cuda.memory_allocated(0) / 1e9
    reserved = torch.cuda.memory_reserved(0) / 1e9
    free = total_memory - allocated

    print(f"Total Memory: {total_memory:.2f} GB")
    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved: {reserved:.2f} GB")
    print(f"Available: {free:.2f} GB")
    print("=" * 80)
else:
    print("No GPU available")
    print("=" * 80)

GPU Memory Check
Device: cuda
GPU: Tesla T4
Total Memory: 15.64 GB
Allocated: 0.00 GB
Reserved: 0.00 GB
Available: 15.64 GB


## Data Loading Functions

In [5]:
import json
import os
from typing import List, Dict

def load_pauq_data(data_dir: str, split: str = "dev") -> List[Dict]:
    filename = f"pauq_{split}.json"
    filepath = os.path.join(data_dir, filename)

    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Data file not found: {filepath}")

    with open(filepath, "r", encoding="utf-8") as f:
        data = json.load(f)

    print(f"Loaded {len(data)} examples from {filename}")
    return data

In [6]:
def get_eval_subset(data: List[Dict], num_samples: int = None) -> List[Dict]:
    if num_samples:
        return data[:num_samples]
    return data

In [7]:
dev_data = load_pauq_data(DATA_DIR, "dev")
eval_data = get_eval_subset(dev_data, NUM_SAMPLES)
print(f"\nEvaluating on {len(eval_data)} samples")

Loaded 1076 examples from pauq_dev.json

Evaluating on 1076 samples


## Model Loading Functions

In [8]:
import torch

def load_model_and_tokenizer(model_name: str):
    from transformers import AutoTokenizer
    print(f"Loading model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    try:
        from transformers import AutoModelForSeq2SeqLM
        model = AutoModelForSeq2SeqLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        is_seq2seq = True
        print("Detected: Seq2Seq model")
    except (OSError, ValueError, KeyError):
        from transformers import AutoModelForCausalLM
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        is_seq2seq = False
        print("Detected: Causal LM")
    except Exception as e:
        raise RuntimeError(f"Failed to load model {model_name}: {e}")

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer, is_seq2seq

In [9]:
model, tokenizer, is_seq2seq = load_model_and_tokenizer(MODEL_NAME)
print(f"Model loaded on: {model.device}")
print(f"Model parameters: {model.num_parameters():,}")

Loading model: cointegrated/rut5-base


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



Detected: Seq2Seq model
Model loaded on: cuda:0
Model parameters: 244,309,248


## Batched Generation Functions

In [10]:
def generate_questions_batched(model, tokenizer, sql_queries: List[str], is_seq2seq: bool,
                                   max_new_tokens: int = 100, temperature: float = 0.7):
    """Generate questions from SQL queries in batches for efficient GPU usage."""
    if is_seq2seq:
        prompts = [f"SQL: {sql}" for sql in sql_queries]
        inputs = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )
        # Decode all outputs
        results = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    else:
        # Causal LM: need to handle continuation format
        prompts = [f"SQL: {sql}\Вопрос:" for sql in sql_queries]
        inputs = tokenizer(prompts, padding=True, truncation=True, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )
        # Decode and extract continuation after "Question:"
        full_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        results = [output.split("Вопрос:")[-1].strip() for output in full_outputs]
    return results

## Quick Evaluation

Test the model on a few samples:

In [11]:
print("\nSample predictions:")
print("=" * 80)

# Get sample data
sample_data = eval_data[:5]
sample_sqls = [item.get("query", {}).get("en", "") for item in sample_data]

# Generate in batch
predictions = generate_questions_batched(
    model, tokenizer, sample_sqls, is_seq2seq, MAX_NEW_TOKENS, TEMPERATURE
)

for i, (item, predicted) in enumerate(zip(sample_data, predictions)):
    sql_query = item.get("query", {}).get("en", "")
    actual_question = item.get("question", {}).get("en", "")
    print(f"\n--- Sample {i+1} ---")
    print(f"SQL: {sql_query}")
    print(f"Expected: {actual_question}")
    print(f"Predicted: {predicted}")
    print("-" * 80)


Sample predictions:

--- Sample 1 ---
SQL: SELECT count(*) FROM singer;
Expected: How many singers do we have?
Predicted: нос­осхac1 100 * <extra_id_29>o?-70each() 10s- <extra_id_53> <extra_id_40> <extra_id_8>лемент <extra_id_8>алай <extra_id_54> <extra_id_55>
--------------------------------------------------------------------------------

--- Sample 2 ---
SQL: SELECT count(*) FROM singer;
Expected: What is the total number of singers?
Predicted: ersедин  extra(фасNето  hell Moreскаль44c <extra_id_32>ere <extra_id_14> Home <extra_id_52> <extra_id_45> Home milli aim <extra_id_20> <extra_id_35> <extra_id_20>zar... <extra_id_55> <extra_id_21> Faceлор ... <extra_id_41>used <extra_id_55> Face <extra_id_40> RT <extra_id_56> <extra_id_18> <extra_id_20> non откai <extra_id_41> ор <extra_id_43> Ев <extra_id_23> больше+ <extra_id_25>ƒ* PC/лам <extra_id_25> него SB"," <extra_id_15> <extra_id_21> паль <extra_id_20> claus капеписа..." <extra_id_29>слухчеш <extra_id_20> <extra_id_23>рады <extra_id

## Evaluation Functions

Functions to compute metrics:

In [12]:
import numpy as np

from sentence_transformers import SentenceTransformer, util

# LaBSE Model Caching
labse_model = None

def get_labse_model():
    global labse_model
    if labse_model is None:
        print("Loading LaBSE model...")
        labse_model = SentenceTransformer('sentence-transformers/LaBSE')
    return labse_model

In [13]:
def compute_metrics(references, hypotheses):
    from sacrebleu.metrics import BLEU, CHRF
    import torch
    from rouge_score import rouge_scorer
    from sentence_transformers import SentenceTransformer, util

    print("Computing BLEU...")
    bleu_metric = BLEU()
    bleu_result = bleu_metric.corpus_score(hypotheses, [references])

    print("Computing ROUGE...")
    rouge_scorer_instance = rouge_scorer.RougeScorer(
        ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True
    )
    rouge1_scores = []
    rouge2_scores = []
    rougeL_scores = []
    for ref, hyp in zip(references, hypotheses):
        scores = rouge_scorer_instance.score(ref, hyp)
        rouge1_scores.append(scores['rouge1'].fmeasure)
        rouge2_scores.append(scores['rouge2'].fmeasure)
        rougeL_scores.append(scores['rougeL'].fmeasure)

    print("Computing CHRF...")
    chrf_metric = CHRF()
    chrf_result = chrf_metric.corpus_score(hypotheses, [references])

    # LaBSE similarity - use cached model
    labse = get_labse_model()
    ref_embeddings = labse.encode(references, convert_to_tensor=True)
    hyp_embeddings = labse.encode(hypotheses, convert_to_tensor=True)
    similarities = util.cos_sim(hyp_embeddings, ref_embeddings)
    similarity_scores = torch.diagonal(similarities).cpu().numpy()

    return {
        'BLEU-1': bleu_result.precisions[0],
        'BLEU-2': bleu_result.precisions[1],
        'BLEU-3': bleu_result.precisions[2],
        'BLEU-4': bleu_result.precisions[3],
        'ROUGE-1': np.mean(rouge1_scores) * 100,
        'ROUGE-2': np.mean(rouge2_scores) * 100,
        'ROUGE-L': np.mean(rougeL_scores) * 100,
        'CHRF': chrf_result.score,
        'LaBSE-Similarity': np.mean(similarity_scores) * 100,
    }

In [14]:
def run_evaluation_batched(model, tokenizer, eval_data, is_seq2seq,
                                   batch_size, max_new_tokens, temperature, output_file=None):
    references, hypotheses, predictions = [], [], []
    num_samples = len(eval_data)
    num_batches = (num_samples + batch_size - 1) // batch_size
    print(f"\nEvaluating {num_samples} samples in {num_batches} batches (size={batch_size})...")

    for batch_idx in range(num_batches):
        start_idx = batch_idx * batch_size
        end_idx = min(start_idx + batch_size, num_samples)
        batch_data = eval_data[start_idx:end_idx]

        # Prepare batch inputs
        batch_sqls = [item.get("query", {}).get("en", "") for item in batch_data]
        batch_refs = [item.get("question", {}).get("en", "") for item in batch_data]

        # Generate in batch
        batch_preds = generate_questions_batched(
            model, tokenizer, batch_sqls, is_seq2seq, max_new_tokens, temperature
        )

        # Store results
        for item, ref, pred in zip(batch_data, batch_refs, batch_preds):
            references.append(ref)
            hypotheses.append(pred)
            predictions.append({
                'id': item.get('id', ''),
                'sql': item.get('query', {}).get('en', ''),
                'expected': ref,
                'predicted': pred,
            })

        print(f"Processed batch {batch_idx + 1}/{num_batches} ({end_idx}/{num_samples} samples)")

    print("Computing metrics...")
    metrics = compute_metrics(references, hypotheses)

    if output_file:
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(predictions, f, indent=2, ensure_ascii=False)
        print(f"\nPredictions saved to {output_file}")

    return metrics

## Run Full Evaluation

In [15]:
metrics = run_evaluation_batched(
    model, tokenizer, eval_data, is_seq2seq,
    BATCH_SIZE, MAX_NEW_TOKENS, TEMPERATURE, OUTPUT_FILE
)

print("\n" + "=" * 80)
print("EVALUATION RESULTS")
print("=" * 80)
print(f"\nModel: {MODEL_NAME}")
print(f"Dataset: {len(eval_data)} samples")
print(f"Batch size: {BATCH_SIZE}")
print("\n--- Metrics ---")
for key, value in metrics.items():
    print(f"{key}: {value:.4f}")
print("=" * 80)


Evaluating 1076 samples in 5 batches (size=256)...
Processed batch 1/5 (256/1076 samples)
Processed batch 2/5 (512/1076 samples)
Processed batch 3/5 (768/1076 samples)
Processed batch 4/5 (1024/1076 samples)
Processed batch 5/5 (1076/1076 samples)
Computing metrics...
Computing BLEU...
Computing ROUGE...
Computing CHRF...
Loading LaBSE model...


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/LaBSE
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



Predictions saved to evaluation_results.json

EVALUATION RESULTS

Model: cointegrated/rut5-base
Dataset: 1076 samples
Batch size: 256

--- Metrics ---
BLEU-1: 0.6737
BLEU-2: 0.0031
BLEU-3: 0.0005
BLEU-4: 0.0003
ROUGE-1: 0.9266
ROUGE-2: 0.0051
ROUGE-L: 0.8975
CHRF: 7.0931
LaBSE-Similarity: 13.2966


## Custom SQL Test

Test with your own SQL queries:

In [16]:
test_sqls = [
    "SELECT name FROM users WHERE age > 25;",
    "SELECT COUNT(*) FROM orders WHERE status = 'completed';",
    "SELECT product, SUM(quantity) FROM sales GROUP BY product ORDER BY SUM(quantity) DESC;",
]
print("\nCustom SQL Tests:")
print("=" * 80)

# Generate in batch
predictions = generate_questions_batched(
    model, tokenizer, test_sqls, is_seq2seq, MAX_NEW_TOKENS, TEMPERATURE
)

for i, (test_sql, predicted) in enumerate(zip(test_sqls, predictions)):
    print(f"\n--- Test {i+1} ---")
    print(f"SQL: {test_sql}")
    print(f"Generated Question: {predicted}")


Custom SQL Tests:

--- Test 1 ---
SQL: SELECT name FROM users WHERE age > 25;
Generated Question: mini:ом? Yet? digital <extra_id_55> <extra_id_16>! <extra_id_53> G <extra_id_20> add more last > <extra_id_8>ft Th <extra_id_23> <extra_id_44> <extra_id_16>trackeryлипсерияa ( <extra_id_46>коя ~remark <extra_id_11>стир }датчик <extra_id_56>evilturkey httpwww­nomine T

--- Test 2 ---
SQL: SELECT COUNT(*) FROM orders WHERE status = 'completed';
Generated Question: VRS & T – L- W from root View ... <extra_id_19>....... <extra_id_16> <extra_id_43> <extra_id_25> <extra_id_23> <extra_id_5> <extra_id_51> <extra_id_11>Left...... <extra_id_55> $ <extra_id_56> <extra_id_56> <extra_id_20>-<0xDA> <extra_id_51>цель <extra_id_14> <extra_id_39> <extra_id_51>

--- Test 3 ---
SQL: SELECT product, SUM(quantity) FROM sales GROUP BY product ORDER BY SUM(quantity) DESC;
Generated Question: РAllEamr $oE (I... M <extra_id_25> [ <extra_id_51> Yet <extra_id_53> <extra_id_12>...... <extra_id_17>AroundNUMнимал <ext

## Compare Multiple Models

Define a list of models to compare:

In [None]:
all_results = []

In [21]:
MODELS_TO_COMPARE = [
    "cointegrated/rut5-base",
    "google/flan-t5-base",
    "Qwen/Qwen2.5-0.5B-Instruct",
    "bond005/FRED-T5-large-instruct-v0.1",
    "t-tech/T-lite-it-1.0",
    "bond005/meno-tiny-0.1",
    "bond005/meno-lite-0.1",
    "RefalMachine/RuadaptQwen2.5-7B-Lite-Beta",
    "Qwen/Qwen3-4B-Instruct-2507",
    "RefalMachine/ruadapt_llama3_8b_instruct_extended_lep_ft"
    ]

COMPARE_NUM_SAMPLES = None  # Use full dataset
compare_data = get_eval_subset(dev_data, COMPARE_NUM_SAMPLES)
print(f"\nComparing {len(MODELS_TO_COMPARE)} models on {len(compare_data)} samples...")


for model_name in MODELS_TO_COMPARE:
    print(f"\n" + "=" * 80)
    print(f"Evaluating: {model_name}")
    print(f"=" * 80)
    try:
        compare_model, compare_tokenizer, compare_is_seq2seq = load_model_and_tokenizer(model_name)
        compare_batch_size = get_batch_size(model_name, 128)
        compare_metrics = run_evaluation_batched(
            compare_model, compare_tokenizer, compare_data, compare_is_seq2seq,
            compare_batch_size, MAX_NEW_TOKENS, TEMPERATURE, None
        )
        result = {"model": model_name}
        result.update(compare_metrics)
        all_results.append(result)
        del compare_model, compare_tokenizer
        torch.cuda.empty_cache()
    except Exception as e:
        print(f"Error evaluating {model_name}: {e}")
        continue


Comparing 10 models on 1076 samples...

Evaluating: cointegrated/rut5-base
Loading model: cointegrated/rut5-base


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



Detected: Seq2Seq model

Evaluating 1076 samples in 5 batches (size=256)...
Processed batch 1/5 (256/1076 samples)
Processed batch 2/5 (512/1076 samples)
Processed batch 3/5 (768/1076 samples)
Processed batch 4/5 (1024/1076 samples)
Processed batch 5/5 (1076/1076 samples)
Computing metrics...
Computing BLEU...
Computing ROUGE...
Computing CHRF...

Evaluating: google/flan-t5-base
Loading model: google/flan-t5-base


Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



Detected: Seq2Seq model

Evaluating 1076 samples in 9 batches (size=128)...
Processed batch 1/9 (128/1076 samples)
Processed batch 2/9 (256/1076 samples)
Processed batch 3/9 (384/1076 samples)
Processed batch 4/9 (512/1076 samples)
Processed batch 5/9 (640/1076 samples)
Processed batch 6/9 (768/1076 samples)
Processed batch 7/9 (896/1076 samples)
Processed batch 8/9 (1024/1076 samples)
Processed batch 9/9 (1076/1076 samples)
Computing metrics...
Computing BLEU...
Computing ROUGE...
Computing CHRF...

Evaluating: Qwen/Qwen2.5-0.5B-Instruct
Loading model: Qwen/Qwen2.5-0.5B-Instruct


Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Detected: Causal LM

Evaluating 1076 samples in 5 batches (size=256)...


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processed batch 1/5 (256/1076 samples)


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processed batch 2/5 (512/1076 samples)


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processed batch 3/5 (768/1076 samples)


A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Processed batch 4/5 (1024/1076 samples)
Processed batch 5/5 (1076/1076 samples)
Computing metrics...
Computing BLEU...
Computing ROUGE...
Computing CHRF...

Evaluating: bond005/FRED-T5-large-instruct-v0.1
Loading model: bond005/FRED-T5-large-instruct-v0.1


config.json:   0%|          | 0.00/970 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.28G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/558 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Detected: Seq2Seq model

Evaluating 1076 samples in 9 batches (size=128)...
Processed batch 1/9 (128/1076 samples)
Processed batch 2/9 (256/1076 samples)
Processed batch 3/9 (384/1076 samples)
Processed batch 4/9 (512/1076 samples)
Processed batch 5/9 (640/1076 samples)
Processed batch 6/9 (768/1076 samples)
Processed batch 7/9 (896/1076 samples)
Processed batch 8/9 (1024/1076 samples)
Processed batch 9/9 (1076/1076 samples)
Computing metrics...
Computing BLEU...
Computing ROUGE...
Computing CHRF...

Evaluating: t-tech/T-lite-it-1.0
Loading model: t-tech/T-lite-it-1.0


config.json:   0%|          | 0.00/712 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/339 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Detected: Causal LM

Evaluating 1076 samples in 9 batches (size=128)...
Error evaluating t-tech/T-lite-it-1.0: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 14.56 GiB of which 131.81 MiB is free. Including non-PyTorch memory, this process has 14.43 GiB memory in use. Of the allocated memory 14.07 GiB is allocated by PyTorch, and 237.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Evaluating: bond005/meno-tiny-0.1
Loading model: bond005/meno-tiny-0.1


config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/499 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Detected: Causal LM

Evaluating 1076 samples in 9 batches (size=128)...
Error evaluating bond005/meno-tiny-0.1: CUDA out of memory. Tried to allocate 144.00 MiB. GPU 0 has a total capacity of 14.56 GiB of which 37.81 MiB is free. Including non-PyTorch memory, this process has 14.52 GiB memory in use. Of the allocated memory 14.26 GiB is allocated by PyTorch, and 136.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Evaluating: bond005/meno-lite-0.1
Loading model: bond005/meno-lite-0.1


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/759 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [18]:
import pandas as pd

df_results = pd.DataFrame(all_results)
print("\n" + "=" * 80)
print("MODEL COMPARISON RESULTS")
print("=" * 80)
print(df_results.to_string(index=False))
df_results.to_csv('model_comparison.csv', index=False)
print("\nResults saved to model_comparison.csv")


MODEL COMPARISON RESULTS
                     model   BLEU-1   BLEU-2   BLEU-3   BLEU-4   ROUGE-1  ROUGE-2   ROUGE-L      CHRF  LaBSE-Similarity
    cointegrated/rut5-base 0.714371 0.003296 0.000556 0.000281  1.051476 0.000000  1.001957  7.126219         13.304046
       google/flan-t5-base 2.980756 0.034745 0.001491 0.000768  5.661441 0.120799  4.622805  9.719327         19.479076
Qwen/Qwen2.5-0.5B-Instruct 6.938039 1.375375 0.505185 0.192722 14.453878 3.991435 11.598139 20.516740         32.164391

Results saved to model_comparison.csv


## Download Results (Optional)

Download evaluation results:

In [19]:
print("\nStarting download...")
files.download(OUTPUT_FILE)
if os.path.exists('model_comparison.csv'):
    files.download('model_comparison.csv')


Starting download...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>