# Fine-tune Gemma 7B Comparison Across Agreement Levels

This notebook fine-tunes Gemma-7B using different agreement levels from FinancialPhraseBank to compare the impact of data quality on model performance.

## Agreement Levels:
- **sentences_50agree**: ≥50% annotator agreement (4,846 sentences)
- **sentences_66agree**: ≥66% annotator agreement (4,217 sentences) 
- **sentences_75agree**: ≥75% annotator agreement (3,453 sentences)
- **sentences_allagree**: 100% annotator agreement (2,264 sentences)

## Hypothesis:
Higher agreement levels should lead to better model performance due to reduced label noise.

In [1]:
# Install required packages
%pip install -q -U torch --index-url https://download.pytorch.org/whl/cu117
%pip install -q -U transformers==4.38.2
%pip install -q accelerate==0.32.0
%pip install -q -i https://pypi.org/simple/ bitsandbytes
%pip install -q -U datasets==2.16.1
%pip install -q -U trl==0.7.11
%pip install -q -U peft==0.10.0

Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




In [2]:
import os
import gc
import time
import warnings
from datetime import datetime

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig, 
    TrainingArguments, 
    pipeline, 
    logging
)
from datasets import Dataset, load_dataset
from peft import LoraConfig, PeftConfig
import bitsandbytes as bnb
from trl import SFTTrainer

from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix
)
from sklearn.model_selection import train_test_split

print(f"transformers=={transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

transformers==4.38.2
PyTorch version: 2.5.1+cu121
CUDA available: True


## Define Helper Functions

In [4]:
def clear_memory():
    """Clear GPU and CPU memory"""
    print("Clearing memory...")
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print(f"GPU memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

def wait_and_clear(seconds=60):
    """Wait for specified seconds and clear memory"""
    print(f"Waiting {seconds} seconds...")
    time.sleep(seconds)
    clear_memory()

In [5]:
def prepare_dataset(agreement_level):
    """Load and prepare dataset for specific agreement level"""
    print(f"\nLoading dataset: {agreement_level}...")
    
    # Load the dataset from HuggingFace
    dataset = load_dataset("takala/financial_phrasebank", agreement_level)
    
    # Convert to pandas for easier manipulation
    df = dataset['train'].to_pandas()
    df = df.rename(columns={'sentence': 'text', 'label': 'sentiment'})
    
    print(f"Dataset shape: {df.shape}")
    print(f"Sentiment distribution:")
    sentiment_dist = df['sentiment'].value_counts()
    print(sentiment_dist)
    
    return df, sentiment_dist

In [6]:
def create_balanced_splits(df, max_samples_per_class=300):
    """Create balanced train/val/test splits"""
    # Calculate minimum class size
    min_class_size = df['sentiment'].value_counts().min()
    samples_per_class = min(max_samples_per_class, min_class_size)
    
    # Use 70% for training, 15% for validation, 15% for testing
    train_size_per_class = int(samples_per_class * 0.7)
    val_size_per_class = int(samples_per_class * 0.15)
    test_size_per_class = samples_per_class - train_size_per_class - val_size_per_class
    
    print(f"Split sizes per class: Train={train_size_per_class}, Val={val_size_per_class}, Test={test_size_per_class}")
    
    X_train, X_val, X_test = [], [], []
    
    # Map integer labels to sentiment names
    label_mapping = {0: "negative", 1: "neutral", 2: "positive"}
    
    for sentiment_label in [0, 1, 2]:
        sentiment_name = label_mapping[sentiment_label]
        sentiment_data = df[df.sentiment == sentiment_label]
        
        if len(sentiment_data) == 0:
            print(f"Warning: No samples found for {sentiment_name} sentiment")
            continue
        
        # Sample the required number for this class
        if len(sentiment_data) >= samples_per_class:
            sampled_data = sentiment_data.sample(n=samples_per_class, random_state=42)
        else:
            sampled_data = sentiment_data
            print(f"Warning: Only {len(sentiment_data)} samples available for {sentiment_name}")
        
        # Split the sampled data
        if len(sampled_data) >= 3:  # Need at least 3 samples to split
            temp_data, test_data = train_test_split(
                sampled_data, 
                test_size=min(test_size_per_class, len(sampled_data)//3),
                random_state=42
            )
            
            if len(temp_data) >= 2:
                train_data, val_data = train_test_split(
                    temp_data,
                    test_size=min(val_size_per_class, len(temp_data)//2),
                    random_state=42
                )
            else:
                train_data = temp_data
                val_data = pd.DataFrame()
        else:
            train_data = sampled_data
            val_data = pd.DataFrame()
            test_data = pd.DataFrame()
        
        if len(train_data) > 0:
            X_train.append(train_data)
        if len(val_data) > 0:
            X_val.append(val_data)
        if len(test_data) > 0:
            X_test.append(test_data)
    
    # Combine and shuffle
    X_train = pd.concat(X_train).sample(frac=1, random_state=10).reset_index(drop=True) if X_train else pd.DataFrame()
    X_val = pd.concat(X_val).sample(frac=1, random_state=10).reset_index(drop=True) if X_val else pd.DataFrame()
    X_test = pd.concat(X_test).sample(frac=1, random_state=10).reset_index(drop=True) if X_test else pd.DataFrame()
    
    print(f"Final splits: Train={len(X_train)}, Val={len(X_val)}, Test={len(X_test)}")
    
    return X_train, X_val, X_test

In [7]:
def prepare_prompts(X_train, X_val, X_test, tokenizer):
    """Prepare training prompts"""
    EOS_TOKEN = tokenizer.eos_token
    label_mapping = {0: "negative", 1: "neutral", 2: "positive"}
    
    def generate_prompt(data_point):
        sentiment_text = label_mapping[data_point["sentiment"]]
        return f"""Analyze the sentiment of the news headline enclosed in square brackets, 
                determine if it is positive, neutral, or negative, and return the answer as 
                the corresponding sentiment label "positive" or "neutral" or "negative"

                [{data_point["text"]}] = {sentiment_text}
                """.strip() + EOS_TOKEN

    def generate_test_prompt(data_point):
        return f"""Analyze the sentiment of the news headline enclosed in square brackets, 
                determine if it is positive, neutral, or negative, and return the answer as 
                the corresponding sentiment label "positive" or "neutral" or "negative"

                [{data_point["text"]}] = 
                """.strip()
    
    # Generate prompts
    train_prompts = pd.DataFrame(X_train.apply(generate_prompt, axis=1), columns=["text"])
    val_prompts = pd.DataFrame(X_val.apply(generate_prompt, axis=1), columns=["text"])
    
    # Test prompts and true labels
    y_true = [label_mapping[label] for label in X_test['sentiment'].tolist()]
    test_prompts = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])
    
    # Convert to HuggingFace datasets
    train_data = Dataset.from_pandas(train_prompts)
    eval_data = Dataset.from_pandas(val_prompts)
    
    return train_data, eval_data, test_prompts, y_true

In [8]:
def load_model_and_tokenizer():
    """Load fresh model and tokenizer"""
    model_name = "google/gemma-7b"
    
    compute_dtype = getattr(torch, "float16")
    
    # Configure 4-bit quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
    )
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=bnb_config, 
    )
    
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    
    # Load tokenizer
    max_seq_length = 2048
    tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length)
    
    return model, tokenizer

In [9]:
def evaluate_model(y_true, y_pred, agreement_level):
    """Comprehensive evaluation function"""
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none': 1, 'negative': 0}
    
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate overall accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    
    print(f"\n{'='*60}")
    print(f"RESULTS FOR {agreement_level.upper()}")
    print(f"{'='*60}")
    print(f'Overall Accuracy: {accuracy:.3f}')
    
    # Calculate per-class accuracy
    unique_labels = set(y_true_mapped)
    
    class_accuracies = {}
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) 
                         if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        label_name = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}[label]
        class_accuracies[label_name] = label_accuracy
        print(f'{label_name} Accuracy: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(
        y_true=y_true_mapped, 
        y_pred=y_pred_mapped,
        target_names=['Negative', 'Neutral', 'Positive'],
        output_dict=True
    )
    
    print('\nClassification Report:')
    print(classification_report(
        y_true=y_true_mapped, 
        y_pred=y_pred_mapped,
        target_names=['Negative', 'Neutral', 'Positive']
    ))
    
    return {
        'agreement_level': agreement_level,
        'overall_accuracy': accuracy,
        'class_accuracies': class_accuracies,
        'classification_report': class_report,
        'test_samples': len(y_true)
    }

In [10]:
def predict_sentiment(test_prompts, model, tokenizer):
    """Prediction function"""
    y_pred = []
    for i in tqdm(range(len(test_prompts)), desc="Predicting"):
        prompt = test_prompts.iloc[i]["text"]
        input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **input_ids, 
                max_new_tokens=1, 
                temperature=0.0,
                do_sample=False
            )
        
        result = tokenizer.decode(outputs[0])
        answer = result.split("=")[-1].lower().strip()
        
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    
    return y_pred

In [11]:
def fine_tune_model(model, tokenizer, train_data, eval_data, agreement_level):
    """Fine-tune model for specific agreement level"""
    # LoRA configuration
    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
    )
    
    # Training arguments
    training_steps = len(train_data) // 8
    eval_steps = max(training_steps // 10, 10)
    
    training_arguments = TrainingArguments(
        output_dir=f"logs_{agreement_level}",
        num_train_epochs=3,  # Reduced epochs for comparison
        gradient_checkpointing=True,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="paged_adamw_32bit",
        save_steps=0,
        logging_steps=max(training_steps // 20, 5),
        learning_rate=2e-4,
        weight_decay=0.001,
        fp16=True,
        bf16=False,
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=0.03,
        group_by_length=False,
        evaluation_strategy='steps',
        eval_steps=eval_steps,
        eval_accumulation_steps=1,
        lr_scheduler_type="cosine",
        report_to="tensorboard",
    )
    
    # Initialize trainer
    trainer = SFTTrainer(
        model=model,
        train_dataset=train_data,
        eval_dataset=eval_data,
        peft_config=peft_config,
        dataset_text_field="text",
        tokenizer=tokenizer,
        max_seq_length=2048,
        args=training_arguments,
        packing=False,
    )
    
    print(f"\nStarting fine-tuning for {agreement_level}...")
    print(f"Training samples: {len(train_data)}")
    print(f"Validation samples: {len(eval_data)}")
    
    # Start training
    trainer.train()
    
    # Save model
    model_save_path = f"../models/trained-gemma-{agreement_level}"
    trainer.model.save_pretrained(model_save_path)
    print(f"Model saved to: {model_save_path}")
    
    return trainer.model

## Main Comparison Loop

In [12]:
# Define agreement levels to test
agreement_levels = [
    "sentences_50agree",
    "sentences_66agree", 
    "sentences_75agree",
    "sentences_allagree"
]

# Store results
all_results = []
detailed_results = {}

print(f"Starting comparison across {len(agreement_levels)} agreement levels...")
print(f"Agreement levels: {agreement_levels}")

# Clear initial memory
clear_memory()

Starting comparison across 4 agreement levels...
Agreement levels: ['sentences_50agree', 'sentences_66agree', 'sentences_75agree', 'sentences_allagree']
Clearing memory...
GPU memory allocated: 0.00 GB
GPU memory reserved: 0.00 GB


In [13]:
for i, agreement_level in enumerate(agreement_levels):
    print(f"\n{'='*80}")
    print(f"PROCESSING {agreement_level.upper()} ({i+1}/{len(agreement_levels)})")
    print(f"{'='*80}")
    
    try:
        # Step 1: Prepare dataset
        df, sentiment_dist = prepare_dataset(agreement_level)
        
        # Step 2: Create splits
        X_train, X_val, X_test = create_balanced_splits(df)
        
        if len(X_train) == 0 or len(X_test) == 0:
            print(f"Skipping {agreement_level} - insufficient data")
            continue
        
        # Step 3: Load fresh model and tokenizer
        print("\nLoading fresh model and tokenizer...")
        model, tokenizer = load_model_and_tokenizer()
        
        # Step 4: Prepare prompts
        train_data, eval_data, test_prompts, y_true = prepare_prompts(X_train, X_val, X_test, tokenizer)
        
        # Step 5: Test baseline performance (quick test on subset)
        print("\nTesting baseline performance...")
        test_subset_size = min(20, len(test_prompts))  # Small subset for baseline
        test_subset = test_prompts.head(test_subset_size)
        true_subset = y_true[:test_subset_size]
        
        baseline_predictions = predict_sentiment(test_subset, model, tokenizer)
        baseline_accuracy = accuracy_score(
            np.vectorize(lambda x: {'positive': 2, 'neutral': 1, 'none': 1, 'negative': 0}.get(x, 1))(true_subset),
            np.vectorize(lambda x: {'positive': 2, 'neutral': 1, 'none': 1, 'negative': 0}.get(x, 1))(baseline_predictions)
        )
        print(f"Baseline accuracy on {test_subset_size} samples: {baseline_accuracy:.3f}")
        
        # Step 6: Fine-tune model
        fine_tuned_model = fine_tune_model(model, tokenizer, train_data, eval_data, agreement_level)
        
        # Step 7: Evaluate fine-tuned model
        print(f"\nEvaluating fine-tuned model on {len(test_prompts)} test samples...")
        final_predictions = predict_sentiment(test_prompts, fine_tuned_model, tokenizer)
        
        # Step 8: Calculate metrics
        results = evaluate_model(y_true, final_predictions, agreement_level)
        results['baseline_accuracy'] = baseline_accuracy
        results['dataset_size'] = len(df)
        results['train_size'] = len(train_data)
        results['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        # Store results
        all_results.append(results)
        detailed_results[agreement_level] = {
            'y_true': y_true,
            'y_pred': final_predictions,
            'test_data': X_test.copy()
        }
        
        print(f"\n✅ Completed {agreement_level}")
        print(f"Final accuracy: {results['overall_accuracy']:.3f}")
        print(f"Improvement over baseline: {results['overall_accuracy'] - baseline_accuracy:.3f}")
        
    except Exception as e:
        print(f"❌ Error processing {agreement_level}: {str(e)}")
        continue
    
    finally:
        # Clean up memory after each model
        if 'model' in locals():
            del model
        if 'tokenizer' in locals():
            del tokenizer
        if 'fine_tuned_model' in locals():
            del fine_tuned_model
        if 'trainer' in locals():
            del trainer
        
        # Wait and clear memory between runs (except for the last one)
        if i < len(agreement_levels) - 1:
            wait_and_clear(60)

print(f"\n🎉 Completed all {len(all_results)} agreement levels!")


PROCESSING SENTENCES_50AGREE (1/4)

Loading dataset: sentences_50agree...
Dataset shape: (4846, 2)
Sentiment distribution:
sentiment
1    2879
2    1363
0     604
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=135

Loading fresh model and tokenizer...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


Testing baseline performance...


Predicting: 100%|██████████| 20/20 [00:04<00:00,  4.16it/s]


Baseline accuracy on 20 samples: 0.600


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_50agree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.7753, 'grad_norm': 2.1528613567352295, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.4162, 'grad_norm': 7.059887409210205, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 5.235414028167725, 'eval_runtime': 42.3383, 'eval_samples_per_second': 3.189, 'eval_steps_per_second': 0.402, 'epoch': 0.13}
{'loss': 1.1484, 'grad_norm': 1.1082799434661865, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 1.0193, 'grad_norm': 1.278225302696228, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.6322758197784424, 'eval_runtime': 40.4586, 'eval_samples_per_second': 3.337, 'eval_steps_per_second': 0.42, 'epoch': 0.25}
{'loss': 0.9916, 'grad_norm': 0.8759284615516663, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.9316, 'grad_norm': 0.766567587852478, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.611525774002075, 'eval_runtime': 40.5168, 'eval_samples_per_second': 3.332, 'eval_steps_per_second': 0.42, 'epoch': 0.38}
{'loss': 0.9062, 'grad_norm': 0.7578486800193787, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.9142, 'grad_norm': 0.8973798155784607, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.369481563568115, 'eval_runtime': 40.4892, 'eval_samples_per_second': 3.334, 'eval_steps_per_second': 0.42, 'epoch': 0.51}
{'loss': 0.9045, 'grad_norm': 0.5756798386573792, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.7784, 'grad_norm': 0.650130033493042, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.950406551361084, 'eval_runtime': 40.5139, 'eval_samples_per_second': 3.332, 'eval_steps_per_second': 0.42, 'epoch': 0.63}
{'loss': 0.871, 'grad_norm': 0.5725348591804504, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.8963, 'grad_norm': 0.5672242045402527, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.308538436889648, 'eval_runtime': 40.4922, 'eval_samples_per_second': 3.334, 'eval_steps_per_second': 0.42, 'epoch': 0.76}
{'loss': 0.7387, 'grad_norm': 0.4933378994464874, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.891, 'grad_norm': 0.7776598930358887, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.332681179046631, 'eval_runtime': 40.4371, 'eval_samples_per_second': 3.339, 'eval_steps_per_second': 0.42, 'epoch': 0.89}
{'loss': 0.8738, 'grad_norm': 0.7762870192527771, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7396, 'grad_norm': 0.5276827812194824, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.2527530193328857, 'eval_runtime': 40.4719, 'eval_samples_per_second': 3.336, 'eval_steps_per_second': 0.42, 'epoch': 1.02}
{'loss': 0.5781, 'grad_norm': 0.5705731511116028, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.6356, 'grad_norm': 0.6096354126930237, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.6295926570892334, 'eval_runtime': 40.4381, 'eval_samples_per_second': 3.338, 'eval_steps_per_second': 0.42, 'epoch': 1.14}
{'loss': 0.6093, 'grad_norm': 0.5657178163528442, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.5614, 'grad_norm': 0.6040478348731995, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.040938377380371, 'eval_runtime': 40.4247, 'eval_samples_per_second': 3.34, 'eval_steps_per_second': 0.421, 'epoch': 1.27}
{'loss': 0.6427, 'grad_norm': 0.6794120669364929, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.7171, 'grad_norm': 0.8418689370155334, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.495939254760742, 'eval_runtime': 40.4302, 'eval_samples_per_second': 3.339, 'eval_steps_per_second': 0.42, 'epoch': 1.4}
{'loss': 0.6115, 'grad_norm': 0.8697348237037659, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.5785, 'grad_norm': 0.4955008924007416, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.329401969909668, 'eval_runtime': 40.4282, 'eval_samples_per_second': 3.339, 'eval_steps_per_second': 0.42, 'epoch': 1.52}
{'loss': 0.6095, 'grad_norm': 0.847461462020874, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.5485, 'grad_norm': 0.7011503577232361, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.2928597927093506, 'eval_runtime': 40.4946, 'eval_samples_per_second': 3.334, 'eval_steps_per_second': 0.42, 'epoch': 1.65}
{'loss': 0.6, 'grad_norm': 0.6665135622024536, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.566, 'grad_norm': 0.6151303052902222, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.0917019844055176, 'eval_runtime': 40.5086, 'eval_samples_per_second': 3.333, 'eval_steps_per_second': 0.42, 'epoch': 1.78}
{'loss': 0.6956, 'grad_norm': 0.5662703514099121, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.5927, 'grad_norm': 0.7945682406425476, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.0964395999908447, 'eval_runtime': 40.5266, 'eval_samples_per_second': 3.331, 'eval_steps_per_second': 0.419, 'epoch': 1.9}
{'loss': 0.5669, 'grad_norm': 0.6821892857551575, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.5383, 'grad_norm': 0.5944995880126953, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.2918949127197266, 'eval_runtime': 40.4569, 'eval_samples_per_second': 3.337, 'eval_steps_per_second': 0.42, 'epoch': 2.03}
{'loss': 0.4032, 'grad_norm': 0.655983030796051, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.3695, 'grad_norm': 1.3740296363830566, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.03077507019043, 'eval_runtime': 40.5848, 'eval_samples_per_second': 3.326, 'eval_steps_per_second': 0.419, 'epoch': 2.16}
{'loss': 0.3497, 'grad_norm': 0.6783967018127441, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.3529, 'grad_norm': 0.9170962572097778, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.944071054458618, 'eval_runtime': 40.4597, 'eval_samples_per_second': 3.337, 'eval_steps_per_second': 0.42, 'epoch': 2.29}
{'loss': 0.3922, 'grad_norm': 0.8543192148208618, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.348, 'grad_norm': 0.7622824907302856, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.8780148029327393, 'eval_runtime': 40.4449, 'eval_samples_per_second': 3.338, 'eval_steps_per_second': 0.42, 'epoch': 2.41}
{'loss': 0.3503, 'grad_norm': 0.8458753228187561, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.3446, 'grad_norm': 0.7818285226821899, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.9033308029174805, 'eval_runtime': 40.4975, 'eval_samples_per_second': 3.334, 'eval_steps_per_second': 0.42, 'epoch': 2.54}
{'loss': 0.3506, 'grad_norm': 0.9490656852722168, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.3506, 'grad_norm': 0.9947255849838257, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.9194488525390625, 'eval_runtime': 40.4636, 'eval_samples_per_second': 3.336, 'eval_steps_per_second': 0.42, 'epoch': 2.67}
{'loss': 0.3685, 'grad_norm': 0.9885782599449158, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.4013, 'grad_norm': 0.8801100254058838, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.9195425510406494, 'eval_runtime': 40.4703, 'eval_samples_per_second': 3.336, 'eval_steps_per_second': 0.42, 'epoch': 2.79}
{'loss': 0.3561, 'grad_norm': 0.8704021573066711, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.3641, 'grad_norm': 1.006036400794983, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.924083948135376, 'eval_runtime': 40.4914, 'eval_samples_per_second': 3.334, 'eval_steps_per_second': 0.42, 'epoch': 2.92}
{'train_runtime': 2099.7948, 'train_samples_per_second': 0.9, 'train_steps_per_second': 0.111, 'train_loss': 0.6795015676408751, 'epoch': 2.97}
Model saved to: ../models/trained-gemma-sentences_50agree

Evaluating fine-tuned model on 135 test samples...


Predicting:   0%|          | 0/135 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Predicting: 100%|██████████| 135/135 [00:27<00:00,  4.95it/s]



RESULTS FOR SENTENCES_50AGREE
Overall Accuracy: 0.859
Negative Accuracy: 0.867
Neutral Accuracy: 0.844
Positive Accuracy: 0.867

Classification Report:
              precision    recall  f1-score   support

    Negative       0.97      0.87      0.92        45
     Neutral       0.81      0.84      0.83        45
    Positive       0.81      0.87      0.84        45

    accuracy                           0.86       135
   macro avg       0.87      0.86      0.86       135
weighted avg       0.87      0.86      0.86       135


✅ Completed sentences_50agree
Final accuracy: 0.859
Improvement over baseline: 0.259
Waiting 60 seconds...
Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.53 GB

PROCESSING SENTENCES_66AGREE (2/4)

Loading dataset: sentences_66agree...


Downloading data:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4217 [00:00<?, ? examples/s]

Dataset shape: (4217, 2)
Sentiment distribution:
sentiment
1    2535
2    1168
0     514
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=135

Loading fresh model and tokenizer...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


Testing baseline performance...


Predicting: 100%|██████████| 20/20 [00:03<00:00,  5.67it/s]


Baseline accuracy on 20 samples: 0.700


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_66agree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.7076, 'grad_norm': 2.1542627811431885, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.5103, 'grad_norm': 4.750561237335205, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.02640962600708, 'eval_runtime': 199.4027, 'eval_samples_per_second': 0.677, 'eval_steps_per_second': 0.085, 'epoch': 0.13}
{'loss': 1.131, 'grad_norm': 2.109680652618408, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 0.9349, 'grad_norm': 1.1098700761795044, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.014948844909668, 'eval_runtime': 188.636, 'eval_samples_per_second': 0.716, 'eval_steps_per_second': 0.09, 'epoch': 0.25}
{'loss': 0.8997, 'grad_norm': 0.8331458568572998, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.8715, 'grad_norm': 0.6664701700210571, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.226595878601074, 'eval_runtime': 175.4024, 'eval_samples_per_second': 0.77, 'eval_steps_per_second': 0.097, 'epoch': 0.38}
{'loss': 0.878, 'grad_norm': 0.8562365770339966, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.8807, 'grad_norm': 0.7893714308738708, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.844292163848877, 'eval_runtime': 168.6373, 'eval_samples_per_second': 0.801, 'eval_steps_per_second': 0.101, 'epoch': 0.51}
{'loss': 0.8249, 'grad_norm': 0.6834694743156433, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.8937, 'grad_norm': 0.8234925866127014, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.610210418701172, 'eval_runtime': 186.0692, 'eval_samples_per_second': 0.726, 'eval_steps_per_second': 0.091, 'epoch': 0.63}
{'loss': 0.9732, 'grad_norm': 0.7330573201179504, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.8364, 'grad_norm': 0.8210954070091248, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.56644868850708, 'eval_runtime': 185.0694, 'eval_samples_per_second': 0.729, 'eval_steps_per_second': 0.092, 'epoch': 0.76}
{'loss': 0.8142, 'grad_norm': 0.7060080766677856, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.8501, 'grad_norm': 0.7035314440727234, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.6381585597991943, 'eval_runtime': 186.7178, 'eval_samples_per_second': 0.723, 'eval_steps_per_second': 0.091, 'epoch': 0.89}
{'loss': 0.8127, 'grad_norm': 0.6339845061302185, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7462, 'grad_norm': 0.5305508971214294, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.9566586017608643, 'eval_runtime': 184.5406, 'eval_samples_per_second': 0.732, 'eval_steps_per_second': 0.092, 'epoch': 1.02}
{'loss': 0.6321, 'grad_norm': 0.6710834503173828, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.5961, 'grad_norm': 0.6381396055221558, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.6212563514709473, 'eval_runtime': 194.3682, 'eval_samples_per_second': 0.695, 'eval_steps_per_second': 0.087, 'epoch': 1.14}
{'loss': 0.6298, 'grad_norm': 0.7598622441291809, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.6106, 'grad_norm': 0.6594679355621338, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.6490116119384766, 'eval_runtime': 190.2341, 'eval_samples_per_second': 0.71, 'eval_steps_per_second': 0.089, 'epoch': 1.27}
{'loss': 0.5946, 'grad_norm': 0.6872744560241699, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.6881, 'grad_norm': 0.7025502920150757, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.378303050994873, 'eval_runtime': 192.2249, 'eval_samples_per_second': 0.702, 'eval_steps_per_second': 0.088, 'epoch': 1.4}
{'loss': 0.6109, 'grad_norm': 0.7040687203407288, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.5664, 'grad_norm': 0.532467782497406, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.387423038482666, 'eval_runtime': 188.1066, 'eval_samples_per_second': 0.718, 'eval_steps_per_second': 0.09, 'epoch': 1.52}
{'loss': 0.5488, 'grad_norm': 0.7776100635528564, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.5784, 'grad_norm': 0.709304928779602, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.827875852584839, 'eval_runtime': 185.9931, 'eval_samples_per_second': 0.726, 'eval_steps_per_second': 0.091, 'epoch': 1.65}
{'loss': 0.6483, 'grad_norm': 0.8374121785163879, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.6096, 'grad_norm': 0.7866350412368774, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.621126651763916, 'eval_runtime': 192.965, 'eval_samples_per_second': 0.7, 'eval_steps_per_second': 0.088, 'epoch': 1.78}
{'loss': 0.5907, 'grad_norm': 0.6101877689361572, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.5799, 'grad_norm': 0.7361197471618652, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.5515027046203613, 'eval_runtime': 192.4323, 'eval_samples_per_second': 0.702, 'eval_steps_per_second': 0.088, 'epoch': 1.9}
{'loss': 0.5958, 'grad_norm': 0.8123001456260681, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.4762, 'grad_norm': 0.4804086983203888, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.6541905403137207, 'eval_runtime': 187.0767, 'eval_samples_per_second': 0.722, 'eval_steps_per_second': 0.091, 'epoch': 2.03}
{'loss': 0.3773, 'grad_norm': 0.595813512802124, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.4254, 'grad_norm': 1.4746308326721191, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.7535037994384766, 'eval_runtime': 185.9249, 'eval_samples_per_second': 0.726, 'eval_steps_per_second': 0.091, 'epoch': 2.16}
{'loss': 0.3272, 'grad_norm': 1.0901398658752441, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.4029, 'grad_norm': 0.8006512522697449, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.7597174644470215, 'eval_runtime': 195.4226, 'eval_samples_per_second': 0.691, 'eval_steps_per_second': 0.087, 'epoch': 2.29}
{'loss': 0.3298, 'grad_norm': 0.7152491807937622, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.3491, 'grad_norm': 0.6010178923606873, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.7930080890655518, 'eval_runtime': 192.6026, 'eval_samples_per_second': 0.701, 'eval_steps_per_second': 0.088, 'epoch': 2.41}
{'loss': 0.3473, 'grad_norm': 0.7069100737571716, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.3491, 'grad_norm': 0.8636981844902039, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.8121564388275146, 'eval_runtime': 194.2038, 'eval_samples_per_second': 0.695, 'eval_steps_per_second': 0.088, 'epoch': 2.54}
{'loss': 0.3261, 'grad_norm': 0.8348325490951538, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.3247, 'grad_norm': 0.8679013848304749, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.80822491645813, 'eval_runtime': 192.1458, 'eval_samples_per_second': 0.703, 'eval_steps_per_second': 0.088, 'epoch': 2.67}
{'loss': 0.3694, 'grad_norm': 0.9941308498382568, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.3428, 'grad_norm': 0.7783603072166443, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.809690237045288, 'eval_runtime': 186.3824, 'eval_samples_per_second': 0.724, 'eval_steps_per_second': 0.091, 'epoch': 2.79}
{'loss': 0.3252, 'grad_norm': 0.7951807379722595, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.4068, 'grad_norm': 1.1584914922714233, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.8103692531585693, 'eval_runtime': 190.4354, 'eval_samples_per_second': 0.709, 'eval_steps_per_second': 0.089, 'epoch': 2.92}
{'train_runtime': 7124.3233, 'train_samples_per_second': 0.265, 'train_steps_per_second': 0.033, 'train_loss': 0.6706424752871195, 'epoch': 2.97}
Model saved to: ../models/trained-gemma-sentences_66agree

Evaluating fine-tuned model on 135 test samples...


Predicting: 100%|██████████| 135/135 [03:34<00:00,  1.59s/it]



RESULTS FOR SENTENCES_66AGREE
Overall Accuracy: 0.896
Negative Accuracy: 0.933
Neutral Accuracy: 0.867
Positive Accuracy: 0.889

Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      0.93      0.97        45
     Neutral       0.83      0.87      0.85        45
    Positive       0.87      0.89      0.88        45

    accuracy                           0.90       135
   macro avg       0.90      0.90      0.90       135
weighted avg       0.90      0.90      0.90       135


✅ Completed sentences_66agree
Final accuracy: 0.896
Improvement over baseline: 0.196
Waiting 60 seconds...
Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.53 GB

PROCESSING SENTENCES_75AGREE (3/4)

Loading dataset: sentences_75agree...


Generating train split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset shape: (3453, 2)
Sentiment distribution:
sentiment
1    2146
2     887
0     420
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=135

Loading fresh model and tokenizer...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


Testing baseline performance...


Predicting: 100%|██████████| 20/20 [00:03<00:00,  5.58it/s]


Baseline accuracy on 20 samples: 0.600


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_75agree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.7559, 'grad_norm': 2.2697901725769043, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.4252, 'grad_norm': 5.801547527313232, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.3807244300842285, 'eval_runtime': 761.2046, 'eval_samples_per_second': 0.177, 'eval_steps_per_second': 0.022, 'epoch': 0.13}
{'loss': 1.0738, 'grad_norm': 1.1775695085525513, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 1.0123, 'grad_norm': 1.0291610956192017, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.7285358905792236, 'eval_runtime': 771.5395, 'eval_samples_per_second': 0.175, 'eval_steps_per_second': 0.022, 'epoch': 0.25}
{'loss': 0.8627, 'grad_norm': 0.8501899838447571, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.8763, 'grad_norm': 0.9155899882316589, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.7896528244018555, 'eval_runtime': 992.3208, 'eval_samples_per_second': 0.136, 'eval_steps_per_second': 0.017, 'epoch': 0.38}
{'loss': 0.8026, 'grad_norm': 0.8193289637565613, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.811, 'grad_norm': 1.1094731092453003, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.7001962661743164, 'eval_runtime': 1121.9366, 'eval_samples_per_second': 0.12, 'eval_steps_per_second': 0.015, 'epoch': 0.51}
{'loss': 0.9329, 'grad_norm': 0.8104057908058167, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.8977, 'grad_norm': 0.8204874396324158, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.455503225326538, 'eval_runtime': 1124.2921, 'eval_samples_per_second': 0.12, 'eval_steps_per_second': 0.015, 'epoch': 0.63}
{'loss': 0.7635, 'grad_norm': 0.6147134304046631, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.8839, 'grad_norm': 0.8285109996795654, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.719355821609497, 'eval_runtime': 1146.9288, 'eval_samples_per_second': 0.118, 'eval_steps_per_second': 0.015, 'epoch': 0.76}
{'loss': 0.8476, 'grad_norm': 0.7275451421737671, 'learning_rate': 0.00017216960824649303, 'epoch': 0.83}
{'loss': 0.838, 'grad_norm': 0.5734236240386963, 'learning_rate': 0.00016718807618570106, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.5927846431732178, 'eval_runtime': 1093.7436, 'eval_samples_per_second': 0.123, 'eval_steps_per_second': 0.016, 'epoch': 0.89}
{'loss': 0.7605, 'grad_norm': 0.6987824440002441, 'learning_rate': 0.00016188209975614542, 'epoch': 0.95}
{'loss': 0.7375, 'grad_norm': 0.7233102917671204, 'learning_rate': 0.00015627730097695638, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.43646502494812, 'eval_runtime': 1058.6111, 'eval_samples_per_second': 0.128, 'eval_steps_per_second': 0.016, 'epoch': 1.02}
{'loss': 0.6329, 'grad_norm': 0.523972749710083, 'learning_rate': 0.00015040074484992, 'epoch': 1.08}
{'loss': 0.6478, 'grad_norm': 0.6137544512748718, 'learning_rate': 0.00014428080866534396, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.699547290802002, 'eval_runtime': 1029.9835, 'eval_samples_per_second': 0.131, 'eval_steps_per_second': 0.017, 'epoch': 1.14}
{'loss': 0.6861, 'grad_norm': 0.7580453753471375, 'learning_rate': 0.00013794704497101655, 'epoch': 1.21}
{'loss': 0.6432, 'grad_norm': 0.6304428577423096, 'learning_rate': 0.00013143003886596669, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.324995756149292, 'eval_runtime': 1051.0945, 'eval_samples_per_second': 0.128, 'eval_steps_per_second': 0.016, 'epoch': 1.27}
{'loss': 0.6021, 'grad_norm': 0.6311206817626953, 'learning_rate': 0.00012476126030813963, 'epoch': 1.33}
{'loss': 0.5776, 'grad_norm': 0.7232437133789062, 'learning_rate': 0.00011797291214917881, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.7622039318084717, 'eval_runtime': 1094.2857, 'eval_samples_per_second': 0.123, 'eval_steps_per_second': 0.016, 'epoch': 1.4}
{'loss': 0.6298, 'grad_norm': 0.856391191482544, 'learning_rate': 0.00011109777463013915, 'epoch': 1.46}
{'loss': 0.6007, 'grad_norm': 0.6362707018852234, 'learning_rate': 0.00010416904708904548, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.368398427963257, 'eval_runtime': 1122.2814, 'eval_samples_per_second': 0.12, 'eval_steps_per_second': 0.015, 'epoch': 1.52}
{'loss': 0.5362, 'grad_norm': 0.7105799317359924, 'learning_rate': 9.722018764467461e-05, 'epoch': 1.59}
{'loss': 0.5641, 'grad_norm': 0.6604537963867188, 'learning_rate': 9.028475163071141e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.437361717224121, 'eval_runtime': 1123.7999, 'eval_samples_per_second': 0.12, 'eval_steps_per_second': 0.015, 'epoch': 1.65}
{'loss': 0.6096, 'grad_norm': 0.8567232489585876, 'learning_rate': 8.339622956046417e-05, 'epoch': 1.71}
{'loss': 0.5945, 'grad_norm': 0.9001504778862, 'learning_rate': 7.658788540459062e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.5141549110412598, 'eval_runtime': 1084.071, 'eval_samples_per_second': 0.125, 'eval_steps_per_second': 0.016, 'epoch': 1.78}
{'loss': 0.6695, 'grad_norm': 0.6156822443008423, 'learning_rate': 6.989259596277582e-05, 'epoch': 1.84}
{'loss': 0.5759, 'grad_norm': 0.6438992619514465, 'learning_rate': 6.334269210501875e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.608677387237549, 'eval_runtime': 1089.9293, 'eval_samples_per_second': 0.124, 'eval_steps_per_second': 0.016, 'epoch': 1.9}
{'loss': 0.5909, 'grad_norm': 0.6954379677772522, 'learning_rate': 5.696980264915777e-05, 'epoch': 1.97}
{'loss': 0.4618, 'grad_norm': 0.5764989256858826, 'learning_rate': 5.080470162853472e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.851266384124756, 'eval_runtime': 1124.9149, 'eval_samples_per_second': 0.12, 'eval_steps_per_second': 0.015, 'epoch': 2.03}
{'loss': 0.4068, 'grad_norm': 0.6058506965637207, 'learning_rate': 4.487715968732568e-05, 'epoch': 2.1}
{'loss': 0.3663, 'grad_norm': 0.9526309370994568, 'learning_rate': 3.921580032113602e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.2406156063079834, 'eval_runtime': 1116.9907, 'eval_samples_per_second': 0.121, 'eval_steps_per_second': 0.015, 'epoch': 2.16}
{'loss': 0.3267, 'grad_norm': 0.9221230745315552, 'learning_rate': 3.3847961657058845e-05, 'epoch': 2.22}
{'loss': 0.3682, 'grad_norm': 0.8725353479385376, 'learning_rate': 2.879956444064703e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.253580093383789, 'eval_runtime': 946.4538, 'eval_samples_per_second': 0.143, 'eval_steps_per_second': 0.018, 'epoch': 2.29}
{'loss': 0.3938, 'grad_norm': 0.8309304714202881, 'learning_rate': 2.409498686727587e-05, 'epoch': 2.35}
{'loss': 0.3397, 'grad_norm': 0.6865515112876892, 'learning_rate': 1.9756946862323545e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.224029541015625, 'eval_runtime': 940.161, 'eval_samples_per_second': 0.144, 'eval_steps_per_second': 0.018, 'epoch': 2.41}
{'loss': 0.34, 'grad_norm': 0.8626471757888794, 'learning_rate': 1.580639237862608e-05, 'epoch': 2.48}
{'loss': 0.3664, 'grad_norm': 0.8477368354797363, 'learning_rate': 1.2262400240949023e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.242103338241577, 'eval_runtime': 953.967, 'eval_samples_per_second': 0.142, 'eval_steps_per_second': 0.018, 'epoch': 2.54}
{'loss': 0.3386, 'grad_norm': 0.7671940922737122, 'learning_rate': 9.142084025945984e-06, 'epoch': 2.6}
{'loss': 0.3573, 'grad_norm': 0.7648212909698486, 'learning_rate': 6.460511422441984e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.269228219985962, 'eval_runtime': 939.1299, 'eval_samples_per_second': 0.144, 'eval_steps_per_second': 0.018, 'epoch': 2.67}
{'loss': 0.3587, 'grad_norm': 0.8221639394760132, 'learning_rate': 4.230631471100655e-06, 'epoch': 2.73}
{'loss': 0.3866, 'grad_norm': 0.8934240937232971, 'learning_rate': 2.4632120348272003e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.2673697471618652, 'eval_runtime': 944.1319, 'eval_samples_per_second': 0.143, 'eval_steps_per_second': 0.018, 'epoch': 2.79}
{'loss': 0.3646, 'grad_norm': 0.8070080280303955, 'learning_rate': 1.1667878018564171e-06, 'epoch': 2.86}
{'loss': 0.3651, 'grad_norm': 1.033717155456543, 'learning_rate': 3.4761907261356976e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.268368721008301, 'eval_runtime': 940.7441, 'eval_samples_per_second': 0.144, 'eval_steps_per_second': 0.018, 'epoch': 2.92}
{'train_runtime': 80922.8707, 'train_samples_per_second': 0.023, 'train_steps_per_second': 0.003, 'train_loss': 0.6687080264091492, 'epoch': 2.97}
Model saved to: ../models/trained-gemma-sentences_75agree

Evaluating fine-tuned model on 135 test samples...


Predicting: 100%|██████████| 135/135 [18:48<00:00,  8.36s/it]



RESULTS FOR SENTENCES_75AGREE
Overall Accuracy: 0.956
Negative Accuracy: 0.956
Neutral Accuracy: 0.956
Positive Accuracy: 0.956

Classification Report:
              precision    recall  f1-score   support

    Negative       0.96      0.96      0.96        45
     Neutral       0.93      0.96      0.95        45
    Positive       0.98      0.96      0.97        45

    accuracy                           0.96       135
   macro avg       0.96      0.96      0.96       135
weighted avg       0.96      0.96      0.96       135


✅ Completed sentences_75agree
Final accuracy: 0.956
Improvement over baseline: 0.356
Waiting 60 seconds...
Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.53 GB

PROCESSING SENTENCES_ALLAGREE (4/4)

Loading dataset: sentences_allagree...
Dataset shape: (2264, 2)
Sentiment distribution:
sentiment
1    1391
2     570
0     303
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


Testing baseline performance...


Predicting: 100%|██████████| 20/20 [00:03<00:00,  5.49it/s]


Baseline accuracy on 20 samples: 0.650


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_allagree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.7024, 'grad_norm': 2.0866994857788086, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.3267, 'grad_norm': 1.1144582033157349, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 4.551583766937256, 'eval_runtime': 301.8283, 'eval_samples_per_second': 0.447, 'eval_steps_per_second': 0.056, 'epoch': 0.13}
{'loss': 0.9411, 'grad_norm': 2.4472694396972656, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 0.9481, 'grad_norm': 1.1820392608642578, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.3061013221740723, 'eval_runtime': 300.6065, 'eval_samples_per_second': 0.449, 'eval_steps_per_second': 0.057, 'epoch': 0.25}
{'loss': 0.8817, 'grad_norm': 0.6980094313621521, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.8163, 'grad_norm': 0.7427102327346802, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.178292989730835, 'eval_runtime': 297.6257, 'eval_samples_per_second': 0.454, 'eval_steps_per_second': 0.057, 'epoch': 0.38}
{'loss': 0.8266, 'grad_norm': 0.6613677144050598, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.7819, 'grad_norm': 0.7747270464897156, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.933924913406372, 'eval_runtime': 302.5757, 'eval_samples_per_second': 0.446, 'eval_steps_per_second': 0.056, 'epoch': 0.51}
{'loss': 0.7099, 'grad_norm': 0.6444739699363708, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.731, 'grad_norm': 0.7612096667289734, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.8503944873809814, 'eval_runtime': 298.676, 'eval_samples_per_second': 0.452, 'eval_steps_per_second': 0.057, 'epoch': 0.63}
{'loss': 0.8121, 'grad_norm': 0.6304430365562439, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.7935, 'grad_norm': 0.8347581028938293, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.0055606365203857, 'eval_runtime': 298.3719, 'eval_samples_per_second': 0.452, 'eval_steps_per_second': 0.057, 'epoch': 0.76}
{'loss': 0.7635, 'grad_norm': 0.6826755404472351, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.7505, 'grad_norm': 0.7690250873565674, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.4949073791503906, 'eval_runtime': 298.3138, 'eval_samples_per_second': 0.453, 'eval_steps_per_second': 0.057, 'epoch': 0.89}
{'loss': 0.8214, 'grad_norm': 0.7085044980049133, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7042, 'grad_norm': 0.41645824909210205, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.4369285106658936, 'eval_runtime': 297.0952, 'eval_samples_per_second': 0.454, 'eval_steps_per_second': 0.057, 'epoch': 1.02}
{'loss': 0.5418, 'grad_norm': 0.47218045592308044, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.5467, 'grad_norm': 0.6585941910743713, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.226062059402466, 'eval_runtime': 299.6632, 'eval_samples_per_second': 0.451, 'eval_steps_per_second': 0.057, 'epoch': 1.14}
{'loss': 0.5413, 'grad_norm': 0.5264376401901245, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.5813, 'grad_norm': 0.48963382840156555, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.7443692684173584, 'eval_runtime': 300.9069, 'eval_samples_per_second': 0.449, 'eval_steps_per_second': 0.056, 'epoch': 1.27}
{'loss': 0.5874, 'grad_norm': 0.7128834128379822, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.5968, 'grad_norm': 0.6966096758842468, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.8696184158325195, 'eval_runtime': 297.82, 'eval_samples_per_second': 0.453, 'eval_steps_per_second': 0.057, 'epoch': 1.4}
{'loss': 0.5881, 'grad_norm': 0.6516638994216919, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.5017, 'grad_norm': 0.5505711436271667, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.9633214473724365, 'eval_runtime': 299.1376, 'eval_samples_per_second': 0.451, 'eval_steps_per_second': 0.057, 'epoch': 1.52}
{'loss': 0.5495, 'grad_norm': 0.7012750506401062, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.5872, 'grad_norm': 0.6071786880493164, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.8263134956359863, 'eval_runtime': 297.604, 'eval_samples_per_second': 0.454, 'eval_steps_per_second': 0.057, 'epoch': 1.65}
{'loss': 0.5084, 'grad_norm': 0.6189647912979126, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.6158, 'grad_norm': 0.7965925335884094, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.6772215366363525, 'eval_runtime': 300.7085, 'eval_samples_per_second': 0.449, 'eval_steps_per_second': 0.057, 'epoch': 1.78}
{'loss': 0.574, 'grad_norm': 0.5997666120529175, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.5019, 'grad_norm': 0.5529484748840332, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.7246572971343994, 'eval_runtime': 297.9635, 'eval_samples_per_second': 0.453, 'eval_steps_per_second': 0.057, 'epoch': 1.9}
{'loss': 0.5292, 'grad_norm': 0.697863757610321, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.413, 'grad_norm': 0.4452820122241974, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 2.8784255981445312, 'eval_runtime': 299.1063, 'eval_samples_per_second': 0.451, 'eval_steps_per_second': 0.057, 'epoch': 2.03}
{'loss': 0.3727, 'grad_norm': 0.49667060375213623, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.3492, 'grad_norm': 0.9106413722038269, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.298081398010254, 'eval_runtime': 298.6313, 'eval_samples_per_second': 0.452, 'eval_steps_per_second': 0.057, 'epoch': 2.16}
{'loss': 0.3476, 'grad_norm': 0.8978917598724365, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.3766, 'grad_norm': 0.7201527953147888, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.1816036701202393, 'eval_runtime': 298.0978, 'eval_samples_per_second': 0.453, 'eval_steps_per_second': 0.057, 'epoch': 2.29}
{'loss': 0.345, 'grad_norm': 0.6159949898719788, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.3525, 'grad_norm': 0.6905192136764526, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.1399388313293457, 'eval_runtime': 301.5172, 'eval_samples_per_second': 0.448, 'eval_steps_per_second': 0.056, 'epoch': 2.41}
{'loss': 0.342, 'grad_norm': 0.6366906762123108, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.3325, 'grad_norm': 0.943338930606842, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.189056158065796, 'eval_runtime': 299.3682, 'eval_samples_per_second': 0.451, 'eval_steps_per_second': 0.057, 'epoch': 2.54}
{'loss': 0.3105, 'grad_norm': 0.7626591324806213, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.3202, 'grad_norm': 0.7246250510215759, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.2168548107147217, 'eval_runtime': 299.3846, 'eval_samples_per_second': 0.451, 'eval_steps_per_second': 0.057, 'epoch': 2.67}
{'loss': 0.3165, 'grad_norm': 0.7485545873641968, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.3373, 'grad_norm': 0.6718629002571106, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.226066827774048, 'eval_runtime': 297.9706, 'eval_samples_per_second': 0.453, 'eval_steps_per_second': 0.057, 'epoch': 2.79}
{'loss': 0.3318, 'grad_norm': 0.8214324116706848, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.3187, 'grad_norm': 0.7478442192077637, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 3.22692608833313, 'eval_runtime': 298.3111, 'eval_samples_per_second': 0.453, 'eval_steps_per_second': 0.057, 'epoch': 2.92}
{'train_runtime': 15764.6666, 'train_samples_per_second': 0.12, 'train_steps_per_second': 0.015, 'train_loss': 0.6225597960317236, 'epoch': 2.97}
Model saved to: ../models/trained-gemma-sentences_allagree

Evaluating fine-tuned model on 135 test samples...


Predicting: 100%|██████████| 135/135 [04:13<00:00,  1.88s/it]


RESULTS FOR SENTENCES_ALLAGREE
Overall Accuracy: 0.978
Negative Accuracy: 1.000
Neutral Accuracy: 0.956
Positive Accuracy: 0.978

Classification Report:
              precision    recall  f1-score   support

    Negative       0.96      1.00      0.98        45
     Neutral       1.00      0.96      0.98        45
    Positive       0.98      0.98      0.98        45

    accuracy                           0.98       135
   macro avg       0.98      0.98      0.98       135
weighted avg       0.98      0.98      0.98       135


✅ Completed sentences_allagree
Final accuracy: 0.978
Improvement over baseline: 0.328

🎉 Completed all 4 agreement levels!





## Results Analysis and Comparison

In [14]:
# Create comprehensive results DataFrame
if all_results:
    comparison_df = pd.DataFrame([
        {
            'Agreement_Level': result['agreement_level'],
            'Dataset_Size': result['dataset_size'],
            'Train_Size': result['train_size'],
            'Test_Size': result['test_samples'],
            'Baseline_Accuracy': result['baseline_accuracy'],
            'Final_Accuracy': result['overall_accuracy'],
            'Improvement': result['overall_accuracy'] - result['baseline_accuracy'],
            'Improvement_Percent': ((result['overall_accuracy'] - result['baseline_accuracy']) / result['baseline_accuracy']) * 100,
            'Timestamp': result['timestamp']
        }
        for result in all_results
    ])
    
    # Sort by agreement level for better display
    level_order = ['sentences_50agree', 'sentences_66agree', 'sentences_75agree', 'sentences_allagree']
    comparison_df['Level_Order'] = comparison_df['Agreement_Level'].apply(lambda x: level_order.index(x) if x in level_order else 999)
    comparison_df = comparison_df.sort_values('Level_Order').drop('Level_Order', axis=1)
    
    print("\n" + "="*100)
    print("COMPREHENSIVE RESULTS COMPARISON")
    print("="*100)
    
    # Display results table
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    print(comparison_df.round(3))
    
    # Save results
    results_filename = "../results/gemma_agreement_levels_comparison.csv"
    comparison_df.to_csv(results_filename, index=False)
    print(f"\n📊 Results saved to: {results_filename}")
    
else:
    print("❌ No results to display - all experiments failed")


COMPREHENSIVE RESULTS COMPARISON
      Agreement_Level  Dataset_Size  Train_Size  Test_Size  Baseline_Accuracy  \
0   sentences_50agree          4846         630        135               0.60   
1   sentences_66agree          4217         630        135               0.70   
2   sentences_75agree          3453         630        135               0.60   
3  sentences_allagree          2264         630        135               0.65   

   Final_Accuracy  Improvement  Improvement_Percent            Timestamp  
0           0.859        0.259               43.210  2025-06-27 08:32:18  
1           0.896        0.196               28.042  2025-06-27 10:37:32  
2           0.956        0.356               59.259  2025-06-28 09:28:01  
3           0.978        0.328               50.427  2025-06-28 13:57:57  

📊 Results saved to: ../results/gemma_agreement_levels_comparison.csv


In [15]:
# Analyze trends and insights
if all_results:
    print("\n" + "="*80)
    print("KEY INSIGHTS AND ANALYSIS")
    print("="*80)
    
    # Find best and worst performing models
    best_model = comparison_df.loc[comparison_df['Final_Accuracy'].idxmax()]
    worst_model = comparison_df.loc[comparison_df['Final_Accuracy'].idxmin()]
    
    print(f"\n🏆 BEST PERFORMING MODEL:")
    print(f"  Agreement Level: {best_model['Agreement_Level']}")
    print(f"  Final Accuracy: {best_model['Final_Accuracy']:.3f}")
    print(f"  Dataset Size: {best_model['Dataset_Size']:,}")
    print(f"  Improvement: {best_model['Improvement']:.3f} ({best_model['Improvement_Percent']:+.1f}%)")
    
    print(f"\n📉 LOWEST PERFORMING MODEL:")
    print(f"  Agreement Level: {worst_model['Agreement_Level']}")
    print(f"  Final Accuracy: {worst_model['Final_Accuracy']:.3f}")
    print(f"  Dataset Size: {worst_model['Dataset_Size']:,}")
    print(f"  Improvement: {worst_model['Improvement']:.3f} ({worst_model['Improvement_Percent']:+.1f}%)")
    
    # Analyze relationship between agreement level and performance
    print(f"\n📈 TRENDS ANALYSIS:")
    print(f"  Average Final Accuracy: {comparison_df['Final_Accuracy'].mean():.3f}")
    print(f"  Average Improvement: {comparison_df['Improvement'].mean():.3f}")
    print(f"  Standard Deviation: {comparison_df['Final_Accuracy'].std():.3f}")
    
    # Check if higher agreement correlates with better performance
    agreement_mapping = {
        'sentences_50agree': 50,
        'sentences_66agree': 66,
        'sentences_75agree': 75,
        'sentences_allagree': 100
    }
    
    comparison_df['Agreement_Percent'] = comparison_df['Agreement_Level'].map(agreement_mapping)
    correlation = comparison_df['Agreement_Percent'].corr(comparison_df['Final_Accuracy'])
    
    print(f"\n🔍 CORRELATION ANALYSIS:")
    print(f"  Agreement Level vs Final Accuracy Correlation: {correlation:.3f}")
    
    if correlation > 0.5:
        print(f"  ✅ Strong positive correlation - Higher agreement improves performance")
    elif correlation > 0.2:
        print(f"  ✅ Moderate positive correlation - Higher agreement tends to improve performance")
    elif correlation > -0.2:
        print(f"  ⚠️  Weak correlation - Agreement level has minimal impact on performance")
    else:
        print(f"  ❌ Negative correlation - Unexpected result, may need investigation")
    
    print(f"\n💡 RECOMMENDATIONS:")
    print(f"  • Use {best_model['Agreement_Level']} for production deployment")
    print(f"  • Data quality vs quantity trade-off analysis completed")
    print(f"  • Consider ensemble methods if performance differences are small")
    print(f"  • Monitor real-world performance to validate these results")
    
else:
    print("❌ No analysis possible - no successful experiments")


KEY INSIGHTS AND ANALYSIS

🏆 BEST PERFORMING MODEL:
  Agreement Level: sentences_allagree
  Final Accuracy: 0.978
  Dataset Size: 2,264
  Improvement: 0.328 (+50.4%)

📉 LOWEST PERFORMING MODEL:
  Agreement Level: sentences_50agree
  Final Accuracy: 0.859
  Dataset Size: 4,846
  Improvement: 0.259 (+43.2%)

📈 TRENDS ANALYSIS:
  Average Final Accuracy: 0.922
  Average Improvement: 0.285
  Standard Deviation: 0.054

🔍 CORRELATION ANALYSIS:
  Agreement Level vs Final Accuracy Correlation: 0.939
  ✅ Strong positive correlation - Higher agreement improves performance

💡 RECOMMENDATIONS:
  • Use sentences_allagree for production deployment
  • Data quality vs quantity trade-off analysis completed
  • Consider ensemble methods if performance differences are small
  • Monitor real-world performance to validate these results


In [16]:
# Save detailed results for each agreement level
if detailed_results:
    for agreement_level, data in detailed_results.items():
        # Create detailed predictions DataFrame
        detailed_df = pd.DataFrame({
            'text': data['test_data']['text'].tolist(),
            'true_sentiment': data['y_true'],
            'predicted_sentiment': data['y_pred'],
            'correct': [t == p for t, p in zip(data['y_true'], data['y_pred'])],
            'agreement_level': agreement_level
        })
        
        # Save detailed results
        detail_filename = f"../results/detailed_predictions_{agreement_level}.csv"
        detailed_df.to_csv(detail_filename, index=False)
        print(f"📝 Detailed predictions saved: {detail_filename}")

print(f"\n✅ All results and analysis completed!")
print(f"📂 Check the ../results/ folder for all output files")

📝 Detailed predictions saved: ../results/detailed_predictions_sentences_50agree.csv
📝 Detailed predictions saved: ../results/detailed_predictions_sentences_66agree.csv
📝 Detailed predictions saved: ../results/detailed_predictions_sentences_75agree.csv
📝 Detailed predictions saved: ../results/detailed_predictions_sentences_allagree.csv

✅ All results and analysis completed!
📂 Check the ../results/ folder for all output files


In [17]:
# Final memory cleanup
clear_memory()
print("\n🧹 Final memory cleanup completed")
print("\n🎯 Experiment completed successfully!")

Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.53 GB

🧹 Final memory cleanup completed

🎯 Experiment completed successfully!
