# Fine-tune Llama 3-8B Comparison Across Agreement Levels

This notebook fine-tunes Llama 3-8B using different agreement levels from FinancialPhraseBank to compare the impact of data quality on model performance.

## Agreement Levels:
- **sentences_50agree**: >=50% annotator agreement (4,846 sentences)
- **sentences_66agree**: >=66% annotator agreement (4,217 sentences) 
- **sentences_75agree**: >=75% annotator agreement (3,453 sentences)
- **sentences_allagree**: 100% annotator agreement (2,264 sentences)

## Hypothesis:
Higher agreement levels should lead to better model performance due to reduced label noise.

In [1]:
# Install required packages
%pip install -q -U torch --index-url https://download.pytorch.org/whl/cu117
%pip install -q -U transformers==4.38.2
%pip install -q accelerate==0.32.0
%pip install -q -i https://pypi.org/simple/ bitsandbytes
%pip install -q -U datasets==2.16.1
%pip install -q -U trl==0.7.11
%pip install -q -U peft==0.10.0

Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




In [2]:
import os
import gc
import time
import warnings
from datetime import datetime

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
warnings.filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig, 
    TrainingArguments, 
    pipeline, 
    logging
)
from datasets import Dataset, load_dataset
from peft import LoraConfig, PeftConfig
import bitsandbytes as bnb
from trl import SFTTrainer

from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix
)
from sklearn.model_selection import train_test_split

print(f"transformers=={transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

transformers==4.38.2
PyTorch version: 2.5.1+cu121
CUDA available: True


## Define Helper Functions

In [4]:
def clear_memory():
    """Clear GPU and CPU memory"""
    print("Clearing memory...")
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print(f"GPU memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

def wait_and_clear(seconds=60):
    """Wait for specified seconds and clear memory"""
    print(f"Waiting {seconds} seconds...")
    time.sleep(seconds)
    clear_memory()

In [5]:
def prepare_dataset(agreement_level):
    """Load and prepare dataset for specific agreement level"""
    print(f"\nLoading dataset: {agreement_level}...")
    
    # Load the dataset from HuggingFace
    dataset = load_dataset("takala/financial_phrasebank", agreement_level)
    
    # Convert to pandas for easier manipulation
    df = dataset['train'].to_pandas()
    df = df.rename(columns={'sentence': 'text', 'label': 'sentiment'})
    
    print(f"Dataset shape: {df.shape}")
    print(f"Sentiment distribution:")
    sentiment_dist = df['sentiment'].value_counts()
    print(sentiment_dist)
    
    return df, sentiment_dist

In [6]:
def create_balanced_splits(df, max_samples_per_class=300):
    """Create balanced train/val/test splits"""
    # Calculate minimum class size
    min_class_size = df['sentiment'].value_counts().min()
    samples_per_class = min(max_samples_per_class, min_class_size)
    
    # Use 70% for training, 15% for validation, 15% for testing
    train_size_per_class = int(samples_per_class * 0.7)
    val_size_per_class = int(samples_per_class * 0.15)
    test_size_per_class = samples_per_class - train_size_per_class - val_size_per_class
    
    print(f"Split sizes per class: Train={train_size_per_class}, Val={val_size_per_class}, Test={test_size_per_class}")
    
    X_train, X_val, X_test = [], [], []
    
    # Map integer labels to sentiment names
    label_mapping = {0: "negative", 1: "neutral", 2: "positive"}
    
    for sentiment_label in [0, 1, 2]:
        sentiment_name = label_mapping[sentiment_label]
        sentiment_data = df[df.sentiment == sentiment_label]
        
        if len(sentiment_data) == 0:
            print(f"Warning: No samples found for {sentiment_name} sentiment")
            continue
        
        # Sample the required number for this class
        if len(sentiment_data) >= samples_per_class:
            sampled_data = sentiment_data.sample(n=samples_per_class, random_state=42)
        else:
            sampled_data = sentiment_data
            print(f"Warning: Only {len(sentiment_data)} samples available for {sentiment_name}")
        
        # Split the sampled data
        if len(sampled_data) >= 3:  # Need at least 3 samples to split
            temp_data, test_data = train_test_split(
                sampled_data, 
                test_size=min(test_size_per_class, len(sampled_data)//3),
                random_state=42
            )
            
            if len(temp_data) >= 2:
                train_data, val_data = train_test_split(
                    temp_data,
                    test_size=min(val_size_per_class, len(temp_data)//2),
                    random_state=42
                )
            else:
                train_data = temp_data
                val_data = pd.DataFrame()
        else:
            train_data = sampled_data
            val_data = pd.DataFrame()
            test_data = pd.DataFrame()
        
        if len(train_data) > 0:
            X_train.append(train_data)
        if len(val_data) > 0:
            X_val.append(val_data)
        if len(test_data) > 0:
            X_test.append(test_data)
    
    # Combine and shuffle
    X_train = pd.concat(X_train).sample(frac=1, random_state=10).reset_index(drop=True) if X_train else pd.DataFrame()
    X_val = pd.concat(X_val).sample(frac=1, random_state=10).reset_index(drop=True) if X_val else pd.DataFrame()
    X_test = pd.concat(X_test).sample(frac=1, random_state=10).reset_index(drop=True) if X_test else pd.DataFrame()
    
    print(f"Final splits: Train={len(X_train)}, Val={len(X_val)}, Test={len(X_test)}")
    
    return X_train, X_val, X_test

In [7]:
def prepare_prompts(X_train, X_val, X_test, tokenizer):
    """Prepare training prompts"""
    EOS_TOKEN = tokenizer.eos_token
    label_mapping = {0: "negative", 1: "neutral", 2: "positive"}
    
    def generate_prompt(data_point):
        sentiment_text = label_mapping[data_point["sentiment"]]
        return f"""Analyze the sentiment of the news headline enclosed in square brackets, 
                determine if it is positive, neutral, or negative, and return the answer as 
                the corresponding sentiment label "positive" or "neutral" or "negative"

                [{data_point["text"]}] = {sentiment_text}
                """.strip() + EOS_TOKEN

    def generate_test_prompt(data_point):
        return f"""Analyze the sentiment of the news headline enclosed in square brackets, 
                determine if it is positive, neutral, or negative, and return the answer as 
                the corresponding sentiment label "positive" or "neutral" or "negative"

                [{data_point["text"]}] = 
                """.strip()
    
    # Generate prompts
    train_prompts = pd.DataFrame(X_train.apply(generate_prompt, axis=1), columns=["text"])
    val_prompts = pd.DataFrame(X_val.apply(generate_prompt, axis=1), columns=["text"])
    
    # Test prompts and true labels
    y_true = [label_mapping[label] for label in X_test['sentiment'].tolist()]
    test_prompts = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])
    
    # Convert to HuggingFace datasets
    train_data = Dataset.from_pandas(train_prompts)
    eval_data = Dataset.from_pandas(val_prompts)
    
    return train_data, eval_data, test_prompts, y_true

In [8]:
def load_model_and_tokenizer():
    """Load fresh model and tokenizer"""
    model_name = "meta-llama/Meta-Llama-3-8B"
    
    compute_dtype = getattr(torch, "float16")
    
    # Configure 4-bit quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
    )
    
    # Load model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=bnb_config, 
        token=True  # Add this for accessing gated models
    )
    
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    
    # Load tokenizer
    max_seq_length = 2048
    tokenizer = AutoTokenizer.from_pretrained(model_name, max_seq_length=max_seq_length, token=True)
    
    # Add padding token if it doesn't exist
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    return model, tokenizer

In [9]:
def evaluate_model(y_true, y_pred, agreement_level):
    """Comprehensive evaluation function"""
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none': 1, 'negative': 0}
    
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate overall accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    
    print(f"\n{'='*60}")
    print(f"RESULTS FOR {agreement_level.upper()}")
    print(f"{'='*60}")
    print(f'Overall Accuracy: {accuracy:.3f}')
    
    # Calculate per-class accuracy
    unique_labels = set(y_true_mapped)
    
    class_accuracies = {}
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) 
                         if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        label_name = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}[label]
        class_accuracies[label_name] = label_accuracy
        print(f'{label_name} Accuracy: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(
        y_true=y_true_mapped, 
        y_pred=y_pred_mapped,
        target_names=['Negative', 'Neutral', 'Positive'],
        output_dict=True
    )
    
    print('\nClassification Report:')
    print(classification_report(
        y_true=y_true_mapped, 
        y_pred=y_pred_mapped,
        target_names=['Negative', 'Neutral', 'Positive']
    ))
    
    return {
        'agreement_level': agreement_level,
        'overall_accuracy': accuracy,
        'class_accuracies': class_accuracies,
        'classification_report': class_report,
        'test_samples': len(y_true)
    }

In [10]:
def predict_sentiment(test_prompts, model, tokenizer):
    """Prediction function"""
    y_pred = []
    for i in tqdm(range(len(test_prompts)), desc="Predicting"):
        prompt = test_prompts.iloc[i]["text"]
        input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **input_ids, 
                max_new_tokens=1, 
                temperature=0.0,
                do_sample=False
            )
        
        result = tokenizer.decode(outputs[0])
        answer = result.split("=")[-1].lower().strip()
        
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    
    return y_pred

In [11]:
def fine_tune_model(model, tokenizer, train_data, eval_data, agreement_level):
    """Fine-tune model for specific agreement level"""
    # LoRA configuration for Llama 3
    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
    )
    
    # Training arguments
    training_steps = len(train_data) // 8
    eval_steps = max(training_steps // 10, 10)
    
    training_arguments = TrainingArguments(
        output_dir=f"logs_{agreement_level}",
        num_train_epochs=3,  # Reduced epochs for comparison
        gradient_checkpointing=True,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        optim="paged_adamw_32bit",
        save_steps=0,
        logging_steps=max(training_steps // 20, 5),
        learning_rate=2e-4,
        weight_decay=0.001,
        fp16=True,
        bf16=False,
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=0.03,
        group_by_length=False,
        evaluation_strategy='steps',
        eval_steps=eval_steps,
        eval_accumulation_steps=1,
        lr_scheduler_type="cosine",
        report_to="tensorboard",
    )
    
    # Initialize trainer
    trainer = SFTTrainer(
        model=model,
        train_dataset=train_data,
        eval_dataset=eval_data,
        peft_config=peft_config,
        dataset_text_field="text",
        tokenizer=tokenizer,
        max_seq_length=2048,
        args=training_arguments,
        packing=False,
    )
    
    print(f"\nStarting fine-tuning for {agreement_level}...")
    print(f"Training samples: {len(train_data)}")
    print(f"Validation samples: {len(eval_data)}")
    
    # Start training
    trainer.train()
    
    # Save model
    model_save_path = f"../models/trained-llama3-{agreement_level}"
    trainer.model.save_pretrained(model_save_path)
    print(f"Model saved to: {model_save_path}")
    
    return trainer.model

## Main Comparison Loop

In [12]:
# Define agreement levels to test
agreement_levels = [
    "sentences_50agree",
    "sentences_66agree", 
    "sentences_75agree",
    "sentences_allagree"
]

# Store results
all_results = []
detailed_results = {}

print(f"Starting comparison across {len(agreement_levels)} agreement levels...")
print(f"Agreement levels: {agreement_levels}")

# Clear initial memory
clear_memory()

Starting comparison across 4 agreement levels...
Agreement levels: ['sentences_50agree', 'sentences_66agree', 'sentences_75agree', 'sentences_allagree']
Clearing memory...
GPU memory allocated: 0.00 GB
GPU memory reserved: 0.00 GB


In [13]:
for i, agreement_level in enumerate(agreement_levels):
    print(f"\n{'='*80}")
    print(f"PROCESSING {agreement_level.upper()} ({i+1}/{len(agreement_levels)})")
    print(f"{'='*80}")
    
    try:
        # Step 1: Prepare dataset
        df, sentiment_dist = prepare_dataset(agreement_level)
        
        # Step 2: Create splits
        X_train, X_val, X_test = create_balanced_splits(df)
        
        if len(X_train) == 0 or len(X_test) == 0:
            print(f"Skipping {agreement_level} - insufficient data")
            continue
        
        # Step 3: Load fresh model and tokenizer
        print("\nLoading fresh model and tokenizer...")
        model, tokenizer = load_model_and_tokenizer()
        
        # Step 4: Prepare prompts
        train_data, eval_data, test_prompts, y_true = prepare_prompts(X_train, X_val, X_test, tokenizer)
        
        # Step 5: Test baseline performance (quick test on subset)
        print("\nTesting baseline performance...")
        test_subset_size = min(20, len(test_prompts))  # Small subset for baseline
        test_subset = test_prompts.head(test_subset_size)
        true_subset = y_true[:test_subset_size]
        
        baseline_predictions = predict_sentiment(test_subset, model, tokenizer)
        baseline_accuracy = accuracy_score(
            np.vectorize(lambda x: {'positive': 2, 'neutral': 1, 'none': 1, 'negative': 0}.get(x, 1))(true_subset),
            np.vectorize(lambda x: {'positive': 2, 'neutral': 1, 'none': 1, 'negative': 0}.get(x, 1))(baseline_predictions)
        )
        print(f"Baseline accuracy on {test_subset_size} samples: {baseline_accuracy:.3f}")
        
        # Step 6: Fine-tune model
        fine_tuned_model = fine_tune_model(model, tokenizer, train_data, eval_data, agreement_level)
        
        # Step 7: Evaluate fine-tuned model
        print(f"\nEvaluating fine-tuned model on {len(test_prompts)} test samples...")
        final_predictions = predict_sentiment(test_prompts, fine_tuned_model, tokenizer)
        
        # Step 8: Calculate metrics
        results = evaluate_model(y_true, final_predictions, agreement_level)
        results['baseline_accuracy'] = baseline_accuracy
        results['dataset_size'] = len(df)
        results['train_size'] = len(train_data)
        results['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        # Store results
        all_results.append(results)
        detailed_results[agreement_level] = {
            'y_true': y_true,
            'y_pred': final_predictions,
            'test_data': X_test.copy()
        }
        
        print(f"\nCompleted {agreement_level}")
        print(f"Final accuracy: {results['overall_accuracy']:.3f}")
        print(f"Improvement over baseline: {results['overall_accuracy'] - baseline_accuracy:.3f}")
        
    except Exception as e:
        print(f"Error processing {agreement_level}: {str(e)}")
        continue
    
    finally:
        # Clean up memory after each model
        if 'model' in locals():
            del model
        if 'tokenizer' in locals():
            del tokenizer
        if 'fine_tuned_model' in locals():
            del fine_tuned_model
        if 'trainer' in locals():
            del trainer
        
        # Wait and clear memory between runs (except for the last one)
        if i < len(agreement_levels) - 1:
            wait_and_clear(60)

print(f"\nCompleted all {len(all_results)} agreement levels!")


PROCESSING SENTENCES_50AGREE (1/4)

Loading dataset: sentences_50agree...
Dataset shape: (4846, 2)
Sentiment distribution:
sentiment
1    2879
2    1363
0     604
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=135

Loading fresh model and tokenizer...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Testing baseline performance...


Predicting:   0%|          | 0/20 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 1/20 [00:02<00:50,  2.65s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  10%|█         | 2/20 [00:02<00:21,  1.19s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  15%|█▌        | 3/20 [00:02<00:12,  1.40it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  20%|██        | 4/20 [00:03<00:07,  2.04it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  25%|██▌       | 5/20 [00:03<00:05,  2.73it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  30%|███       | 6/20 [00:03<00:04,  3.44it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  35%|███▌      | 7/20 [00:03<00:03,  4.08it/s]Setting `pad_token_id` to `eos_to

Baseline accuracy on 20 samples: 0.500


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_50agree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.9091, 'grad_norm': 0.5394439101219177, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.7832, 'grad_norm': 0.9753520488739014, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.2524452209472656, 'eval_runtime': 49.154, 'eval_samples_per_second': 2.746, 'eval_steps_per_second': 0.346, 'epoch': 0.13}
{'loss': 1.142, 'grad_norm': 0.3727717697620392, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 1.0257, 'grad_norm': 0.5527883172035217, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.0498992204666138, 'eval_runtime': 45.9934, 'eval_samples_per_second': 2.935, 'eval_steps_per_second': 0.37, 'epoch': 0.25}
{'loss': 1.0193, 'grad_norm': 0.4222785234451294, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.9567, 'grad_norm': 0.32622280716896057, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9793999791145325, 'eval_runtime': 46.9093, 'eval_samples_per_second': 2.878, 'eval_steps_per_second': 0.362, 'epoch': 0.38}
{'loss': 0.9352, 'grad_norm': 0.3624098002910614, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.9384, 'grad_norm': 0.27643492817878723, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9562624096870422, 'eval_runtime': 47.1221, 'eval_samples_per_second': 2.865, 'eval_steps_per_second': 0.361, 'epoch': 0.51}
{'loss': 0.9114, 'grad_norm': 0.25675663352012634, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.791, 'grad_norm': 0.24408505856990814, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9446445107460022, 'eval_runtime': 46.8777, 'eval_samples_per_second': 2.88, 'eval_steps_per_second': 0.363, 'epoch': 0.63}
{'loss': 0.886, 'grad_norm': 0.28039151430130005, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.9059, 'grad_norm': 0.32270902395248413, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9350346326828003, 'eval_runtime': 47.7045, 'eval_samples_per_second': 2.83, 'eval_steps_per_second': 0.356, 'epoch': 0.76}
{'loss': 0.764, 'grad_norm': 0.24870052933692932, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.8961, 'grad_norm': 0.4486890435218811, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9253931045532227, 'eval_runtime': 47.1801, 'eval_samples_per_second': 2.861, 'eval_steps_per_second': 0.36, 'epoch': 0.89}
{'loss': 0.8928, 'grad_norm': 0.3704843521118164, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7744, 'grad_norm': 0.24348542094230652, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.916245698928833, 'eval_runtime': 47.4302, 'eval_samples_per_second': 2.846, 'eval_steps_per_second': 0.358, 'epoch': 1.02}
{'loss': 0.6611, 'grad_norm': 0.2870008051395416, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.7261, 'grad_norm': 0.28376466035842896, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9374059438705444, 'eval_runtime': 47.502, 'eval_samples_per_second': 2.842, 'eval_steps_per_second': 0.358, 'epoch': 1.14}
{'loss': 0.6997, 'grad_norm': 0.26925286650657654, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.6596, 'grad_norm': 0.22904078662395477, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9309154748916626, 'eval_runtime': 47.5794, 'eval_samples_per_second': 2.837, 'eval_steps_per_second': 0.357, 'epoch': 1.27}
{'loss': 0.7472, 'grad_norm': 0.3242730498313904, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.8373, 'grad_norm': 0.4210909903049469, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9469806551933289, 'eval_runtime': 46.642, 'eval_samples_per_second': 2.894, 'eval_steps_per_second': 0.364, 'epoch': 1.4}
{'loss': 0.6988, 'grad_norm': 0.39314207434654236, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.6876, 'grad_norm': 0.2516789734363556, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9292834401130676, 'eval_runtime': 46.6709, 'eval_samples_per_second': 2.893, 'eval_steps_per_second': 0.364, 'epoch': 1.52}
{'loss': 0.7352, 'grad_norm': 0.3526500463485718, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.631, 'grad_norm': 0.3447439968585968, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9360257983207703, 'eval_runtime': 47.9472, 'eval_samples_per_second': 2.816, 'eval_steps_per_second': 0.355, 'epoch': 1.65}
{'loss': 0.7201, 'grad_norm': 0.3587779700756073, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.6421, 'grad_norm': 0.2924353778362274, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9335948824882507, 'eval_runtime': 47.6309, 'eval_samples_per_second': 2.834, 'eval_steps_per_second': 0.357, 'epoch': 1.78}
{'loss': 0.8031, 'grad_norm': 0.31765973567962646, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.698, 'grad_norm': 0.38972336053848267, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9342652559280396, 'eval_runtime': 46.1723, 'eval_samples_per_second': 2.924, 'eval_steps_per_second': 0.368, 'epoch': 1.9}
{'loss': 0.6709, 'grad_norm': 0.35808658599853516, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.6912, 'grad_norm': 0.33790621161460876, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9364230632781982, 'eval_runtime': 47.6642, 'eval_samples_per_second': 2.832, 'eval_steps_per_second': 0.357, 'epoch': 2.03}
{'loss': 0.5707, 'grad_norm': 0.31081220507621765, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.5257, 'grad_norm': 0.3687444031238556, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9695270657539368, 'eval_runtime': 46.0496, 'eval_samples_per_second': 2.932, 'eval_steps_per_second': 0.369, 'epoch': 2.16}
{'loss': 0.5245, 'grad_norm': 0.3477259576320648, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.5052, 'grad_norm': 0.409900039434433, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.001775860786438, 'eval_runtime': 45.8078, 'eval_samples_per_second': 2.947, 'eval_steps_per_second': 0.371, 'epoch': 2.29}
{'loss': 0.5545, 'grad_norm': 0.5080394744873047, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.4804, 'grad_norm': 0.4507872760295868, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.0013407468795776, 'eval_runtime': 45.9395, 'eval_samples_per_second': 2.939, 'eval_steps_per_second': 0.37, 'epoch': 2.41}
{'loss': 0.4976, 'grad_norm': 0.4272882640361786, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.5167, 'grad_norm': 0.4582081735134125, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.00126051902771, 'eval_runtime': 46.3291, 'eval_samples_per_second': 2.914, 'eval_steps_per_second': 0.367, 'epoch': 2.54}
{'loss': 0.536, 'grad_norm': 0.6060853600502014, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.5516, 'grad_norm': 0.5654637217521667, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.997919499874115, 'eval_runtime': 45.9686, 'eval_samples_per_second': 2.937, 'eval_steps_per_second': 0.37, 'epoch': 2.67}
{'loss': 0.5292, 'grad_norm': 0.5464028716087341, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.5483, 'grad_norm': 0.44941914081573486, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.99681556224823, 'eval_runtime': 46.1454, 'eval_samples_per_second': 2.926, 'eval_steps_per_second': 0.368, 'epoch': 2.79}
{'loss': 0.5345, 'grad_norm': 0.4495753347873688, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.5099, 'grad_norm': 0.48617178201675415, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9966216087341309, 'eval_runtime': 46.1799, 'eval_samples_per_second': 2.923, 'eval_steps_per_second': 0.368, 'epoch': 2.92}
{'train_runtime': 2465.3032, 'train_samples_per_second': 0.767, 'train_steps_per_second': 0.095, 'train_loss': 0.7819739845063951, 'epoch': 2.97}
Model saved to: ../models/trained-llama3-sentences_50agree

Evaluating fine-tuned model on 135 test samples...


Predicting:   0%|          | 0/135 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Predicting:   1%|          | 1/135 [00:00<00:29,  4.49it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|▏         | 2/135 [00:00<00:25,  5.16it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   2%|▏         | 3/135 [00:00<00:24,  5.41it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   3%|▎         | 4/135 [00:00<00:22,  5.74it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▎         | 5/135 [00:00<00:22,  5.75it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▍         | 6/135 [00:01<00:22,  5.84it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generat


RESULTS FOR SENTENCES_50AGREE
Overall Accuracy: 0.874
Negative Accuracy: 0.911
Neutral Accuracy: 0.844
Positive Accuracy: 0.867

Classification Report:
              precision    recall  f1-score   support

    Negative       0.93      0.91      0.92        45
     Neutral       0.83      0.84      0.84        45
    Positive       0.87      0.87      0.87        45

    accuracy                           0.87       135
   macro avg       0.87      0.87      0.87       135
weighted avg       0.87      0.87      0.87       135


Completed sentences_50agree
Final accuracy: 0.874
Improvement over baseline: 0.374
Waiting 60 seconds...
Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.04 GB

PROCESSING SENTENCES_66AGREE (2/4)

Loading dataset: sentences_66agree...
Dataset shape: (4217, 2)
Sentiment distribution:
sentiment
1    2535
2    1168
0     514
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=135

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Testing baseline performance...


Predicting:   0%|          | 0/20 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 1/20 [00:01<00:23,  1.25s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  10%|█         | 2/20 [00:01<00:10,  1.64it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  15%|█▌        | 3/20 [00:01<00:06,  2.50it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  20%|██        | 4/20 [00:01<00:04,  3.37it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  25%|██▌       | 5/20 [00:01<00:03,  4.07it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  30%|███       | 6/20 [00:02<00:03,  4.66it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  35%|███▌      | 7/20 [00:02<00:02,  5.19it/s]Setting `pad_token_id` to `eos_to

Baseline accuracy on 20 samples: 0.400


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_66agree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.8304, 'grad_norm': 0.5446348190307617, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.8879, 'grad_norm': 0.8633529543876648, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.156339406967163, 'eval_runtime': 211.2112, 'eval_samples_per_second': 0.639, 'eval_steps_per_second': 0.08, 'epoch': 0.13}
{'loss': 1.1505, 'grad_norm': 1.007103443145752, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 0.9782, 'grad_norm': 0.5200950503349304, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9706398248672485, 'eval_runtime': 207.3175, 'eval_samples_per_second': 0.651, 'eval_steps_per_second': 0.082, 'epoch': 0.25}
{'loss': 0.9653, 'grad_norm': 0.29955658316612244, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.8947, 'grad_norm': 0.3100181221961975, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9062876105308533, 'eval_runtime': 208.7365, 'eval_samples_per_second': 0.647, 'eval_steps_per_second': 0.081, 'epoch': 0.38}
{'loss': 0.8886, 'grad_norm': 0.3610437214374542, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.8745, 'grad_norm': 0.30494892597198486, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8732984662055969, 'eval_runtime': 206.9268, 'eval_samples_per_second': 0.652, 'eval_steps_per_second': 0.082, 'epoch': 0.51}
{'loss': 0.8472, 'grad_norm': 0.3035907447338104, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.8987, 'grad_norm': 0.3648854196071625, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8642804622650146, 'eval_runtime': 206.648, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.082, 'epoch': 0.63}
{'loss': 0.9902, 'grad_norm': 0.313923180103302, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.8262, 'grad_norm': 0.31440868973731995, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8562947511672974, 'eval_runtime': 203.81, 'eval_samples_per_second': 0.662, 'eval_steps_per_second': 0.083, 'epoch': 0.76}
{'loss': 0.8269, 'grad_norm': 0.26375895738601685, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.888, 'grad_norm': 0.401833176612854, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8436304330825806, 'eval_runtime': 207.2318, 'eval_samples_per_second': 0.651, 'eval_steps_per_second': 0.082, 'epoch': 0.89}
{'loss': 0.8405, 'grad_norm': 0.30799952149391174, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7823, 'grad_norm': 0.21458658576011658, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8395868539810181, 'eval_runtime': 207.1333, 'eval_samples_per_second': 0.652, 'eval_steps_per_second': 0.082, 'epoch': 1.02}
{'loss': 0.7198, 'grad_norm': 0.22920845448970795, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.683, 'grad_norm': 0.2732987105846405, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.837647020816803, 'eval_runtime': 206.6152, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.082, 'epoch': 1.14}
{'loss': 0.7323, 'grad_norm': 0.3369291126728058, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.7093, 'grad_norm': 0.33015990257263184, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.837715744972229, 'eval_runtime': 206.2022, 'eval_samples_per_second': 0.655, 'eval_steps_per_second': 0.082, 'epoch': 1.27}
{'loss': 0.6911, 'grad_norm': 0.34088942408561707, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.776, 'grad_norm': 0.3414470851421356, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8460949063301086, 'eval_runtime': 206.3149, 'eval_samples_per_second': 0.654, 'eval_steps_per_second': 0.082, 'epoch': 1.4}
{'loss': 0.6992, 'grad_norm': 0.34536927938461304, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.6639, 'grad_norm': 0.38471129536628723, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8463066220283508, 'eval_runtime': 206.579, 'eval_samples_per_second': 0.654, 'eval_steps_per_second': 0.082, 'epoch': 1.52}
{'loss': 0.6545, 'grad_norm': 0.36988595128059387, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.6964, 'grad_norm': 0.34648168087005615, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.841667115688324, 'eval_runtime': 208.1792, 'eval_samples_per_second': 0.648, 'eval_steps_per_second': 0.082, 'epoch': 1.65}
{'loss': 0.7604, 'grad_norm': 0.4095156192779541, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.7227, 'grad_norm': 0.3905446529388428, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8395272493362427, 'eval_runtime': 205.3851, 'eval_samples_per_second': 0.657, 'eval_steps_per_second': 0.083, 'epoch': 1.78}
{'loss': 0.6974, 'grad_norm': 0.34515321254730225, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.7026, 'grad_norm': 0.369934618473053, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8392063975334167, 'eval_runtime': 206.7554, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.082, 'epoch': 1.9}
{'loss': 0.7357, 'grad_norm': 0.35470104217529297, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.6057, 'grad_norm': 0.27844318747520447, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8451243042945862, 'eval_runtime': 206.7452, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.082, 'epoch': 2.03}
{'loss': 0.5628, 'grad_norm': 0.35753053426742554, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.585, 'grad_norm': 0.4752093553543091, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8696701526641846, 'eval_runtime': 206.6026, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.082, 'epoch': 2.16}
{'loss': 0.5237, 'grad_norm': 0.4945109188556671, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.5681, 'grad_norm': 0.43065571784973145, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8923985362052917, 'eval_runtime': 206.3531, 'eval_samples_per_second': 0.654, 'eval_steps_per_second': 0.082, 'epoch': 2.29}
{'loss': 0.5092, 'grad_norm': 0.49138087034225464, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.4653, 'grad_norm': 0.40004271268844604, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8872116208076477, 'eval_runtime': 207.9782, 'eval_samples_per_second': 0.649, 'eval_steps_per_second': 0.082, 'epoch': 2.41}
{'loss': 0.536, 'grad_norm': 0.441256046295166, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.4915, 'grad_norm': 0.43019312620162964, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8817506432533264, 'eval_runtime': 210.2621, 'eval_samples_per_second': 0.642, 'eval_steps_per_second': 0.081, 'epoch': 2.54}
{'loss': 0.5002, 'grad_norm': 0.47106724977493286, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.4908, 'grad_norm': 0.5028514266014099, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8831735253334045, 'eval_runtime': 210.3494, 'eval_samples_per_second': 0.642, 'eval_steps_per_second': 0.081, 'epoch': 2.67}
{'loss': 0.5375, 'grad_norm': 0.49816927313804626, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.4582, 'grad_norm': 0.45445743203163147, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8836924433708191, 'eval_runtime': 208.0274, 'eval_samples_per_second': 0.649, 'eval_steps_per_second': 0.082, 'epoch': 2.79}
{'loss': 0.501, 'grad_norm': 0.44629451632499695, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.5477, 'grad_norm': 0.4729022979736328, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8837672472000122, 'eval_runtime': 206.7775, 'eval_samples_per_second': 0.653, 'eval_steps_per_second': 0.082, 'epoch': 2.92}
{'train_runtime': 7362.4668, 'train_samples_per_second': 0.257, 'train_steps_per_second': 0.032, 'train_loss': 0.7756713343481733, 'epoch': 2.97}
Model saved to: ../models/trained-llama3-sentences_66agree

Evaluating fine-tuned model on 135 test samples...


Predicting:   0%|          | 0/135 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|          | 1/135 [00:00<00:34,  3.91it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|▏         | 2/135 [00:00<00:29,  4.55it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   2%|▏         | 3/135 [00:00<00:26,  4.90it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   3%|▎         | 4/135 [00:00<00:25,  5.11it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▎         | 5/135 [00:01<00:25,  5.10it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▍         | 6/135 [00:01<00:24,  5.21it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 7/135 [00:01<00:23,  5.41it/s]Setting `pad_token_id` to


RESULTS FOR SENTENCES_66AGREE
Overall Accuracy: 0.911
Negative Accuracy: 0.933
Neutral Accuracy: 0.956
Positive Accuracy: 0.844

Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      0.93      0.97        45
     Neutral       0.81      0.96      0.88        45
    Positive       0.95      0.84      0.89        45

    accuracy                           0.91       135
   macro avg       0.92      0.91      0.91       135
weighted avg       0.92      0.91      0.91       135


Completed sentences_66agree
Final accuracy: 0.911
Improvement over baseline: 0.511
Waiting 60 seconds...
Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.04 GB

PROCESSING SENTENCES_75AGREE (3/4)

Loading dataset: sentences_75agree...
Dataset shape: (3453, 2)
Sentiment distribution:
sentiment
1    2146
2     887
0     420
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=135

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Testing baseline performance...


Predicting:   0%|          | 0/20 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 1/20 [00:01<00:35,  1.89s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  10%|█         | 2/20 [00:02<00:15,  1.16it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  15%|█▌        | 3/20 [00:02<00:09,  1.87it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  20%|██        | 4/20 [00:02<00:06,  2.64it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  25%|██▌       | 5/20 [00:02<00:04,  3.34it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  30%|███       | 6/20 [00:02<00:03,  4.05it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  35%|███▌      | 7/20 [00:02<00:02,  4.66it/s]Setting `pad_token_id` to `eos_to

Baseline accuracy on 20 samples: 0.450


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_75agree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.8732, 'grad_norm': 0.5393697619438171, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.8321, 'grad_norm': 0.8845611214637756, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.1338236331939697, 'eval_runtime': 219.7067, 'eval_samples_per_second': 0.614, 'eval_steps_per_second': 0.077, 'epoch': 0.13}
{'loss': 1.091, 'grad_norm': 0.4055427610874176, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 1.0231, 'grad_norm': 0.7259054780006409, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9373804330825806, 'eval_runtime': 217.0687, 'eval_samples_per_second': 0.622, 'eval_steps_per_second': 0.078, 'epoch': 0.25}
{'loss': 0.9119, 'grad_norm': 0.41434502601623535, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.9088, 'grad_norm': 0.3927311897277832, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8770038485527039, 'eval_runtime': 216.4366, 'eval_samples_per_second': 0.624, 'eval_steps_per_second': 0.079, 'epoch': 0.38}
{'loss': 0.8343, 'grad_norm': 0.32796576619148254, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.8306, 'grad_norm': 0.46996957063674927, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8681343197822571, 'eval_runtime': 213.6059, 'eval_samples_per_second': 0.632, 'eval_steps_per_second': 0.08, 'epoch': 0.51}
{'loss': 0.9763, 'grad_norm': 0.35498496890068054, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.899, 'grad_norm': 0.3547478914260864, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8263314962387085, 'eval_runtime': 222.3409, 'eval_samples_per_second': 0.607, 'eval_steps_per_second': 0.076, 'epoch': 0.63}
{'loss': 0.7724, 'grad_norm': 0.26523557305336, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.8923, 'grad_norm': 0.3638657331466675, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8185635209083557, 'eval_runtime': 218.8139, 'eval_samples_per_second': 0.617, 'eval_steps_per_second': 0.078, 'epoch': 0.76}
{'loss': 0.8327, 'grad_norm': 0.2931210994720459, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.8642, 'grad_norm': 0.24488259851932526, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8090540170669556, 'eval_runtime': 215.1266, 'eval_samples_per_second': 0.628, 'eval_steps_per_second': 0.079, 'epoch': 0.89}
{'loss': 0.777, 'grad_norm': 0.24602742493152618, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7644, 'grad_norm': 0.3128562271595001, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8089891076087952, 'eval_runtime': 216.6737, 'eval_samples_per_second': 0.623, 'eval_steps_per_second': 0.078, 'epoch': 1.02}
{'loss': 0.7298, 'grad_norm': 0.23146384954452515, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.7617, 'grad_norm': 0.2591002285480499, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8133487105369568, 'eval_runtime': 221.1349, 'eval_samples_per_second': 0.61, 'eval_steps_per_second': 0.077, 'epoch': 1.14}
{'loss': 0.7676, 'grad_norm': 0.28644463419914246, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.7226, 'grad_norm': 0.2920090854167938, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8161007165908813, 'eval_runtime': 216.4418, 'eval_samples_per_second': 0.624, 'eval_steps_per_second': 0.079, 'epoch': 1.27}
{'loss': 0.6878, 'grad_norm': 0.2940739095211029, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.6505, 'grad_norm': 0.3880539536476135, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8204386234283447, 'eval_runtime': 217.1192, 'eval_samples_per_second': 0.622, 'eval_steps_per_second': 0.078, 'epoch': 1.4}
{'loss': 0.704, 'grad_norm': 0.30593281984329224, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.6447, 'grad_norm': 0.2970358431339264, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8163177967071533, 'eval_runtime': 214.8824, 'eval_samples_per_second': 0.628, 'eval_steps_per_second': 0.079, 'epoch': 1.52}
{'loss': 0.6117, 'grad_norm': 0.3853822946548462, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.6382, 'grad_norm': 0.35897135734558105, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8250334858894348, 'eval_runtime': 216.757, 'eval_samples_per_second': 0.623, 'eval_steps_per_second': 0.078, 'epoch': 1.65}
{'loss': 0.7086, 'grad_norm': 0.3674660921096802, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.6403, 'grad_norm': 0.3801726698875427, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8075795769691467, 'eval_runtime': 221.2036, 'eval_samples_per_second': 0.61, 'eval_steps_per_second': 0.077, 'epoch': 1.78}
{'loss': 0.7855, 'grad_norm': 0.3012968599796295, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.6695, 'grad_norm': 0.31696459650993347, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8047964572906494, 'eval_runtime': 216.5422, 'eval_samples_per_second': 0.623, 'eval_steps_per_second': 0.079, 'epoch': 1.9}
{'loss': 0.7236, 'grad_norm': 0.3670088052749634, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.5816, 'grad_norm': 0.2919301986694336, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8079885840415955, 'eval_runtime': 218.2693, 'eval_samples_per_second': 0.619, 'eval_steps_per_second': 0.078, 'epoch': 2.03}
{'loss': 0.5719, 'grad_norm': 0.28881630301475525, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.5246, 'grad_norm': 0.4111449420452118, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8460132479667664, 'eval_runtime': 216.1742, 'eval_samples_per_second': 0.624, 'eval_steps_per_second': 0.079, 'epoch': 2.16}
{'loss': 0.4804, 'grad_norm': 0.4492410123348236, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.4856, 'grad_norm': 0.4509808123111725, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8626189827919006, 'eval_runtime': 216.0102, 'eval_samples_per_second': 0.625, 'eval_steps_per_second': 0.079, 'epoch': 2.29}
{'loss': 0.567, 'grad_norm': 0.41515856981277466, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.5049, 'grad_norm': 0.36313995718955994, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8476336002349854, 'eval_runtime': 221.9463, 'eval_samples_per_second': 0.608, 'eval_steps_per_second': 0.077, 'epoch': 2.41}
{'loss': 0.4733, 'grad_norm': 0.4357021450996399, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.4651, 'grad_norm': 0.4105619192123413, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8472341895103455, 'eval_runtime': 222.0203, 'eval_samples_per_second': 0.608, 'eval_steps_per_second': 0.077, 'epoch': 2.54}
{'loss': 0.4754, 'grad_norm': 0.39201226830482483, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.4654, 'grad_norm': 0.3738134503364563, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8522493839263916, 'eval_runtime': 223.9052, 'eval_samples_per_second': 0.603, 'eval_steps_per_second': 0.076, 'epoch': 2.67}
{'loss': 0.4862, 'grad_norm': 0.45464855432510376, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.5554, 'grad_norm': 0.4134228825569153, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8541361689567566, 'eval_runtime': 219.8418, 'eval_samples_per_second': 0.614, 'eval_steps_per_second': 0.077, 'epoch': 2.79}
{'loss': 0.4825, 'grad_norm': 0.4669239819049835, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.5167, 'grad_norm': 0.5111904740333557, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8541510701179504, 'eval_runtime': 219.019, 'eval_samples_per_second': 0.616, 'eval_steps_per_second': 0.078, 'epoch': 2.92}
{'train_runtime': 8013.7805, 'train_samples_per_second': 0.236, 'train_steps_per_second': 0.029, 'train_loss': 0.7601264531795795, 'epoch': 2.97}
Model saved to: ../models/trained-llama3-sentences_75agree

Evaluating fine-tuned model on 135 test samples...


Predicting:   0%|          | 0/135 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|          | 1/135 [00:01<03:30,  1.57s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|▏         | 2/135 [00:03<03:39,  1.65s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   2%|▏         | 3/135 [00:04<02:56,  1.34s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   3%|▎         | 4/135 [00:05<02:53,  1.33s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▎         | 5/135 [00:06<02:50,  1.31s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▍         | 6/135 [00:08<02:51,  1.33s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 7/135 [00:10<03:19,  1.56s/it]Setting `pad_token_id` to


RESULTS FOR SENTENCES_75AGREE
Overall Accuracy: 0.963
Negative Accuracy: 0.978
Neutral Accuracy: 0.933
Positive Accuracy: 0.978

Classification Report:
              precision    recall  f1-score   support

    Negative       0.96      0.98      0.97        45
     Neutral       0.95      0.93      0.94        45
    Positive       0.98      0.98      0.98        45

    accuracy                           0.96       135
   macro avg       0.96      0.96      0.96       135
weighted avg       0.96      0.96      0.96       135


Completed sentences_75agree
Final accuracy: 0.963
Improvement over baseline: 0.513
Waiting 60 seconds...
Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.04 GB

PROCESSING SENTENCES_ALLAGREE (4/4)

Loading dataset: sentences_allagree...
Dataset shape: (2264, 2)
Sentiment distribution:
sentiment
1    1391
2     570
0     303
Name: count, dtype: int64
Split sizes per class: Train=210, Val=45, Test=45
Final splits: Train=630, Val=135, Test=1

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



Testing baseline performance...


Predicting:   0%|          | 0/20 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 1/20 [00:02<00:44,  2.32s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  10%|█         | 2/20 [00:02<00:19,  1.06s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  15%|█▌        | 3/20 [00:02<00:10,  1.56it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  20%|██        | 4/20 [00:02<00:07,  2.23it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  25%|██▌       | 5/20 [00:02<00:05,  2.91it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  30%|███       | 6/20 [00:03<00:03,  3.59it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:  35%|███▌      | 7/20 [00:03<00:03,  4.24it/s]Setting `pad_token_id` to `eos_to

Baseline accuracy on 20 samples: 0.500


Map:   0%|          | 0/630 [00:00<?, ? examples/s]

Map:   0%|          | 0/135 [00:00<?, ? examples/s]


Starting fine-tuning for sentences_allagree...
Training samples: 630
Validation samples: 135


  0%|          | 0/234 [00:00<?, ?it/s]

{'loss': 2.879, 'grad_norm': 0.5210840702056885, 'learning_rate': 0.000125, 'epoch': 0.06}
{'loss': 1.7489, 'grad_norm': 0.9715730547904968, 'learning_rate': 0.00019996135574945544, 'epoch': 0.13}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 1.0943827629089355, 'eval_runtime': 174.5634, 'eval_samples_per_second': 0.773, 'eval_steps_per_second': 0.097, 'epoch': 0.13}
{'loss': 0.9792, 'grad_norm': 0.5262844562530518, 'learning_rate': 0.00019952695086820975, 'epoch': 0.19}
{'loss': 0.9663, 'grad_norm': 0.5234163403511047, 'learning_rate': 0.00019861194048993863, 'epoch': 0.25}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.9017727375030518, 'eval_runtime': 155.7085, 'eval_samples_per_second': 0.867, 'eval_steps_per_second': 0.109, 'epoch': 0.25}
{'loss': 0.918, 'grad_norm': 0.3805803954601288, 'learning_rate': 0.00019722074310645553, 'epoch': 0.32}
{'loss': 0.8599, 'grad_norm': 0.3570477366447449, 'learning_rate': 0.00019536007666806556, 'epoch': 0.38}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8367169499397278, 'eval_runtime': 157.6639, 'eval_samples_per_second': 0.856, 'eval_steps_per_second': 0.108, 'epoch': 0.38}
{'loss': 0.8566, 'grad_norm': 0.2690417766571045, 'learning_rate': 0.00019303892614326836, 'epoch': 0.44}
{'loss': 0.7909, 'grad_norm': 0.26692622900009155, 'learning_rate': 0.00019026850013126157, 'epoch': 0.51}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8157224059104919, 'eval_runtime': 281.9759, 'eval_samples_per_second': 0.479, 'eval_steps_per_second': 0.06, 'epoch': 0.51}
{'loss': 0.7339, 'grad_norm': 0.32899463176727295, 'learning_rate': 0.00018706217673675811, 'epoch': 0.57}
{'loss': 0.7448, 'grad_norm': 0.23386503756046295, 'learning_rate': 0.00018343543896848273, 'epoch': 0.63}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8091524839401245, 'eval_runtime': 333.6511, 'eval_samples_per_second': 0.405, 'eval_steps_per_second': 0.051, 'epoch': 0.63}
{'loss': 0.8291, 'grad_norm': 0.2396279275417328, 'learning_rate': 0.00017940579997330165, 'epoch': 0.7}
{'loss': 0.8183, 'grad_norm': 0.28816738724708557, 'learning_rate': 0.00017499271846702213, 'epoch': 0.76}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7917585372924805, 'eval_runtime': 330.8633, 'eval_samples_per_second': 0.408, 'eval_steps_per_second': 0.051, 'epoch': 0.76}
{'loss': 0.7838, 'grad_norm': 0.3204214572906494, 'learning_rate': 0.0001702175047702382, 'epoch': 0.83}
{'loss': 0.7703, 'grad_norm': 0.3157523572444916, 'learning_rate': 0.00016510321790296525, 'epoch': 0.89}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7821682691574097, 'eval_runtime': 329.8693, 'eval_samples_per_second': 0.409, 'eval_steps_per_second': 0.052, 'epoch': 0.89}
{'loss': 0.8344, 'grad_norm': 0.2941927909851074, 'learning_rate': 0.00015967455423498387, 'epoch': 0.95}
{'loss': 0.7386, 'grad_norm': 0.19458281993865967, 'learning_rate': 0.00015395772822958845, 'epoch': 1.02}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7822928428649902, 'eval_runtime': 331.1346, 'eval_samples_per_second': 0.408, 'eval_steps_per_second': 0.051, 'epoch': 1.02}
{'loss': 0.6381, 'grad_norm': 0.2493957132101059, 'learning_rate': 0.00014798034585661695, 'epoch': 1.08}
{'loss': 0.6227, 'grad_norm': 0.3655543923377991, 'learning_rate': 0.00014177127128603745, 'epoch': 1.14}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7965009808540344, 'eval_runtime': 320.855, 'eval_samples_per_second': 0.421, 'eval_steps_per_second': 0.053, 'epoch': 1.14}
{'loss': 0.621, 'grad_norm': 0.26448148488998413, 'learning_rate': 0.00013536048750581494, 'epoch': 1.21}
{'loss': 0.6734, 'grad_norm': 0.2443794459104538, 'learning_rate': 0.00012877895153711935, 'epoch': 1.27}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7839676141738892, 'eval_runtime': 335.1248, 'eval_samples_per_second': 0.403, 'eval_steps_per_second': 0.051, 'epoch': 1.27}
{'loss': 0.6927, 'grad_norm': 0.5456637144088745, 'learning_rate': 0.0001220584449460274, 'epoch': 1.33}
{'loss': 0.6791, 'grad_norm': 0.32138553261756897, 'learning_rate': 0.0001152314203735805, 'epoch': 1.4}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7837921380996704, 'eval_runtime': 343.3427, 'eval_samples_per_second': 0.393, 'eval_steps_per_second': 0.05, 'epoch': 1.4}
{'loss': 0.6793, 'grad_norm': 0.2975497841835022, 'learning_rate': 0.00010833084482529048, 'epoch': 1.46}
{'loss': 0.5684, 'grad_norm': 0.30668407678604126, 'learning_rate': 0.00010139004047683151, 'epoch': 1.52}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7946357727050781, 'eval_runtime': 356.7692, 'eval_samples_per_second': 0.378, 'eval_steps_per_second': 0.048, 'epoch': 1.52}
{'loss': 0.6218, 'grad_norm': 0.3050818145275116, 'learning_rate': 9.444252376465171e-05, 'epoch': 1.59}
{'loss': 0.6844, 'grad_norm': 0.2932189106941223, 'learning_rate': 8.752184353851916e-05, 'epoch': 1.65}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7911360859870911, 'eval_runtime': 346.2204, 'eval_samples_per_second': 0.39, 'eval_steps_per_second': 0.049, 'epoch': 1.65}
{'loss': 0.5891, 'grad_norm': 0.29990899562835693, 'learning_rate': 8.066141905754723e-05, 'epoch': 1.71}
{'loss': 0.6925, 'grad_norm': 0.5221179127693176, 'learning_rate': 7.389437861200024e-05, 'epoch': 1.78}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7838712334632874, 'eval_runtime': 346.5302, 'eval_samples_per_second': 0.39, 'eval_steps_per_second': 0.049, 'epoch': 1.78}
{'loss': 0.6611, 'grad_norm': 0.26235347986221313, 'learning_rate': 6.725339955015777e-05, 'epoch': 1.84}
{'loss': 0.5832, 'grad_norm': 0.3248666226863861, 'learning_rate': 6.0770550482731924e-05, 'epoch': 1.9}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7827178835868835, 'eval_runtime': 345.1218, 'eval_samples_per_second': 0.391, 'eval_steps_per_second': 0.049, 'epoch': 1.9}
{'loss': 0.5858, 'grad_norm': 0.37818285822868347, 'learning_rate': 5.447713642681612e-05, 'epoch': 1.97}
{'loss': 0.5354, 'grad_norm': 0.27961787581443787, 'learning_rate': 4.840354763714991e-05, 'epoch': 2.03}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.7843722701072693, 'eval_runtime': 344.3247, 'eval_samples_per_second': 0.392, 'eval_steps_per_second': 0.049, 'epoch': 2.03}
{'loss': 0.5218, 'grad_norm': 0.292674720287323, 'learning_rate': 4.257911285467754e-05, 'epoch': 2.1}
{'loss': 0.4874, 'grad_norm': 0.35253503918647766, 'learning_rate': 3.7031957681048604e-05, 'epoch': 2.16}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8154512643814087, 'eval_runtime': 346.0057, 'eval_samples_per_second': 0.39, 'eval_steps_per_second': 0.049, 'epoch': 2.16}
{'loss': 0.4377, 'grad_norm': 0.43229809403419495, 'learning_rate': 3.178886876295578e-05, 'epoch': 2.22}
{'loss': 0.4937, 'grad_norm': 0.42884498834609985, 'learning_rate': 2.6875164442149147e-05, 'epoch': 2.29}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8285250663757324, 'eval_runtime': 345.8385, 'eval_samples_per_second': 0.39, 'eval_steps_per_second': 0.049, 'epoch': 2.29}
{'loss': 0.4929, 'grad_norm': 0.4064130187034607, 'learning_rate': 2.2314572495745746e-05, 'epoch': 2.35}
{'loss': 0.4901, 'grad_norm': 0.47014114260673523, 'learning_rate': 1.8129115557213262e-05, 'epoch': 2.41}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8230857849121094, 'eval_runtime': 346.2235, 'eval_samples_per_second': 0.39, 'eval_steps_per_second': 0.049, 'epoch': 2.41}
{'loss': 0.497, 'grad_norm': 0.4197785556316376, 'learning_rate': 1.433900477131882e-05, 'epoch': 2.48}
{'loss': 0.4396, 'grad_norm': 0.44456347823143005, 'learning_rate': 1.0962542196571634e-05, 'epoch': 2.54}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8221232891082764, 'eval_runtime': 344.0138, 'eval_samples_per_second': 0.392, 'eval_steps_per_second': 0.049, 'epoch': 2.54}
{'loss': 0.4602, 'grad_norm': 0.3751881420612335, 'learning_rate': 8.016032426448817e-06, 'epoch': 2.6}
{'loss': 0.4307, 'grad_norm': 0.45388710498809814, 'learning_rate': 5.5137038561761115e-06, 'epoch': 2.67}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8235680460929871, 'eval_runtime': 346.0437, 'eval_samples_per_second': 0.39, 'eval_steps_per_second': 0.049, 'epoch': 2.67}
{'loss': 0.4622, 'grad_norm': 0.4631974995136261, 'learning_rate': 3.467639975257997e-06, 'epoch': 2.73}
{'loss': 0.4497, 'grad_norm': 0.4747866690158844, 'learning_rate': 1.88772101753929e-06, 'epoch': 2.79}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8236562609672546, 'eval_runtime': 345.4396, 'eval_samples_per_second': 0.391, 'eval_steps_per_second': 0.049, 'epoch': 2.79}
{'loss': 0.4795, 'grad_norm': 0.44055768847465515, 'learning_rate': 7.815762505632096e-07, 'epoch': 2.86}
{'loss': 0.4512, 'grad_norm': 0.41676968336105347, 'learning_rate': 1.545471346164007e-07, 'epoch': 2.92}


  0%|          | 0/17 [00:00<?, ?it/s]

{'eval_loss': 0.8235984444618225, 'eval_runtime': 344.2162, 'eval_samples_per_second': 0.392, 'eval_steps_per_second': 0.049, 'epoch': 2.92}
{'train_runtime': 17526.6598, 'train_samples_per_second': 0.108, 'train_steps_per_second': 0.013, 'train_loss': 0.7132101629534339, 'epoch': 2.97}
Model saved to: ../models/trained-llama3-sentences_allagree

Evaluating fine-tuned model on 135 test samples...


Predicting:   0%|          | 0/135 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|          | 1/135 [00:04<09:16,  4.16s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   1%|▏         | 2/135 [00:08<09:00,  4.06s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   2%|▏         | 3/135 [00:12<08:50,  4.02s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   3%|▎         | 4/135 [00:15<08:23,  3.85s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▎         | 5/135 [00:19<08:12,  3.79s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   4%|▍         | 6/135 [00:22<07:58,  3.71s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Predicting:   5%|▌         | 7/135 [00:26<07:47,  3.65s/it]Setting `pad_token_id` to


RESULTS FOR SENTENCES_ALLAGREE
Overall Accuracy: 0.993
Negative Accuracy: 1.000
Neutral Accuracy: 0.978
Positive Accuracy: 1.000

Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00        45
     Neutral       1.00      0.98      0.99        45
    Positive       0.98      1.00      0.99        45

    accuracy                           0.99       135
   macro avg       0.99      0.99      0.99       135
weighted avg       0.99      0.99      0.99       135


Completed sentences_allagree
Final accuracy: 0.993
Improvement over baseline: 0.493

Completed all 4 agreement levels!





## Results Analysis and Comparison

In [14]:
# Create comprehensive results DataFrame
if all_results:
    comparison_df = pd.DataFrame([
        {
            'Agreement_Level': result['agreement_level'],
            'Dataset_Size': result['dataset_size'],
            'Train_Size': result['train_size'],
            'Test_Size': result['test_samples'],
            'Baseline_Accuracy': result['baseline_accuracy'],
            'Final_Accuracy': result['overall_accuracy'],
            'Improvement': result['overall_accuracy'] - result['baseline_accuracy'],
            'Improvement_Percent': ((result['overall_accuracy'] - result['baseline_accuracy']) / result['baseline_accuracy']) * 100,
            'Timestamp': result['timestamp']
        }
        for result in all_results
    ])
    
    # Sort by agreement level for better display
    level_order = ['sentences_50agree', 'sentences_66agree', 'sentences_75agree', 'sentences_allagree']
    comparison_df['Level_Order'] = comparison_df['Agreement_Level'].apply(lambda x: level_order.index(x) if x in level_order else 999)
    comparison_df = comparison_df.sort_values('Level_Order').drop('Level_Order', axis=1)
    
    print("\n" + "="*100)
    print("COMPREHENSIVE RESULTS COMPARISON")
    print("="*100)
    
    # Display results table
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    print(comparison_df.round(3))
    
    # Save results
    results_filename = "../results/llama3_agreement_levels_comparison.csv"
    comparison_df.to_csv(results_filename, index=False)
    print(f"\nResults saved to: {results_filename}")
    
else:
    print("No results to display - all experiments failed")


COMPREHENSIVE RESULTS COMPARISON
      Agreement_Level  Dataset_Size  Train_Size  Test_Size  Baseline_Accuracy  \
0   sentences_50agree          4846         630        135               0.50   
1   sentences_66agree          4217         630        135               0.40   
2   sentences_75agree          3453         630        135               0.45   
3  sentences_allagree          2264         630        135               0.50   

   Final_Accuracy  Improvement  Improvement_Percent            Timestamp  
0           0.874        0.374               74.815  2025-06-29 17:55:52  
1           0.911        0.511              127.778  2025-06-29 20:04:12  
2           0.963        0.513              113.992  2025-06-29 22:25:33  
3           0.993        0.493               98.519  2025-06-30 03:32:31  

Results saved to: ../results/llama3_agreement_levels_comparison.csv


In [15]:
# Analyze trends and insights
if all_results:
    print("\n" + "="*80)
    print("KEY INSIGHTS AND ANALYSIS")
    print("="*80)
    
    # Find best and worst performing models
    best_model = comparison_df.loc[comparison_df['Final_Accuracy'].idxmax()]
    worst_model = comparison_df.loc[comparison_df['Final_Accuracy'].idxmin()]
    
    print(f"\nBEST PERFORMING MODEL:")
    print(f"  Agreement Level: {best_model['Agreement_Level']}")
    print(f"  Final Accuracy: {best_model['Final_Accuracy']:.3f}")
    print(f"  Dataset Size: {best_model['Dataset_Size']:,}")
    print(f"  Improvement: {best_model['Improvement']:.3f} ({best_model['Improvement_Percent']:+.1f}%)")
    
    print(f"\nLOWEST PERFORMING MODEL:")
    print(f"  Agreement Level: {worst_model['Agreement_Level']}")
    print(f"  Final Accuracy: {worst_model['Final_Accuracy']:.3f}")
    print(f"  Dataset Size: {worst_model['Dataset_Size']:,}")
    print(f"  Improvement: {worst_model['Improvement']:.3f} ({worst_model['Improvement_Percent']:+.1f}%)")
    
    # Analyze relationship between agreement level and performance
    print(f"\nTRENDS ANALYSIS:")
    print(f"  Average Final Accuracy: {comparison_df['Final_Accuracy'].mean():.3f}")
    print(f"  Average Improvement: {comparison_df['Improvement'].mean():.3f}")
    print(f"  Standard Deviation: {comparison_df['Final_Accuracy'].std():.3f}")
    
    # Check if higher agreement correlates with better performance
    agreement_mapping = {
        'sentences_50agree': 50,
        'sentences_66agree': 66,
        'sentences_75agree': 75,
        'sentences_allagree': 100
    }
    
    comparison_df['Agreement_Percent'] = comparison_df['Agreement_Level'].map(agreement_mapping)
    correlation = comparison_df['Agreement_Percent'].corr(comparison_df['Final_Accuracy'])
    
    print(f"\nCORRELATION ANALYSIS:")
    print(f"  Agreement Level vs Final Accuracy Correlation: {correlation:.3f}")
    
    if correlation > 0.5:
        print(f"  Strong positive correlation - Higher agreement improves performance")
    elif correlation > 0.2:
        print(f"  Moderate positive correlation - Higher agreement tends to improve performance")
    elif correlation > -0.2:
        print(f"  Weak correlation - Agreement level has minimal impact on performance")
    else:
        print(f"  Negative correlation - Unexpected result, may need investigation")
    
    print(f"\nRECOMMENDATIONS:")
    print(f"  • Use {best_model['Agreement_Level']} for production deployment")
    print(f"  • Data quality vs quantity trade-off analysis completed")
    print(f"  • Consider ensemble methods if performance differences are small")
    print(f"  • Monitor real-world performance to validate these results")
    
else:
    print("No analysis possible - no successful experiments")


KEY INSIGHTS AND ANALYSIS

BEST PERFORMING MODEL:
  Agreement Level: sentences_allagree
  Final Accuracy: 0.993
  Dataset Size: 2,264
  Improvement: 0.493 (+98.5%)

LOWEST PERFORMING MODEL:
  Agreement Level: sentences_50agree
  Final Accuracy: 0.874
  Dataset Size: 4,846
  Improvement: 0.374 (+74.8%)

TRENDS ANALYSIS:
  Average Final Accuracy: 0.935
  Average Improvement: 0.473
  Standard Deviation: 0.053

CORRELATION ANALYSIS:
  Agreement Level vs Final Accuracy Correlation: 0.959
  Strong positive correlation - Higher agreement improves performance

RECOMMENDATIONS:
  • Use sentences_allagree for production deployment
  • Data quality vs quantity trade-off analysis completed
  • Consider ensemble methods if performance differences are small
  • Monitor real-world performance to validate these results


In [16]:
# Save detailed results for each agreement level
if detailed_results:
    for agreement_level, data in detailed_results.items():
        # Create detailed predictions DataFrame
        detailed_df = pd.DataFrame({
            'text': data['test_data']['text'].tolist(),
            'true_sentiment': data['y_true'],
            'predicted_sentiment': data['y_pred'],
            'correct': [t == p for t, p in zip(data['y_true'], data['y_pred'])],
            'agreement_level': agreement_level
        })
        
        # Save detailed results
        detail_filename = f"../results/detailed_predictions_llama3_{agreement_level}.csv"
        detailed_df.to_csv(detail_filename, index=False)
        print(f"Detailed predictions saved: {detail_filename}")

print(f"\nAll results and analysis completed!")
print(f"Check the ../results/ folder for all output files")

Detailed predictions saved: ../results/detailed_predictions_llama3_sentences_50agree.csv
Detailed predictions saved: ../results/detailed_predictions_llama3_sentences_66agree.csv
Detailed predictions saved: ../results/detailed_predictions_llama3_sentences_75agree.csv
Detailed predictions saved: ../results/detailed_predictions_llama3_sentences_allagree.csv

All results and analysis completed!
Check the ../results/ folder for all output files


In [17]:
# Final memory cleanup
clear_memory()
print("\nFinal memory cleanup completed")
print("\nExperiment completed successfully!")

Clearing memory...
GPU memory allocated: 0.02 GB
GPU memory reserved: 1.04 GB

Final memory cleanup completed

Experiment completed successfully!
