## Fine-tune Gemma 3 270M-it for Sentiment Analysis

This notebook provides a hands-on tutorial for fine-tuning the Gemma 3 270M-it model for sentiment analysis on financial and economic information. Analyzing sentiment in this domain is crucial for businesses to gain market insights, manage risks, and inform investment decisions.

To demonstrate the fine-tuning process, we use the FinancialPhraseBank dataset. This dataset is particularly valuable because, within the realm of finance and economic texts, annotated datasets are notably rare, with many being exclusively reserved for proprietary purposes. To address the issue of insufficient training data, scholars from the Aalto University School of Business introduced in 2014 a set of approximately 5000 sentences. This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 individuals with adequate background knowledge of financial markets) were instructed to assess the sentences solely from the perspective of an investor, evaluating whether the news potentially has a positive, negative, or neutral impact on the stock price.

The FinancialPhraseBank dataset is a comprehensive collection that captures the sentiments of financial news headlines from the viewpoint of a retail investor. Comprising two key columns, namely "Sentiment" and "News Headline," the dataset effectively classifies sentiments as either negative, neutral, or positive. This structured dataset serves as a valuable resource for analyzing and understanding the complex dynamics of sentiment in the financial news domain. It has been utilized in various studies and research initiatives since its inception, as noted in the work by Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala, P.  "Good debt or bad debt: Detecting semantic orientations in economic texts.", published in the Journal of the Association for Information Science and Technology in 2014.

### 1. Setup Environment

First, we install the required libraries.
* accelerate: A library by Hugging Face for efficient PyTorch training on any hardware configuration, including multi-GPU setups.
* peft: A library for Parameter-Efficient Fine-Tuning. It enables us to adapt pre-trained models by fine-tuning only a small fraction of their parameters, thereby significantly reducing computational costs.
* trl: A library by Hugging Face for training transformer language models with techniques like Supervised Fine-tuning (SFT), which we will use here.
* flash-attn: An optional library that provides a highly optimized attention mechanism, which can speed up training and reduce memory usage on compatible GPUs.

In [None]:
# Install Pytorch & other libraries
%pip -q install torch tensorboard

# Install Hugging Face libraries
%pip -q install transformers datasets accelerate evaluate trl protobuf sentencepiece

In [None]:
import subprocess
import sys
import torch

def install_flash_attn_conditionally():
    """
    Checks the GPU's compute capability and installs the appropriate version of flash-attn.
    """
    if not torch.cuda.is_available():
        print("No CUDA-enabled GPU found. Skipping flash-attn installation.")
        return

    try:
        # Get the compute capability of the first available GPU
        major, minor = torch.cuda.get_device_capability(0)
        compute_capability = float(f"{major}.{minor}")
        gpu_name = torch.cuda.get_device_name(0)
        print(f"Found GPU: {gpu_name} with Compute Capability: {compute_capability}")

        # Check for Ampere, Ada, Hopper, or newer architectures (for FlashAttention 2)
        if compute_capability >= 8.0:
            # Ampere, Ada, and Hopper architectures support bfloat16 and are ideal for FlashAttention 2
            is_bf16_supported = torch.cuda.is_bf16_supported()
            if is_bf16_supported:
                print("GPU supports BF16 and is compatible with FlashAttention 2.")
                print("Proceeding with installation of the latest 'flash-attn'...")
                # Install the latest version of flash-attn
                install_package("flash-attn", "-q --no-build-isolation")
                return True
            else:
                 print("GPU architecture is compatible, but BF16 is not supported. Skipping installation.")
                 return False
        # Check for Turing architecture (for original FlashAttention)
        elif compute_capability == 7.5:
            print("Turing architecture GPU detected. Compatible with original FlashAttention (v1.x).")
            print("Proceeding with installation of 'flash-attn==1.0.9'...")
            # Install a specific version of flash-attn compatible with Turing
            install_package("flash-attn==1.0.9", "-q")
            return True

        else:
            print(f"GPU with compute capability {compute_capability} is not supported by flash-attn. Skipping installation.")
            return False
    except Exception as e:
        print(f"An error occurred during GPU check or installation: {e}")
        return False

def install_package(package_name, *pip_args):
    """
    A helper function to install a pip package using subprocess.
    """
    try:
        command = [sys.executable, "-m", "pip", "install", package_name]
        command.extend(pip_args)
        subprocess.check_call(command)
        print(f"Successfully installed {package_name}.")
    except subprocess.CalledProcessError as e:
        print(f"Error installing {package_name}: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

is_flash_attn_available = install_flash_attn_conditionally()

We set environment variables to specify the GPU and manage tokenizer parallelism.
* Set to the desired GPU ID. "0" uses the first available GPU.
* Disable tokenizer parallelism to prevent potential issues with some environments.

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

During training, Hugging Face libraries can produce numerous non-critical warnings. We'll suppress these to keep the output clean.

In [None]:
import warnings
warnings.filterwarnings("ignore")

### 2. Load Model and Tokenizer
Now, we load the Gemma 3 270M-it model and its corresponding tokenizer. We'll load the model in bfloat16 for memory efficiency, a capability supported by modern GPUs.

In [None]:
# General imports
import os
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import torch

# Hugging Face imports
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from datasets import Dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

# Scikit-learn for evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

In [None]:
print(f"transformers=={transformers.__version__}")

In [None]:
def set_deterministic(seed):
    """Sets all seeds and CUDA settings for deterministic results."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU. [2, 3]
    set_seed(seed)

SEED = 0
set_deterministic(SEED)

In [None]:
# We specify the model path on Kaggle.
GEMMA_PATH = "/kaggle/input/gemma-3/transformers/gemma-3-270m-it/1"

# Determine the attention implementation.
# Use the faster "flash_attention_2" if installed, otherwise fall back to the eager implementation.
attn_implementation = "flash_attention_2" if is_flash_attn_available else "eager"

model = AutoModelForCausalLM.from_pretrained(
    GEMMA_PATH,
    dtype="auto", # Automatically uses bfloat16 on compatible GPUs
    device_map="auto",
    attn_implementation=attn_implementation
)

max_seq_length = 2048
tokenizer = AutoTokenizer.from_pretrained(GEMMA_PATH, max_seq_length=max_seq_length)

# Explicitly enable use_cache for faster inference
model.config.use_cache = True

# We use the end-of-sequence token as the padding token.
# Padding on the left is a common practice for decoder-only models.
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model.config.pad_token_id = tokenizer.pad_token_id
model.generation_config.pad_token_id = tokenizer.pad_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.generation_config.bos_token_id = tokenizer.bos_token_id

# Store the End-Of-Sequence token for use in prompt formatting
EOS_TOKEN = tokenizer.eos_token

print(f"Device: {model.device}")
print(f"DType: {model.dtype}")
print(f"Attention Implementation: {attn_implementation}")

### 3. Prepare the Dataset

We perform the following steps to prepare our data for fine-tuning:

1. Load Data: Read the all-data.csv file.
2. Create Splits:
* Training Set: A balanced set of 300 examples for each sentiment (positive, neutral, negative).
* Test Set: A balanced set of 300 examples for each sentiment, separate from the training set.
* Evaluation Set: A smaller, balanced set of 50 examples per sentiment, sampled with replacement from the remaining data. This is used for validation during training.
3. Format Prompts: We convert the raw text into structured prompts that guide the model to perform the sentiment analysis task. A special prompt format is used for training (including the answer) and another for testing (without the answer).
4. Create Datasets: The prepared data is converted into Hugging Face Dataset objects, which are the standard format for the SFTTrainer.

In [None]:
# Load the dataset from the CSV file
filename = "../input/sentiment-analysis-for-financial-news/all-data.csv"

df = pd.read_csv(filename, 
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

# Stratified sampling to create balanced train and test sets
X_train, X_test = [], []
for sentiment in ["positive", "neutral", "negative"]:
    train, test  = train_test_split(df[df.sentiment==sentiment], 
                                    train_size=300,
                                    test_size=300, 
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

# Concatenate and shuffle the training data
X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

# Create a balanced evaluation set from the remaining data
eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
X_eval = df[df.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)

In [None]:
# Prompt engineering for training and inference 

def create_training_prompt(data_point):
    """Formats a data point for training, including the expected sentiment."""
    return f"""generate_prompt
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip() + EOS_TOKEN

def create_test_prompt(data_point):
    """Formats a data point for inference, leaving the sentiment for the model to generate."""
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = 

            """.strip()

In [None]:
# Apply prompt formatting
X_train["text"] = X_train.apply(create_training_prompt, axis=1)
X_eval["text"] = X_eval.apply(create_training_prompt, axis=1)

# Store true labels for final evaluation and format test set for inference
y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(create_test_prompt, axis=1), columns=["text"])

# Convert pandas DataFrames to Hugging Face Dataset objects
train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

In [None]:
print(f"Training samples: {len(train_data)}")
print(f"Evaluation samples: {len(eval_data)}")
print(f"Test samples: {len(X_test)}")

### 4. Define Evaluation Metrics

We create a function to evaluate the model's predictions. This function will calculate and display:
* Overall accuracy.
* Accuracy per sentiment class.
* A detailed classification report with precision, recall, and F1-score.
* A confusion matrix to visualize where the model is making errors.

In [None]:
def evaluate(y_true, y_pred):
    """Calculates and prints comprehensive evaluation metrics."""
    
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

### 5. Baseline Performance (Zero-Shot)

Before fine-tuning, let's establish a baseline by evaluating the pre-trained Gemma 3 270M-it model on our test set. This "zero-shot" performance demonstrates the model's ability to understand the task without any specific training.
The following prediction function is optimized to process the entire test set in batches, which is significantly faster than predicting one by one. It tokenizes the prompts, sends them to the GPU, and generates responses.

In [None]:
def predict(X_test, model, tokenizer):
    """Performs batch inference on the test set."""

    y_pred = []
    # Convert DataFrame column to a list of prompts
    prompts = X_test["text"].tolist()

    # Set batch size depending on GPU memory
    batch_size = 8 
    
    for i in tqdm(range(0, len(prompts), batch_size)):
        batch = prompts[i:i+batch_size]
        inputs = tokenizer(batch,
                           return_tensors="pt",
                           padding=True,
                           truncation=True,
                           max_length=max_seq_length).to("cuda")
        
        outputs = model.generate(
            **inputs,
            # Set a higher max_new_tokens to ensure the model can generate full words
            max_new_tokens=10, 
            do_sample=False, # Use greedy decoding for deterministic output
            top_p=1.0,
            top_k=50,
            pad_token_id=tokenizer.eos_token_id
        )
        
        # Decode and parse the generated text
        decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        
        for output in decoded_outputs:
            # The generated answer is after the last '=' sign
            answer = output.split("=")[-1].lower().strip()
            
            if "positive" in answer:
                y_pred.append("positive")
            elif "negative" in answer:
                y_pred.append("negative")
            elif "neutral" in answer:
                y_pred.append("neutral")
            else:
                # Fallback for unexpected or empty outputs
                y_pred.append("none")
                
    return y_pred

In [None]:
# Evaluate the base model
y_pred = predict(X_test, model, tokenizer)

In [None]:
evaluate(y_true, y_pred)

As expected, the base model's performance is poor. It often defaults to a single sentiment (like neutral) because it hasn't been specifically trained for this nuanced financial analysis task. This result highlights the need for fine-tuning.

### 6. Fine-Tuning with PEFT (LoRA)

We will use the SFTTrainer from the TRL library to perform Supervised Fine-tuning. To make this process efficient, we'll use a PEFT method called LoRA (Low-Rank Adaptation).
LoRA freezes the pre-trained model weights and injects trainable, low-rank matrices into the attention layers. We only train these small matrices, drastically reducing the number of trainable parameters and memory requirements.
Below, we define the configurations for LoRA and the trainer.

In [None]:
# LoRA configuration
peft_config = LoraConfig(
    lora_alpha=48,
    lora_dropout=0.1,
    r=40,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
)

# SFT (Supervised Fine-tuning) configuration
training_arguments = SFTConfig(
    output_dir="logs",
    seed=SEED,
    num_train_epochs=5,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim="adamw_torch_fused",
    save_steps=0,
    logging_steps=25,
    learning_rate=7.75e-4,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=False,
    eval_strategy='steps',
    eval_steps = 112,
    eval_accumulation_steps=1,
    lr_scheduler_type="cosine",
    dataset_text_field="text",
    packing=False,
    max_length=max_seq_length,
    report_to="tensorboard",
)

# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    processing_class=tokenizer,
    args=training_arguments,
    
)

### 7. Start Training

We can now start the fine-tuning process. With a T4 GPU on Kaggle, this should take around 15-20 minutes. The training loss and validation loss (if eval_dataset is provided) will be printed periodically.

In [None]:
# Train model
trainer.train()

# Save the fine-tuned LoRA adapter
trainer.model.save_pretrained("trained-model")

In [None]:
# Access the log history
log_history = trainer.state.log_history

# Extract training / validation loss
train_losses = [log["loss"] for log in log_history if "loss" in log]
epoch_train = [log["epoch"] for log in log_history if "loss" in log]
eval_losses = [log["eval_loss"] for log in log_history if "eval_loss" in log]
epoch_eval = [log["epoch"] for log in log_history if "eval_loss" in log]

# Plot the training loss
plt.plot(epoch_train, train_losses, label="Training Loss")
plt.plot(epoch_eval, eval_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and Validation Loss per Epoch")
plt.legend()
plt.grid(True)
plt.show()

You can monitor the training progress using TensorBoard, which provides visualizations of metrics like training loss.

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs/runs

### 8. Evaluate the Fine-Tuned Model
After training is complete, the SFTTrainer automatically merges the LoRA adapter weights into the base model. We can now use the same predict function to evaluate its performance on the test set. We should see a dramatic improvement over the baseline.

In [None]:
# Set model configuration for inference
model.gradient_checkpointing_disable()
model.config.use_cache = True

y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

The results should show a significant increase in accuracy, precision, and recall across all sentiment classes. This demonstrates the power of fine-tuning for adapting a general-purpose model to a specific domain and task.

### 9. Analyze Predictions
Finally, let's create a CSV file containing the original text, the true labels, and the model's predictions. This is useful for error analysis—examining the specific cases where the model failed can provide insights for further improvements, such as refining the prompt or adding more diverse training data.

In [None]:
evaluation_df = pd.DataFrame({'text': X_test["text"], 
                              'y_true':y_true, 
                              'y_pred': y_pred},
                            )
evaluation_df.to_csv("test_predictions.csv", index=False)

print("Predictions saved to test_predictions.csv")
evaluation_df.head()