## Project Overview

This notebook demonstrates the process of fine-tuning a Llama 3.1 large language model specifically for news headline generation with numerical features. The project uses:

1. **Meta's Llama 3.1-8B-Instruct model** as the base model
2. **LoRA (Low-Rank Adaptation)** for efficient fine-tuning
3. **The NumHG dataset** containing news articles paired with headlines
4. **Unsloth library** for optimized training of LLMs

## Key Components Explained

### Data Preparation
- The code loads news articles and their headlines from CSV files
- It formats them into conversations (system prompt + article + headline)
- The data is preprocessed to match Llama 3.1's chat template format

### Model Fine-tuning
- Uses 4-bit quantization to reduce memory requirements
- Implements LoRA to efficiently update only 1-10% of the model parameters
- Specifically trains on the assistant (headline) responses only
- Uses AdamW optimizer with 8-bit precision

### Evaluation Pipeline
- Sets up three different evaluation scenarios:
  1. Base model (original Llama 3.1)
  2. Fine-tuned model with basic prompts
  3. Fine-tuned model with chain-of-thought prompts
- Includes a special peer evaluation case with an RTF article

### Outputs
- Saves the fine-tuned model to Google Drive
- Generates headline predictions using all three approaches
- Saves the generated headlines to text files for comparison

The notebook demonstrates a complete machine learning pipeline from data preparation through model training to evaluation and inference, with a focus on generating newspaper headlines that effectively incorporate numerical information.

In [None]:
# News Headline Generation Model with Numeric Features
# This notebook fine-tunes a Llama 3.1 model to generate news headlines with numerical features.
# The model is trained on the NumHG dataset which contains news articles and their corresponding headlines.

# Import necessary libraries and mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
!git clone https://github.com/ArrowHuang/NumHG.git

In [None]:
# Install required packages for model training and evaluation
%%capture
!pip install unsloth bert-score rouge_score datasets

In [None]:
# Import libraries for data processing and model training
import pandas as pd
from transformers import AutoTokenizer
from unsloth import FastLanguageModel  # Unsloth offers optimized training for LLMs
import torch
from rouge_score import rouge_scorer
from sklearn.metrics import accuracy_score
import re
from datasets import Dataset

In [None]:
# Log in to Hugging Face to access the base model
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# Load the base model: Llama 3.1 8B Instruct variant
# Using 4-bit quantization to reduce memory usage while maintaining model quality
max_seq_length = 3100  # Maximum sequence length for input texts
dtype = None  # Auto-detect best data type based on available hardware
load_in_4bit = True  # Enable 4-bit quantization to reduce memory requirements

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "", 
)

In [None]:
# Prepare datasets for training
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

# Define paths for training and validation data
train_path = "./NumHG/Dataset/fold-1/train.csv"
val_path = "./NumHG/Dataset/fold-1/val.csv"

# Set up the chat template to format our data according to Llama 3.1 requirements
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1"
)

# Function to convert conversations to the expected format for the model
def formatting_train_prompts_func(examples):
  convos = examples["conversations"]
  texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
  return {"text": texts, }
pass

# Preprocess function to convert CSV data into conversational format for training
def train_preprocess(path, train_type):
  # Read the CSV file
  df = pd.read_csv(path)
  # Add a label column for either train or validation data
  if train_type == "train":
    df["train"] = "train"
  else:
    df["val"] = "val"

  # Create conversational format with system, human, and assistant roles
  df["conversations"] = df.apply(lambda row: [
      {"from": "system", "value": "You are a journalist tasked with creating headlines for news articles."},
      {"from": "human", "value": row["text"]},
      {"from": "gpt", "value": row["summary"]}
  ], axis=1)

  # Remove columns that aren't needed for training
  df = df.drop(columns=["text", "cloze_gt", "cloze_annotation", "need_reasoning", "summary", "cloze"])

  # Convert to HuggingFace Dataset format and apply conversational templating
  dataset = Dataset.from_pandas(df)
  dataset = standardize_sharegpt(dataset)
  dataset = dataset.map(formatting_train_prompts_func, batched=True)
  return dataset

# Process training and validation datasets
train_dataset = train_preprocess(train_path, "train")
val_dataset = train_preprocess(val_path, "val")

In [None]:
# Add LoRA (Low-Rank Adaptation) adapters to the model
# LoRA enables efficient fine-tuning by updating only 1-10% of parameters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # Rank of the update matrices, higher values = more capacity but more parameters
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],  # Layers to apply LoRA to
    lora_alpha = 16,  # LoRA scaling factor
    lora_dropout = 0,  # Dropout probability for LoRA layers (optimized to be 0)
    bias = "none",  # Whether to train bias parameters (optimized to be "none")
    use_gradient_checkpointing = "unsloth",  # Memory optimization technique
    random_state = 3407,  # Random seed for reproducibility
    use_rslora = False,  # Whether to use rank stabilized LoRA
    loftq_config = None,  # Configuration for LoftQ initialization
)

In [None]:
# Set up the trainer using HuggingFace's SFTTrainer (Supervised Fine-Tuning)
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,  # Number of processes for dataset preparation
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False,  # Whether to pack sequences together (disabled for longer sequences)
    args = TrainingArguments(
        per_device_train_batch_size = 32,  # Batch size per GPU for training
        gradient_accumulation_steps = 8,  # Number of updates to accumulate before performing backward pass
        warmup_steps = 13,  # Number of warmup steps for learning rate scheduler
        num_train_epochs = 2,  # Number of complete passes through the dataset
        eval_strategy = "steps",  # When to evaluate (every few steps)
        save_strategy = "steps",  # When to save checkpoints
        eval_steps = 5,  # Evaluate every 5 steps
        save_steps = 5,  # Save checkpoint every 5 steps
        logging_steps = 5,  # Log metrics every 5 steps
        logging_strategy = "steps",
        logging_dir = "logs",
        per_device_eval_batch_size = 32,  # Batch size for evaluation
        learning_rate = 2e-5,  # Learning rate for optimizer
        fp16 = not is_bfloat16_supported(),  # Use FP16 if BF16 not available
        bf16 = is_bfloat16_supported(),  # Use BF16 if available (better precision)
        optim = "adamw_8bit",  # Use 8-bit AdamW optimizer for memory efficiency
        weight_decay = 0.01,  # L2 regularization factor
        lr_scheduler_type = "linear",  # Learning rate scheduler type
        seed = 3407,  # Random seed for reproducibility
        output_dir = "outputs",
        report_to = "none",  # Disable external reporting (e.g., WandB)
    ),
)

In [None]:
# Configure the trainer to only train on the assistant (response) portions
# This is important because we want the model to learn to generate headlines (assistant responses)
# rather than learning to repeat the article text (human prompts)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",  # Marker for user input
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n"  # Marker for assistant response
)

In [None]:
# Display current GPU memory usage statistics
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
# Start the training process
trainer_stats = trainer.train()

In [None]:
# Calculate and display final memory usage and training time statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory/max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

In [None]:
# Save the fine-tuned model and tokenizer
model.save_pretrained("news_summary_model_finalvInstructGen")
tokenizer.save_pretrained("news_summary_model_finalvInstructGen")

In [None]:
# Copy the saved model to Google Drive for persistence
from google.colab import drive
import shutil
import os

# Source folder path (local)
source_folder = "/content/news_summary_model_finalvInstructGen"

# Destination path in Google Drive
destination = "/content/drive/MyDrive/news_summary_model_finalvInstructGen"

# Copy the folder
shutil.copytree(source_folder, destination)

In [None]:
# Load models for evaluation and testing
from transformers import set_seed

# Reload the base model for comparison
max_seq_length = 3100
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "", 
)

# Set model to inference mode
FastLanguageModel.for_inference(model)

# Load the fine-tuned model
saved_model, saved_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/content/drive/MyDrive/news_summary_model_finalvInstructGen",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

# Set fine-tuned model to inference mode
FastLanguageModel.for_inference(saved_model)

In [None]:
# Install and import striprtf for handling RTF files (for peer evaluation)
!pip install striprtf
from striprtf.striprtf import rtf_to_text

# Load a peer evaluation article (RTF format)
with open("IS3433FinalArticle.rtf", 'r', encoding='utf-8') as file:
    rtf_content = file.read()

# Convert RTF to plain text
peer_eval = rtf_to_text(rtf_content)

In [None]:
# Prepare test prompts for model evaluation
test_df = pd.read_csv("./NumHG/Dataset/fold-1/test.csv")

# Create three sets of test prompts:
# 1. Base prompts (simple headline generation)
# 2. Fine-tuned model prompts (same as base)
# 3. Fine-tuned with chain-of-thought prompts (more detailed instructions)
test_feed_base = []
test_feed_ft = []
test_feed_ftcot = []

# Create a special prompt for the peer evaluation article
final_prompt = f"Generate a single headline for this news article that includes at least one key numeric feature:\n\n{peer_eval}\n\nWhen generating the headline consider the following:\n1. What is the subject of the article?\n2. What is the sentiment of the article?\n3. Does the headline accurately portray the subject and sentiment of the article?\n4. Does the headline match the journalistic style of the examples?\n5. Is the headline an appropriate length?\n6. Is the key numeric feature formatted as a number?\n\nBe sure to return only the generated headline without any enclosing quotation marks."

final_peer = [
      {"role": "system", "content": "You are a journalist tasked to provide headlines for news articles!"},
      {"role": "user", "content": final_prompt},
  ]

# Prepare prompts for test dataset evaluation
for index, row in test_df.iterrows():
  news = row["text"]
  # Basic prompt without chain-of-thought guidance
  prompt = f"Generate a single headline for this news article that includes at least one key numeric feature:\n\n{news}\n\nBe sure to return only the generated headline without any enclosing quotation marks."
  messages = [
      {"role": "system", "content": "You are a journalist tasked to provide headlines for news articles!"},
      {"role": "user", "content": prompt},
  ]
  test_feed_base.append(messages)
  test_feed_ft.append(messages)

# Prepare prompts with chain-of-thought guidance
for index, row in test_df.iterrows():
  news = row["text"]
  prompt = f"Generate a single headline for this news article that includes at least one key numeric feature:\n\n{news}\n\nWhen generating the headline consider the following:\n1. What is the subject of the article?\n2. What is the sentiment of the article?\n3. Does the headline accurately portray the subject and sentiment of the article?\n4. Does the headline match the journalistic style of the examples?\n5. Is the headline an appropriate length?\n6. Is the key numeric feature formatted as a number?\n\nBe sure to return only the generated headline without any enclosing quotation marks."
  messages = [
      {"role": "system", "content": "You are a journalist tasked to provide headlines for news articles!"},
      {"role": "user", "content": prompt},
  ]
  test_feed_ftcot.append(messages)

In [None]:
# Set up text generation pipeline for the base model
from transformers import set_seed
set_seed(43)  # Set seed for reproducibility
base_generator = pipeline("text-generation",
                          model = model,
                          tokenizer = tokenizer
                          )

# Define special tokens that indicate the end of generated text
terminators = [
    base_generator.tokenizer.eos_token_id,
    base_generator.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

# Generate headlines using the base model
output_base = base_generator(test_feed_base, max_new_tokens=40, eos_token_id = terminators, batch_size = 31)


In [None]:
# Set up text generation pipeline for the fine-tuned model
set_seed(43)  # Set seed for reproducibility
ft_generator = pipeline("text-generation",
                        model = saved_model,
                        tokenizer = saved_tokenizer)

# Generate headlines using the fine-tuned model (without chain-of-thought)
output_ft = ft_generator(test_feed_ft, max_new_tokens=40, eos_token_id = terminators, batch_size = 31)

# Generate headlines using the fine-tuned model with chain-of-thought prompts
output_ftcot = ft_generator(test_feed_ftcot, max_new_tokens=40, eos_token_id = terminators, batch_size = 31)

# Generate headline for the peer evaluation article
output_final = ft_generator(final_peer, max_new_tokens=40, eos_token_id = terminators)

# Print the generated headline for the peer evaluation article
print(output_final[0]["generated_text"][-1]["content"])

In [None]:
# Helper function to save generated headlines to text files
def save_preds(filename, gend, strip = False):
  with open(filename, "w") as f:
    for element in gend:
      headline = element[0]["generated_text"][-1]["content"]
      # Option to strip quotation marks from generated headlines
      if strip:
        if headline.startswith('"') and headline.endswith('"'):
          headline = headline[1:-1]
      f.write(headline + "\n")

In [None]:
# Save the base model predictions
save_preds("BASEpreds-head.txt", output_base, True)

In [None]:
# Save the fine-tuned model predictions
save_preds("FTpreds-head.txt", output_ft)

In [None]:
# Save the fine-tuned with chain-of-thought model predictions
save_preds("FTCOTpreds-head.txt", output_ftcot)