# Homework: Fine-tuning Llama-3.2 for Conversation Summarization

## Getting Started

This homework assignment builds on the concepts covered in the in-class lab. You will implement supervised fine-tuning of a language model for conversation summarization.

### Implementation Tasks

This notebook contains implementation tasks where you're expected to write your own code. Look for sections marked with:

```python
# ------------ Edit AREA Start---------------------------------
# Task description and instructions...

# Your code goes here...

# ------------ Edit AREA End----------------------------------
```

## Submission Guidelines

Your final submission should include:
1. This completed notebook with all code cells implemented and executed
2. A PDF report (maximum 2 pages) addressing the questions in the Written Report Requirements section
3. The evaluation and test output data files for three parameter sets.

**Total Points: 50**

*In response to your feedback, this assignment increases the percentage of code that you are expected to write yourself. However, if you encounter difficulties with any of the tasks, you may reference the code of lab class.*

# Conversation Summarization for Business Professionals

## Business Context

As a data scientist at a large organization, you've been tasked with developing an AI solution to help business professionals manage the overwhelming amount of textual information they process daily.

**The Problem:**
- Executives spend an average of 23 hours per week in meetings
- Teams exchange hundreds of emails and chat messages daily
- Important decisions and action items get buried in lengthy conversations
- Cross-functional communication is inefficient when participants need to read through entire conversations

**Your Solution:**
You will fine-tune a language model to automatically generate concise, accurate summaries of business conversations. This will help:

1. **Save Time**: Reduce time spent reading through lengthy conversations
2. **Improve Decision Making**: Quickly extract key points, decisions, and action items
3. **Enhance Knowledge Sharing**: Create searchable summaries of institutional knowledge
4. **Scale Communication**: Enable teams to efficiently process more information

This homework will guide you through the process of fine-tuning a language model for this specific business application.

## 1. Setup & Base Model

### 1.1 Installation and Setup
First, we install the necessary libraries for our fine-tuning task.

In [1]:
%%capture
# Import basic utilities
import os, re

# Check if we're in Google Colab and install appropriate packages
if "COLAB_" not in "".join(os.environ.keys()):
    # For non-Colab environments, a simple install is sufficient
    !pip install unsloth
else:
    # For Colab, we need more specific installations due to environment constraints
    # Get the torch version to ensure compatible installations
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")

    # Install dependencies without overwriting existing packages
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

# Install specific versions of transformers and trl for compatibility
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2

### 1.2. Loading the Base Model

In this section, you'll load a pre-trained language model that will serve as the base for fine-tuning.

**Task:** Load the pre-trained model `unsloth/Llama-3.2-1B-Instruct` using the helper from Unsloth.

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Suitable for most conversations in the dataset

# We use 4-bit quantization to reduce memory requirements
# This allows us to fit the model in limited GPU memory while maintaining quality
load_in_4bit = True  # Enable 4-bit quantization

# Choose the data type (None for automatic selection based on hardware)
dtype = None

# ------------ Edit AREA Start---------------------------------
# INSTRUCTIONS:
# 1. Use FastLanguageModel.from_pretrained to load the "unsloth/Llama-3.2-1B-Instruct" checkpoint.
# 2. Pass the variables max_seq_length, dtype, and load_in_4bit defined above.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name      = "unsloth/Llama-3.2-1B-Instruct",
    max_seq_length  = max_seq_length,
    dtype           = dtype,
    load_in_4bit    = load_in_4bit,
)
# ------------ Edit AREA End----------------------------------

# Display model information
device = next(model.parameters()).device
print(f"Model loaded successfully on {device}")
print(f"Model architecture: Llama-3.2 with {model.config.num_hidden_layers} layers")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Maximum sequence length: {max_seq_length} tokens")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.1: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Model loaded successfully on cuda:0
Model architecture: Llama-3.2 with 16 layers
Vocabulary size: 128256
Maximum sequence length: 2048 tokens


## 2. PEFT: Adding LoRA Adapters

Next, you'll configure Low-Rank Adaptation (LoRA) for efficient fine-tuning. This technique allows you to update only a small subset of model parameters, reducing memory requirements and training time.

**Task:** Apply LoRA adapters to the model.

In [3]:
# We focus on attention layers which are most important for summarization tasks
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"]

# Set the LoRA alpha parameter (scaling factor)
lora_rank = 16

# Set dropout probability for regularization during training
lora_dropout = 0

# ------------ Edit AREA Start---------------------------------
# INSTRUCTIONS:
# 1. Choose one of the provide hyperparameter combinations for your first experiment and assign the
#    corresponding values to the variables lora_alpha and learning_rate.
# 2. Call FastLanguageModel.get_peft_model with the base `model` and the hyperparameters
#    defined above to obtain `lora_model`.

# For this fine-tuning task, we need to select hyperparameters that balance learning speed
# and stability. Here are three parameter sets with different trade-offs:

# Parameter Set 1: Low rank, low learning rate
# lora_alpha = 4      # Lower scaling factor - more conservative adaptation
# learning_rate = 1e-5  # Slow learning rate - stable but may require more steps

# Parameter Set 2: Medium rank, medium learning rate
# lora_alpha = 16     # Moderate scaling - balanced adaptation
# learning_rate = 1e-4  # Medium learning rate - good general purpose value

# Parameter Set 3: Higher rank, higher learning rate
# lora_alpha = 32       # Higher scaling factor - more aggressive adaptation
# learning_rate = 1e-3  # Faster learning rate - quicker convergence but risk of overshooting

# Selected: Parameter Set 2 (balanced)
lora_alpha = 16        # Moderate scaling - balanced adaptation
learning_rate = 1e-4   # Good general-purpose LR for LoRA on 1B models

# Apply LoRA adapters to the model
lora_model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,
    target_modules=target_modules,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
)
# ------------ Edit AREA End----------------------------------

# Print information about trainable parameters
total_params = sum(p.numel() for p in lora_model.parameters())
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Percentage of parameters being trained: {100 * trainable_params / total_params:.4f}%")

Not an error, but Unsloth cannot patch MLP layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.10.1 patched 16 layers with 16 QKV layers, 16 O layers and 0 MLP layers.


Total parameters: 783,091,712
Trainable parameters: 8,650,752
Percentage of parameters being trained: 1.1047%


## 3. Testing the Base Model

Before fine-tuning, let's test how the base model performs on conversation summarization. This will establish a baseline for comparing the improvements after fine-tuning.

In [4]:
# Load the SAMSum dataset for testing
from datasets import load_dataset

# Load a small portion of the dataset for testing
samsum_dataset = load_dataset("knkarthick/samsum", split="test[:100]")
print(f"Loaded {len(samsum_dataset)} test examples from SAMSum dataset")

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14731 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Loaded 100 test examples from SAMSum dataset


In [5]:
from unsloth.chat_templates import get_chat_template

# Set up the chat template for our model
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",  # Llama-3 series models use this template
)

**Task**: Now, define a function to generate a summary using the model

In [6]:
from transformers import TextStreamer

def generate_summary_with_model(conversation, max_new_tokens=150):
    """
    Generate a summary using the base model

    Args:
        conversation: The conversation text to summarize
        max_new_tokens: Maximum length of the generated summary

    Returns:
        The generated summary
    """
    # Create a clear prompt asking the model to summarize the conversation
    messages = [
        {"role": "user", "content": f"Please provide a concise summary of the following conversation:\n\n{conversation}"}
    ]

    # ------------ Edit AREA Start---------------------------------
    # INSTRUCTIONS:
    # 1. Apply appropriate chat template for messages
    # 2. Create a text streamer to show generation in real-time
    # 3. Generate summary with controlled parameters, temperature=0.7, top_p=0.95
    # 4. Extract the generated summary
    # 5. Return the generated text after the prompt

    # 1) Apply appropriate chat template for messages
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    )

    # Choose model (prefer LoRA model if available)
    mdl = lora_model if "lora_model" in globals() and lora_model is not None else model
    device = next(mdl.parameters()).device
    inputs = inputs.to(device)

    # Ensure pad token is set to avoid warnings
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token

    # 2) Create a text streamer to show generation in real-time
    text_streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # 3) Generate summary with controlled parameters
    outputs = mdl.generate(
        input_ids=inputs,
        streamer=text_streamer,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        use_cache=True,
    )

    # 4) Extract the generated summary (only the new tokens after the prompt)
    gen_ids = outputs[0][inputs.shape[-1]:]

    # 5) Return the generated text after the prompt
    return tokenizer.decode(gen_ids, skip_special_tokens=True).strip()

In [7]:
# Test the function on an example conversation
test_conversation_list = []
test_idx_list = [40, 60, 90] # You can change this to test different examples

for test_idx in test_idx_list:
    test_conversation = samsum_dataset[test_idx]

    print("CONVERSATION:")
    print(test_conversation["dialogue"])
    print("-" * 80)

    print("\nREFERENCE SUMMARY (Human-written):")
    print(test_conversation["summary"])
    print("-" * 80)

    print("\nBASE MODEL SUMMARY:")
    summary = generate_summary_with_model(test_conversation["dialogue"])
    # print(summary)

    test_conversation_list.append(test_conversation)
    print("=" * 100)
    print()

CONVERSATION:
Sebastian: It's been already a year since we moved here.
Sebastian: This is totally the best time of my life.
Kevin: Really? 
Sebastian: Yeah! Totally maaan.
Sebastian: During this 1 year I learned more than ever. 
Sebastian: I learned how to be resourceful, I'm learning responsibility, and I literally have the power to make my dreams come true.
Kevin: It's great to hear that.
Kevin: It's great that you are satisfied with your decisions.
Kevin: And above all it's great to see that you have someone you love by your side :)
Sebastian: Exactly!
Sebastian: That's another part of my life that is going great.
Kevin: I wish I had such a person by my side.
Sebastian: Don't worry about it.
Sebastian: I have a feeling this day will come shortly.
Kevin: Haha. I don' think so, but thanks.
Sebastian: This one year proved to me that when you want something really badly, you can achieve it.
Kevin: I want to win lottery and I never did :D
Sebastian: If you devoted your life to analyze al

## 4. Data Pipeline

Now, we'll prepare the SAMSum dataset for fine-tuning. This involves formatting the conversations and summaries into the appropriate structure for our model.

**Task:** Create a function to format the dataset into instruction-response pairs suitable for fine-tuning.

In [8]:
# Load the full training dataset
train_dataset = load_dataset("knkarthick/samsum", split="train")
print(f"Loaded {len(train_dataset)} training examples")

# Display a sample to understand the data structure
print("\nSample example:")
print(f"Dialogue: {train_dataset[0]['dialogue'][:200]}...")
print(f"Summary: {train_dataset[0]['summary']}")

Loaded 14731 training examples

Sample example:
Dialogue: Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)...
Summary: Amanda baked cookies and will bring Jerry some tomorrow.


In [9]:
# Define a function to convert our data into the conversation format
def convert_to_conversation_format(example):
    """
    Converts a dataset example into the conversation format expected by the model.

    Args:
        example: Dictionary containing dialogue and summary fields from SAMSum.

    Returns:
        Dictionary with a `conversations` list in the required format.
    """
    # ------------ Edit AREA Start---------------------------------
    # INSTRUCTIONS:
    # 1. Create a user turn that requests a concise summary and includes the dialogue text.
    # 2. Create an assistant turn that provides the human-written summary.
    # 3. Return a dictionary with a single key `conversations` mapped to a list
    #    containing the user turn followed by the assistant turn.

   # 1) User turn requesting a concise summary, including the dialogue text
    user_turn = {
        "role": "user",
        "content": (
            "Please provide a concise summary (1â€“3 sentences) of the following conversation:\n\n"
            f"{example['dialogue']}"
        ),
    }

    # 2) Assistant turn providing the human-written summary
    assistant_turn = {
        "role": "assistant",
        "content": example["summary"],
    }

    # 3) Return as a list under `conversations` (user first, then assistant)
    return {"conversations": [user_turn, assistant_turn]}
    # ------------ Edit AREA End----------------------------------
# Split the SAMSum dataset into train and validation sets
from datasets import load_dataset

dataset = load_dataset("knkarthick/samsum")
train_dataset = dataset["train"]
validation_dataset = dataset["validation"]

print(f"Train set: {len(train_dataset)} examples")
print(f"Validation set: {len(validation_dataset)} examples")

print("Converting dataset to conversation format...")
formatted_dataset = train_dataset.map(convert_to_conversation_format)

# Function to apply the chat template and prepare data for training
def formatting_prompts_func(examples):
    """
    Apply the tokenizer's chat template to format the conversations properly.

    Args:
        examples: Batch of examples with conversations

    Returns:
        Dictionary with formatted text
    """
    # ------------ Edit AREA Start---------------------------------
    # INSTRUCTIONS:
    # 1. Extract the list of conversations from the input examples.
    # 2. Use tokenizer.apply_chat_template to format each conversation.
    #    - Set tokenize=False to get raw text
    #    - Set add_generation_prompt=False since we don't need the assistant prompt here
    # 3. Return a dictionary with a single key `text` mapped to the list of formatted texts.
    convos = examples["conversations"]  # 1) Extract list of conversations
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False,  # 2) Include both user + assistant in rendered text
        )
        for convo in convos
    ]
    return {"text": texts}
    # ------------ Edit AREA End----------------------------------

# Apply the formatting
print("Applying chat template...")
processed_dataset = formatted_dataset.map(formatting_prompts_func, batched=True)

# Display an example of the formatted data
if processed_dataset is not None and len(processed_dataset) > 0 and 'text' in processed_dataset[0]:
    print("\nSample formatted example:")
    print("-" * 60)
    print(processed_dataset[0]['text'][:500] + "...")
else:
    print("Warning: Formatted dataset is empty or incorrectly structured")

Train set: 14731 examples
Validation set: 818 examples
Converting dataset to conversation format...


Map:   0%|          | 0/14731 [00:00<?, ? examples/s]

Applying chat template...


Map:   0%|          | 0/14731 [00:00<?, ? examples/s]


Sample formatted example:
------------------------------------------------------------
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Please provide a concise summary (1â€“3 sentences) of the following conversation:

Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Amanda baked cookies and will bring Jerry some tomorrow.<|eot_id|>...


## 5. Training

Now, you'll fine-tune the model on the conversation summarization task.

In [10]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq
import math

# Most parameters are pre-configured for optimal performance
batch_size = 2
gradient_accumulation_steps = 4
effective_batch_size = batch_size * gradient_accumulation_steps
weight_decay = 0.01
training_steps = 100

# Calculate warmup steps (typically 10% of total steps)
warmup_ratio = 0.1
warmup_steps = int(training_steps * warmup_ratio)

print(f"Dataset size: {len(processed_dataset)} examples")
print(f"Training with effective batch size of {effective_batch_size}")
print(f"Total training steps: {training_steps}")
print(f"Learning rate: {learning_rate}")
print(f"Warmup steps: {warmup_steps}")

# Create the SFT trainer
trainer = SFTTrainer(
    model = lora_model,
    tokenizer = tokenizer,
    train_dataset = processed_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer),
    packing = False,
    args = SFTConfig(
        per_device_train_batch_size = batch_size,
        gradient_accumulation_steps = gradient_accumulation_steps,
        warmup_steps = warmup_steps,
        max_steps = training_steps,  # Using fixed steps instead of epochs
        learning_rate = learning_rate,
        fp16 = not torch.cuda.is_available() or not torch.cuda.get_device_capability()[0] >= 8,
        bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = weight_decay,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "summarization_model",
        report_to = "none",
    ),
)

# Add response-only training optimization
from unsloth.chat_templates import train_on_responses_only

# Specify the marker for the beginning of user instructions
instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n"

# Specify the marker for the beginning of assistant responses
response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n"

# Apply response-only training
trainer = train_on_responses_only(
    trainer,
    instruction_part = instruction_part,
    response_part = response_part,
)

Dataset size: 14731 examples
Training with effective batch size of 8
Total training steps: 100
Learning rate: 0.0001
Warmup steps: 10


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/14731 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/14731 [00:00<?, ? examples/s]

In [11]:
# Check available resources before training
import torch
import time

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    memory_allocated = torch.cuda.memory_allocated(0) / (1024**3)  # Convert to GB
    memory_reserved = torch.cuda.memory_reserved(0) / (1024**3)  # Convert to GB
    max_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)  # Convert to GB

    print(f"GPU: {gpu_name}")
    print(f"Memory allocated: {memory_allocated:.2f} GB")
    print(f"Memory reserved: {memory_reserved:.2f} GB")
    print(f"Total available: {max_memory:.2f} GB")
else:
    print("No GPU detected. Training will be slow on CPU.")


# Start timing
start_time = time.time()

# Run training
training_results = trainer.train()

# Calculate training time
end_time = time.time()
training_time = end_time - start_time

# Display training results
print(f"\nTraining completed in {training_time:.2f} seconds ({training_time/60:.2f} minutes)")

if hasattr(training_results, "metrics"):
    print("\nTraining metrics:")
    for key, value in training_results.metrics.items():
        if isinstance(value, (int, float)):
            print(f"- {key}: {value:.4f}" if isinstance(value, float) else f"- {key}: {value}")

GPU: Tesla T4
Memory allocated: 1.11 GB
Memory reserved: 1.26 GB
Total available: 14.74 GB


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 14,731 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 8,650,752 of 1,244,465,152 (0.70% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,2.0169
20,1.6161
30,1.5058
40,1.484
50,1.4857
60,1.4286
70,1.4349
80,1.4041
90,1.2491
100,1.4655



Training completed in 143.11 seconds (2.39 minutes)

Training metrics:
- train_runtime: 141.0962
- train_samples_per_second: 5.6700
- train_steps_per_second: 0.7090
- total_flos: 1298281046704128.0000
- train_loss: 1.5091
- epoch: 0.0543


## 5. Evaluation

Now that we've fine-tuned our model, let's test it on some conversations to see how well it generates summaries.

In [12]:
# Configure the model for inference
FastLanguageModel.for_inference(model)

# Prepare the evaluation data
eval_data = []

for test_conversation in test_conversation_list:
    # Test the function on the same example we used for the base model
    print("CONVERSATION:")
    print(test_conversation["dialogue"])
    print("-" * 80)

    print("\nREFERENCE SUMMARY (Human-written):")
    print(test_conversation["summary"])
    print("-" * 80)

    print("\nFINE-TUNED MODEL SUMMARY:")
    summary = generate_summary_with_model(test_conversation["dialogue"])
    # print(summary)

    print("=" * 100)
    print()

    eval_data.append({
        "test_conversation": test_conversation["dialogue"],
        "test_reference_summary": test_conversation["summary"],
        "test_model_summary": summary,
    })

CONVERSATION:
Sebastian: It's been already a year since we moved here.
Sebastian: This is totally the best time of my life.
Kevin: Really? 
Sebastian: Yeah! Totally maaan.
Sebastian: During this 1 year I learned more than ever. 
Sebastian: I learned how to be resourceful, I'm learning responsibility, and I literally have the power to make my dreams come true.
Kevin: It's great to hear that.
Kevin: It's great that you are satisfied with your decisions.
Kevin: And above all it's great to see that you have someone you love by your side :)
Sebastian: Exactly!
Sebastian: That's another part of my life that is going great.
Kevin: I wish I had such a person by my side.
Sebastian: Don't worry about it.
Sebastian: I have a feeling this day will come shortly.
Kevin: Haha. I don' think so, but thanks.
Sebastian: This one year proved to me that when you want something really badly, you can achieve it.
Kevin: I want to win lottery and I never did :D
Sebastian: If you devoted your life to analyze al

In [13]:
# Save the evaluation results to a file for later analysis
import os
import json
from datetime import datetime

# Create a unique filename with timestamp and parameters used
params_str = f"lr{learning_rate}_lora_alpha{lora_alpha}"
eval_filename = f"eval_{params_str}.json"

# Save the evaluation data to a JSON file
with open(eval_filename, "w") as f:
    json.dump(eval_data, f, indent=2)

print(f"\nEvaluation results saved to {eval_filename}")
print("You can use these saved results to compare different parameter combinations.")


Evaluation results saved to eval_lr0.0001_lora_alpha16.json
You can use these saved results to compare different parameter combinations.


**Please download the eval file immediately after running the above code. Google Colab may disconnect, causing you to lose the data file.**

## Homework Submission Instructions

Complete the following steps to successfully submit your homework:

1. **Code Implementation**: Ensure you have completed all the Solution Areas in this notebook with your working code. Verify that your implementation follows the specified approaches for model fine-tuning.

2. **Execute the Notebook**: Run all cells sequentially to verify your implementation works correctly. Fix any errors that arise during execution before submission.

3. **Experiment with Parameters**: Train the model using all three parameter sets discussed in report to compare their performance:
   - Parameter Set 1: `lora_alpha = 4, learning_rate = 1e-5`
   - Parameter Set 2: `lora_alpha = 16, learning_rate = 1e-4`
   - Parameter Set 3: `lora_alpha = 32, learning_rate = 1e-3`

4. **Save Output Files**: Download all evaluation and test output JSON files generated during your experiments. These files contain valuable metrics and examples for your analysis.

5. **Written Report**: Prepare a comprehensive PDF report (maximum 2 pages) addressing the requirements detailed below.

6. **Submit Deliverables**: Upload both your completed notebook (.ipynb file) and your written report (.pdf) to the designated submission portal.

## Written Report Requirements (25 points total)

Your written report should demonstrate critical analysis and thoughtful evaluation of your fine-tuning experiment. Address the following sections:

### 1. Base Model Analysis (5 points)
Answer these questions about the base model's performance:

1. What specific weaknesses did you identify in the base model's summaries? Provide 1-2 concrete examples.

2. What patterns of errors did you observe across multiple summaries (e.g., missing key information, hallucinations, verbosity)?

3. How would these limitations impact the practical use of this model in real-world applications?

### 2. Fine-tuned Model Evaluation (5 points)

1. Provide specific examples of improved summaries after fine-tuning compared to the base model. What specific qualities improved?

2. What limitations still exist in your fine-tuned models? Are there consistent error patterns?

### 3. Example Analysis (15 points)

Assume we have fine-tuned two LLMs (model 1 and model 2) on the same conversation summarization dataset. Now we test them with two tasks:

***

**Conversation**

> **Alice:** "Okay, so for the Q4 product launch, we need to finalize the marketing strategy. I've been thinking about a multi-channel approach. Primarily social media, but also some email marketing. What are your thoughts, Bob?"
>
> **Bob:** "I like the multi-channel idea. For social media, we should definitely focus on Instagram and TikTok, given our target audience is 18-25. My nephew, who is 19, is *always* on TikTok. We should also run A/B tests on the email subject lines to maximize open rates. Remember how poorly the 'Project Eagle' campaign emails did last year?"
>
> **Alice:** "Good point. Let's allocate 70% of the budget to social media and 30% to email. We need to get the final budget proposal to finance by Friday. Let's aim to draft it tomorrow."

***

**Task 1**

* **Prompt**: Your task is to summarize the given dialogue and make sure (1) be rather short, (2) extract important pieces of information, (3) include names of interlocutors, (4) be written in the third person.
* **Human Answer for Reference**: Alice and Bob are preparing a multi-channel marketing strategy for the Q4 product launch. They will focus on social media and email marketing. They will allocate 70% of the budget to social media and 30% to email. They will draft the final budget proposal tomorrow.
* **Model 1 Output:** Alice and Bob discussed the Q4 product launch, focusing on a multi-channel marketing strategy. Alice suggested using social media and email marketing, while Bob recommended prioritizing Instagram and TikTok for their target demographic (18-25) and conducting A/B tests on email subject lines. They agreed to allocate 70% of the budget to social media and 30% to email, aiming to finalize the budget proposal by Friday and draft it the next day.
* **Model 2 Output:** Alice and Bob are preparing a multi-channel marketing strategy for the Q4 product launch. They will focus on social media and email marketing. They will allocate 70% of the budget to social media and 30% to email. They will draft the final budget proposal tomorrow.


***

**Task 2**

* **Prompt**: Your task is to analyze the given dialogue and find why Bob wants to focus on Instagram and TikTok.
* **Human Answer for Reference**: Bob wants to focus on Instagram and TikTok because their target audience is primarily 18-25 years old, which are the demographics most active on these platforms. He mentions his nephew, who is 19 and frequently uses TikTok, as a personal example to illustrate that these platforms are popular among their target demographic. This suggests that Bob believes marketing on Instagram and TikTok will be more effective in reaching and engaging their intended audience.
* **Model 1 Output:** Bob wants to focus on Instagram and TikTok because their target audience is 18-25 years old, which aligns with the demographics that are highly active on these platforms. Additionally, Bob mentions his nephew, who is 19 and is always on TikTok, implying that he personally observes the popularity and engagement levels of these platforms among young people. This suggests that Bob believes Instagram and TikTok are the most effective channels to reach and engage their target demographic for the product launch.
* **Model 2 Output:** Their target audience is 18-25.

***

Now, based on the two models' output on the two task, answer the following questions:

For each model:

1. Analyze the model's performance on the two tasks.

2. Based on your analysis, propose hyperparameter tuning strategies that could be explored to optimize its performance. List all related strategies and explain the rationale behind your solutions by referencing concepts discussed in class.


**Formatting Requirements**: Include appropriate section headings and clear writing. Visual aids (tables, charts) comparing results are encouraged but count toward your page limit (maximum 2 pages).