# Week 7 Assignment: synthetic Data Generation & Fine-Tuning (QLoRA) Assignment

## Homework Introduction

Fine-tuning a language model is a pivotal step in advanced agent workflows. While large pre-trained models are powerful generalists, fine-tuning **transforms a base model into a specialized agent** that excels at specific tasks. In an academic Q\&A setting, fine-tuning injects domain-specific knowledge directly into the model, enabling it to answer scholarly questions more accurately and with appropriate depth. This is in contrast to relying solely on retrieval-augmented generation (RAG); a fine-tuned model can **replicate what RAG does (using built-in knowledge)**, but RAG alone cannot impart permanent knowledge or improved reasoning to the model. In practice, fine-tuning can update a model with new information, adjust its style, and optimize it for the Q\&A task at hand.

The goal this week is to improve the agent’s academic question-answering performance by creating **domain-aligned training data** and fine-tuning a local LLM on it. Academic papers contain complex information and terminology that a general model might not fully grasp. By curating a high-quality Q\&A dataset from scholarly sources and fine-tuning on it, we align the model with the domain’s content and style. High-quality, domain-specific data is known to reduce misinformation and hallucinations in LLM responses. In other words, we’ll be *teaching* our model using data tailored to academic questions, so that it can respond with greater accuracy and relevance in that setting.

## Learning Objectives

By the end of this assignment, you will be able to:

* **Generate synthetic Q\&A data using GPT-4**, designing prompts to yield high-quality question-answer pairs from academic text.
* **Understand instruction-tuning formats** for chat-style data (using special tokens like `<|system|>`, `<|user|>`, `<|assistant|>` to structure prompts and responses).
* **Fine-tune a local LLaMA 3 (7B) model using QLoRA and Unsloth**, leveraging 4-bit quantization for efficient training on limited GPU resources.
* **Evaluate the model’s performance before and after fine-tuning**, comparing accuracy on academic QA tasks and quantifying improvements.

## Project Design

This week’s project involves creating a synthetic dataset and using it to fine-tune the model for better academic Q\&A performance. The plan is as follows:

1. **Data Sampling:** Select **100 academic papers** from your Week 4–5 arXiv dataset (e.g. using their abstracts and key sections). Ensure a diverse mix of subjects or paper types to provide a broad training base.
2. **Synthetic Q\&A Generation:** Use **GPT-4** to generate \~5 question-answer pairs for each paper. Craft a prompt that provides GPT-4 with the paper’s abstract or content and asks for informative Q\&A pairs. The questions should cover important points, definitions, or insights from the paper, and the answers should be correct summaries or explanations based on the text. This yields roughly **500 Q\&A pairs** in total.
3. **Include Edge-Case Examples:** Incorporate some **edge-case questions** among the above pairs – for example, a question that reflects a misunderstanding or a **hallucinated detail** about the paper. For these, provide an answer that corrects the false premise or clarifies that the paper doesn’t contain that information. Including a few such Q\&A examples (e.g. *“Q: According to the paper, what is the value of constant XYZ?”* when XYZ is not actually in the paper, and *“A: The paper does not specify XYZ; in fact, that detail is not discussed.”*) will teach the model to handle incorrect or unanswerable queries gracefully.
4. **Format Data for Instruction Tuning:** Convert all the Q\&A pairs into the **instruction-tuning JSONL format** expected by our fine-tuning pipeline. Each line in the dataset should represent a complete prompt-response dialogue. We will use a chat-style format with explicit roles. For example, you can prepend a fixed system instruction (such as `"You are a helpful academic assistant."`) and then format each Q\&A as:

   ```
   <|system|> You are a helpful academic Q&A assistant specialized in scholarly content.
   <|user|> [Question from the dataset]
   <|assistant|> [Answer from the dataset]
   ```

   Structure each JSONL entry to contain this composite prompt. This ensures the model is trained in a conversational format where it receives a user question and produces an answer, following any system instructions (tone, style) you provided.
5. **Fine-Tune LLaMA 3 7B with QLoRA:** Run a fine-tuning job on **Google Colab** (or a local GPU) using **QLoRA** via the Unsloth library. QLoRA (Quantized LoRA) will load the 7B model in 4-bit precision and train low-rank adaptation weights. This drastically lowers memory usage, allowing even a 7B (and larger) model to be fine-tuned on a single GPU without out-of-memory errors. Using Unsloth’s tools, load the base LLaMA 3 (7B) model (preferably an instruct variant) and fine-tune it on your synthetic Q\&A dataset. We’ll use LoRA adapters so the base model weights remain fixed; the training will produce a small set of adapted weights after 1–3 epochs over the dataset. *(Expect the fine-tuning to be relatively fast given \~500 examples — on a T4 or similar GPU, a few epochs should only take minutes.)*
6. **Evaluation (Pre vs. Post-Tuning):** Finally, evaluate the model’s academic QA performance **before and after fine-tuning**. Prepare a set of **10 test questions** covering various papers or concepts (you can come up with these manually, ensuring they are challenging). Run the original base model and the fine-tuned model on each question, and compare the answers. Look for improvements such as: the fine-tuned model’s answers are more detailed, use terminology from the papers, correct mistakes the base model made, or cite relevant concepts from the training data. This comparison will let you quantify accuracy gains. You might measure accuracy as the number of questions answered correctly or with relevant info, or simply note qualitatively how the responses differ.

Throughout this design, the key idea is that **domain-aligned data** will make the model more knowledgeable in that domain. Instead of the agent relying solely on retrieval each time, the fine-tuned model will have *internalized* some academic knowledge and answer patterns. Fine-tuning on a well-structured QA dataset (as opposed to just dumping raw text) is crucial for the model to learn effectively.

## Starter Code

To help you get started, we provide some starter code snippets for each part of the project. You can adapt these examples in your Colab notebook or Python scripts.

**1. Prompt Template for GPT-4 (Data Generation):** Use a clear prompt to get high-quality Q\&A from GPT-4. For example:

```text
You are a research assistant who reads academic papers and creates quiz questions.

Below is the abstract of a research paper. **Read the abstract and generate 5 question-answer pairs** that a student might ask after reading this paper. 
- Ensure the questions cover the key points or findings of the paper.
- Provide detailed answers based only on the information in the abstract.
- Include a mix of question types (factual, conceptual, etc.), and avoid ambiguous or trivial questions.

Abstract:
"[Paste the paper's abstract here]"

Now output 5 Q&A pairs in JSON format, as a list of objects with "question" and "answer" fields.
```

This prompt instructs GPT-4 to act as a quiz-maker for the given paper. You might need to iterate on the wording to get the best results (ensuring it follows the abstract closely). Run this for each of your 100 selected papers. Make sure to **manually review** or spot-check the generated Q\&As for quality and correctness, correcting any mistakes or irrelevant questions.

**2. JSONL Conversion Example:** Once you have all Q\&A pairs (for example, collected in a Python list or as separate files), convert them into a single JSONL file for fine-tuning. The following code illustrates how to structure the data with the desired format:


In [None]:
import json

system_prompt = "You are a helpful academic Q&A assistant specialized in scholarly content."
data = []

# Suppose qas_list is a list of all generated QAs, where each QA is a dict: {"question": ..., "answer": ...}
for qa in qas_list:
    user_q = qa["question"]
    assistant_a = qa["answer"]
    # Compose the prompt with system, user, assistant roles
    full_prompt = f"<|system|>{system_prompt}<|user|>{user_q}<|assistant|>{assistant_a}"
    data.append({"text": full_prompt})

# Write to JSONL file
with open("synthetic_qa.jsonl", "w") as outfile:
    for entry in data:
        outfile.write(json.dumps(entry) + "\n")


This code creates a JSONL where each line has a `"text"` field containing the combined conversation. We include a system role (same for all entries) to set the assistant’s behavior, then the user question, then the assistant answer. The special tokens `<|system|>`, `<|user|>`, `<|assistant|>` will be recognized by the model’s tokenizer to delineate the roles. **Note:** Ensure that these tokens are supported by the model’s vocabulary or tokenizer; since LLaMA-based models often use similar tokens for chat, this format will align with the fine-tuning chat template.

**3. QLoRA Fine-Tuning Script (using Unsloth in Colab):** The snippet below demonstrates how you can fine-tune the 7B model with QLoRA. It uses the 🤗 Hugging Face ecosystem along with Unsloth for convenience. Run this in a Colab cell (make sure GPU is enabled):


pip install unsloth transformers peft bitsandbytes datasets

In [None]:
from unsloth import FastLanguageModel, SFTTrainer
from transformers import AutoTokenizer, TrainingArguments
from datasets import load_dataset

# Load the base LLaMA 3 7B model in 4-bit mode (dynamic 4-bit quantization)
model_name = "unsloth/llama-3.1-7b-unsloth-bnb-4bit"
model = FastLanguageModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

# Load our synthetic Q&A dataset
dataset = load_dataset("json", data_files="synthetic_qa.jsonl", split="train")

# Initialize the trainer for Supervised Fine-Tuning (SFT)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=TrainingArguments(
        output_dir="llama3-7b-qlora-finetuned",
        per_device_train_batch_size=4,   # small batch size for Colab GPU
        gradient_accumulation_steps=4,   # accumulate gradients to simulate larger batch
        num_train_epochs=2,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=50,
        save_strategy="epoch"
    )
)

trainer.train()
model.save_pretrained("llama3-7b-qlora-finetuned")

A few notes on this script: We install **Unsloth** and related libraries first. We then load a 4-bit quantized version of LLaMA 3.1–7B (the model name ending in `bnb-4bit` is an Unsloth-provided quantized model). The `FastLanguageModel.from_pretrained` call takes care of loading the model with the correct 4-bit settings (under the hood it uses [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) to load 4-bit weights). We use `SFTTrainer` (a supervised fine-tuning trainer) to handle the training loop. In the `TrainingArguments`, we keep the batch size small and use gradient accumulation to fit the data in memory – this is because even a 7B model in 4-bit might use a few GB of VRAM for gradients. We train for 2 epochs (you can adjust this; 1 might be sufficient if the data is small but 2–3 can help it memorize better). After training, we save the fine-tuned model (which will include the LoRA adapter weights merged or alongside the base model as configured).

**4. Evaluation Scaffold (Comparing Base vs Fine-Tuned):** After fine-tuning, you’ll want to evaluate how the model’s answers have improved. Below is an example of how you can generate answers from both the base model and the fine-tuned model for comparison:


In [None]:
# Define some test questions (ensure these were not exactly in training data)
test_questions = [
    "What is the main hypothesis proposed by the paper on quantum computing?",
    "How did the authors of the deep learning study evaluate their model's performance?",
    # ... (add total 10 questions)
]

# Load the base and fine-tuned models for inference
base_model = FastLanguageModel.from_pretrained(model_name)  # base 7B model
ft_model = FastLanguageModel.from_pretrained("llama3-7b-qlora-finetuned")

for q in test_questions:
    prompt_input = f"<|system|>{system_prompt}<|user|>{q}<|assistant|>"
    # Tokenize input and generate output with each model
    input_ids = tokenizer(prompt_input, return_tensors='pt').input_ids.cuda()
    base_output_ids = base_model.generate(input_ids, max_new_tokens=150)
    ft_output_ids  = ft_model.generate(input_ids, max_new_tokens=150)
    # Decode the outputs
    base_answer = tokenizer.decode(base_output_ids[0], skip_special_tokens=True)
    ft_answer   = tokenizer.decode(ft_output_ids[0], skip_special_tokens=True)
    # (Post-process to remove the prompt part if needed)
    base_answer = base_answer.split('<|assistant|>')[-1].strip()
    ft_answer   = ft_answer.split('<|assistant|>')[-1].strip()
    print(f"Q: {q}")
    print(f"Base Model Answer: {base_answer}")
    print(f"Fine-Tuned Model Answer: {ft_answer}")
    print("-" * 60)


This script constructs the same prompt (system + user) for each test question and uses `.generate()` to get the model’s answer. We decode and then split on the `<|assistant|>` token to isolate the answer portion. The results printed will allow you to directly compare the base model’s response with the fine-tuned model’s response for each question. When running this, pay attention to whether the fine-tuned model’s answers are more **accurate, detailed, and aligned** with academic content. For instance, does it correctly recall specific details or terminology that the base model missed? Does it avoid the base model’s mistakes? These comparisons will feed into your analysis of the fine-tuning impact.

## Environment Setup

To execute this project, follow these setup guidelines:

1. **Set Up Your Environment**
2. **Model and Data Access:** Ensure you can access the base LLaMA 3 model weights. The code uses the model name `"unsloth/llama-3.1-7b-unsloth-bnb-4bit"`, which should fetch a 4-bit quantized LLaMA 3.1–7B model from Hugging Face. (If this model is not publicly available or requires authentication, you might need to use an equivalent or provide a Hugging Face token. You can also quantize the model yourself using BitsAndBytes, but using Unsloth’s ready model is convenient.) The tokenizer will be loaded alongside the model. Verify that the model loads successfully on the GPU (watch the memory usage).
3. **Prepare Synthetic Dataset:** Upload your `synthetic_qa.jsonl` to the Colab environment. You can use Colab’s file upload, mount Google Drive, or use `wget` if the file is hosted somewhere. Use the `load_dataset("json", data_files=...)` function to load the JSONL into a Dataset object. Confirm the dataset format (print a sample) to ensure the `"text"` field contains the combined prompt as expected.
4. **Run QLoRA Fine-Tuning:** Execute the fine-tuning code. QLoRA will allow the 7B model to be fine-tuned with low memory footprint – in fact, QLoRA was shown to match full 16-bit fine-tuning performance on benchmarks while using far less memory. Monitor the training logs. You should see the training loss decreasing. Given the small dataset, the model may overfit (loss going very low); that’s okay for our purpose, since we *want* it to memorize the QA pairs. The training should complete in a reasonable time (a few minutes on Colab for 2–3 epochs). If you encounter out-of-memory errors, try reducing `per_device_train_batch_size` or `max_seq_length` (context length) in the training arguments. Colab’s free GPU (\~15GB) is sufficient for 7B with 4-bit, as Unsloth notes fine-tuning 8B models in 3GB VRAM is possible.
5. **Save the Fine-Tuned Model:** After training, save the model artifacts. The code above saves to a directory `llama3-7b-qlora-finetuned`. You can zip this and download it, or push it to your Hugging Face hub (if you have one) for easy reuse. If the model consists of LoRA adapters separate from the base, ensure you save those (Unsloth’s `save_pretrained` should handle it). Having the fine-tuned model will allow you to load it later for inference without re-training.
6. **Perform Evaluation:** With the fine-tuned model saved or still in memory, run the evaluation script. Make sure to also load the base model (perhaps reuse the one from before fine-tuning for speed, since you already loaded it). Generate answers for your test questions with both models. It’s a good idea to **turn off sampling (set a deterministic decoding)** or use the same random seed for fairness when comparing outputs. For example, you can use `model.generate(..., do_sample=False)` to use greedy decoding, or a fixed small `temperature`. Record the outputs or print them side by side as shown in the scaffold code.

By following these steps, you will set up an environment where fine-tuning and evaluation can be done smoothly. Throughout, keep an eye on GPU utilization. QLoRA + 4-bit quantization is very efficient, but if you choose a larger model or increase sequence length, you might still hit limits. Adjust parameters as needed (Unsloth documentation suggests using 2048 tokens context during testing even if the model supports 8192, to save memory).

## Deliverables

Please submit the following artifacts for Week 7:

* **Synthetic Q\&A Dataset:** The JSONL file (`synthetic_qa.jsonl`) containing your generated question-answer pairs. This should be well-formatted (one JSON object per line) and ideally human-inspected for quality. We will spot-check a few entries for correctness and variety.
* **Fine-Tuning Code/Notebook:** Your Colab notebook (or Python script) used for fine-tuning. Ensure that the code is clean and commented where necessary. Include the crucial steps: data loading, model loading, training loop (or trainer usage), and evaluation. If you used the provided starter code with minimal changes, that’s fine – but note any modifications or tuning you did (e.g., different hyperparameters, or if you had to use a smaller model due to resource constraints).
* **Evaluation Results:** A brief report or output logs comparing the base vs. fine-tuned model on the 10 test questions. This could be a table or just a list of Q, base answer, fine-tuned answer, with perhaps a one-line commentary on which is better. Highlight any significant improvements or interesting differences. If the fine-tuned model still answered something incorrectly, you can mention that too. The goal is to demonstrate you tested the models and observed the effect of the fine-tune.

As always, ensure your submission is well-organized. The JSONL and code can be attached, and the evaluation comparison can be included in your write-up or as an output snippet from the notebook.



## Exploration Tips

If you have extra time or curiosity, consider these extensions to deepen your understanding:

* **Experiment with Data Formats:** We used a single-turn Q\&A format for fine-tuning. Try creating **multi-turn dialogues** or a **dialog-style QA** (e.g., a follow-up question that depends on the previous answer) to see if the model can handle more interactive conversation. Unsloth supports multi-turn conversations by merging single turns, so you could simulate a back-and-forth discussion about a paper.
* **Add Grounding to Prompts:** To push the model’s specificity, you can include **grounding information** in the prompts. For example, prepend the question with a reference to the paper section or figure: *“According to Section 3 of the paper, ...?”* and train the model to answer using that hint. Grounding the questions in specific sections might improve the model’s ability to pinpoint answers (and discourage it from using outside knowledge).
* **Compare Different Model Sizes:** We focused on LLaMA 3 7B for fine-tuning. You could try the same fine-tuning process on a **smaller model** (for instance, a 3B or 1B variant if available) and a **larger model** (if resources allow) to see how model capacity affects learning. Does the 7B significantly outperform a 3B after fine-tuning on the same data? This can be an interesting study in how model size and fine-tune data interact.
* **Extend the Dataset:** We generated 500 Q\&A pairs. For a more robust fine-tuning, one might scale this up. If you’re interested, you could generate more data (or use the fine-tuned model itself to generate new questions) and continue fine-tuning (taking care to maintain quality). There is even a tool by Meta called the *Synthetic Data Kit* for automating QA generation. More data could further improve the model, but watch out for diminishing returns or overfitting.
* **Monitor Training and Use Checkpoints:** During fine-tuning, consider saving intermediate checkpoints or enabling evaluation if you had a dev set. Although not required for this assignment, it’s good practice to monitor whether the model is genuinely learning or just overfitting. You could, for instance, hold out a few Q\&A pairs as a validation set and see if the model gets them right after each epoch.
* **Try Different Prompts Post-tuning:** After fine-tuning, test the model with various phrasings of questions – not just the ones you wrote. See if it can handle paraphrased queries or more open-ended questions in the academic domain. This will show if the fine-tuning improved the model’s generalization in the domain (not just memorization of the Q\&A pairs).

By exploring these ideas, you can gain a richer understanding of instruction tuning and how synthetic data can be used to tailor an LLM’s capabilities. Fine-tuning is a powerful technique – it **customizes the model’s behavior and knowledge** in ways that prompt-engineering alone often cannot. We hope this assignment gives you hands-on insight into that process, and we look forward to seeing your fine-tuned academic Q\&A agents in action!