# Synthetic Data Generation for LLM Fine-Tuning (Gemini 2.5 Flash with PDF Chunks)

This notebook provides a two-stage workflow for generating synthetic data. First, it preprocesses large PDF files from a `raw_data` directory by splitting them into smaller chunks and copying the original file into a `chunked_data` directory. Second, it iterates through all files in `chunked_data` (both chunks and full files), uploading each one to the `gemini-2.5-flash` model to generate synthetic input-output pairs.

### Workflow:
1.  **Setup**: Install and import necessary libraries.
2.  **Environment Configuration**: Load your Gemini API key from environment variables.
3.  **API and Path Configuration**: Configure directories, the Gemini client, and model parameters.
4.  **Preprocessing**: A dedicated function splits source PDFs and copies the original files into the `chunked_data` directory.
5.  **Data Generation**: The script iterates through the pre-made PDF chunks and the full PDFs, generating a consistent number of pairs for each.
6.  **Save Output**: All generated data is grouped by the original source file and saved to CSV files in the `generated_data` directory.

## 1. Setup

In [1]:
import os
import glob
import pandas as pd
import time
import shutil
import re
from dotenv import load_dotenv
import google.generativeai as genai
import pypdf
import json

## 2. Environment Configuration

In [2]:
load_dotenv()
genai.configure(api_key=os.getenv("GEMINI_TOKEN"))

## 3. Model and Path Configuration

We define our parameters, including the model ID, directory paths, and the chunking settings.

In [None]:
MODEL_ID = "gemini-2.5-flash-lite"
MODEL = genai.GenerativeModel(MODEL_ID)
RAW_DATA_DIR = "data_collection/raw_data/"
CHUNKED_DATA_DIR = "data_preparation_chunked_data/"
GENERATED_DATA_DIR = "data_generation/generated_data/"

CHUNK_SIZE = 7
CHUNK_OVERLAP = 2

# Batch generation settings
PAIRS_PER_ITERATION = 30 # Number of pairs to generate in a single API call
NUM_ITERATIONS = 2      # Number of API calls to make for each file

## 4. Preprocessing Step: Splitting and Copying PDFs

This function reads all PDFs from `raw_data`, saves smaller chunks to `chunked_data`, and also copies the original full PDF to `chunked_data`. Run this cell once to prepare your data.

In [4]:
def preprocess_and_chunk_pdfs(source_dir, dest_dir, chunk_size, overlap):
    """
    Reads all PDFs from a source directory, splits them into chunks,
    saves chunks to a destination directory, and copies the original file.
    """
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
        
    source_pdfs = glob.glob(os.path.join(source_dir, "*.pdf"))
    print(f"Found {len(source_pdfs)} PDFs in '{source_dir}' to preprocess.")

    for pdf_path in source_pdfs:
        original_basename = os.path.basename(pdf_path).replace('.pdf', '')
        try:
            print(f"  - Chunking {os.path.basename(pdf_path)}...")
            pdf_reader = pypdf.PdfReader(pdf_path)
            num_pages = len(pdf_reader.pages)
            step = chunk_size - overlap
            
            for i in range(0, num_pages, step):
                pdf_writer = pypdf.PdfWriter()
                start_page = i
                end_page = min(i + chunk_size, num_pages)
                
                for page_num in range(start_page, end_page):
                    pdf_writer.add_page(pdf_reader.pages[page_num])
                
                chunk_index = i // step + 1
                chunk_filename = f"{original_basename}_chunk_{chunk_index}.pdf"
                chunk_path = os.path.join(dest_dir, chunk_filename)
                
                with open(chunk_path, 'wb') as out_pdf:
                    pdf_writer.write(out_pdf)
            print(f"    ... completed chunking.")
        except Exception as e:
            print(f"    - Error splitting PDF {pdf_path}: {e}")
            
        try:
            print(f"  - Copying original file {os.path.basename(pdf_path)}...")
            shutil.copy(pdf_path, dest_dir)
            print(f"    ... completed copying.")
        except Exception as e:
            print(f"    - Error copying original file {pdf_path}: {e}")

preprocess_and_chunk_pdfs(RAW_DATA_DIR, CHUNKED_DATA_DIR, CHUNK_SIZE, CHUNK_OVERLAP)

Found 4 PDFs in 'raw_data/' to preprocess.
  - Chunking 3.1 Software Security 2.pdf...
    ... completed chunking.
  - Copying original file 3.1 Software Security 2.pdf...
    ... completed copying.
  - Chunking 4.1 Software Security 3.pdf...
    ... completed chunking.
  - Copying original file 4.1 Software Security 3.pdf...
    ... completed copying.
  - Chunking 5.1 Operating System Security 1.pdf...
    ... completed chunking.
  - Copying original file 5.1 Operating System Security 1.pdf...
    ... completed copying.
  - Chunking 6.1 Operating System Security 2.pdf...
    ... completed chunking.
  - Copying original file 6.1 Operating System Security 2.pdf...
    ... completed copying.


## 5. Data Generation Step

This is the main generation loop. It iterates through all files in `chunked_data` (both chunks and full documents), generates pairs for each, and saves the combined results into CSV files in the `generated_data` directory, grouped by the original filename.

In [None]:
original_pdf_files = glob.glob(os.path.join(RAW_DATA_DIR, "*.pdf"))
original_basenames = [os.path.basename(f).replace('.pdf', '') for f in original_pdf_files]

print(f"Found {len(original_basenames)} original documents to process for generation.")

for basename in original_basenames:
    print(f"\n--- Generating data for original file: {basename}.pdf ---")
    all_pairs = []
    
    files_to_process = sorted(glob.glob(os.path.join(CHUNKED_DATA_DIR, f"{basename}*.pdf")))

    if not files_to_process:
        print(f"  - No files found for {basename}. Skipping.")
        continue

    for i, file_path in enumerate(files_to_process):
        print(f"  - Processing file {i+1}/{len(files_to_process)}: {os.path.basename(file_path)}")
        uploaded_file = None
        try:
            uploaded_file = genai.upload_file(path=file_path, display_name=os.path.basename(file_path))
            
            # Loop for the number of API calls
            for iteration_num in range(NUM_ITERATIONS):
                retries = 3
                for attempt in range(retries):  # Retry mechanism
                    error_data = None
                    try:
                        print(f"    - Starting iteration {iteration_num + 1}/{NUM_ITERATIONS} (requesting {PAIRS_PER_ITERATION} pairs)... ")
                        
                        prompt = r"""
You are an expert data scientist tasked with creating a high-quality dataset for instruction-tuning a large language model.

Your primary goal is to generate {0} distinct and high-quality instruction-response pairs from the provided document.

## Key Instructions
1.  **Instruction (`input`):** Create a clear and specific prompt a user would ask. Vary the tasks: include direct questions, summarization requests, comparisons between concepts, and analytical prompts that require reasoning. Ensure a range of complexity.
2.  **Response (`output`):** Write a detailed, accurate, and direct answer as if you are an expert on the subject. The response must fully and exclusively satisfy the user's instruction.
3.  **Source Grounding:** Base all responses **exclusively** on the information within the provided document.
4.  **No Self-Reference:** Do not mention the source document in your responses. Avoid any phrases like "According to the document..." or "The provided text states...".

## Formatting Requirements
* Your entire output **must** be a single, valid JSON object. Do not include any text, explanations, or markdown formatting before or after the JSON structure.
* **CRITICAL:** All strings within the JSON must be properly escaped to ensure the output is parsable.
    * All backslashes (`\`) must be escaped as (`\\`). For example, the text `\x48` must be written in the JSON string as `\\x48`, and `\n` must be written as `\\n`.
    * All newlines (`\n`) must be escaped as (`\\n`). For example, the text `hi\n` must be written in the JSON string as `hi\\n`.
    * All double quotes (`"`) must be escaped as (`\"`). For example, the text `he said "hello"` must be written as `he said \"hello\"` and `"` must be written as `\"`.
    * Do not escape any other characters as those are invalid characters. For example, the text `this 'item'` must be written as `this 'item'` and `printf("%d\n", 5);` must be written as `printf(\"%d\\n\", 5);`.

Use the following JSON structure:
```json
{{
  "results": [
    {{
      "input": "Your first generated input here.",
      "output": "Your first generated output here."
    }},
    {{
      "input": "Your second generated input here.",
      "output": "Your second generated output here."
    }},
    ... and so on for all {0} pairs. 
  ]
}}
"""
                        prompt = str.format(prompt, PAIRS_PER_ITERATION)
                        response = MODEL.generate_content([prompt, uploaded_file], generation_config={"temperature": 0.7, "top_p": 0.95})
                        response_part = response.text
                        error_data = response_part
                        results = json.loads(response_part[7:-3])["results"]
                        print(f"      - Received {len(results)} pairs in response.")

                        for result in results:
                            if result.get("input") and result.get("output"):
                                all_pairs.append({
                                    "input": r"{}".format(result["input"].strip()),
                                    "output": r"{}".format(result["output"].strip())
                                })
                            else:
                                print("      Warning: Generated empty input or output for a pair.")
                        print(f"      - Total of {len(all_pairs)} pairs generated.")
                        break
                    except Exception as e:
                        print(f"      An error occurred during iteration {iteration_num + 1}: {e}")
                        print(f"      Error data: {error_data}")
                        continue

        except Exception as e:
            print(f"      An error occurred during generation: {e}")
        finally:
            if uploaded_file:
                genai.delete_file(uploaded_file.name)
    
    if all_pairs:
        if not os.path.exists(GENERATED_DATA_DIR):
            os.makedirs(GENERATED_DATA_DIR)
        output_filename = f"{basename}_1.csv"
        output_path = os.path.join(GENERATED_DATA_DIR, output_filename)
        df = pd.DataFrame(all_pairs)
        df.to_csv(output_path, index=False)
        print(f"\nSuccessfully generated a total of {len(all_pairs)} pairs for {basename}.pdf and saved to {output_path}")
    else:
        print(f"\nNo data was generated for the file {basename}.pdf.")

print("\n--- All files processed. ---")

Found 4 original documents to process for generation.

--- Generating data for original file: 3.1 Software Security 2.pdf ---
  - Processing file 1/9: 3.1 Software Security 2.pdf
    - Starting iteration 1/2 (requesting 30 pairs)... 
      - Received 27 pairs in response.
      - Total of 27 pairs generated.
    - Starting iteration 2/2 (requesting 30 pairs)... 
      An error occurred during iteration 2: Expecting ',' delimiter: line 106 column 570 (char 20988)
      Error data: ```json
{
  "results": [
    {
      "input": "Explain the concept of format string vulnerabilities in C programming.",
      "output": "Format string vulnerabilities arise when a program uses a format string that can be controlled by user input. The `printf` function, for instance, interprets format specifiers like `%s`, `%d`, and `%x` as instructions to read data from the stack. If an attacker can manipulate the format string, they can cause `printf` to read arbitrary data from the stack, potentially reveali