# Synthetic Data Generation for LLM Fine-Tuning (Gemini 2.5 Flash with PDF Chunks)

This notebook provides a two-stage workflow for generating synthetic data. First, it preprocesses large PDF files from a `raw_data` directory by splitting them into smaller chunks and copying the original file into a `chunked_data` directory. Second, it iterates through all files in `chunked_data` (both chunks and full files), uploading each one to the `gemini-2.5-flash` model to generate synthetic input-output pairs.

### Workflow:
1.  **Setup**: Install and import necessary libraries.
2.  **Environment Configuration**: Load your Gemini API key from environment variables.
3.  **API and Path Configuration**: Configure directories, the Gemini client, and model parameters.
4.  **Preprocessing**: A dedicated function splits source PDFs and copies the original files into the `chunked_data` directory.
5.  **Data Generation**: The script iterates through the pre-made PDF chunks and the full PDFs, generating a consistent number of pairs for each.
6.  **Save Output**: All generated data is grouped by the original source file and saved to CSV files in the `generated_data` directory.

## 1. Setup

In [None]:
import os
import glob
import pandas as pd
import time
import shutil
import re
from dotenv import load_dotenv
from google import genai
import pypdf
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from functools import partial

## 2. Environment Configuration

In [None]:
load_dotenv()

## 3. Model and Path Configuration

We define our parameters, including the model ID, directory paths, and the chunking settings.

In [None]:
MODEL_ID = os.getenv('MODEL_ID')
MODEL = genai.Client(api_key=os.getenv("GEMINI_TOKEN"))
RAW_DATA_DIR = os.getenv('RAW_DATA_DIR')
CHUNKED_DATA_DIR = os.getenv('CHUNKED_DATA_DIR')
GENERATED_DATA_DIR = os.getenv('GENERATED_DATA_DIR')

CHUNK_SIZE = int(os.getenv('CHUNK_SIZE'))
CHUNK_OVERLAP = int(os.getenv('CHUNK_OVERLAP'))

MAX_CONCURRENCY = int(os.getenv('MAX_CONCURRENCY'))
PAIRS_PER_ITERATION = int(os.getenv('PAIRS_PER_ITERATION'))
NUM_ITERATIONS = int(os.getenv('NUM_ITERATIONS'))
NUM_RETRIES = int(os.getenv('NUM_RETRIES'))

## 4. Preprocessing Step: Splitting and Copying PDFs

This function reads all PDFs from `raw_data`, saves smaller chunks to `chunked_data`, and also copies the original full PDF to `chunked_data`. Run this cell once to prepare your data.

In [None]:
def preprocess_and_chunk_pdfs(source_dir=RAW_DATA_DIR,
                              dest_dir=CHUNKED_DATA_DIR,
                              chunk_size=CHUNK_SIZE,
                              overlap=CHUNK_OVERLAP,
                              stage_key='PREPROCESSING'):
    
    """
    Reads all PDFs from a source directory, splits them into chunks,
    saves chunks to a destination directory, and copies the original file.
    """

    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
        
    source_pdfs = glob.glob(os.path.join(source_dir, "*.pdf"))
    print(f"[{stage_key}] Found {len(source_pdfs)} PDFs in '{source_dir}'\n")

    for pdf_path in source_pdfs:
        original_basename = os.path.basename(pdf_path).replace('.pdf', '')
        try:
            print(f"[{stage_key}] Starting to chunk {os.path.basename(pdf_path)}")
            pdf_reader = pypdf.PdfReader(pdf_path)
            num_pages = len(pdf_reader.pages)
            step = chunk_size - overlap
            
            for i in range(0, num_pages, step):
                pdf_writer = pypdf.PdfWriter()
                end_page = min(i + chunk_size, num_pages)
                
                for page_num in range(i, end_page):
                    pdf_writer.add_page(pdf_reader.pages[page_num])
                
                chunk_filename = f"{original_basename}_chunk_{(i // step) + 1}.pdf"
                chunk_path = os.path.join(dest_dir, chunk_filename)
                
                with open(chunk_path, 'wb') as out_pdf:
                    pdf_writer.write(out_pdf)
            print(f"[{stage_key}] Successfully chunked {os.path.basename(pdf_path)}\n")
        except Exception as e:
            print(f"[{stage_key}] Error chunking {pdf_path} due to {e}\n")

preprocess_and_chunk_pdfs()

## 5. Data Generation Step

This is the main generation loop. It iterates through all files in `chunked_data` (both chunks and full documents), generates pairs for each, and saves the combined results into CSV files in the `generated_data` directory, grouped by the original filename.

In [None]:
DATA_GENERATION_PROMPT = r"""
You are an expert data scientist tasked with creating a high-quality dataset for instruction-tuning a large language model.

Your primary goal is to generate {0} distinct and high-quality instruction-response pairs from the provided document.

## Key Instructions
1.  **Instruction (`input`):** Create a clear and specific prompt a user would ask. Vary the tasks: include direct questions, summarization requests, comparisons between concepts, and analytical prompts that require reasoning. Ensure a range of complexity.
2.  **Response (`output`):** Write a detailed, accurate, and direct answer as if you are an expert on the subject. The response must fully and exclusively satisfy the user's instruction, and respond in a complete and factual manner.
3.  **Source Grounding:** Base all responses **exclusively** on the information within the provided document.
4.  **No Self-Reference:** Do not mention the source document in your responses. Avoid any phrases like "According to the document..." or "The provided text states...".

## Formatting Requirements
* Your entire output **must** be a single, valid JSON object. Do not include any text, explanations, or markdown formatting before or after the JSON structure.
* **CRITICAL:** All strings within the JSON must be properly escaped to ensure the output is parsable.
    * All backslashes (`\`) must be escaped as (`\\`). For example, the text `\x48` must be written in the JSON string as `\\x48`, and `\n` must be written as `\\n`.
    * All newlines (`\n`) must be escaped as (`\\n`). For example, the text `hi\n` must be written in the JSON string as `hi\\n`.
    * All double quotes (`"`) must be escaped as (`\"`). For example, the text `he said "hello"` must be written as `he said \"hello\"` and `"` must be written as `\"`.
    * Do not escape any other characters as those are invalid characters. For example, the text `this 'item'` must be written as `this 'item'` and `printf("%d\n", 5);` must be written as `printf(\"%d\\n\", 5);`.

Use the following JSON structure:
```json
{{
  "results": [
    {{
      "input": "Your first generated input here.",
      "output": "Your first generated output here."
    }},
    {{
      "input": "Your second generated input here.",
      "output": "Your second generated output here."
    }},
    ... and so on for all {0} pairs. 
  ]
}}
"""

In [None]:
def get_base_name(file_name):
    return re.sub(r'\.[^.]+$', '', os.path.basename(file_name))

def generate_data_for_chunk(file_path,
                            model=MODEL,
                            model_id=MODEL_ID,
                            dest_dir=GENERATED_DATA_DIR,
                            data_generation_prompt=DATA_GENERATION_PROMPT,
                            iterations=NUM_ITERATIONS,
                            pairs_per_iteration=PAIRS_PER_ITERATION,
                            retries=NUM_RETRIES,
                            stage_key='DATA_GENERATION'):

    full_file_name = os.path.basename(file_path)
    base_file_name = get_base_name(full_file_name)
    trgt_dir = os.path.join(dest_dir, base_file_name)
    
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
    if not os.path.exists(trgt_dir):
        os.makedirs(trgt_dir)
    
    try:
        uploaded_file = model.files.upload(file=file_path, config={'display_name': full_file_name})
        full_df = pd.DataFrame()
        print(f"[{stage_key}] Successfully uploaded '{full_file_name}'")
        
        for iteration_num in range(1, iterations + 1):
            print(f"[{stage_key}] Starting data generation ({iteration_num}/{iterations}) for '{full_file_name}'")
            for attempt in range(retries):
                try:
                    prompt = str.format(data_generation_prompt, pairs_per_iteration)
                    response = model.models.generate_content(
                        model=model_id,
                        contents=[prompt, uploaded_file],
                        config={"temperature": 0.7, "top_p": 0.95}
                    )
                    response_text = response.text[7:-3]
                    results = json.loads(response_text)["results"]
                    print(f"[{stage_key}] Received {len(results)} pairs in response ({iteration_num}/{iterations}) for '{full_file_name}'{' (attempt' + attempt + '/' + retries + ')' if attempt > 1 else ''}")

                    generated_pairs = [{
                        "input": r"{}".format(result["input"].strip()),
                        "output": r"{}".format(result["output"].strip())
                    } for result in results if result.get('input') and result.get('output')]
                    print(f"[{stage_key}] Cumulative total of {len(generated_pairs)} pairs generated for '{full_file_name}' after generation ({iteration_num}/{iterations}){' (attempt' + attempt + '/' + retries + ')' if attempt > 1 else ''}")

                    df = pd.DataFrame(generated_pairs)
                    pd.concat([full_df, df])
                    df.to_csv(os.path.join(trgt_dir, f'{base_file_name}_{iteration_num}.csv'), index=False)
                    break
                except Exception as e:
                    print(f"[{stage_key}] Data generation ({iteration_num}/{iterations}){' (attempt' + attempt + '/' + retries + ')' if attempt > 1 else ''} for '{full_file_name}' failed due to {e}")
        return full_df
    except Exception as e:
        print(f"[{stage_key}] Data generation for '{full_file_name}' failed due to {e}")

    finally:
        if uploaded_file:
            model.files.delete(name=uploaded_file.name)

In [None]:
def generate_data_for_chunks(raw_dir=RAW_DATA_DIR,
                             chunked_dir=CHUNKED_DATA_DIR,
                             dest_dir=GENERATED_DATA_DIR,
                             stage_key='DATA_GENERATION',
                             max_workers=MAX_CONCURRENCY):
    
    original_pdf_files = glob.glob(os.path.join(raw_dir, "*.pdf"))

    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)

    print(f"[{stage_key}] Found {len(original_pdf_files)} original documents to process for generation\n")

    all_files_to_process = []
    
    for original_pdf_file in original_pdf_files:
        base_name = get_base_name(original_pdf_file)
        files_to_process = sorted(glob.glob(os.path.join(chunked_dir, f"{base_name}*.pdf")))

        if not files_to_process:
            print(f"[{stage_key}] No files found for '{base_name}'")
            continue
        
        print(f"[{stage_key}] Generating data for '{base_name}'")
        all_files_to_process.extend(files_to_process)
    
    full_df = pd.DataFrame()

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(generate_data_for_chunk, file_path): file_path 
            for file_path in all_files_to_process
        }
        
        for future in as_completed(futures):
            file_path = futures[future]
            pd.concat([full_df, future.result()])
            print(f"[{stage_key}] Cumulative total of {len(full_df)} pairs generated")


generate_data_for_chunks()

In [None]:
# original_pdf_files = glob.glob(os.path.join(RAW_DATA_DIR, "*.pdf"))
# original_basenames = [os.path.basename(f).replace('.pdf', '') for f in original_pdf_files]

# print(f"Found {len(original_basenames)} original documents to process for generation.")

# if not os.path.exists(GENERATED_DATA_DIR):
#     os.makedirs(GENERATED_DATA_DIR)

# for basename in original_basenames:
#     print(f"\n--- Generating data for original file: {basename}.pdf ---")
#     all_pairs = []
    
#     files_to_process = sorted(glob.glob(os.path.join(CHUNKED_DATA_DIR, f"{basename}*.pdf")))

#     if not files_to_process:
#         print(f"  - No files found for {basename}. Skipping.")
#         continue

#     for i, file_path in enumerate(files_to_process):
#         full_file_name = os.path.basename(file_path)
#         only_file_name = re.sub(r'\.[^.]+$', '', full_file_name)

#         if not os.path.exists(GENERATED_DATA_DIR):
#             os.makedirs(GENERATED_DATA_DIR)
#         if not os.path.exists(f'{GENERATED_DATA_DIR}/{only_file_name}'):
#             os.makedirs(f'{GENERATED_DATA_DIR}/{only_file_name}')

#         print(f"  - Processing file {i+1}/{len(files_to_process)}: {full_file_name}")
#         uploaded_file = None
#         try:
#             uploaded_file = MODEL.files.upload(file=file_path, config={'display_name': full_file_name})
            
#             # Loop for the number of API calls
#             for iteration_num in range(1, NUM_ITERATIONS + 1):
#                 retries = 3
#                 for attempt in range(retries):  # Retry mechanism
#                     error_data = None
#                     try:
#                         print(f"    - Starting iteration {iteration_num}/{NUM_ITERATIONS} (requesting {PAIRS_PER_ITERATION} pairs)... ")
                        
#                         prompt = r"""
# You are an expert data scientist tasked with creating a high-quality dataset for instruction-tuning a large language model.

# Your primary goal is to generate {0} distinct and high-quality instruction-response pairs from the provided document.

# ## Key Instructions
# 1.  **Instruction (`input`):** Create a clear and specific prompt a user would ask. Vary the tasks: include direct questions, summarization requests, comparisons between concepts, and analytical prompts that require reasoning. Ensure a range of complexity.
# 2.  **Response (`output`):** Write a detailed, accurate, and direct answer as if you are an expert on the subject. The response must fully and exclusively satisfy the user's instruction, and respond in a complete and factual manner.
# 3.  **Source Grounding:** Base all responses **exclusively** on the information within the provided document.
# 4.  **No Self-Reference:** Do not mention the source document in your responses. Avoid any phrases like "According to the document..." or "The provided text states...".

# ## Formatting Requirements
# * Your entire output **must** be a single, valid JSON object. Do not include any text, explanations, or markdown formatting before or after the JSON structure.
# * **CRITICAL:** All strings within the JSON must be properly escaped to ensure the output is parsable.
#     * All backslashes (`\`) must be escaped as (`\\`). For example, the text `\x48` must be written in the JSON string as `\\x48`, and `\n` must be written as `\\n`.
#     * All newlines (`\n`) must be escaped as (`\\n`). For example, the text `hi\n` must be written in the JSON string as `hi\\n`.
#     * All double quotes (`"`) must be escaped as (`\"`). For example, the text `he said "hello"` must be written as `he said \"hello\"` and `"` must be written as `\"`.
#     * Do not escape any other characters as those are invalid characters. For example, the text `this 'item'` must be written as `this 'item'` and `printf("%d\n", 5);` must be written as `printf(\"%d\\n\", 5);`.

# Use the following JSON structure:
# ```json
# {{
#   "results": [
#     {{
#       "input": "Your first generated input here.",
#       "output": "Your first generated output here."
#     }},
#     {{
#       "input": "Your second generated input here.",
#       "output": "Your second generated output here."
#     }},
#     ... and so on for all {0} pairs. 
#   ]
# }}
# """
#                         prompt = str.format(prompt, PAIRS_PER_ITERATION)
#                         response = MODEL.models.generate_content(
#                             model=MODEL_ID,
#                             contents=[prompt, uploaded_file],
#                             config={"temperature": 0.7, "top_p": 0.95}
#                         )
#                         response_text = response.text[7:-3]

#                         with open(f'{GENERATED_DATA_DIR}/{only_file_name}/{only_file_name}_{iteration_num}.txt', 'w') as f:
#                             f.write(response_text)
                        
#                         error_data = response_text
#                         results = json.loads(response_text)["results"]
#                         print(f"      - Received {len(results)} pairs in response.")

#                         for result in results:
#                             if result.get("input") and result.get("output"):
#                                 all_pairs.append({
#                                     "input": r"{}".format(result["input"].strip()),
#                                     "output": r"{}".format(result["output"].strip())
#                                 })
#                             else:
#                                 print("      Warning: Generated empty input or output for a pair.")
#                         print(f"      - Total of {len(all_pairs)} pairs generated.")
#                         break
#                     except Exception as e:
#                         print(f"      An error occurred during iteration {iteration_num}: {e}")
#                         print(f"      Error data: {error_data}")
#                         continue

#         except Exception as e:
#             print(f"      An error occurred during generation: {e}")
#         finally:
#             if uploaded_file:
#                 MODEL.files.delete(name=uploaded_file.name)
    
#     if all_pairs:
#         output_filename = f"{basename}_full.csv"
#         output_path = os.path.join(GENERATED_DATA_DIR, output_filename)
#         df = pd.DataFrame(all_pairs)
#         df.to_csv(output_path, index=False)
#         print(f"\nSuccessfully generated a total of {len(all_pairs)} pairs for {basename}.pdf and saved to {output_path}")
#     else:
#         print(f"\nNo data was generated for the file {basename}.pdf.")

# print("\n--- All files processed. ---")