# User Queries Generation Script

In this project, we’re generating user queries for each advertisement by calling the GPT-4o API. Since the dataset is large (more than 4,000 of ads) and API calls can be time-consuming, rate-limited, or prone to occasional errors, we will be using a checkpointing mechanism to ensure reliability and efficiency.

This script generates three types of user search queries (specific, vague, and conversational) for each product in an ad dataset using GPT-4o. It saves progress after each ad, so if the script is interrupted or shared across machines, it can resume exactly where it left off using a checkpoint.

### Files You Need

| File                          | Purpose                                        |
|-------------------------------|------------------------------------------------|
| `generated_ad_dataset.csv`    | The ad dataset to generate queries from       |
| `user_queries_checkpoint.json`| Stores saved progress (so you don’t restart)  |
| `user_query_generation.py`         | The script that generates the user queries     |

Run the script. It will:
- Load the dataset
- Check if user_queries_checkpoint.json exists
         * If yes, it skips all ads already processed
         * If no, it starts from scratch
- Call GPT-4o to generate 3 user queries per ad
- Save results to the checkpoint file after every ad

## Resume Query Generation from Checkpoint

1. **Ensure these 3 files are in the same folder:**
   - `user_queries_checkpoint.json`
   - `generated_ad_dataset.csv`
   - The query generation script

2. **Make sure the dataset hasn’t changed**  
   - Same row order and content as before

3. **Run the script again**  
   - It will auto-resume from where it left off using the checkpoint

## Installation & Import

In [9]:
!pip install openai pandas



In [10]:
import pandas as pd
import openai
import time
import json
from openai import OpenAI
import os

## User Query Generation Implementation

In [12]:
OPENAI_API_KEY = "Enter your key here"

In [13]:
# Load ad dataset
df = pd.read_csv("https://raw.githubusercontent.com/m1chae11u/llm-ad-integration/main/generated_ad_dataset.csv")

In [14]:
client = OpenAI(api_key=OPENAI_API_KEY)

# Load ad dataset
df = pd.read_csv("https://raw.githubusercontent.com/m1chae11u/llm-ad-integration/main/generated_ad_dataset.csv")

# Checkpoint file
CHECKPOINT_FILE = "user_queries_checkpoint.json"

# Load previous checkpoint if exists
if os.path.exists(CHECKPOINT_FILE):
    with open(CHECKPOINT_FILE, "r", encoding="utf-8") as f:
        query_data = json.load(f)
    completed_indices = {entry["ad_index"] for entry in query_data}
    print(f"Loaded {len(query_data)} previous queries. Resuming...")
else:
    query_data = []
    completed_indices = set()
    print("Starting from scratch...")

# Loop through remaining ads
for idx, row in df.iterrows():
    if idx in completed_indices:
        continue  # skip already processed ads

    product = row["product"]
    description = row["ad_description"]
    benefits = eval(row["ad_benefits"]) if isinstance(row["ad_benefits"], str) else []

    prompt = (
        f"You're helping to test an AI advertisement system.\n\n"
        f"Product: {product}\n"
        f"Description: {description}\n"
        f"Benefits: {', '.join(benefits)}\n\n"
        f"Generate 3 types of user search queries that this product would be a good fit for:\n"
        f"1. A 'specific' query\n"
        f"2. A 'vague' query\n"
        f"3. A 'conversational' query\n\n"
        f"Return only a JSON object like this:\n"
        f'{{"specific": "...", "vague": "...", "conversational": "..."}}'
        f"Please do NOT include triple backticks or markdown formatting."

    )

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300
        )
        
        content = response.choices[0].message.content.strip()
        
        # Remove markdown formatting if present, sometimes, GPT wraps the JSON in triple backticks because
        # that is how it formats code blocks by default message, this check will help to make sure that those 
        # backticks are removed.
        if content.startswith("```json"):
            content = content.replace("```json", "").strip()
        if content.endswith("```"):
            content = content[:-3].strip()

        # Safety check before parsing
        if not content:
            raise ValueError("Empty response from GPT")

        try:
            queries = json.loads(content)
        except json.JSONDecodeError:
            print(f"GPT returned malformed JSON on ad #{idx}. Content:\n{content}\nSkipping...")
            continue

        print(f"\nAd #{idx} | Product: {product}")

        print(f"\nAd #{idx} | Product: {product}")
        for qtype, query in queries.items():
            print(f"  ➤ [{qtype}] {query}")
            query_data.append({
                "query": query,
                "query_type": qtype,
                "ad_product": product,
                "ad_index": idx
            })

        # Save checkpoint after each successful ad
        with open(CHECKPOINT_FILE, "w", encoding="utf-8") as f:
            json.dump(query_data, f, indent=2, ensure_ascii=False)

    except Exception as e:
        print(f"Error on ad #{idx}: {e}")
        time.sleep(5)
        continue

    time.sleep(1)  # Rate limit buffer

print(f"\nSaved {len(query_data)} queries to '{CHECKPOINT_FILE}'")

Loaded 5322 previous queries. Resuming...

Ad #1774 | Product: ComfyCloud Recliner

Ad #1774 | Product: ComfyCloud Recliner
  ➤ [specific] best ergonomic recliner with adjustable settings
  ➤ [vague] comfortable furniture for living room
  ➤ [conversational] I'm looking for a recliner that's both stylish and supportive, any recommendations?

Ad #1775 | Product: NaturaDry Dryer

Ad #1775 | Product: NaturaDry Dryer
  ➤ [specific] energy-efficient dryer with smart sensors for optimal drying
  ➤ [vague] best quiet dryer for home
  ➤ [conversational] I'm looking for a reliable and eco-friendly dryer that doesn't make much noise. Any recommendations?

Ad #1776 | Product: SonicWash Shower Head

Ad #1776 | Product: SonicWash Shower Head
  ➤ [specific] high-pressure eco-friendly shower head
  ➤ [vague] best shower head for home use
  ➤ [conversational] What's a good shower head that feels like a spa and saves water?

Ad #1777 | Product: LuxeLight LED Strip

Ad #1777 | Product: LuxeLight LED Str

KeyboardInterrupt: 

In [None]:
# --- Final Check: Ensure all ads have queries ---
all_ad_indices = set(df.index)
queried_ad_indices = {entry["ad_index"] for entry in query_data}

missing_ads = all_ad_indices - queried_ad_indices

if not missing_ads:
    print("\nAll ads have user queries!!!")

    # Save final output
    with open("user_queries.json", "w", encoding="utf-8") as f:
        json.dump(query_data, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(query_data)} queries to user_queries.json")

    # Auto-download file
    from google.colab import files
    files.download("user_queries.json")

else:
    print(f"\n{len(missing_ads)} ads are missing user queries.")
    print("Missing ad indices:", sorted(missing_ads))