# User Queries Generation Script

This pipeline uses GPT-4o to:
-	Dynamically categorize advertisements into domains and subdomains
- Generate domain-specific “vague” user queries for each ad (LLM-style)
- Maintain progress via checkpoints

### Files You Need

| File                           | Purpose                                                                                 |
|--------------------------------|-----------------------------------------------------------------------------------------|
| `sampled_ads.csv`              | Input dataset containing a 4k-row random subset of `train_250k.tsv` ads. Used to generate LLM-style queries. |
| `dynamic_queries_checkpoint.json` | Checkpoint file that stores processed ads along with their generated **vague** LLM-style queries and metadata (e.g., category, justification). Automatically updated after each ad is processed. |

## Pipeline Steps
1. Load input dataset
`sampled_ads.csv` is loaded. A subset of 4000 rows is sampled randomly from `train_250k.tsv`

2. Resume from Checkpoint (if exits):
- Previously processed ads are loaded from `dynamic_queries_checkpoint.json`

3. Iterate over ads:
- Classify each ad into a known domain and subdomain.
- Ask the LLM to generate a *vague* user query (LLM-style) for the ad.
- Append results to `dynamic_queries_checkpoint.json`

4. Rate limits:
- Use `time.sleep(1)` to avoid hitting API rate limits.

## Installation & Import

In [None]:
!pip install openai pandas

In [None]:
import pandas as pd
import openai
import time
import json
from openai import OpenAI
import os

## User Query Generation Implementation

## API Key
This should be removed if you upload it to Github or somewhere.

In [None]:
OPENAI_API_KEY = "YOUR_KEY_HERE"

## Hidden cells
This cells below are used to randomly sample the ad_dataset. This code also gives the original dataset column titles since it doesn't have any.

In [None]:
# # Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# # Define file path to the TSV
# file_path = "/content/drive/MyDrive/Algoverse/train_250k.tsv"

# #Load TSV safely, skipping bad lines
# df = pd.read_csv(file_path, sep="\t", header=None, on_bad_lines='skip')

# # Assign all 10 columns
# df.columns = [
#     "product_id", "ad_id", "user_search_query", "ad_title", "ad_description",
#     "url", "seller", "brand", "label", "image_id"
# ]


# print("Loaded and cleaned. Shape:", df.shape)
# df.head()

In [None]:
# # Sample ONCE and save for reuse
# sampled_df = df.sample(n=4000, random_state=42).reset_index(drop=True)
# sampled_df.to_csv("sampled_ads.csv", index=False)  # Save it

# Implementation

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

# Load dataset
csv_file = "https://raw.githubusercontent.com/m1chae11u/llm-ad-integration/refs/heads/main/sampled_ads.csv"
df = pd.read_csv(csv_file)

# File paths
CHECKPOINT_FILE = "dynamic_queries_checkpoint.json"

# Initialize or resume query checkpoint
if os.path.exists(CHECKPOINT_FILE):
    with open(CHECKPOINT_FILE, "r", encoding="utf-8") as f:
        query_data = json.load(f)
    completed_indices = {entry["ad_index"] for entry in query_data}
    print(f"Resuming from checkpoint. {len(query_data)} ads completed.")
else:
    query_data = []
    completed_indices = set()
    print("Starting from scratch...")

# Main loop
for idx, row in df.iterrows():
    if idx in completed_indices:
        continue

    user_query = row["user_search_query"]
    ad_title = row["ad_title"]
    ad_description = row["ad_description"]
    url = row["url"]
    brand = row["brand"]
    ad_id = row["ad_id"] if "ad_id" in row else None
    prompt = f"""
    You are an intelligent assistant helping with ad analysis.

    Given the product ad information below, your task is to:
    1. Identify a **broad domain** the ad belongs to (e.g., Electronics, Beauty & Personal Care).
    2. Identify a **more specific subdomain** (e.g., Smartphones, Skincare).
    3. Write a **vague, natural-sounding LLM-style user query** that someone might ask. This query should be broad enough to match the ad, but not directly name the brand or product.

    Please label your output clearly:
    DOMAIN: ...
    SUBDOMAIN: ...
    QUERY: ...

    Example of an expected output:
    DOMAIN: Beauty & Personal Care
    SUBDOMAIN: Skincare
    QUERY: What are some ways to even out my skin tone naturally?

    Now do the same for the following ad:

    Input Ad:
    - Title: {ad_title}
    - Description: {ad_description}
    - URL: {url}
    - Brand: {brand}

    Respond only with the labeled output. No markdown, no explanations.
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.5,
            max_tokens=300
        )

        content = response.choices[0].message.content.strip()

        # Check if content is empty
        if not content:
            raise ValueError(f"Empty response from GPT for ad #{idx}")

        # Strip any Markdown/code block formatting
        if content.startswith("```"):
            content = content.split("```")[-1].strip()

        # Manually parse
        lines = content.splitlines()
        domain = next((line.split(":", 1)[1].strip() for line in lines if line.startswith("DOMAIN:")), "Other")
        subdomain = next((line.split(":", 1)[1].strip() for line in lines if line.startswith("SUBDOMAIN:")), "General")
        vague_query = next((line.split(":", 1)[1].strip() for line in lines if line.startswith("QUERY:")), "")
        print(f"\nAd #{idx} | Domain: {domain} | Subdomain: {subdomain}")
        print(f"Query: {vague_query}")


        # Save query data
        query_data.append({
            "ad_index": idx,
            "ad_id": ad_id,
            "ad_product": ad_title,
            "domain": domain,
            "subdomain": subdomain,
            "vague_query": vague_query
        })

        with open(CHECKPOINT_FILE, "w", encoding="utf-8") as f:
            json.dump(query_data, f, indent=2, ensure_ascii=False)

    except Exception as e:
        print(f"Error on ad #{idx}: {e}")
        time.sleep(5)
        continue

    time.sleep(1)  # Rate limiting

# Wrap up
print(f"\n Done! {len(query_data)} ads saved to '{CHECKPOINT_FILE}'")

Resuming from checkpoint. 4000 ads completed.

 Done! 4000 ads saved to 'dynamic_queries_checkpoint.json'


In [None]:
# Save to file
with open("user_queries.json", "w", encoding="utf-8") as f:
    json.dump(query_data, f, indent=2, ensure_ascii=False)

print(f"Saved {len(query_data)} queries to user_queries.json")

Saved 4000 queries to user_queries.json
