# User Queries Generation Script

This pipeline uses GPT-4o to:
-	Dynamically categorize advertisements into domains
- Generate domain-specific “vague” user queries for each ad (LLM-style)
- Maintain progress via checkpoints
- Expand the category list if new domains emerge


### Files You Need

| File                           | Purpose                                                                                 |
|--------------------------------|-----------------------------------------------------------------------------------------|
| `sampled_ads.csv`              | Input dataset containing a 4k-row random subset of `train_250k.tsv` ads. Used to generate LLM-style queries. |
| `dynamic_queries_checkpoint.json` | Checkpoint file that stores processed ads along with their generated **vague** LLM-style queries and metadata (e.g., category, justification). Automatically updated after each ad is processed. |
| `dynamic_category_list.json`   | Stores the list of ad domain categories. Dynamically updated if the LLM proposes a new category during classification. |

## Pipeline Steps
1. Load input dataset
`sampled_ads.csv` is loaded. A subset of 4000 rows is sampled randomly from `train_250k.tsv`

2. Resume from Checkpoint (if exits):
- Previously processed ads are loaded from `dynamic_queries_checkpoint.json`

3. Iterate over ads:
- Classify each ad into a known domain or propose a new one via LLM (with reasoning).
- Update and save `dynamic_category_list.json` if a new domain is added.
- Ask the LLM to generate a *vague* user query (LLM-style) for the ad.
- Append results to `dynamic_queries_checkpoint.json`

4. Rate limits:
- Use `time.sleep(1)` to avoid hitting API rate limits.

## Installation & Import

In [None]:
!pip install openai pandas

In [2]:
import pandas as pd
import openai
import time
import json
from openai import OpenAI
import os

## User Query Generation Implementation

## API Key
This should be removed if you upload it to Github or somewhere.

In [3]:
OPENAI_API_KEY = "Enter your code here"

## Hidden cells
This cells below are used to randomly sample the ad_dataset. This code also gives the original dataset column titles since it doesn't have any.

In [4]:
# # Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# # Define file path to the TSV
# file_path = "/content/drive/MyDrive/Algoverse/train_250k.tsv"

# #Load TSV safely, skipping bad lines
# df = pd.read_csv(file_path, sep="\t", header=None, on_bad_lines='skip')

# # Assign all 10 columns
# df.columns = [
#     "product_id", "ad_id", "user_search_query", "ad_title", "ad_description",
#     "url", "seller", "brand", "label", "image_id"
# ]


# print("Loaded and cleaned. Shape:", df.shape)
# df.head()

In [5]:
# # Sample ONCE and save for reuse
# sampled_df = df.sample(n=4000, random_state=42).reset_index(drop=True)
# sampled_df.to_csv("sampled_ads.csv", index=False)  # Save it

# Implementation

In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

# Load dataset
csv_file = "https://raw.githubusercontent.com/m1chae11u/llm-ad-integration/refs/heads/main/sampled_ads.csv"
df = pd.read_csv(csv_file)

# Checkpoint and category files
CHECKPOINT_FILE = "dynamic_queries_checkpoint.json"
CATEGORY_FILE = "dynamic_category_list.json"

# Load or initialize category list
if os.path.exists(CATEGORY_FILE):
    with open(CATEGORY_FILE, "r", encoding="utf-8") as f:
        domain_list = json.load(f)
else:
    domain_list = [
        "Electronics", "Apparel & Fashion", "Beauty & Personal Care",
        "Home & Living", "Travel", "Tools & Hardware",
        "Health & Wellness", "Automotive"
    ]

# Load checkpoint if exists
if os.path.exists(CHECKPOINT_FILE):
    with open(CHECKPOINT_FILE, "r", encoding="utf-8") as f:
        query_data = json.load(f)
    completed_indices = {entry["ad_index"] for entry in query_data}
    print(f"Resuming from checkpoint. {len(query_data)} ads completed.")
else:
    query_data = []
    completed_indices = set()
    print("Starting from scratch...")

# Main loop
for idx, row in df.iterrows():
    if idx in completed_indices:
        continue

    query = row["user_search_query"]
    title = row["ad_title"]
    description = row["ad_description"]
    url = row["url"]
    brand = row["brand"]

    # Step 1: Classify ad domain
    domain_prompt = f"""
      You are categorizing product ads into high-level domains.

      Here are the current domain categories:
      {', '.join(domain_list)}

      Please classify the following ad into one of the above domains.
      If none apply, propose a new domain and explain why.

      Ad Title: {title}
      Ad Description: {description}
      Brand: {brand}

      Return this JSON format:
      {{
        "domain": "category name",
        "is_new": true or false,
        "justification": "brief explanation"
      }}
      """

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": domain_prompt}],
            temperature=0.3,
            max_tokens=150
        )

        content = response.choices[0].message.content.strip()
        if not content:
            raise ValueError("Empty response from GPT")

        if content.startswith("```json"):
            content = content.replace("```json", "").strip()
        if content.endswith("```"):
            content = content[:-3].strip()

        try:
            domain_result = json.loads(content)
        except json.JSONDecodeError:
            print(f"Malformed JSON in domain response for ad #{idx}.\nRaw content:\n{content}\n")
            continue

        category = domain_result.get("domain", "Other")
        is_new = domain_result.get("is_new", False)
        justification = domain_result.get("justification", "")

        if is_new and category not in domain_list:
            domain_list.append(category)
            print(f"New category added: {category} — {justification}")
            with open(CATEGORY_FILE, "w", encoding="utf-8") as f:
                json.dump(domain_list, f, indent=2)

    except Exception as e:
        print(f"Error on domain classification for ad #{idx}: {e}")
        continue

    # Step 2: Generate only the vague query
    query_prompt = f"""
      You're helping us test an AI advertisement system.

      Here is the information about the product:
      - User Query (from a traditional search engine): {query}
      - Ad Title: {title}
      - Description: {description}
      - Domain: {url}
      - Brand: {brand}
      - Category: {category}

      Now, return only a vague user query that would be appropriate in the context of a user interacting with a large language model (LLM), rather than a traditional search engine.

      Respond with just the vague query as a plain string. Do not return JSON, markdown, or explanations.
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": query_prompt}],
            temperature=0.7,
            max_tokens=600
        )

        query = response.choices[0].message.content.strip().strip('"')

        print(f"\nAd #{idx} | Ad Title: {title} | Category: {category}")
        print(f"Query: {query}")

        # Save progress
        query_data.append({
            "ad_index": idx,
            "ad_product": title,
            "category": category,
            "is_new_category": is_new,
            "justification": justification,
            "llm_queries": query,
        })

        with open(CHECKPOINT_FILE, "w", encoding="utf-8") as f:
            json.dump(query_data, f, indent=2, ensure_ascii=False)

    except Exception as e:
        print(f"Error generating queries on ad #{idx}: {e}")
        time.sleep(5)
        continue

    time.sleep(1)

# Done generating
print(f"\nAll done! {len(query_data)} entries saved to '{CHECKPOINT_FILE}'")
print(f"Final domain list saved to '{CATEGORY_FILE}'")

In [None]:
# Save to file
with open("user_queries.json", "w", encoding="utf-8") as f:
    json.dump(query_data, f, indent=2, ensure_ascii=False)

print(f"Saved {len(query_data)} queries to user_queries.json")