# Demo: Classify Search Queries using GPT-4o-Mini

We propose three complementary, IR-grounded definitions to distinguish between narrowly focused vs. exploratory search queries:

* Search Intent Taxonomy [(Broder, 2002)](https://dl.acm.org/doi/abs/10.1145/792550.792552): navigational (i.e., aimed at reaching a website), transactional (i.e., intended to complete a task), informational (i.e., seeking to learn about a topic).

* Named-Entity Presence [(White and Roth, 2009)](https://books.google.co.kr/books?hl=ko&lr=&id=MjRr9Z8lxXkC&oi=fnd&pg=PR3&dq=exploratory+search:+beyond+the+query-response&ots=GNjB-sZE4h&sig=lf1XNgrEENlBg4boZslqeYESiJ4&redir_esc=y#v=onepage&q=exploratory%20search%3A%20beyond%20the%20query-response&f=false): Examine whether a query includes named entities such as specific product or brand names.

* Inverse Document Frequency [(Sparck Jones, 1972)](https://www.emerald.com/insight/content/doi/10.1108/eb026526/full/html): Assess whether the semantic content of a query appears in the top five Google search result snippets.

# Step 1: Install packages

In [None]:
%pip install pandas requests openai python-dotenv

In [1]:
import requests
import pandas as pd
import json
import re
import os
import openai

from dotenv import load_dotenv

# Step 2: Define your OpenAI API key

In [2]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Step 3: Define system prompts

In [3]:
system_prompt_1 = """
You are an information-retrieval expert fluent in Korean. You will receive five items; each item has (1) an unspaced Korean search query and (2) an array of five title+snippet strings returned by Google.
For each item, follow these steps exactly, then return one JSON object:
1. For each snippet, think step by step in 1-2 sentences whether it semantically contains the query. Allow for minor typos and close paraphrases, but be cautious--only consider it a match if the snippet directly and clearly relates to the core meaning of the query.
2. After your rationale, output one array "matches" of five Boolean values (true / false), one for snippet.

Respond strictly in the following JSON array format (no extra text):
[
  {
    "query": "<original query>",
    "matches": "[true, false, false, true, false]"
  },
  ...
]
"""

system_prompt_2 = """
You will be given a list of search queries submitted to a search engine. For each query, please classify the user's primary intent into one of the following three categories, choosing the most dominant intent if multiple apply:

1. Navigational: The user wants to visit a specific website, brand page, or official service page.
   (Example: "Naver," "Coupang customer service")

2. Transactional: The user intends to perform a specific action such as purchasing, downloading, booking, or watching something.
   (Example: "Buy used AirPods," "Netflix subscription," "vacuum cleaner head")

3. Informational: The user is looking for information or answers and the query does not fit into the above two categories.
   (Example: "Symptoms of depression," "Weather in Paris," "Supplements good for thyroid")

Additional instructions:
- If the query contains typos or abbreviations, infer the user's intent based on the most common or likely interpretation in context.
- Respond strictly in the following JSON array format (no extra text):

[
  {
    "query": "<original query>",
    "reasoning": "<clear explanation of why this query falls into the chosen category, based on words, phrases, or user intent>",
    "category": "Navigational / Transactional / Informational"
  },
  ...
]
"""

system_prompt_3 = """
You will be given a list of search queries entered into a search engine. For each query, extract **specific brand or product names** and **general product category names**.

Instructions:

1. Specific Brand or Product Name:
- Extract brand names, product names, or model numbers if present.
- Examples: 'Dior Lip Glow' → 'Dior', 'Apple Watch' → 'Apple Watch', 'Emtek GeForce GTX750' → 'Emtek GeForce GTX750'

2. General Product Category Name:
- Extract general product categories (only physical products; exclude services or software).
- Examples: 'meat', 'sneakers', 'Bluetooth earphones'

3. Compound Queries:
- For brand + product name, extract as one combined name.
- For brand + general product category, extract separately by type.
- Examples: 'Dior Lip Glow' → 'Dior', 'Lip Glow'; 'Musinsa Sneakers' → 'Musinsa', 'Sneakers'

4. Do not extract descriptions of product features or functions.
- Examples: 'laptop for design' → 'laptop', 'hypoallergenic dog chew' → 'dog chew', 'bidet bolt tightening' → 'bidet', 'kettlebell 16kg' → 'kettlebell'

5. N/A case:
- If neither a specific brand/product name nor a general product category applies, return 'N/A'.
- Example: 'pirate' → 'N/A'

Respond strictly in the following JSON array format (no extra text):
[
  {
    "query": "<original query>",
    "brand_or_product": "<extracted brand or product name or 'N/A'>",
    "general_product_category": "<extracted general product category name or 'N/A'>"
  },
  ...
]
"""

# Step 4: Functions

In [4]:
# Parse response for system prompt 1
def parse_response_1(response: str) -> list:
    try:
        results = json.loads(response)
        return [item for item in results if 'query' in item and 'matches' in item]
    except json.JSONDecodeError:
        blocks = re.findall(r'\{[^{}]+\}', response)
        parsed = []
        for blk in blocks:
            try:
                obj = json.loads(blk)
                if {'query','matches'} <= obj.keys():
                    parsed.append(obj)
            except json.JSONDecodeError:
                continue
        return parsed

# Parse response for system prompt 2
def parse_response_2(response: str) -> list:
    try:
        results = json.loads(response)
        return [item for item in results if 'query' in item and 'reasoning' in item and 'category' in item]
    except json.JSONDecodeError:
        blocks = re.findall(r'\{[^{}]+\}', response)
        parsed = []
        for blk in blocks:
            try:
                obj = json.loads(blk)
                if {'query','reasoning','category'} <= obj.keys():
                    parsed.append(obj)
            except json.JSONDecodeError:
                continue
        return parsed

# Parse response for system prompt 3
def parse_response_3(response: str) -> list:
    required_keys = {'query', 'brand_or_product', 'general_product_category'}
    try:
        results = json.loads(response)
        # Return only items that contain all required keys
        return [item for item in results if required_keys <= item.keys()]
    except json.JSONDecodeError:
        # If full JSON parse fails, try to extract JSON objects via regex and parse individually
        blocks = re.findall(r'\{[^{}]+\}', response)
        parsed = []
        for blk in blocks:
            try:
                obj = json.loads(blk)
                if required_keys <= obj.keys():
                    parsed.append(obj)
            except json.JSONDecodeError:
                continue
        return parsed

def call_gpt(queries, system_prompt, model="gpt-4o-mini", max_tokens=1000):
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": queries}
        ],
        temperature=0,
        max_tokens=max_tokens
    )
    return response.choices[0].message.content

# Step 5: Workflow

In [5]:
queries = [
    "amazon apple watch se discount",
    "dyson vacuum cleaner noise level",
    "chipotle official menu",
    "nike dunk low stockx",
    "samsung refrigerator installation guide",
]
queries_json = json.dumps(queries)

In [7]:
response2 = call_gpt(queries_json, system_prompt_2, model="gpt-4o-mini", max_tokens=1000)
print("\n--------------------------------")
print("Search intent: informational / navigational / transactional")
print("--------------------------------\n")
print(response2)


--------------------------------
Search intent: informational / navigational / transactional
--------------------------------

[
  {
    "query": "amazon apple watch se discount",
    "reasoning": "The user is looking for a discount on a specific product, indicating an intention to purchase or find a deal.",
    "category": "Transactional"
  },
  {
    "query": "dyson vacuum cleaner noise level",
    "reasoning": "The user is seeking information about the noise level of a specific product, which does not indicate a desire to buy or visit a website.",
    "category": "Informational"
  },
  {
    "query": "chipotle official menu",
    "reasoning": "The user is looking for the official menu of Chipotle, which suggests they want to navigate to a specific brand's information.",
    "category": "Navigational"
  },
  {
    "query": "nike dunk low stockx",
    "reasoning": "The user is likely looking to purchase or check the availability of a specific sneaker on StockX, indicating a transacti

In [9]:
response3 = call_gpt(queries_json, system_prompt_3, model="gpt-4o-mini", max_tokens=1000)
print("\n--------------------------------")
print("Extract brand/product names and categories")
print("--------------------------------\n")
print(response3)


--------------------------------
Extract brand/product names and categories
--------------------------------

[
  {
    "query": "amazon apple watch se discount",
    "brand_or_product": "Apple Watch SE",
    "general_product_category": "smartwatch"
  },
  {
    "query": "dyson vacuum cleaner noise level",
    "brand_or_product": "Dyson",
    "general_product_category": "vacuum cleaner"
  },
  {
    "query": "chipotle official menu",
    "brand_or_product": "N/A",
    "general_product_category": "N/A"
  },
  {
    "query": "nike dunk low stockx",
    "brand_or_product": "Nike Dunk Low",
    "general_product_category": "sneakers"
  },
  {
    "query": "samsung refrigerator installation guide",
    "brand_or_product": "Samsung",
    "general_product_category": "refrigerator"
  }
]


# Step 6: Final dataset

In [11]:
data2 = parse_response_2(response2)
df2 = pd.DataFrame(data2)
df2 = df2[['query','category']]
data3 = parse_response_3(response3)
df3 = pd.DataFrame(data3)

df = pd.merge(df2,df3,on='query')
print(df)

                                     query       category brand_or_product  \
0           amazon apple watch se discount  Transactional   Apple Watch SE   
1         dyson vacuum cleaner noise level  Informational            Dyson   
2                   chipotle official menu   Navigational              N/A   
3                     nike dunk low stockx  Transactional    Nike Dunk Low   
4  samsung refrigerator installation guide  Informational          Samsung   

  general_product_category  
0               smartwatch  
1           vacuum cleaner  
2                      N/A  
3                 sneakers  
4             refrigerator  
