# Experiment 11: QA Engine - Hypothesis-Driven Research

A **hypothesis-driven** Q&A engine for researching Salient (mental availability) changes.

**IMPORTANT NOTE**
- Currently, the only question supported is "Salience fell by 6 points in Q3 2025 for new look, can you help find external reasons for decreased mental availability for fashion & apparel retail category?"
- This is due to context was added manually around Salience and New Look not due to experiment design

**Approach:** Like a human researcher, generates hypotheses **separately by category**:
1. üåç **Market/Macro** - Industry-wide trends (NOT brand-specific)
2. üè∑Ô∏è **Brand** - What the brand did/didn't do
3. ‚öîÔ∏è **Competitive** - What competitors are doing

**Workflow:**
1. **Parse Question** - Extract brand, direction
2. **Generate Hypotheses** - Separately for market, brand, competitive
3. **Generate Search Queries** - Targeted queries per hypothesis
4. **Execute Searches** - Parallel search with Tier 1 source prioritization
5. **Return Findings** - Only RELEVANT facts that explain the metric change

**Key Rules:**
- ‚úÖ Hypotheses separated by category
- ‚úÖ Market = industry trends (not brand-specific)
- ‚úÖ Only relevant findings (e.g., for decreased Salient, only news that reduces visibility)
- ‚úÖ Tier 1 sources prioritized
- üö´ No inferences - facts only
- üö´ No vague "strategy" news without concrete impact

In [None]:
# Cell 1: Setup and Dependencies
import os
import json
import re
from typing import Dict, List, Any, Optional, Tuple
from datetime import datetime
from dataclasses import dataclass, field
from IPython.display import display, HTML, Markdown
from concurrent.futures import ThreadPoolExecutor, as_completed
import time

from openai import OpenAI

# Initialize OpenAI client (set your key)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
client = OpenAI(api_key=OPENAI_API_KEY)

# OpenAI Web Search Configuration
SEARCH_MODEL = "gpt-4o-search-preview"

print("‚úì OpenAI client initialized")
print(f"‚úì Using OpenAI built-in web search with model: {SEARCH_MODEL}")

In [None]:
# Cell 2: Data Classes and Configuration

@dataclass
class ParsedQuestion:
    original_question: str
    brand: str
    metrics: List[str]
    direction: str
    time_period: Optional[str] = None
    additional_context: Optional[str] = None

@dataclass
class SearchResult:
    title: str
    url: str
    snippet: str
    source_name: str
    date: Optional[str] = None
    relevance_score: float = 0.0

# Competitor Database
COMPETITOR_DATABASE = {
    "new look": [
        "primark", "marks and spencer", "m&s", "asos", "next",
        "h&m", "shein", "zara", "river island", "boohoo",
        "very", "amazon", "tk maxx", "george by asda", "jd sports"
    ]
}

# TIER 1 SOURCES - Premium authoritative
TIER_1_SOURCES = [
    "bloomberg.com", "ft.com", "wsj.com",
    "adweek.com", "adage.com", "thedrum.com",
    "campaignlive.com", "marketingweek.com",
    "kantar.com", "mckinsey.com", "mintel.com"
]

# TIER 2 SOURCES - Other credible
TIER_2_SOURCES = [
    "reuters.com", "cnbc.com", "forbes.com",
    "businessinsider.com", "bain.com", "bcg.com"
]

print("‚úì Data classes defined")
print(f"‚úì Competitor database loaded: {len(COMPETITOR_DATABASE)} brands")
print(f"‚úì Source tiers: {len(TIER_1_SOURCES)} T1, {len(TIER_2_SOURCES)} T2")