# Sneaker Intelligence RAG Analyst

**What this is:** A Retrieval-Augmented Generation (RAG) system that lets you ask natural-language questions about sneaker market data and get grounded, data-cited answers from Claude.

**Why RAG and not a fine-tuned model?** The data changes (new Reddit scrapes, new market data). RAG keeps the LLM general and swaps in fresh knowledge at query time — no retraining required. It also cites sources, making answers auditable.

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   Question  ──►  Retriever (TF-IDF)  ──►  Top-k Documents      │
│                                                 │               │
│                                                 ▼               │
│                              Prompt = System + Context + Query  │
│                                                 │               │
│                                                 ▼               │
│                                      Claude (claude-haiku)      │
│                                                 │               │
│                                                 ▼               │
│                                     Grounded Answer + Sources   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

**Knowledge base:** 15 documents computed from live data — brand demand signals, release timing, pricing strategy, hype resilience, geographic demand, size runs, Reddit sentiment, and channel attribution.

> **Note:** Retrieval always runs and outputs are shown. LLM generation requires `ANTHROPIC_API_KEY` to be set.

In [1]:
import warnings, os, sys, textwrap
warnings.filterwarnings('ignore')
from pathlib import Path

from sneaker_intel.rag import DocumentBuilder, Retriever, Analyst
from sneaker_intel.visualization.style import apply_nike_style
apply_nike_style()

HAS_API_KEY = bool(os.getenv('ANTHROPIC_API_KEY'))
print(f'LLM generation : {"ENABLED  (ANTHROPIC_API_KEY found)" if HAS_API_KEY else "DISABLED (set ANTHROPIC_API_KEY to enable)"}')

LLM generation : DISABLED (set ANTHROPIC_API_KEY to enable)


## 1. Build the Knowledge Base

Documents are computed from live data on startup — brand demand signals, Reddit sentiment, pricing, timing, geography, size runs. Each document is a structured text chunk with metadata for filtering.

In [2]:
builder   = DocumentBuilder(portfolio_root=Path('.').resolve())
documents = builder.build()
retriever = Retriever(documents)
analyst   = Analyst(retriever)

print(f'Knowledge base: {len(documents)} documents indexed')
print()
print('Document inventory:')
for doc in documents:
    print(f'  [{doc.metadata.get("topic","?"):18s}]  {doc.title}')

Knowledge base: 17 documents indexed

Document inventory:
  [demand            ]  Adidas — Aftermarket Demand Signal
  [demand            ]  Asics — Aftermarket Demand Signal
  [demand            ]  New Balance — Aftermarket Demand Signal
  [demand            ]  Nike — Aftermarket Demand Signal
  [demand            ]  Puma — Aftermarket Demand Signal
  [timing            ]  Release Timing Strategy — Best Months and Days
  [pricing           ]  Retail Pricing Strategy — Sweet Spot Analysis
  [restock_policy    ]  Hype Resilience — How Long Does Premium Hold?
  [geography         ]  Geographic Demand Concentration
  [size_run          ]  Size Run Optimization — Production Allocation by Size
  [sentiment         ]  Adidas — Reddit Consumer Sentiment (Feb 2026)
  [sentiment         ]  Asics — Reddit Consumer Sentiment (Feb 2026)
  [sentiment         ]  Li-Ning — Reddit Consumer Sentiment (Feb 2026)
  [sentiment         ]  New Balance — Reddit Consumer Sentiment (Feb 2026)
  [sentiment     

## 2. Retrieval — How Context Is Found

Before generating anything, the system retrieves the most relevant documents for the query using TF-IDF cosine similarity. This is the core RAG mechanism — the LLM only sees what the retriever surfaces.

> **Production note:** TF-IDF is a strong sparse baseline. In production, replace with dense embeddings (`text-embedding-3-small` or `voyage-large-2`) stored in Chroma or Pinecone for semantic rather than keyword matching.

In [3]:
sample_query = 'What is the best time of year to release a limited Nike sneaker?'

results = retriever.retrieve(sample_query, k=3)

print(f'Query: "{sample_query}"')
print(f'Top {len(results)} retrieved documents:\n')
for i, (doc, score) in enumerate(results, 1):
    print(f'  [{i}] score={score:.4f} | {doc.title}')
    print(f'      source={doc.metadata.get("source")} | topic={doc.metadata.get("topic")}')
    snippet = doc.content[:200].replace('\n', ' ')
    print(f'      "{snippet}..."')
    print()

Query: "What is the best time of year to release a limited Nike sneaker?"
Top 3 retrieved documents:

  [1] score=0.2854 | Release Timing Strategy — Best Months and Days
      source=market_2023 | topic=timing
      "Analysis of 1,908 core-brand releases (Nike, Jordan, adidas, New Balance) shows release timing significantly affects resale premium. Best release month: Aug (0.307× median premium, +50% above annual a..."

  [2] score=0.1809 | Hype Resilience — How Long Does Premium Hold?
      source=stockx_2019 | topic=restock_policy
      "StockX 2019 data covering 70,170 Yeezy and Off-White transactions shows that investment-grade limited releases do NOT follow typical product demand decay. Median premium in the first 14 days post-rele..."

  [3] score=0.1018 | Nike — Reddit Consumer Sentiment (Feb 2026)
      source=reddit_feb2026 | topic=sentiment
      "Nike appears in 627 Reddit posts and comments across 9 sneaker subreddits (Feb 2026). Average hybrid sentiment score: +0.201 (posit

## 3. Q&A — Asking the Analyst

Six questions covering the key demand planning decisions a footwear brand faces. For each question, the retrieved context is shown alongside the generated answer.

In [4]:
def ask_and_display(analyst, question, show_context=False):
    """Run a query and display the result cleanly."""
    resp = analyst.ask(question, k=4)
    print('─' * 72)
    print(f'Q: {question}')
    print()
    print(f'A: {resp.answer}')
    print()
    print(f'Sources: {" | ".join(s.title for s in resp.sources)}')
    if show_context:
        print()
        print('Retrieved context:')
        print(textwrap.indent(resp.context_used, '  '))
    print()

QUESTIONS = [
    'Which brand should Nike prioritize for limited releases this quarter based on current demand signals?',
    'When is the optimal time to drop a new Jordan colorway — what month and day?',
    'What retail price range maximizes resale premium for a limited Nike release?',
    'How long should we wait before restocking a limited release, and why?',
    'Which US states should SNKRS prioritize for geofenced limited drops?',
    'What are consumers saying about Adidas on Reddit right now, and what channels are they using?',
]

for q in QUESTIONS:
    ask_and_display(analyst, q)

────────────────────────────────────────────────────────────────────────
Q: Which brand should Nike prioritize for limited releases this quarter based on current demand signals?

A: ⚠️  ANTHROPIC_API_KEY not set — LLM generation disabled.
Retrieved context is shown below. Set the env var to enable answers.

Sources: Hype Resilience — How Long Does Premium Hold? | Reddit Sneaker Market — Overall Sentiment Overview | Nike — Aftermarket Demand Signal | Release Timing Strategy — Best Months and Days

────────────────────────────────────────────────────────────────────────
Q: When is the optimal time to drop a new Jordan colorway — what month and day?

A: ⚠️  ANTHROPIC_API_KEY not set — LLM generation disabled.
Retrieved context is shown below. Set the env var to enable answers.

Sources: Release Timing Strategy — Best Months and Days | New Balance — Reddit Consumer Sentiment (Feb 2026) | New Balance — Aftermarket Demand Signal | Geographic Demand Concentration

────────────────────────────

## 4. Retrieval Transparency — Full Context for One Query

Showing the exact context passed to the LLM for one query. This is what makes RAG auditable — unlike a black-box fine-tuned model, you can always inspect *why* the system said what it said.

In [5]:
ask_and_display(
    analyst,
    'What does the data say about Nike vs Adidas competitive positioning right now?',
    show_context=True
)

────────────────────────────────────────────────────────────────────────
Q: What does the data say about Nike vs Adidas competitive positioning right now?

A: ⚠️  ANTHROPIC_API_KEY not set — LLM generation disabled.
Retrieved context is shown below. Set the env var to enable answers.

Sources: Reddit Sneaker Market — Overall Sentiment Overview | Adidas — Reddit Consumer Sentiment (Feb 2026) | Hype Resilience — How Long Does Premium Hold? | Nike — Reddit Consumer Sentiment (Feb 2026)

Retrieved context:
  [Source 1 | relevance=0.129]
  Title: Reddit Sneaker Market — Overall Sentiment Overview
  Dataset: 5,796 Reddit posts and comments from 9 subreddits (r/Sneakers, r/Nike, r/Adidas, r/SneakerMarket, r/Jordans, r/Yeezy, r/malefashionadvice, r/Running, r/Basketball), collected Feb 2026. Overall average sentiment: +0.198 (positive). Total brands tracked: 6. Most mentioned brand: Nike (627 mentions). Most positive brand: Li-Ning (sentiment +0.691). The sneaker community shows broadly positi

## 5. Production Architecture

```
Current (portfolio)                  Production
─────────────────────                ──────────────────────────────────────
TF-IDF sparse retrieval     →        Dense embeddings (text-embedding-3-small)
In-memory document list     →        Chroma / Pinecone vector DB
Static documents (batch)    →        Streaming ingestion (Airflow / Kafka)
Hardcoded data paths        →        Feature store (Feast / Tecton)
Single-turn Q&A             →        Multi-turn conversation with memory
claude-haiku (fast/cheap)   →        claude-sonnet for complex synthesis
Notebook demo               →        FastAPI endpoint + Streamlit UI
```

The core pattern — retrieve relevant context, inject into prompt, generate grounded answer — is identical at both scales. The production version adds infrastructure around the same retrieval + generation loop.