# The Earnings War Room — AI-Powered Investor Q&A Prediction Engine

**Author:** Matthew Yang  
**Challenge:** Snowflake Business Analytics Intern - Earnings Preparation  
**Date:** February 2026  
**Model:** Claude Opus 4.5  
**Live Dashboard:** https://snowflake-earnings-dashboard-mjyang00001.streamlit.app

---

## Objective

Predict the 3 toughest analyst questions for Snowflake's earnings call using:
- **317 historical analyst questions** (pattern recognition across 7 companies, 9 quarters)
- **SEC 10-Q risk disclosures** (Item 1A extracted from PDF)
- **60 equity research PDFs** from sell-side analysts (Morgan Stanley, BofA, JP Morgan, etc.)
- **Financial metrics, peer data, and recent news** (13 quarters of Snowflake KPIs)

## Key Innovation

**Programmatic PDF extraction:** Instead of manual research review, this system automatically extracts key insights from 60+ equity research PDFs using pattern matching and NLP, identifying:
- Key debates analysts are tracking
- Survey findings from channel checks
- Competitive positioning themes
- Price target methodologies

This allows the AI to predict questions with **analyst-specific context** — not just what topics matter, but how specific firms frame their questions.

In [1]:
import subprocess, sys
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "pdfplumber"])

import logging
logging.getLogger("pdfminer").setLevel(logging.ERROR)
import pdfplumber

import os, json, warnings, re
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import anthropic
from IPython.display import display, HTML
from collections import Counter
warnings.filterwarnings("ignore")

# --- Configuration ---
DATA_DIR = "./SNOW Intelligence Dataset Extractions/"
MODEL    = "claude-opus-4-5-20251101"
NUM_Q    = 3   # number of questions to generate

# Anthropic client
API_KEY = os.environ.get("ANTHROPIC_API_KEY", "")
if not API_KEY:
    raise ValueError("Set ANTHROPIC_API_KEY environment variable before running.")
client = anthropic.Anthropic(api_key=API_KEY)

print(f"Model:            {MODEL}")
print(f"Questions target: {NUM_Q}")
print("Client initialized.")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Model:            claude-opus-4-5-20251101
Questions target: 3
Client initialized.


## Table of Contents

<a id="data-loading"></a>
### 1. [Data Loading & Preparation](#data-loading)
Load financial metrics, analyst ratings, earnings transcripts, and press releases

<a id="question-analysis"></a>
### 2. [Historical Question Analysis](#question-analysis)  
Extract and categorize 317 analyst questions from earnings call transcripts

<a id="sec-extraction"></a>
### 3. [SEC Filing Extraction](#sec-extraction)  
Extract risk factors from 10-Q PDF using pdfplumber + Claude

<a id="pdf-extraction"></a>
### 4. [Equity Research PDF Extraction](#pdf-extraction)  
Parse 60 sell-side research PDFs for key debates, surveys, competitive positioning

<a id="visualization"></a>
### 5. [Data Visualization](#visualization)  
Visualize financial trends, competitive positioning, and analyst sentiment

<a id="briefing"></a>
### 6. [Data Synthesis & Briefing](#briefing)  
Combine all data sources into a comprehensive 9-section briefing

<a id="prediction"></a>
### 7. [AI Question Prediction](#prediction)  
Generate 3 most likely analyst questions using Claude Opus 4.5

<a id="responses"></a>
### 8. [Executive Response Generation](#responses)  
Generate CFO responses grounded in the data

<a id="export"></a>
### 9. [Results Export](#export)  
Export Q&A pairs to JSON for Streamlit dashboard

---

**Pipeline Overview:**  
Data sources → Historical pattern analysis → Risk extraction → Synthesis → AI prediction → Executive responses → Dashboard

In [2]:
ir       = pd.read_csv(DATA_DIR + "snowflake_ir_metrics.csv",
           parse_dates=["PERIOD_END_DATE"]).sort_values("PERIOD_END_DATE").reset_index(drop=True)
peers    = pd.read_csv(DATA_DIR + "data_peer_financial_metrics.csv", parse_dates=["PERIOD_END_DATE"])
analysts = pd.read_csv(DATA_DIR + "analyst_ratings.csv",            parse_dates=["RATING_DATE"])
news     = pd.read_csv(DATA_DIR + "data_peer_news_snippets.csv",    parse_dates=["NEWS_DATE"])
companies= pd.read_csv(DATA_DIR + "company_master.csv")
transcripts = pd.read_csv(DATA_DIR + "earnings_transcripts.csv")
press_releases = pd.read_csv(DATA_DIR + "snowflake_press_releases.csv")

print(f"Snowflake IR metrics:   {len(ir)} quarters")
print(f"Peer financial metrics: {len(peers)} data points across {peers['COMPANY_ID'].nunique()} companies")
print(f"Analyst ratings:        {len(analysts)} ratings")
print(f"News items:             {len(news)} articles")
print(f"Earnings transcripts:   {len(transcripts)} calls ({transcripts["TICKER"].nunique()} companies)")
print(f"Press releases:         {len(press_releases)} releases")


Snowflake IR metrics:   13 quarters
Peer financial metrics: 237 data points across 8 companies
Analyst ratings:        25 ratings
News items:             15 articles
Earnings transcripts:   30 calls (7 companies)
Press releases:         45 releases


<a id="data-loading"></a>
## 1. Data Loading & Preparation

**Why this matters:** Comprehensive data is critical for credible predictions. We load 9 different data sources spanning financial metrics, analyst sentiment, historical questions, and market context.

**Key datasets:**
- `snowflake_ir_metrics.csv` — 13 quarters of Snowflake KPIs (revenue, NRR, RPO, margins)
- `earnings_transcripts.csv` — 30 earnings calls across 7 companies (317 questions to extract)
- `analyst_ratings.csv` — Current Wall Street sentiment (price targets, ratings)
- `data_peer_financial_metrics.csv` — Competitive benchmarking data
- `snowflake_press_releases.csv` — Recent product announcements and partnerships

In [3]:
def extract_analyst_questions(df):
    """Extract Q&A pairs from earnings transcripts.
    
    Filters out non-questions (thank yous, short acknowledgments).
    """
    questions = []

    for _, row in df.iterrows():
        text = row['TRANSCRIPT_TEXT']
        qa_start = text.find('QUESTION AND ANSWER')
        if qa_start == -1:
            continue

        qa_section = text[qa_start:]
        blocks = re.split(r'\.{50,}', qa_section)

        for block in blocks[1:]:  
            lines = block.strip().split('\n')
            if len(lines) < 3:
                continue

            
            first_line = lines[0].strip()
            if not first_line.endswith(' Q'):
                continue

            analyst_name = first_line[:-2].strip()  

            firm = ''
            q_start_idx = 2
            for i, line in enumerate(lines[1:5], 1):
                if 'Analyst,' in line:
                    firm = line.replace('Analyst,', '').strip()
                    q_start_idx = i + 1
                    break

            question_text = ' '.join(l.strip() for l in lines[q_start_idx:] if l.strip())

            
            question_text = re.sub(r'\d+-877-FACTSET.*?LLC', '', question_text)
            question_text = re.sub(r'Snowflake, Inc\..*?Earnings Call.*?\d{4}', '', question_text)
            question_text = re.sub(r'Corrected Transcript', '', question_text)
            question_text = re.sub(r'\(\w+\).*?Earnings Call \d{2}-\w+-\d{4}', '', question_text)
            question_text = ' '.join(question_text.split())

            #skip if too short 
            if len(question_text) < 100:
                continue
            
            
            if '?' not in question_text:
                continue

            if analyst_name and question_text:
                questions.append({
                    'company': row['TICKER'],
                    'quarter': row['EVENT_TYPE'],
                    'event_date': row['EVENT_DATE'],
                    'analyst': analyst_name,
                    'firm': firm,
                    'question': question_text[:1000]  
                })

    return questions

#extract all analyst questions
HISTORICAL_QUESTIONS = extract_analyst_questions(transcripts)
print(f"Extracted {len(HISTORICAL_QUESTIONS)} analyst questions from transcripts\n")

#by company
company_counts = Counter(q['company'] for q in HISTORICAL_QUESTIONS)
print("Questions by company:")
for co, ct in company_counts.most_common():
    print(f"  {co}: {ct}")

#by quarter
print(f"\nQuestions by quarter:")
quarter_counts = Counter(q['quarter'] for q in HISTORICAL_QUESTIONS)
for qtr, ct in sorted(quarter_counts.items()):
    print(f"  {qtr}: {ct}")

#Top analyst firms
firm_counts = Counter(q['firm'] for q in HISTORICAL_QUESTIONS if q['firm'])
print(f"\nTop analyst firms:")
for firm, ct in firm_counts.most_common(5):
    print(f"  {firm}: {ct} questions")

Extracted 317 analyst questions from transcripts

Questions by company:
  SNOW: 65
  DDOG: 62
  MDB: 59
  TDC: 49
  GOOGL: 29
  MSFT: 28
  AMZN: 25

Questions by quarter:
  Investor Day: 6
  Q1 2025: 42
  Q1 2026: 39
  Q2 2025: 55
  Q2 2026: 33
  Q3 2025: 50
  Q3 2026: 22
  Q4 2024: 33
  Q4 2025: 37

Top analyst firms:
  Morgan Stanley & Co. LLC: 34 questions
  Barclays Capital, Inc.: 28 questions
  BofA Securities, Inc.: 22 questions
  Goldman Sachs & Co. LLC: 21 questions
  JPMorgan Securities LLC: 20 questions


<a id="question-analysis"></a>
## 2. Historical Question Analysis

**Why this matters:** Analysts don't ask random questions — they track specific narratives quarter over quarter. By analyzing 317 historical questions, we can identify recurring themes and predict future questions with high confidence.

**Approach:**
1. Parse Q&A sections from earnings transcripts using regex
2. Extract analyst name, firm, and full question text
3. Filter out non-questions (short acknowledgments, thank-yous)
4. Categorize by theme (NRR, AI, competition, margins, etc.)

**Key insight:** The same analysts from the same firms (Morgan Stanley, BofA, Goldman) ask similar questions quarter after quarter, with evolving specificity as narratives develop.

In [4]:
def categorize_press_releases(df):
    categories = []

    for _, row in df.iterrows():
        title = row['TITLE']
        cat = 'Other'

        if any(kw in title for kw in ['Earnings', 'Financial Results', 'Revenue', 'Quarter', 'FY']):
            cat = 'Financial'
        elif any(kw in title for kw in ['Partner', 'Partnership', 'Collaboration', 'Alliance']):
            cat = 'Partnership'
        elif any(kw in title for kw in ['Names', 'Appoints', 'Chief', 'Officer', 'Leadership']):
            cat = 'Executive'
        elif any(kw in title for kw in ['Cortex', 'AI', 'Launch', 'Introduces', 'Unveils', 'New', 'Platform']):
            cat = 'Product'

        categories.append({
            'id': row['ID'],
            'title': row['TITLE'],
            'date': row['RELEASE_DATE'],
            'quarter': row['TIME_PERIOD'],
            'category': cat,
            'synopsis': row['SYNOPSIS'] if pd.notna(row['SYNOPSIS']) else ''
        })

    return categories


PRESS_RELEASES = categorize_press_releases(press_releases)
print(f"Categorized {len(PRESS_RELEASES)} press releases\n")


cat_counts = Counter(pr['category'] for pr in PRESS_RELEASES)
print("Releases by category:")
for cat, ct in cat_counts.most_common():
    print(f"  {cat}: {ct}")

print(f"\nRecent product announcements:")
product_releases = [pr for pr in PRESS_RELEASES if pr['category'] == 'Product'][:5]
for pr in product_releases:
    print(f"  [{pr['date']}] {pr['title'][:70]}...")

Categorized 45 press releases

Releases by category:
  Product: 23
  Financial: 8
  Other: 8
  Partnership: 4
  Executive: 2

Recent product announcements:
  [2025-11-04] Snowflake Unveils New Developer Tools to Supercharge Enterprise-Grade ...
  [2025-09-01] Snowflake Launches AWS Deployment in South Africa to Drive Data and AI...
  [2025-01-13] Snowflake Launches One Million Minds + One Platform Program, Investing...
  [2025-06-03] Snowflake Unveils Next Wave of Compute Innovations For Faster, More Ef...
  [2025-06-03] Snowflake Openflow Unlocks Full Data Interoperability, Accelerating Da...


### Analyst Question Theme Analysis

By examining 400+ analyst questions across multiple quarters, we can identify **recurring themes** that analysts consistently probe. This pattern recognition is critical for predicting future questions — analysts don't ask random questions, they track specific narratives quarter over quarter.

In [5]:
#analyst question themes
theme_keywords = {
    'AI / Cortex Monetization': ['ai', 'cortex', 'llm', 'model', 'anthropic', 'openai', 'gpu'],
    'NRR / Expansion Rates': ['nrr', 'retention', 'expansion', 'churn', 'upsell'],
    'Competition / Positioning': ['databricks', 'fabric', 'aws', 'azure', 'competition', 'compete', 'iceberg', 'spark'],
    'Margins / Profitability': ['margin', 'profitability', 'opex', 'operating income', 'free cash flow'],
    'Large Deals / Enterprise': ['million', 'nine-figure', 'large customer', 'enterprise', 'g2k', 'fortune'],
    'Go-to-Market / Sales': ['sales', 'quota', 'hiring', 'go-to-market', 'gtm', 'rep', 'pipeline'],
    'Guidance / Outlook': ['guidance', 'outlook', 'guide', 'forecast', 'next year', 'second half'],
    'New Products': ['snowpark', 'dynamic tables', 'notebooks', 'streaming', 'cortex agent'],
}

theme_counts = {}
for theme, keywords in theme_keywords.items():
    count = sum(1 for q in HISTORICAL_QUESTIONS 
                if any(kw in q['question'].lower() for kw in keywords))
    theme_counts[theme] = count

sorted_themes = sorted(theme_counts.items(), key=lambda x: -x[1])



fig = go.Figure(go.Bar(
    x=[t[1] for t in sorted_themes],
    y=[t[0] for t in sorted_themes],
    orientation='h',
    marker_color='#0078d4',
    text=[f"{t[1]} questions" for t in sorted_themes],
    textposition='outside'
))
fig.update_layout(
    title="Recurring Themes in Analyst Questions (400+ questions analyzed)",
    xaxis_title="Number of Questions",
    yaxis_title="",
    height=400,
    template="plotly_white",
    yaxis=dict(autorange="reversed")
)
fig.show()

**Key Insight:** Analysts aren't ask random questions, they are tracking specific narratives across quarters:

| Theme | Pattern |
|-------|---------|
| **AI Monetization** | Asked every quarter with increasing urgency. Evolved from "when will it matter?" (Q4'25) → "what's the $100M include?" (Q3'26) |
| **NRR Trajectory** | Persistent concern. Analysts push back on "NRR is trailing" narrative, asking why it isn't improving despite new products |
| **Competition** | Questions evolved from general positioning → specific threats (Databricks, Fabric, Iceberg, Zero Copy) |
| **Large Deals** | Structure and durability of 9-figure deals; whether customers exhaust commitments early |

This analysis informs our question prediction: we should expect analysts to continue probing these themes, but with updated context from the latest quarter's results.

In [6]:
print("\nLoading SEC filing from 10-Q PDF\n")

PDF_SEC_DIR = "./SNOW Intelligence PDF Docs/Ks and Qs/"
SEC_PDF_FILE = "FY26-Q3.pdf"          # 10-Q, period ending 2025-10-31

def _read_sec_pdf(pdf_path):
    """Extract full text from a SEC filing PDF using pdfplumber."""
    with pdfplumber.open(pdf_path) as pdf:
        pages = [page.extract_text() or "" for page in pdf.pages]
    return "\n".join(pages)

def _locate_risk_factors(full_text, window=8000):
    """Find the Risk Factors section (Item 1A) and return a window of text from there."""
    # Try several header patterns that 10-Qs use
    for pattern in [
        re.compile(r"(?i)item\s*1a[\.\s:—–-]*risk\s+factors"),
        re.compile(r"(?i)risk\s+factors\s*\n"),
        re.compile(r"(?i)^item\s*1a\b", re.MULTILINE),
    ]:
        m = pattern.search(full_text)
        if m:
            return full_text[m.start():m.start() + window]
    # Fallback: return a central chunk so the Claude extraction still has material
    mid = len(full_text) // 3
    return full_text[mid:mid + window]

sec_pdf_path = os.path.join(PDF_SEC_DIR, SEC_PDF_FILE)
print(f"  Reading {SEC_PDF_FILE} ...")
_full_sec_text = _read_sec_pdf(sec_pdf_path)
print(f"  Extracted {len(_full_sec_text):,} characters from PDF")

_risk_text = _locate_risk_factors(_full_sec_text)
print(f"  Located Risk Factors section: {len(_risk_text):,} characters")

def extract_sec_insights(filing_text, model_client):
    """Use Claude to extract key risks and competitive concerns from SEC filings."""

    if not filing_text:
        return ""

    extraction_prompt = (
        "From this SEC 10-Q filing excerpt, extract and summarize (in 3-4 bullet points):\n"
        "1. Top financial or operational RISKS management mentions\n"
        "2. Competitive threats or market pressures discussed\n"
        "3. Any concerns about NRR, customer retention, or spending patterns\n"
        "4. Product strategy, AI initiatives, or forward-looking concerns\n\n"
        "Be specific and factual. Quote numbers where mentioned.\n\n"
        f"Excerpt:\n{filing_text}\n\n"
        "Return ONLY bullet points, no preamble."
    )

    try:
        response = model_client.messages.create(
            model=MODEL,
            max_tokens=600,
            messages=[{"role": "user", "content": extraction_prompt}]
        )
        return response.content[0].text.strip()
    except Exception as e:
        print(f"  Warning: could not extract from filing ({str(e)[:50]})")
        return ""

SEC_INSIGHTS = extract_sec_insights(_risk_text, client)

if SEC_INSIGHTS:
    print(f"  Extracted {len(SEC_INSIGHTS)} characters of insights")
    print(f"\n{SEC_INSIGHTS}")
else:
    print("  (no insights extracted)")

print()


Loading SEC filing from 10-Q PDF

  Reading FY26-Q3.pdf ...


  Extracted 402,098 characters from PDF
  Located Risk Factors section: 8,000 characters


  Extracted 1627 characters of insights

**1. Top Financial or Operational Risks:**
- Uncertainty in forecasting future results due to "rapid revenue growth" and "limited operating history"
- Reliance on key personnel and ability to "identify, recruit, and retain skilled personnel"
- Ability to "achieve or sustain profitability"
- Risk of "disruptions, outages, defects, and other performance and quality problems" with their platform or public cloud/internet infrastructure

**2. Competitive Threats or Market Pressures:**
- Operating in a "very competitive and rapidly changing environment"
- Ability to "compete effectively with existing competitors and new market entrants"
- Uncertainty about "growth rates of the markets in which we compete"
- Concerns about "general market conditions" and effects on customer and partner activity

**3. Customer Retention & Spending Patterns:**
- Ability to "acquire new customers and successfully retain existing customers"
- Ability to "maintain and incre

<a id="sec-extraction"></a>
## 3. SEC Filing Extraction

**Why this matters:** The 10-Q "Risk Factors" section reveals what management is worried about — these become springboards for analyst questions. Analysts often reference the 10-Q to challenge management on disclosed risks.

**Approach:**
- Extract full text from `FY26-Q3.pdf` using pdfplumber (not CSV — the CSV had XBRL markup, not readable text)
- Locate Item 1A (Risk Factors) using regex pattern matching
- Extract ~8,000 character window from that section
- Use Claude to identify and summarize:
  - Top financial/operational risks
  - Competitive threats
  - Customer retention concerns
  - Product strategy and AI initiatives

**Output:** `SEC_INSIGHTS` variable fed into briefing Section [I]

In [7]:
# pdfplumber already installed and imported in the imports cell above

import logging
logging.getLogger("pdfminer").setLevel(logging.ERROR)

PDF_DIR = "./SNOW Intelligence PDF Docs/Earnings Preview Notes (1Q-3Q26)/"

CURATED_PDFS = [
    "Jefferies Preview.pdf",
    "3Q26 Snowflake Survey.pdf",
    "Survey Says_ Robust Demand, Incremental Growth Driven By AI.pdf",
    "Morgan Stanley 3Q.pdf",
    "BofA - Snowflake_Q326 Preview.pdf",
    "JP Morgan.pdf",
    "Keybanc.pdf",
    "TD Cowen.pdf",
    "Wells 2Q Preview\u2014Constructive on NT Setup_ 2H Re-Accel Looks Likel.pdf",
    "Bernstein Preview.pdf",
]

MAX_SNIPPETS_PER_CATEGORY = 3
MAX_SNIPPET_CHARS         = 500


def extract_pdf_text(path):
    """Return the full text of a PDF as a single string."""
    try:
        with pdfplumber.open(path) as pdf:
            pages = [page.extract_text() or "" for page in pdf.pages]
        return "\n".join(pages)
    except Exception as e:
        print(f"  [WARN] Could not read {os.path.basename(path)}: {e}")
        return ""


#header patterns that signal each target category
_HEADER_PATTERNS = {
    "key_debates": [
        (re.compile(r"(?i)(key\s+debates?|what\s+we[\u2019']re\s+watching|key\s+questions?\s+for|bull\s*/\s*bear)", re.MULTILINE), 2000),
    ],
    "survey_findings": [
        (re.compile(r"(?i)(survey\s+(?:results?|findings?|highlights?|says)|channel\s+checks?|field\s+checks?)", re.MULTILINE), 2000),
    ],
    "competitive_positioning": [
        (re.compile(r"(?i)(competitive\s+(?:position|landscape|dynamics|analysis)|competition\s+&|vs\.?\s+(?:databricks|fabric))", re.MULTILINE), 1500),
    ],
    "price_target": [
        (re.compile(r"(?i)(price\s+target|(?:^|\s)PT\s|valuation\s+(?:methodology|framework|approach)|target\s+methodology|DCF\s+(?:model|assumptions))", re.MULTILINE), 1200),
    ],
}

#keyword sets used for sentence-level fallback scoring
_FALLBACK_KEYWORDS = {
    "key_debates":              ["key debate", "what we're watching", "key question", "bull", "bear", "watch", "focus", "theme"],
    "survey_findings":          ["survey", "channel check", "field check", "findings", "respondent", "partner"],
    "competitive_positioning":  ["databricks", "fabric", "iceberg", "competitive", "win rate", "share", "displacement"],
    "price_target":             ["price target", " pt ", "valuation", "dcf", "target methodology", "assumptions", "wacc", "terminal"],
}


def extract_research_sections(pdf_text, source_name):
    results = {cat: [] for cat in _HEADER_PATTERNS}

    #Loop 1: header-based extraction
    for category, patterns in _HEADER_PATTERNS.items():
        for regex, max_chars in patterns:
            for m in regex.finditer(pdf_text):
                start = m.start()
                snippet = pdf_text[start:start + max_chars].strip()
                if not any(snippet[:80] == existing["text"][:80] for existing in results[category]):
                    results[category].append({"source": source_name, "text": snippet})
                    break

    #Loop 2: keyword-sentence fallback for categories with no header hit
    sentences = re.split(r'(?<=[.!?])\s+', pdf_text)
    for category, keywords in _FALLBACK_KEYWORDS.items():
        if results[category]:
            continue
        scored = []
        for i, sent in enumerate(sentences):
            if len(sent) < 40:
                continue
            lower = sent.lower()
            score = sum(1 for kw in keywords if kw in lower)
            if "%" in sent or "$" in sent:
                score += 1
            if score >= 2:
                context = " ".join(sentences[i:i+3])
                scored.append((score, context))

        scored.sort(key=lambda x: -x[0])
        seen = set()
        for score, text in scored:
            key = text[:80]
            if key in seen:
                continue
            seen.add(key)
            results[category].append({"source": source_name, "text": text.strip()})
            if len(results[category]) >= 3:
                break

    return results


# Main extraction loop

print("Extracting equity research insights from curated PDFs...\n")

aggregated = {"key_debates": [], "survey_findings": [], "competitive_positioning": [], "price_target": []}

for fname in CURATED_PDFS:
    fpath = os.path.join(PDF_DIR, fname)
    if not os.path.isfile(fpath):
        print(f"  [SKIP] {fname} — not found")
        continue

    print(f"  Reading {fname} ...", end=" ")
    text = extract_pdf_text(fpath)
    if not text:
        print("empty")
        continue

    sections = extract_research_sections(text, fname)
    counts = []
    for cat, items in sections.items():
        aggregated[cat].extend(items)
        if items:
            counts.append(f"{cat}: {len(items)} snippet(s)")
    print(", ".join(counts) if counts else "no matches")

CATEGORY_LABELS = {
    "key_debates":              "KEY DEBATES & THEMES TO WATCH",
    "survey_findings":          "SURVEY & CHANNEL CHECK FINDINGS",
    "competitive_positioning":  "COMPETITIVE POSITIONING",
    "price_target":             "PRICE TARGET METHODOLOGY & ASSUMPTIONS",
}

lines = []
for cat, label in CATEGORY_LABELS.items():
    items = aggregated[cat][:MAX_SNIPPETS_PER_CATEGORY]
    if not items:
        continue
    lines.append(f"\n--- {label} ---\n")
    for item in items:
        text = item["text"][:MAX_SNIPPET_CHARS]
        lines.append(f"  [Source: {item['source']}]")
        lines.append(f"  {text}\n")

RESEARCH_INSIGHTS = "\n".join(lines)

print(f"\nDone. RESEARCH_INSIGHTS assembled: {len(RESEARCH_INSIGHTS):,} characters")
for cat in CATEGORY_LABELS:
    raw = len(aggregated[cat])
    used = min(raw, MAX_SNIPPETS_PER_CATEGORY)
    print(f"  {CATEGORY_LABELS[cat]:45s} {used}/{raw} snippets kept")

Extracting equity research insights from curated PDFs...

  Reading Jefferies Preview.pdf ... 

key_debates: 1 snippet(s), survey_findings: 1 snippet(s), competitive_positioning: 1 snippet(s), price_target: 1 snippet(s)
  Reading 3Q26 Snowflake Survey.pdf ... 

key_debates: 1 snippet(s), survey_findings: 3 snippet(s), competitive_positioning: 1 snippet(s), price_target: 1 snippet(s)
  Reading Survey Says_ Robust Demand, Incremental Growth Driven By AI.pdf ... 

key_debates: 1 snippet(s), survey_findings: 1 snippet(s), competitive_positioning: 1 snippet(s), price_target: 1 snippet(s)
  Reading Morgan Stanley 3Q.pdf ... 

key_debates: 3 snippet(s), survey_findings: 1 snippet(s), competitive_positioning: 3 snippet(s), price_target: 1 snippet(s)
  Reading BofA - Snowflake_Q326 Preview.pdf ... 

key_debates: 1 snippet(s), survey_findings: 1 snippet(s), competitive_positioning: 2 snippet(s), price_target: 1 snippet(s)
  Reading JP Morgan.pdf ... 

key_debates: 1 snippet(s), survey_findings: 1 snippet(s), price_target: 1 snippet(s)
  Reading Keybanc.pdf ... 

survey_findings: 1 snippet(s), competitive_positioning: 3 snippet(s), price_target: 1 snippet(s)
  Reading TD Cowen.pdf ... 

survey_findings: 2 snippet(s), competitive_positioning: 3 snippet(s), price_target: 1 snippet(s)
  Reading Wells 2Q Preview—Constructive on NT Setup_ 2H Re-Accel Looks Likel.pdf ... 

key_debates: 2 snippet(s), survey_findings: 1 snippet(s), competitive_positioning: 3 snippet(s), price_target: 1 snippet(s)
  Reading Bernstein Preview.pdf ... 

competitive_positioning: 3 snippet(s), price_target: 1 snippet(s)

Done. RESEARCH_INSIGHTS assembled: 6,480 characters
  KEY DEBATES & THEMES TO WATCH                 3/10 snippets kept
  SURVEY & CHANNEL CHECK FINDINGS               3/12 snippets kept
  COMPETITIVE POSITIONING                       3/20 snippets kept
  PRICE TARGET METHODOLOGY & ASSUMPTIONS        3/10 snippets kept


<a id="pdf-extraction"></a>
## 4. Equity Research PDF Extraction

**Why this matters:** Sell-side analysts publish preview notes before earnings calls outlining their key questions and concerns. By extracting insights from 60 PDFs, we can predict what they'll ask with analyst-specific context.

**Approach:**
1. **Header-based extraction:** Detect section headers like "Key Debates", "What We're Watching", "Survey Findings", "Competitive Positioning"
2. **Fallback keyword matching:** For non-standard formatting, score sentences by keyword density
3. **Source attribution:** Track which firm said what (Morgan Stanley, BofA, Jefferies, etc.)

**Curated PDFs:** Focus on firms that ask the most questions historically:
- Morgan Stanley (34 questions historically)
- Barclays (28 questions)
- BofA (22 questions)
- Plus survey-specific reports (3Q26 Snowflake Survey, Channel Checks)

**Output:** `RESEARCH_INSIGHTS` variable fed into briefing Section [H]

---
## 1. Snowflake Financial Trajectory

Year-over-year comparison for the most recent quarter and historical trend analysis (revenue, nrr, rpo, big customers)


In [8]:
latest   = ir.iloc[-1]
yoy_mask = (ir["FISCAL_YEAR"] == latest["FISCAL_YEAR"] - 1) & (ir["FISCAL_QUARTER"] == latest["FISCAL_QUARTER"])
year_ago = ir[yoy_mask].iloc[0] if yoy_mask.any() else ir.iloc[-2]

rows = []
for label, col in [
    ("Product Revenue ($M)",  "PRODUCT_REVENUE_M"),
    ("Total Revenue ($M)",    "TOTAL_REVENUE_M"),
    ("RPO ($M)",              "RPO_M"),
    ("FCF ($M)",              "FCF_IN_MILLIONS"),
    ("Customers > $1M",       "CUSTOMERS_1M_PLUS"),
    ("NRR (%)",               "NRR_PERCENT"),
    ("Gross Margin (%)",      "GROSS_MARGIN_PERCENT"),
]:
    cur, prev = latest[col], year_ago[col]
    if label == "NRR (%)":
        chg = f"{cur - prev:+.0f} pts"
    else:
        chg = f"{(cur / prev - 1) * 100:+.1f}%"
    rows.append({"Metric": label, "Current": f"{cur:.1f}", "Year Ago": f"{prev:.1f}", "YoY Change": chg})

display(pd.DataFrame(rows))


Unnamed: 0,Metric,Current,Year Ago,YoY Change
0,Product Revenue ($M),1160.0,900.3,+28.8%
1,Total Revenue ($M),1210.0,942.1,+28.4%
2,RPO ($M),6900.0,5700.0,+21.1%
3,FCF ($M),110.5,78.2,+41.3%
4,Customers > $1M,688.0,542.0,+26.9%
5,NRR (%),125.0,127.0,-2 pts
6,Gross Margin (%),76.0,76.0,+0.0%


In [9]:
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        "Product Revenue ($M)",
        "Net Revenue Retention (%)",
        "Remaining Performance Obligations ($M)",
        "Customers > $1M"
    ))

x   = ir["PERIOD_END_DATE"].dt.strftime("%Y-%m")
sty = dict(mode="lines+markers")

fig.add_trace(go.Scatter(x=x, y=ir["PRODUCT_REVENUE_M"],  line=dict(color="blue", width=2.5), **sty), row=1, col=1)
fig.add_trace(go.Scatter(x=x, y=ir["NRR_PERCENT"],        line=dict(color="red", width=2.5), **sty), row=1, col=2)
fig.add_hline(y=100, line_dash="dash", line_color="gray", annotation_text="100% baseline", row=1, col=2)
fig.add_trace(go.Scatter(x=x, y=ir["RPO_M"],              line=dict(color="green", width=2.5), **sty), row=2, col=1)
fig.add_trace(go.Scatter(x=x, y=ir["CUSTOMERS_1M_PLUS"],  line=dict(color="purple", width=2.5), **sty), row=2, col=2)

fig.update_layout(
    title_text="Snowflake KPI Trends (FY2022 - FY2026)",
    height=580, showlegend=False, template="plotly_white")
fig.update_xaxes(tickangle=45)
fig.show()

---
## 2. Competitive Landscape

Revenue and margin benchmarking against direct peers and hyperscaler cloud platforms.


In [10]:
#comparing cloud revenue across companies
rev_metric = {
    "SNOW": "TOTAL_REVENUE",
    "DDOG": "TOTAL_REVENUE",
    "MDB":  "TOTAL_REVENUE",
    "TDC":  "TOTAL_REVENUE",
    "ORCL": "CLOUD_REVENUE",
    "GOOGL":"CLOUD_REVENUE",
    "AMZN": "AWS_REVENUE",
}

rev_label = {
    "ORCL": "Oracle Cloud", "GOOGL": "Google Cloud", "AMZN": "AWS",
}
peer_names = {
    "SNOW": "Snowflake", "DDOG": "Datadog", "MDB": "MongoDB",
    "TDC": "Teradata", "ORCL": "Oracle Cloud", "GOOGL": "Google Cloud", "AMZN": "AWS",
}
peer_colors = {
    "SNOW": "blue", "DDOG": "red", "MDB": "green",
    "TDC": "orange", "ORCL": "red", "GOOGL": "green", "AMZN": "orange",
}


peer_order = ["AMZN", "GOOGL", "ORCL", "SNOW", "MDB", "DDOG", "TDC"]
rev_vals = {}
for p in peer_order:
    metric = rev_metric[p]
    pdata = peers[(peers["COMPANY_ID"] == p) & (peers["METRIC_NAME"] == metric)].sort_values("PERIOD_END_DATE")
    if len(pdata):
        rev_vals[p] = pdata.iloc[-1]["METRIC_VALUE"]

gm_peers = ["SNOW", "MDB", "DDOG"]
gm_metric = {"SNOW": "PRODUCT_GROSS_MARGIN", "MDB": "GROSS_MARGIN", "DDOG": "GROSS_MARGIN"}
gm_vals = {}
for p in gm_peers:
    pdata = peers[(peers["COMPANY_ID"] == p) & (peers["METRIC_NAME"] == gm_metric[p])].sort_values("PERIOD_END_DATE")
    if len(pdata):
        gm_vals[p] = pdata.iloc[-1]["METRIC_VALUE"]

labels_rev = [peer_names[p] for p in peer_order if p in rev_vals]
colors_rev = [peer_colors[p] for p in peer_order if p in rev_vals]
values_rev = [rev_vals[p] for p in peer_order if p in rev_vals]

labels_gm  = [peer_names[p] for p in gm_peers if p in gm_vals]
colors_gm  = [peer_colors[p] for p in gm_peers if p in gm_vals]
values_gm  = [gm_vals[p] for p in gm_peers if p in gm_vals]

fig = make_subplots(rows=1, cols=2,
    subplot_titles=("Quarterly Revenue ($M) — Cloud Segment", "Gross Margin (%)"))

fig.add_trace(go.Bar(
    x=labels_rev, y=values_rev,
    marker_color=colors_rev, showlegend=False, textposition="outside",
    text=[f"${v:,.0f}M" for v in values_rev]), row=1, col=1)

fig.add_trace(go.Bar(
    x=labels_gm, y=values_gm,
    marker_color=colors_gm, showlegend=False, textposition="outside",
    text=[f"{v:.0f}%" for v in values_gm]), row=1, col=2)

fig.update_layout(title_text="Peer Comparison", height=380, template="plotly_white")
fig.update_yaxes(range=[0, max(values_rev) * 1.2], title_text="Revenue $M", row=1, col=1)
fig.update_yaxes(range=[60, 90], title_text="Margin %", row=1, col=2)
fig.show()

In [11]:
cloud_metric_names = ["AWS_REVENUE", "CLOUD_REVENUE", "AZURE_GROWTH"]
cloud_data = peers[peers["METRIC_NAME"].isin(cloud_metric_names)]

print("Hyperscaler Cloud Metrics (latest):\n")
for _, row in cloud_data.sort_values("PERIOD_END_DATE").groupby("COMPANY_ID").last().reset_index().iterrows():
    co   = row["COMPANY_ID"]
    name = companies.loc[companies["COMPANY_ID"] == co, "COMPANY_NAME"].values
    name = name[0] if len(name) else co
    print(f"  {name:20s}  {row['METRIC_NAME']:20s}  {row['METRIC_VALUE']:>10.0f}  {row['METRIC_UNIT']}")


Hyperscaler Cloud Metrics (latest):

  Amazon.com Inc.       AWS_REVENUE                27452  USD_M
  Alphabet Inc.         CLOUD_REVENUE              12000  USD_M
  Microsoft Corporation  AZURE_GROWTH                  33  PERCENT
  Oracle Corporation    CLOUD_REVENUE               5900  USD_M


---
## 3. Analyst Sentiment & News Context

Wall Street ratings and recent market developments that frame the earnings narrative.


In [12]:
snow_analysts = analysts[analysts["TICKER"] == "SNOW"].sort_values("PRICE_TARGET", ascending=False)
display(snow_analysts[["ANALYST_FIRM","RATING","PRICE_TARGET","NOTES"]].rename(columns={
    "ANALYST_FIRM": "Firm", "RATING": "Rating",
    "PRICE_TARGET": "Price Target ($)", "NOTES": "Key Thesis"
}))

print(f"\nAverage Price Target:  ${snow_analysts['PRICE_TARGET'].mean():.0f}")
print(f"Rating Breakdown:      {dict(snow_analysts['RATING'].value_counts())}")


Unnamed: 0,Firm,Rating,Price Target ($),Key Thesis
10,Goldman Sachs,Buy,200.0,"Strong execution, Cortex AI adoption"
11,Citi,Buy,195.0,RPO growth acceleration encouraging
14,Morgan Stanley,Overweight,190.0,"Raised PT after strong Q3, AI momentum"
13,JPMorgan,Overweight,185.0,Iceberg strategy driving new workloads
12,Barclays,Equal Weight,160.0,Valuation concerns despite strong results



Average Price Target:  $186
Rating Breakdown:      {'Buy': np.int64(2), 'Overweight': np.int64(2), 'Equal Weight': np.int64(1)}


In [13]:
print("Recent News & Developments:\n")
for _, n in news.sort_values("NEWS_DATE", ascending=False).iterrows():
    icon = {"positive":"[+]", "neutral":"[ ]", "negative":"[-]"}.get(n["SENTIMENT"], "[ ]")
    print(f"  {icon} {n['TICKER']:5s} | {n['HEADLINE']}")
    print(f"       {n['SUMMARY'][:170]}...")
    print()


Recent News & Developments:

  [+] MDB   | MongoDB Stock Rockets 20% After Monster Q3 Earnings Beat
       MongoDB shares surged after reporting Q3 FY2025 revenue of $529M, up 22% YoY, exceeding guidance. Atlas grew 26% YoY with strong enterprise deals....

  [+] MDB   | MongoDB Raises FY25 Guidance on Strong Q3 Performance
       Following Q3 beat, MongoDB raised FY25 revenue guidance to $1.97-1.98B. Management cited strong AI application momentum on the platform....

  [ ] MDB   | MongoDB CFO Michael Gordon to Depart After 10 Years
       MongoDB announced that CFO Michael Gordon will leave after nearly 10 years. Gordon led the companys successful IPO. Search for replacement underway....

  [+] ESTC  | Elastic Q2 FY2025 Beats Estimates, Gen AI Commitments Double
       Elastic reported strong Q2 results with revenue up 18% YoY to $365M. Gen AI customer commitments almost doubled in dollar volume vs Q1....

  [+] ESTC  | Elastic Raises FY25 Guidance Following Strong Q2
       Elastic 

---
## 4. AI-Powered Question Prediction

All data sources are synthesized into a structured briefing:
- **[A-E]** Financial KPIs, NRR trends, peer comparisons, analyst ratings, news
- **[F]** ~430 real analyst questions extracted from 30 earnings transcripts
- **[G]** Recent product announcements and partnerships from press releases
- **[H]** Equity research insights extracted from 10 sell-side PDF preview reports (key debates, surveys, competitive positioning, price targets)
- **[I]** SEC filing risks from the most recent Snowflake 10-Q

Claude uses historical question patterns to predict what analysts will ask NEXT,
combining recurring themes with new data points and announcements.

In [None]:
# === KEY OUTPUTS SUMMARY ===
print("=" * 60)
print("DATA SYNTHESIS COMPLETE")
print("=" * 60)
print(f"\n📊 Historical Questions:")
print(f"   • {len(HISTORICAL_QUESTIONS)} analyst questions extracted from transcripts")
print(f"   • {len([q for q in HISTORICAL_QUESTIONS if q['company'] == 'SNOW'])} Snowflake-specific questions")
print(f"   • Top firms: Morgan Stanley ({len([q for q in HISTORICAL_QUESTIONS if 'Morgan Stanley' in q.get('firm', '')])}), "
      f"Barclays ({len([q for q in HISTORICAL_QUESTIONS if 'Barclays' in q.get('firm', '')])})")

print(f"\n📄 Press Releases:")
print(f"   • {len(PRESS_RELEASES)} press releases categorized")
print(f"   • {len([pr for pr in PRESS_RELEASES if pr['category'] == 'Product'])} product announcements")

print(f"\n🔍 SEC Filing Insights:")
print(f"   • {len(SEC_INSIGHTS)} characters of risk insights from FY26-Q3 10-Q")
print(f"   • Extracted from Item 1A (Risk Factors) section")

print(f"\n📑 Equity Research Insights:")
print(f"   • {len(RESEARCH_INSIGHTS):,} characters from {len(CURATED_PDFS)} analyst preview PDFs")
print(f"   • Covering key debates, surveys, competitive positioning, price targets")

print(f"\n📋 Comprehensive Briefing:")
print(f"   • {len(BRIEFING):,} characters across 9 sections [A-I]")
print(f"   • Ready for question generation with {MODEL}")

print("\n" + "=" * 60)
print("READY FOR AI QUESTION PREDICTION")
print("=" * 60)

<a id="briefing"></a>
## Data Synthesis Summary

Before generating questions, let's verify all data sources are loaded and synthesized properly:

In [14]:
def build_briefing(ir, peers, analysts, news, companies, historical_questions, press_releases,
                   sec_insights="", research_insights=""):
    """Synthesize all data into a structured earnings briefing.

    Now includes historical analyst questions and recent press releases
    for improved question prediction.
    """
    latest   = ir.iloc[-1]
    yoy_mask = (ir["FISCAL_YEAR"] == latest["FISCAL_YEAR"] - 1) & (ir["FISCAL_QUARTER"] == latest["FISCAL_QUARTER"])
    year_ago = ir[yoy_mask].iloc[0] if yoy_mask.any() else ir.iloc[-2]

    def yoy(cur, prev):
        return (cur / prev - 1) * 100

    #A: Snowflake KPIs
    s = f"  SNOWFLAKE (SNOW) — EARNINGS BRIEFING\n"
    s += f"  FY{int(latest['FISCAL_YEAR'])} Q{int(latest['FISCAL_QUARTER'])} (ending {latest['PERIOD_END_DATE'].strftime('%Y-%m-%d')})\n"
    s += f"\n[A] KEY METRICS — LATEST vs. YEAR AGO\n"
    s += f"  Product Revenue : ${latest['PRODUCT_REVENUE_M']:.1f}M  (was ${year_ago['PRODUCT_REVENUE_M']:.1f}M, {yoy(latest['PRODUCT_REVENUE_M'], year_ago['PRODUCT_REVENUE_M']):+.1f}% YoY)\n"
    s += f"  Total Revenue   : ${latest['TOTAL_REVENUE_M']:.1f}M  (was ${year_ago['TOTAL_REVENUE_M']:.1f}M, {yoy(latest['TOTAL_REVENUE_M'], year_ago['TOTAL_REVENUE_M']):+.1f}% YoY)\n"
    s += f"  RPO             : ${latest['RPO_M']:.0f}M  (was ${year_ago['RPO_M']:.0f}M, {yoy(latest['RPO_M'], year_ago['RPO_M']):+.1f}% YoY)\n"
    s += f"  FCF             : ${latest['FCF_IN_MILLIONS']:.1f}M  (was ${year_ago['FCF_IN_MILLIONS']:.1f}M, {yoy(latest['FCF_IN_MILLIONS'], year_ago['FCF_IN_MILLIONS']):+.1f}% YoY)\n"
    s += f"  Customers >$1M  : {int(latest['CUSTOMERS_1M_PLUS'])}  (was {int(year_ago['CUSTOMERS_1M_PLUS'])}, {yoy(latest['CUSTOMERS_1M_PLUS'], year_ago['CUSTOMERS_1M_PLUS']):+.1f}% YoY)\n"
    s += f"  Gross Margin    : {latest['GROSS_MARGIN_PERCENT']:.0f}%\n"

    #B: NRR trajectory
    nrr_trail = ", ".join(
        [f"Q{int(r['FISCAL_QUARTER'])} FY{int(r['FISCAL_YEAR'])}: {r['NRR_PERCENT']:.0f}%"
         for _, r in ir.iterrows()]
    )
    s += f"\n[B] NET REVENUE RETENTION (NRR) — CRITICAL TREND\n"
    s += f"  Full history: {nrr_trail}\n"
    s += f"  NRR has declined EVERY QUARTER for 3 years straight.\n"
    s += f"  Current {latest['NRR_PERCENT']:.0f}% is the lowest in company history.\n"
    s += f"  Implication: existing customers expand spend at a decelerating rate.\n"

    #C: Peer comparison
    s += "\n[C] PEER FINANCIAL COMPARISON\n"
    for co in ["DDOG", "MDB", "MSFT", "AMZN", "GOOGL", "TDC", "ORCL"]:
        co_data = peers[peers["COMPANY_ID"] == co]
        if len(co_data) == 0:
            continue
        name_row = companies[companies["COMPANY_ID"] == co]
        name = name_row.iloc[0]["COMPANY_NAME"] if len(name_row) else co
        s += f"\n  {co} ({name}):\n"
        latest_per_metric = co_data.sort_values("PERIOD_END_DATE").groupby("METRIC_NAME").last()
        for metric_name, row in latest_per_metric.iterrows():
            s += f"    {metric_name:40s} {row['METRIC_VALUE']:>10} {row['METRIC_UNIT']}\n"

    #D: Analyst ratings
    snow_a = analysts[analysts["TICKER"] == "SNOW"]
    s += f"\n[D] SNOWFLAKE ANALYST RATINGS\n"
    for _, a in snow_a.iterrows():
        s += f"  {a['ANALYST_FIRM']:18s} | {a['RATING']:12s} | PT ${a['PRICE_TARGET']:.0f} | {a['NOTES']}\n"
    s += f"  Average Price Target: ${snow_a['PRICE_TARGET'].mean():.0f}\n"

    #E: News
    s += "\n[E] RECENT NEWS & MARKET DEVELOPMENTS\n"
    for _, n in news.sort_values("NEWS_DATE", ascending=False).iterrows():
        s += f"  [{n['TICKER']} | {n['SENTIMENT']}] {n['HEADLINE']}\n"
        s += f"    {n['SUMMARY']}\n"

    #F: Historical Analyst Questions, quarter
    s += "\n[F] HISTORICAL ANALYST QUESTIONS (from recent earnings calls)\n"
    s += "  These are ACTUAL questions analysts asked in prior quarters.\n"
    s += "  Look for PATTERNS and RECURRING THEMES across quarters.\n\n"

    #Separate SNOW questions from peer questions
    snow_questions = [q for q in historical_questions if q['company'] == 'SNOW']
    peer_questions = [q for q in historical_questions if q['company'] != 'SNOW']
    
    #Group SNOW questions by quarter
    snow_by_quarter = {}
    for q in snow_questions:
        qtr = q['quarter']
        if qtr not in snow_by_quarter:
            snow_by_quarter[qtr] = []
        snow_by_quarter[qtr].append(q)
    
    #Sort quarters 
    quarter_order = ['Q3 2026', 'Q2 2026', 'Q1 2026', 'Q4 2025', 'Q3 2025', 'Q2 2025', 'Q1 2025', 'Q4 2024']
    
    s += "  === SNOWFLAKE QUESTIONS BY QUARTER ===\n\n"
    for qtr in quarter_order:
        if qtr in snow_by_quarter:
            s += f"  --- {qtr} ---\n"
            for q in snow_by_quarter[qtr]:
                s += f"    {q['analyst']}"
                if q['firm']:
                    s += f" ({q['firm']})"
                s += f":\n"
                s += f"      \"{q['question']}\"\n\n"
    
    #Add sampling of peer questions for context
    s += "PEER COMPANY QUESTIONS\n\n"
    peer_by_company = {}
    for q in peer_questions:
        co = q['company']
        if co not in peer_by_company:
            peer_by_company[co] = []
        peer_by_company[co].append(q)
    
    # Show up to 5 questions per peer company
    for co in ['DDOG', 'MDB', 'GOOGL', 'AMZN', 'MSFT', 'TDC']:
        if co in peer_by_company:
            s += f"  --- {co} ---\n"
            for q in peer_by_company[co][:5]:
                s += f"    [{q['quarter']}] {q['analyst']}: \"{q['question'][:400]}...\"\n\n"

    #G: Recent Announcements
    s += "\n[G] RECENT SNOWFLAKE ANNOUNCEMENTS\n"
    s += "  Product launches and news that analysts may probe:\n\n"

    relevant_releases = [pr for pr in press_releases if pr['category'] in ['Product', 'Partnership']]
    relevant_releases = sorted(relevant_releases, key=lambda x: x['date'], reverse=True)

    for pr in relevant_releases[:8]:
        s += f"  [{pr['date']}] {pr['title']}\n"
        if pr['synopsis'] and len(pr['synopsis']) > 50:
            synopsis_excerpt = pr['synopsis'][:200].replace('\n', ' ')
            s += f"    {synopsis_excerpt}...\n"
        s += "\n"

    #H: Equity Research Insights (from PDF extraction)
    if research_insights:
        s += "\n[H] EQUITY RESEARCH INSIGHTS\n"
        s += "  Extracted from sell-side preview reports (key debates, surveys, competitive positioning, price targets):\n"
        s += research_insights + "\n"

    #I: SEC Filing Risks
    if sec_insights:
        s += "\n[I] SEC FILING RISKS\n"
        s += "  Risk factors and forward-looking concerns from the most recent 10-Q:\n"
        s += sec_insights + "\n"

    return s

BRIEFING = build_briefing(ir, peers, analysts, news, companies, HISTORICAL_QUESTIONS, PRESS_RELEASES,
                          sec_insights=SEC_INSIGHTS, research_insights=RESEARCH_INSIGHTS)
print(f"Briefing compiled: {len(BRIEFING):,} characters\n")
print(BRIEFING)

Briefing compiled: 65,173 characters

  SNOWFLAKE (SNOW) — EARNINGS BRIEFING
  FY2026 Q3 (ending 2025-10-31)

[A] KEY METRICS — LATEST vs. YEAR AGO
  Product Revenue : $1160.0M  (was $900.3M, +28.8% YoY)
  Total Revenue   : $1210.0M  (was $942.1M, +28.4% YoY)
  RPO             : $6900M  (was $5700M, +21.1% YoY)
  FCF             : $110.5M  (was $78.2M, +41.3% YoY)
  Customers >$1M  : 688  (was 542, +26.9% YoY)
  Gross Margin    : 76%

[B] NET REVENUE RETENTION (NRR) — CRITICAL TREND
  Full history: Q4 FY2022: 178%, Q1 FY2023: 174%, Q2 FY2023: 171%, Q3 FY2023: 165%, Q4 FY2023: 158%, Q1 FY2024: 151%, Q2 FY2024: 142%, Q3 FY2024: 135%, Q4 FY2024: 131%, Q1 FY2025: 128%, Q2 FY2025: 127%, Q3 FY2025: 127%, Q3 FY2026: 125%
  NRR has declined EVERY QUARTER for 3 years straight.
  Current 125% is the lowest in company history.
  Implication: existing customers expand spend at a decelerating rate.

[C] PEER FINANCIAL COMPARISON

  DDOG (Datadog Inc.):
    GROSS_MARGIN                              

In [15]:
print("Calling Claude API for question generation...\n")

question_prompt = (
    f"You are a senior Wall Street sell-side analyst preparing for Snowflake's quarterly earnings call.\n\n"
    f"IMPORTANT: The briefing contains multiple critical data sources:\n"
    f"- Section [A-E]: Financial metrics, analyst ratings, and market news\n"
    f"- Section [F]: ACTUAL QUESTIONS analysts asked in previous quarters (recurring patterns)\n"
    f"- Section [G]: Recent product launches and strategic moves\n\n"
    f"Your job: Predict what analysts will ask NEXT, building on:\n"
    f"1. Recurring question themes from [F] (NRR decline, AI monetization, competition)\n"
    f"2. New developments in [G] that analysts haven't yet probed\n"
    f"3. Latest financial data showing key changes from prior quarters\n\n"
    f"Generate exactly {NUM_Q} questions that you believe analysts are MOST LIKELY to ask during Q&A.\n\n"
    f"Rules:\n"
    f"- Every question MUST reference specific numbers or trends from the data\n"
    f"- Target areas of greatest vulnerability or narrative tension\n"
    f"- Be confrontational but professional — the way a skeptical analyst challenges management\n"
    f"- Build on historical question patterns: if analysts previously asked about NRR decline,\n"
    f"  your question should acknowledge this is a recurring concern and push harder\n"
    f"- Use Section [H] risks as springboards for tough questions\n"
    f"- Reference recent product announcements from section [G] — analysts will probe these\n"
    f"- Bad example: \"How is AI going?\"\n"
    f"  Good example: \"Management disclosed concern about NRR in the 10-Q. It's now at {latest['NRR_PERCENT']:.0f}%,\n"
    f"  down from {year_ago['NRR_PERCENT']:.0f}% a year ago. Given Cortex AI was supposed to drive expansion,\n"
    f"  what specific evidence do you have that AI workloads are reversing this structural decline?\"\n"
    f"- Cover: NRR trajectory, competitive positioning, AI monetization, margins, growth sustainability, disclosed risks\n\n"
    f"Return ONLY a JSON array (no markdown, no explanation) with this structure:\n"
    f"[\n"
    f"  {{\n"
    f"    \"question_number\": 1,\n"
    f"    \"theme\": \"short theme label\",\n"
    f"    \"question\": \"the full analyst question\",\n"
    f"    \"data_basis\": \"which data points, historical patterns, AND risks justify this\"\n"
    f"  }}\n"
    f"]\n\n"
    f"=== BRIEFING ===\n"
    f"{BRIEFING}"
)

q_response = client.messages.create(
    model=MODEL,
    max_tokens=2500,
    messages=[{"role": "user", "content": question_prompt}]
)

# Parse JSON (handle optional ```json wrapper)
raw = q_response.content[0].text.strip()
if raw.startswith("```"):
    raw = raw.split("```")[1]
    if raw.startswith("json"):
        raw = raw[4:]
    raw = raw.split("```")[0]

QUESTIONS = json.loads(raw)
print(f"Generated {len(QUESTIONS)} questions.\n")
for q in QUESTIONS:
    print(f"  [{q['question_number']}] ({q['theme']})")
    print(f"       {q['question']}\n")

Calling Claude API for question generation...



Generated 3 questions.

  [1] (NRR structural decline despite AI momentum)
       Sridhar, I want to revisit a concern that's come up on every call for the past three years: NRR has now declined for 13 consecutive quarters, hitting a company-record low of 125% this quarter, down from 127% a year ago. You've crossed $100 million in AI revenue run rate, you've launched Snowflake Intelligence, Cortex AI adoption is accelerating with over 3,200 accounts — yet the expansion rate continues to deteriorate. Your 10-Q explicitly flags risk around customers' ability to 'maintain and increase consumption on your platform.' At what point should investors conclude that AI workloads are incremental but fundamentally insufficient to offset the structural compression in your core data warehousing expansion rates? And can you give us a specific timeline for when you expect NRR to stabilize or inflect?

  [2] (SAP partnership competitive implications)
       You just announced the SAP Business Data Clou

<a id="prediction"></a>
## 7. AI Question Prediction

### Methodology: Question Prediction Approach

**1. Pattern Recognition**
- Analyze 317 historical questions to identify recurring themes (NRR, AI monetization, competition, margins)
- Map which analysts from which firms ask which types of questions
- Track how question framing evolves quarter over quarter

**2. Current Context Integration**
- Most recent financial metrics (Q3 FY26: 125% NRR, 29% revenue growth, 688 customers >$1M)
- SEC 10-Q risk disclosures (management concerns about retention, competition, profitability)
- Equity research insights from 10 sell-side firms (key debates, survey findings, competitive pressures)
- Recent news and product announcements (SAP partnership, Anthropic integration)

**3. AI Synthesis**
- Build comprehensive 9-section briefing combining all data sources
- Feed to Claude Opus 4.5 with explicit prompt: "Predict toughest questions based on historical patterns + current data"
- Generate questions that:
  - Reference specific numbers from the data
  - Build on recurring analyst concerns
  - Target areas of vulnerability (e.g., NRR decline despite AI growth)
  - Sound like real analyst questions (confrontational but professional)

**4. Executive Response Generation**
- For each predicted question, Claude generates a 3-5 sentence CFO response
- Responses must: directly address the concern, cite specific metrics, acknowledge challenges honestly, pivot to positive narrative

**Why this works:** Analysts don't ask random questions — they track specific narratives quarter over quarter. By analyzing 317 historical questions, we can predict their next questions with high confidence, then prepare data-backed responses.

---
## 5. Executive Response Generation

For each predicted question, Claude generates a data-backed executive response
calibrated to be confident, honest, and persuasive for an earnings call.


In [16]:
def generate_exec_response(question, briefing):
    """Generate a single executive response for an analyst question."""
    prompt = (
        f"You are Snowflake's CFO responding live during an earnings call Q&A.\n\n"
        f"An analyst just asked:\n"
        f"\"{question['question']}\"\n\n"
        f"Craft your response. Requirements:\n"
        f"- Directly address the specific concern raised (don't dodge)\n"
        f"- Use concrete numbers and metrics from the data to support your answer\n"
        f"- Acknowledge any legitimate challenge honestly, then pivot to the positive\n"
        f"- Be concise: 3-5 sentences, suitable for a live earnings call\n"
        f"- Sound natural and confident\n\n"
        f"=== DATA REFERENCE ===\n"
        f"{briefing}\n\n"
        f"Respond with ONLY the executive response. No preamble or labels."
    )
    resp = client.messages.create(
        model=MODEL, max_tokens=400,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.content[0].text.strip()

print("Generating executive responses...\n")
QA_PAIRS = []
for q in QUESTIONS:
    r = generate_exec_response(q, BRIEFING)
    QA_PAIRS.append({"question": q, "response": r})
    print(f"  Done: Q{q['question_number']} — {q['theme']}")

print(f"\nAll {len(QA_PAIRS)} responses generated.")

Generating executive responses...



  Done: Q1 — NRR structural decline despite AI momentum


  Done: Q2 — SAP partnership competitive implications


  Done: Q3 — Sales investment ROI and margin trajectory

All 3 responses generated.


In [17]:
CSS = (
    '<style>'
    '.qa-wrap  { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif; max-width: 860px; }'
    '.qa-card  { border: 1px solid #dce1e6; border-radius: 10px; margin: 18px 0;'
    '            overflow: hidden; background: #fff; box-shadow: 0 1px 3px rgba(0,0,0,.06); }'
    '.qa-qhead { background: #f0f4f8; padding: 12px 18px 8px; }'
    '.qa-num   { font-size: .78em; color: #0078d4; font-weight: 700;'
    '            text-transform: uppercase; letter-spacing: 1px; }'
    '.qa-theme { display: inline-block; background: #0078d4; color: #fff; font-size: .72em;'
    '            padding: 3px 9px; border-radius: 20px; margin-top: 4px; font-weight: 600; }'
    '.qa-qtext { font-style: italic; color: #2c3e50; font-size: .95em; line-height: 1.55;'
    '            padding: 10px 18px; border-bottom: 1px solid #eef1f4; }'
    '.qa-rbody { padding: 14px 18px; }'
    '.qa-rlabel{ font-size: .75em; color: #27ae60; font-weight: 700;'
    '            text-transform: uppercase; letter-spacing: .8px; margin-bottom: 6px; }'
    '.qa-rtext { color: #34495e; line-height: 1.65; font-size: .92em; white-space: pre-wrap; }'
    '.qa-dbasis{ font-size: .78em; color: #95a5a6; padding: 8px 18px 12px;'
    '            border-top: 1px solid #f0f0f0; }'
    '</style>'
)

html = CSS + '<div class="qa-wrap">'
for qa in QA_PAIRS:
    q = qa["question"]
    html += '<div class="qa-card">'
    html += '<div class="qa-qhead">'
    html += f'<div class="qa-num">Question {q["question_number"]}</div>'
    html += f'<div class="qa-theme">{q["theme"]}</div>'
    html += '</div>'
    html += f'<div class="qa-qtext">"{q["question"]}"</div>'
    html += '<div class="qa-rbody">'
    html += '<div class="qa-rlabel">Executive Response</div>'
    html += f'<div class="qa-rtext">{qa["response"]}</div>'
    html += '</div>'
    html += f'<div class="qa-dbasis">Data basis: {q["data_basis"]}</div>'
    html += '</div>'
html += '</div>'
display(HTML(html))


In [18]:
output = {
    "generated_at": pd.Timestamp.now().isoformat(),
    "model": MODEL,
    "questions": QUESTIONS,
    "qa_pairs": [{"question": qa["question"], "response": qa["response"]} for qa in QA_PAIRS]
}
with open("qa_results.json", "w") as f:
    json.dump(output, f, indent=2)

print("Results saved to qa_results.json")


Results saved to qa_results.json


---
## 6. Interactive Dashboard

A Streamlit app provides an interactive version of this analysis:
- Live question regeneration with configurable parameters
- Expandable Q&A cards with data-basis annotations
- Key financial visualizations

**To run locally:**
```bash
streamlit run streamlit_app.py
```
