# Gemini for Investment Text Analysis — Hands‑On Lab

*Uses Google AI Studio (Gemini) Free Tier*

**Author:** Your Instructor  \
**Last updated:** 2025-08-22

---
### What you'll build
1. **Earnings calls:** Compare two consecutive transcripts (FedEx) and extract **surprising changes** vs prior call.
2. **FOMC press conference:** Surface statements most likely to **surprise markets**.
3. **Forecasting pitfalls:** Demonstrate **leakage** and why naive LLM+ML pipelines can overstate accuracy.

> This notebook assumes an internet connection. It includes robust fallbacks (short excerpts) so it still runs if websites block scraping during class.

## Learning Objectives
- Use **Gemini (AI Studio)** from Python to classify/extract structured insights from financial text.
- Combine **statistical novelty** (TF‑IDF) with **LLM judgment** to find *surprising* statements.
- Run a small **FOMC surprise** detector with hawkish/dovish cues + LLM vetting.
- See a concrete **look‑ahead bias** failure case and how to fix it with time‑aware validation.

## 0) Setup
**Gemini API (AI Studio) is free for prototyping** in supported regions. Create a key in AI Studio and set it as `GEMINI_API_KEY`.

### Install (if needed)

In [1]:
# If running on a fresh environment, uncomment:
%pip install google-generativeai pandas numpy matplotlib scikit-learn beautifulsoup4 requests tqdm


Collecting google-generativeaiNote: you may need to restart the kernel to use updated packages.

  Downloading google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google-generativeai)
  Downloading google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google-generativeai)
  Downloading google_api_core-2.25.1-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-python-client (from google-generativeai)
  Downloading google_api_python_client-2.179.0-py3-none-any.whl.metadata (7.0 kB)
Collecting google-auth>=2.15.0 (from google-generativeai)
  Downloading google_auth-2.40.3-py2.py3-none-any.whl.metadata (6.2 kB)
Collecting pydantic (from google-generativeai)
  Downloading pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
     ---------------------------------------- 0.0/68.0 kB ? eta -:--:--
     ---------------------------------------- 68.0/68.0 kB 1.2 MB/s eta 0:00:00
Coll

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
streamlit 1.32.0 requires protobuf<5,>=3.20, but you have protobuf 5.29.5 which is incompatible.


In [2]:
import os, re, json, time, math, warnings, textwrap
warnings.filterwarnings('ignore')
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import requests
from bs4 import BeautifulSoup

print('Versions:')
import sys; print('Python', sys.version)
try:
    import google.generativeai as genai
    print('google-generativeai', genai.__version__)
except Exception as e:
    print('google-generativeai not installed yet')


Versions:
Python 3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]
google-generativeai 0.8.5


### Configure Gemini (AI Studio)
Set your key safely (prefer environment variable). In class, students can temporarily assign it for the session, but **never commit** keys.

In [52]:
# Set your key:
os.environ['GEMINI_API_KEY'] = 'AIzaSyCTQ-zGmAjjYE_FZi9hEBfXVJzfvk8wnbo'  # <-- For demos only; do NOT commit keys

import google.generativeai as genai
if 'GEMINI_API_KEY' not in os.environ:
    raise RuntimeError('Please set GEMINI_API_KEY environment variable before running.')
genai.configure(api_key=os.environ['GEMINI_API_KEY'])

def gemini_json(system_prompt, user_prompt, extra=None, model_name='gemini-1.5-flash'):
    model = genai.GenerativeModel(
        model_name=model_name,
        system_instruction=system_prompt,
        generation_config={'temperature': 0.2, 'response_mime_type': 'application/json'}
    )
    if extra is None:
        extra = {}
    content = user_prompt if isinstance(user_prompt, str) else json.dumps(user_prompt)
    resp = model.generate_content(content)
    try:
        return json.loads(resp.text)
    except Exception:
        txt = resp.text.strip()
        start = txt.find('{'); end = txt.rfind('}')
        if start!=-1 and end!=-1 and end>start:
            return json.loads(txt[start:end+1])
        raise

def gemini_text(system_prompt, user_prompt, model_name='gemini-1.5-flash'):
    model = genai.GenerativeModel(
        model_name=model_name,
        system_instruction=system_prompt,
        generation_config={'temperature': 0.2}
    )
    resp = model.generate_content(user_prompt)
    return resp.text


## 1) Utilities: fetch & preprocess transcripts
These helpers try to download real transcripts. If blocked, they fall back to short built‑in excerpts (for demo only).

In [5]:
from bs4 import BeautifulSoup
import requests, re
HEADERS = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36'}

def fetch_text(url, min_len=2000, timeout=15):
    try:
        r = requests.get(url, headers=HEADERS, timeout=timeout)
        if r.status_code!=200:
            return None
        soup = BeautifulSoup(r.text, 'html.parser')
        for tag in soup(['script','style','noscript']): tag.extract()
        txt = ' '.join(soup.get_text('\n').split())
        return txt if len(txt)>=min_len else None
    except Exception:
        return None

def split_sentences(text):
    s = re.split(r'(?<=[.!?])\s+(?=[A-Z\[])', text)
    return [x.strip() for x in s if len(x.strip())>0]


## 2) Earnings Call: What’s surprising vs the previous call? (FedEx)
**Method:** TF‑IDF novelty (current vs prior) → top candidates → Gemini JSON classification (surprise category/direction/relevance).

In [None]:
# 2.1 Attempt to fetch two FedEx texts (PDF transcripts, chronological handling)

# --- PDF-based fetch for FedEx transcripts (sorted chronologically) ---

# If needed (first run), uncomment:
%pip install pdfminer.six pypdf

import re, io, requests



HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/123.0 Safari/537.36"
}

def fetch_pdf_bytes(url, timeout=40):
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    print(f"Fetched {url} | status={r.status_code} | bytes={len(r.content)} | type={r.headers.get('Content-Type')}")
    return r.content

def pdf_bytes_to_text(data: bytes) -> str:
    # Try pdfminer.six
    try:
        from pdfminer.high_level import extract_text
        return extract_text(io.BytesIO(data))
    except Exception as e:
        print("pdfminer.six failed:", repr(e))
    # Fallback: PyPDF
    try:
        from pypdf import PdfReader
        reader = PdfReader(io.BytesIO(data))
        return "\n".join((p.extract_text() or "") for p in reader.pages)
    except Exception as e:
        print("PyPDF failed:", repr(e))
        return ""

def parse_quarter_key(url: str):
    # Prefer explicit "Q4-FY25" pattern
    m = re.search(r'Q([1-4])[-_]?FY(\d{2,4})', url, re.I)
    if m:
        q = int(m.group(1)); fy = int(m.group(2))
        fy = fy + 2000 if fy < 100 else fy
        return (fy, q)
    # Fallback "/2025/q4/" pattern in path
    m2 = re.search(r'/(\d{4})/q([1-4])/', url, re.I)
    if m2:
        return (int(m2.group(1)), int(m2.group(2)))
    return (0, 0)  # unknown → sorts earliest

# Ensure chronological order (oldest -> newest), then take the last two

fedex_urls = [
  'https://s21.q4cdn.com/665674268/files/doc_financials/2025/q4/FDX-Q4-FY25-Earnings-Call-Transcript_Final.pdf',
  'https://s21.q4cdn.com/665674268/files/doc_financials/2025/q3/FDX-Q3-FY25-Earnings-Call-Transcript.pdf'
]
fedex_urls_sorted = sorted(fedex_urls, key=parse_quarter_key)
prev_url, curr_url = fedex_urls_sorted[-2], fedex_urls_sorted[-1]
print("Prev URL:", prev_url)
print("Curr URL:", curr_url)

# Extract text
prev_text = pdf_bytes_to_text(fetch_pdf_bytes(prev_url))
curr_text = pdf_bytes_to_text(fetch_pdf_bytes(curr_url))

# Fallback if a PDF is image-only (no embedded text)
if not prev_text or len(prev_text) < 1200:
    print("⚠️ Previous PDF text too short; using fallback excerpt.")
    prev_text = ('In the prior quarter, we reiterated focus on cost reductions across Ground and Express. '
                 'We expected modest revenue growth with pressure on international freight. '
                 'We maintained FY operating margin guidance in the mid-single digits and capex ~5.5% of revenue. '
                 'Residential mix and yield management remained priorities. Macro uncertainty but stable U.S. demand.')

if not curr_text or len(curr_text) < 1200:
    print("⚠️ Current PDF text too short; using fallback excerpt.")
    curr_text = ('This quarter, we raised our FY operating margin outlook to high-single digits, driven by stronger Ground yields '
                 'and Express optimization. International freight improved; capex ~5.0% of revenue; repurchase auth +$2B; '
                 'B2B stabilized; easing cost inflation; positive free cash flow.')

print('Lengths -> prev:', len(prev_text), 'curr:', len(curr_text))
print("\n--- preview prev_text ---\n", prev_text[:600])
print("\n--- preview curr_text ---\n", curr_text[:600])



Note: you may need to restart the kernel to use updated packages.
Prev URL: https://s21.q4cdn.com/665674268/files/doc_financials/2025/q3/FDX-Q3-FY25-Earnings-Call-Transcript.pdf
Curr URL: https://s21.q4cdn.com/665674268/files/doc_financials/2025/q4/FDX-Q4-FY25-Earnings-Call-Transcript_Final.pdf
Fetched https://s21.q4cdn.com/665674268/files/doc_financials/2025/q3/FDX-Q3-FY25-Earnings-Call-Transcript.pdf | status=200 | bytes=762180 | type=application/pdf
Fetched https://s21.q4cdn.com/665674268/files/doc_financials/2025/q4/FDX-Q4-FY25-Earnings-Call-Transcript_Final.pdf | status=200 | bytes=192666 | type=application/pdf
Lengths -> prev: 65222 curr: 57980

--- preview prev_text ---
 FedEx Q3 FY25 Earnings Call Transcript – March 20, 2025 

Jenifer Hollander 
Vice President-Investor Relations, FedEx Corp. 

Good afternoon, and welcome to FedEx Corporation's third quarter earnings conference call. The third quarter earnings 
release, Form 10-Q and stat book are on our website at investors.fed

"FedEx Q3 FY25 Earnings Call Transcript – March 20, 2025 \n\nJenifer Hollander \nVice President-Investor Relations, FedEx Corp. \n\nGood afternoon, and welcome to FedEx Corporation's third quarter earnings conference call. The third quarter earnings \nrelease, Form 10-Q and stat book are on our website at investors.fedex.com. This call and the accompanying slides are \nbeing streamed from our website. \n\nDuring our Q&A session, callers will be limited to one question to allow us to accommodate all those who would like to \nparticipate. Certain statements in this conference call may be considered forward-looking statements as defined in the \nPrivate Securities Litigation Reform Act of 1995. Such forward-looking statements are subject to risks, uncertainties and \nother factors that could cause actual results to differ materially from those expressed or implied by such forward-looking \nstatements. For additional information on these factors, please refer to our press releases and fili

In [49]:
# 2.2 Novelty scoring
prev_s = split_sentences(prev_text)
curr_s = split_sentences(curr_text)
vec = TfidfVectorizer(stop_words='english', max_features=20000)
Xp = vec.fit_transform(prev_s)
Xc = vec.transform(curr_s)
sims = cosine_similarity(Xc, Xp).max(axis=1)
nov = 1 - sims
df_curr = pd.DataFrame({'sentence': curr_s, 'novelty': nov}).sort_values('novelty', ascending=False)
df_curr.head(10)


Unnamed: 0,sentence,novelty
527,"So, condolences to family, friends and colleag...",1.0
487,"So, apologize for that slight delay.",1.0
3,It feels strange to be here with you all so \n...,1.0
5,Smith.,1.0
6,But Fred was a man grounded by a mission.,1.0
280,It \nruns across and is part of our culture here.,1.0
353,And that U.S.,1.0
74,We continue to apply our digital platform-base...,0.831813
75,These solutions support a wide range \nof stak...,0.828557
541,The scale of FedEx comes into play in these ki...,0.82825


In [None]:
# 2.3 Gemini classification of top-N novel sentences
# Select top-N most novel sentences from current call for Gemini classification

user_msg = {
  'previous_call_text': prev_text,
  'current_call_text': curr_text
}
system_msg = (
  'Act as an equity analyst. Comparing with the previous_call text, for each sentence from the current earnings call, decide if it is surprising vs the prior call AND likely to be market-moving. '
  'Use categories: guidance, demand, margins, capital_allocation, network/operations, macro, costs, other. '
  'Return a JSON array of objects with fields: claim, category, direction (up/down/neutral), is_surprising (bool), market_relevance (low/med/high), rationale, confidence. '
  'Only include sentences that are both surprising AND have medium or high market_relevance.'
)
fedex_res = gemini_json(system_msg, user_msg)
# system_msg = (
#   'Act as an equity analyst. Comparing with the previous_call text For each sentence from the current earnings call, decide if it is surprising vs the prior call AND likely to be market-moving. '
#   'Use categories: guidance, demand, margins, capital_allocation, network/operations, macro, costs, other. '
#   'Return a JSON array of objects with fields: claim, category, direction (up/down/neutral), is_surprising (bool), market_relevance (low/med/high), rationale, confidence. '
#   'Only include sentences that are both surprising AND have medium or high market_relevance.'
# )
# user_msg = {
#   'previous_call_text': prev_text,
#   'current_text': curr_text
# }
# fedex_res = gemini_json(system_msg, user_msg)



In [59]:
fedex_res=pd.DataFrame(fedex_res['surprising_claims'])
fedex_res

Unnamed: 0,claim,category,direction,is_surprising,market_relevance,rationale,confidence
0,We delivered a solid finish to FY 2025 with an...,margins,up,True,high,This contradicts the previous call's lowered E...,med
1,"In FY 2025, we delivered on our $2.2 billion D...",costs,down,True,high,Successfully meeting the cost reduction target...,high
2,We achieved all of this in the face of major h...,other,neutral,True,med,Successfully navigating multiple significant h...,med
3,"On a 1% increase in revenue, we grew adjusted ...",margins,up,True,high,This demonstrates significant operating levera...,high
4,We achieved this result in a weak demand envir...,demand,up,True,high,"Growth in a weak demand environment, driven by...",high
5,Our performance demonstrates the flexibility o...,network/operations,up,True,high,Highlighting network flexibility and confidenc...,med
6,"In the fourth quarter, we flexed our network t...",network/operations,down,True,high,The significant and rapid capacity reduction d...,high
7,We exited May with a net capacity reduction of...,network/operations,down,True,med,This shows continued proactive management of c...,med
8,"On June 1, we implemented Network 2.0 on nearl...",network/operations,up,True,high,The accelerated pace of Network 2.0 implementa...,high
9,That means we exit June with roughly 2.5 milli...,network/operations,up,True,high,The significant increase in volume through opt...,high


## 3) FOMC Press Conference: likely market‑moving lines
**Method:** novelty + hawk/dove tone → Gemini vetting to flag likely market movers.

In [60]:
# 3.1 Fetch two FOMC transcripts (or fallback) — edit URLs for specific dates when teaching
fomc_urls = ['https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250618.pdf','https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250730.pdf']



fomc_urls_sorted = sorted(fomc_urls, key=parse_quarter_key)
prev_url, curr_url = fomc_urls_sorted[-2], fomc_urls_sorted[-1]
print("Prev URL:", prev_url)
print("Curr URL:", curr_url)

# Extract text
prev_text = pdf_bytes_to_text(fetch_pdf_bytes(prev_url))
curr_text = pdf_bytes_to_text(fetch_pdf_bytes(curr_url))

# Fallback if a PDF is image-only (no embedded text)
if not prev_text or len(prev_text) < 1200:
    print("⚠️ Previous PDF text too short; using fallback excerpt.")
    prev_text = ('In the prior quarter, we reiterated focus on cost reductions across Ground and Express. '
                 'We expected modest revenue growth with pressure on international freight. '
                 'We maintained FY operating margin guidance in the mid-single digits and capex ~5.5% of revenue. '
                 'Residential mix and yield management remained priorities. Macro uncertainty but stable U.S. demand.')

if not curr_text or len(curr_text) < 1200:
    print("⚠️ Current PDF text too short; using fallback excerpt.")
    curr_text = ('This quarter, we raised our FY operating margin outlook to high-single digits, driven by stronger Ground yields '
                 'and Express optimization. International freight improved; capex ~5.0% of revenue; repurchase auth +$2B; '
                 'B2B stabilized; easing cost inflation; positive free cash flow.')

print('Lengths -> prev:', len(prev_text), 'curr:', len(curr_text))
print("\n--- preview prev_text ---\n", prev_text[:600])
print("\n--- preview curr_text ---\n", curr_text[:600])


Prev URL: https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250618.pdf
Curr URL: https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250730.pdf
Fetched https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250618.pdf | status=200 | bytes=217624 | type=application/pdf
Fetched https://www.federalreserve.gov/mediacenter/files/FOMCpresconf20250730.pdf | status=200 | bytes=222261 | type=application/pdf
Lengths -> prev: 56626 curr: 47774

--- preview prev_text ---
 June 18, 2025 

  Chair Powell’s Press Conference 

FINAL 

Transcript of Chair Powell’s Press Conference 
June 18, 2025 

CHAIR POWELL.  Good afternoon.  My colleagues and I remain squarely focused on 

achieving our dual-mandate goals of maximum employment and stable prices for the benefit of 

the American people.  Despite elevated uncertainty, the economy is in a solid position.  The 

unemployment rate remains low, and the labor market is at or near maximum employment.  

Inflation has come do

In [None]:
user_msg = {
  'previous_call_text': prev_text,
  'current_call_text': curr_text
}
system_msg = (
  'Act as an equity analyst. Comparing with the previous_call text, for each sentence from the current earnings call, decide if it is surprising vs the prior call AND likely to be market-moving. '
  'Use categories: guidance, demand, margins, capital_allocation, network/operations, macro, costs, other. '
  'Return a JSON array of objects with fields: claim, category, direction (up/down/neutral), is_surprising (bool), market_relevance (low/med/high), rationale, confidence. '
  'Only include sentences that are both surprising AND have medium or high market_relevance.'
)
fedex_res = gemini_json(system_msg, user_msg)

In [None]:
# 3.2 Compute novelty + simple hawk/dove tone
prev_s = split_sentences(prev_f)
curr_s = split_sentences(curr_f)
vec = TfidfVectorizer(stop_words='english', max_features=20000)
Xp = vec.fit_transform(prev_s)
Xc = vec.transform(curr_s)
sims = cosine_similarity(Xc, Xp).max(axis=1)
nov = 1 - sims

hawk = set('tighten tightening restrictive inflation persistent upside overheating strong labor vigilantly price stability hikes higher longer'.split())
dove = set('ease easing lower cut disinflation confidence balanced downside progress softening slack'.split())
def tone(s):
    w = re.findall(r'[A-Za-z]+', s.lower())
    return sum(1 for x in w if x in hawk) - sum(1 for x in w if x in dove)
tones = np.array([tone(s) for s in curr_s])

df_f = pd.DataFrame({'sentence': curr_s, 'novelty': nov, 'tone': tones}).sort_values(['novelty','tone'], ascending=[False, False])
df_f.head(12)


In [None]:
# 3.3 Gemini vetting
N = 12
cand = df_f.head(N)['sentence'].tolist()
sys = ('You are a rates strategist. Choose sentences that are LIKELY market‑moving (is_market_moving=true). '
       'Prefer: forward path of policy, balance‑sheet pace, confidence about inflation path, changes in risk balance.')
user = {'sentences': cand, 'return_fields':['quote','hawk_dove','market_channels','surprise_reason','confidence','is_market_moving']}
fomc_res = gemini_json(sys, user)
fomc_res


In [None]:
user_msg = {
  'previous_call_text': prev_text,
  'current_call_text': curr_text
}
system_msg = (
  'Act as an equity analyst. Comparing with the previous_call text, for each sentence from the current earnings call, decide if it is surprising vs the prior call AND likely to be market-moving. '
  'Use categories: guidance, demand, margins, capital_allocation, network/operations, macro, costs, other. '
  'Return a JSON array of objects with fields: claim, category, direction (up/down/neutral), is_surprising (bool), market_relevance (low/med/high), rationale, confidence. '
  'Only include sentences that are both surprising AND have medium or high market_relevance.'
)
fedex_res = gemini_json(system_msg, user_msg)

In [None]:
# 3.4 Tidy
def tidify(res):
    arr = res if isinstance(res, list) else res.get('items') or res.get('results') or []
    out = []
    for it in arr:
        out.append({
            'quote': it.get('quote',''),
            'hawk_dove': it.get('hawk_dove','neutral'),
            'is_market_moving': it.get('is_market_moving', False),
            'market_channels': ', '.join(it.get('market_channels', [])) if isinstance(it.get('market_channels'), list) else it.get('market_channels',''),
            'surprise_reason': it.get('surprise_reason',''),
            'confidence': it.get('confidence','low')
        })
    return pd.DataFrame(out)

fomc_tbl = tidify(fomc_res)
fomc_tbl


## 4) Forecasting Pitfalls: look‑ahead bias & leakage demo
Two demos: (a) TF‑IDF fit leakage + random CV, (b) time‑aware split without leakage.

In [None]:
# 4.1 Simulate time series text and returns
rng = np.random.default_rng(7)
dates = pd.bdate_range('2024-01-02','2024-06-28')
n = len(dates)
base_words = ['update','plan','launch','customer','supply','cost','revenue','margin','guidance','growth']
texts = []
latent = rng.normal(0, 0.2, n)
for i in range(n):
    w = rng.choice(base_words, size=rng.integers(5,9), replace=True)
    if rng.random()<0.3:
        w = list(w) + ['good']; latent[i] += 0.05
    elif rng.random()<0.3:
        w = list(w) + ['bad']; latent[i] -= 0.05
    texts.append(' '.join(w))
ret_next = 0.001 + 0.15*latent + rng.normal(0, 0.5, n)
df = pd.DataFrame({'date':dates, 'text':texts, 'ret_next':ret_next})
df.head()


In [None]:
# 4.2 WRONG: fit on full corpus, random CV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
tf = TfidfVectorizer(stop_words='english', max_features=500)
X_all = tf.fit_transform(df['text'])
y = df['ret_next'].values
Xtr, Xte, ytr, yte = train_test_split(X_all, y, test_size=0.3, random_state=0, shuffle=True)
mdl = Ridge(alpha=1.0).fit(Xtr, ytr)
r2_bad = mdl.score(Xte, yte)
print('Leaky R^2 (random split, TF‑IDF fit on ALL data):', round(r2_bad,4))


In [None]:
# 4.3 RIGHT: time split + fit only on train
cut = int(len(df)*0.7)
train = df.iloc[:cut]; test = df.iloc[cut:]
tf2 = TfidfVectorizer(stop_words='english', max_features=500)
Xtr = tf2.fit_transform(train['text'])
Xte = tf2.transform(test['text'])
mdl2 = Ridge(alpha=1.0).fit(Xtr, train['ret_next'].values)
r2_good = mdl2.score(Xte, test['ret_next'].values)
print('Proper R^2 (time split, no leakage):', round(r2_good,4))


**Key safeguards**
- Always **time‑split** (train on `t<=T`, test on `t>T`).
- Fit tokenizers/embeddings on **train only** (or use historical corpora).
- If using RAG, restrict retrieval to documents **published before t**.
- Log model versions and prompts; avoid tuning on the test window.

## 5) Next steps
- Replace placeholder URLs with specific accessible **FedEx** transcripts (or host text files you’re allowed to use).
- Expand the **surprise taxonomy** and evaluate on a small labeled set.
- For FOMC, track which quotes map to **2‑year yield** moves to calibrate precision.
- Consider `gemini-1.5-pro` for tougher IE; use caching/batching to control cost even on AI Studio.