# 20 · Two-Stage Retrieval — “Long Text → Summary → Edge List”  
_Last updated: 2025-05-03_

In real projects you often face **long documents** (30–100 pages) that exceed an LLM’s context window.  
A common workaround is a **two-stage pipeline**:

1. **Stage 1 – Summarize** (or *chunk* + *filter*): reduce the long text to the subset that matters.  
2. **Stage 2 – Extract**: feed the summary into a second prompt that outputs structured data.

This notebook shows the pattern on a **toy paragraph** so it runs instantly, but
the code structure is identical for full PDFs.



### If you use this code, please cite the paper: 

- Garg, P. and Fetzer, T., 2025. **Causal claims in economics**. arXiv preprint arXiv:2501.06873.


## Key handling

Identical to Notebook 00:

* Looks for `OPENAI_API_KEY` env var.  
* Else reads `key/openai_key.txt` (one line).  
* Raises an error if not found.


In [1]:
# %pip -q install --upgrade openai
import os, pathlib, json, pandas as pd
from openai import OpenAI

# Locate key
key_path = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_path.exists():
    os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Add OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()


### 1 · Example long-ish text (placeholder)

Imagine this paragraph is one **chunk** (≈ 2500 tokens) from the first 30 pages
of a research paper.


In [3]:
page_text = (
    "This paper studies how access to microcredit affects household business creation in rural India. "
    "Using a randomized rollout, the authors show that microfinance availability increases the probability "
    "that a household starts a small enterprise by 7 percentage points. They also find evidence that "
    "female-headed households benefit disproportionately, while existing businesses do not grow. "
    "Stress related to loan repayment, however, increases among the poorest borrowers."
)


### 2 · Stage 1 — Summarize causal claims

We ask the LLM to produce a *single string* (`causal_claims`) containing every
explicit **cause → effect** statement.  
Schema: just one required field so we keep the output predictable.


In [4]:
summary_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "stage1_summary",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "causal_claims": {"type": "string"}
            },
            "required": ["causal_claims"],
            "additionalProperties": False
        }
    }
}

stage1_resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system",
         "content": "List every explicit cause–effect statement from the text."},
        {"role": "user", "content": page_text}
    ],
    temperature=0.3,  # low: we want faithful extraction
    response_format=summary_schema
)

causal_claims_text = json.loads(stage1_resp.choices[0].message.content)["causal_claims"]
print("Stage-1 output ➜", causal_claims_text)


Stage-1 output ➜ 1. Access to microcredit increases the probability that a household starts a small enterprise by 7 percentage points.
2. Female-headed households benefit disproportionately from microfinance availability.
3. Existing businesses do not grow as a result of microfinance availability.
4. Stress related to loan repayment increases among the poorest borrowers.


### 3 · Stage 2 — Convert summary to structured edge list

Now we feed that condensed string into a second prompt that outputs
an **array of edges** with an optional `method` field.


In [6]:
edge_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "edges_v1",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "edges": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "claim":  {"type": "string"},
                            "cause":  {"type": "string"},
                            "effect": {"type": "string"},
                            "method": {"type": "string"}
                        },
                        # 🟢 now *including* "method"
                        "required": ["claim", "cause", "effect", "method"],
                        "additionalProperties": False
                    }
                }
            },
            "required": ["edges"],
            "additionalProperties": False
        }
    }
}


stage2_resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system",
         "content": "Extract each relationship in 'A -> B' format. "
                    "Include method if explicitly stated."},
        {"role": "user", "content": causal_claims_text}
    ],
    temperature=0,
    response_format=edge_schema
)

edges = json.loads(stage2_resp.choices[0].message.content)["edges"]
pd.DataFrame(edges)


Unnamed: 0,claim,cause,effect,method
0,Access to microcredit increases the probabilit...,Access to microcredit,Probability of starting a small enterprise inc...,Statistical analysis
1,Female-headed households benefit disproportion...,Microfinance availability,Benefits to female-headed households,Comparative analysis
2,Existing businesses do not grow as a result of...,Microfinance availability,No growth in existing businesses,Observational study
3,Stress related to loan repayment increases amo...,Loan repayment,Increased stress among the poorest borrowers,Survey data analysis


### 4 · Interpretation

* **Stage 1** shrank ~80 words of prose into a concise bullet string. Useful more many page documents. 
* **Stage 2** transformed that text into a table with `cause`, `effect`, and (where present) `method`.

In production:

* **Stage 1** can be different, e.g., *embedding search* + *LLM clean-up* instead of an LLM summary. 
  * You can also pre-process the text with regex or other tools to extract sections of interest.
* **Chunking** logic decides paragraph size (e.g. 1 k tokens).  
* The overall pattern stays identical: _retrieve ➜ condense ➜ parse_.
