# 20 · Two-Stage Retrieval — “Long Text → Summary → Edge List”  
_Last updated: 2025-05-03_

In real projects you often face **long documents** (30–100 pages) that exceed an LLM’s context window.  
A common workaround is a **two-stage pipeline**:

1. **Stage 1 – Summarize** (or *chunk* + *filter*): reduce the long text to the subset that matters.  
2. **Stage 2 – Extract**: feed the summary into a second prompt that outputs structured data.

This notebook shows the pattern on a **toy paragraph** so it runs instantly, but
the code structure is identical for full PDFs.



### If you use this code, please cite the paper: 

- Garg, P. and Fetzer, T., 2025. **Causal claims in economics**. arXiv preprint arXiv:2501.06873.


## Key handling

Identical to Notebook 00:

* Looks for `OPENAI_API_KEY` env var.  
* Else reads `key/openai_key.txt` (one line).  
* Raises an error if not found.


In [6]:
# %pip -q install --upgrade openai
import os, pathlib, json, pandas as pd
from openai import OpenAI

# Locate key
key_path = pathlib.Path("key/openai_key.txt")
if os.getenv("OPENAI_API_KEY") is None and key_path.exists():
    os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("Add OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()


### 1 · Example long-ish text (placeholder)

Imagine this paragraph is one **chunk** (≈ 2500 tokens) from the first 30 pages
of a research paper.


In [12]:
page_text = (
    "This paper studies how access to microcredit affects household business creation and welfare outcomes "
    "in rural India. The analysis is part of a broader evaluation of financial inclusion policies implemented "
    "in the early 2000s, when microfinance institutions began expanding into underserved districts. The authors "
    "combine administrative data from participating banks with household survey responses to assess both direct "
    "and spillover effects of access to small-scale credit. \n\n"
    "Using a randomized rollout (an RCT), the study estimates that microfinance availability increases the "
    "probability that a household starts a small enterprise by 7 percentage points relative to control villages. "
    "Heterogeneity analysis shows that female-headed households experience a larger effect of +12 percentage points, "
    "while male-headed households see a smaller effect of +4 percentage points. These results are consistent with "
    "qualitative interviews suggesting that women primarily use new loans to fund self-employment activities such "
    "as tailoring and food processing, whereas men are more likely to invest in existing family businesses. \n\n"
    "For already established enterprises, the program has no detectable impact on firm growth (effect size "
    "approximately 0), suggesting that credit constraints are less binding for ongoing operations. Among the "
    "poorest borrowers, however, stress related to loan repayment increases by about 12%, measured via a survey-based "
    "stress index under the same randomized rollout. The authors note that repayment stress may reflect both higher "
    "financial responsibility and social pressure within joint-liability groups. \n\n"
    "Additional descriptive results show small but insignificant improvements in children's school attendance and "
    "no measurable change in consumption patterns within the study period. The paper concludes with a discussion "
    "of the policy relevance of these findings and the potential for targeted lending programs to promote inclusive "
    "entrepreneurship without exacerbating financial vulnerability."
)


### 2 · Stage 1 — Summarize causal claims

We ask the LLM to produce a *single string* (`causal_claims`) containing every
explicit **cause → effect** statement.  
Schema: just one required field so we keep the output predictable.


In [13]:
summary_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "stage1_summary",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "causal_claims": {"type": "string"}
            },
            "required": ["causal_claims"],
            "additionalProperties": False
        }
    }
}

stage1_resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": (
                "Write a concise VERBAL summary that enumerates every explicit cause–effect claim in the text. "
                "Use short, declarative sentences. For EACH claim, explicitly state: "
                "(i) the cause and effect, (ii) the empirical method as named in the text (if given), "
                "(iii) the direction of effect (say 'increases' or 'decreases' or 'no detectable effect'), "
                "and (iv) the effect size as a numeral with the wording found (e.g., '7 percentage points', '12%'). "
                "If subgroup effects are reported (e.g., by gender or income), include SEPARATE sentences for each subgroup. "
                "Avoid bullet points, tables, or lists—just clear prose. Do not speculate beyond the text."
            )
        },
        {"role": "user", "content": page_text}
    ],
    temperature=0.3,
    response_format=summary_schema
)

causal_claims_text = json.loads(stage1_resp.choices[0].message.content)["causal_claims"]
print("Stage-1 output ➜\n", causal_claims_text)



Stage-1 output ➜
 Access to microcredit increases the probability that a household starts a small enterprise by 7 percentage points relative to control villages. (RCT) The effect is larger for female-headed households, which experience an increase of +12 percentage points. (RCT) Male-headed households see a smaller increase of +4 percentage points. (RCT) Access to microcredit has no detectable impact on firm growth for already established enterprises (effect size approximately 0). (RCT) Among the poorest borrowers, stress related to loan repayment increases by about 12%. (RCT) There are small but insignificant improvements in children's school attendance. (descriptive results) There is no measurable change in consumption patterns within the study period. (descriptive results)


### 3 · Stage 2 — Convert summary to structured edge list

Now we feed that condensed string into a second prompt that outputs
an **array of edges** with an optional `method` field.


In [15]:
edge_schema = {
    "type": "json_schema",
    "json_schema": {
        "name": "edges_v1",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {
                "edges": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "claim":            {"type": "string"},
                            "cause":            {"type": "string"},
                            "effect":           {"type": "string"},
                            "method":           {"type": "string"},
                            "effect_direction": {"type": "string", "enum": ["positive","negative","unclear/mixed"]},
                            "effect_size":      {"type": "number"}   # numeric only (strip symbols/units)
                        },
                        "required": ["claim","cause","effect","method","effect_direction","effect_size"],
                        "additionalProperties": False
                    }
                }
            },
            "required": ["edges"],
            "additionalProperties": False
        }
    }
}

stage2_resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": (
                "You are given a short VERBAL summary that describes causal claims (cause, effect, method, "
                "direction, and effect size) in sentences. Extract each claim and output strictly valid JSON "
                "matching the provided schema. \n\n"
                "Parsing rules:\n"
                "• Map direction words: 'increases'/'higher' → effect_direction='positive'; "
                "'decreases'/'lower' → 'negative'; 'no detectable effect'/'null' → 'unclear/mixed'.\n"
                "• Parse the main numeric magnitude in each sentence as effect_size (number only, no symbols/units). "
                "Examples: '7 percentage points' → 7; '12%' → 12; 'approximately 0' → 0.\n"
                "• If subgroup-specific effects are mentioned, emit a separate edge per subgroup and include the subgroup "
                "in the 'effect' text (e.g., 'household starts small enterprise (female-headed)').\n"
                "• 'method' should replicate the method name as stated in the sentence if available; otherwise set to an empty string.\n"
                "• 'claim' should be a human-readable concatenation, e.g., 'access to microcredit → starts a small enterprise (female-headed)'."
            )
        },
        {"role": "user", "content": causal_claims_text}
    ],
    temperature=0,
    response_format=edge_schema
)

edges = json.loads(stage2_resp.choices[0].message.content)["edges"]
pd.DataFrame(edges)


Unnamed: 0,claim,cause,effect,method,effect_direction,effect_size
0,access to microcredit → starts a small enterprise,access to microcredit,household starts a small enterprise,RCT,positive,7
1,access to microcredit → starts a small enterpr...,access to microcredit,household starts a small enterprise (female-he...,RCT,positive,12
2,access to microcredit → starts a small enterpr...,access to microcredit,household starts a small enterprise (male-headed),RCT,positive,4
3,access to microcredit → firm growth,access to microcredit,firm growth,RCT,unclear/mixed,0
4,stress related to loan repayment increases,access to microcredit,stress related to loan repayment,RCT,positive,12
5,access to microcredit → children's school atte...,access to microcredit,children's school attendance,descriptive results,unclear/mixed,0
6,access to microcredit → consumption patterns,access to microcredit,consumption patterns,descriptive results,unclear/mixed,0


### 4 · Interpretation

* **Stage 1** condensed a long piece of text into a compact, machine-readable list of cause–effect lines that also captured **method**, **direction**, and **effect size** for each claim.  
  This step is crucial when processing multi-page PDFs: it filters text to just the causal substance.

* **Stage 2** then parsed that summary into a structured **edge table** with fields  
  `cause`, `effect`, `method`, `effect_direction`, and `effect_size`.  
  The output is ready for quantitative inspection—e.g., aggregating magnitudes, comparing positive vs negative effects, or validating consistency across methods.

In production:

* **Stage 1** may use hybrid retrieval — e.g., *embedding search + LLM clean-up* — to locate causal passages before summarization.  
  *Regex filters* or section headers (“Results”, “Empirical Strategy”) can further constrain input.  
* **Chunking** logic sets token size (e.g., ≈ 1 k tokens per paragraph).  
* The overall pattern remains the same:  
  **retrieve → condense → parse → analyze**.
