# 10 · Single-Stage Retrieval — Abstract → Edge List  
_Last updated: 2025-05-03_

**Goal** — turn unstructured paper abstracts into a tidy *edge list* of cause-effect
statements (source → sink).  
We’ll:

1. **Load** a tiny TSV with three abstracts.  
2. **Define** a minimal JSON schema (`edges_v0`).  
3. **Prompt** GPT-4o-mini to extract edges that match the schema.  
4. **Parse → Normalize → Inspect** the results.

You can scale this pattern to hundreds of papers by batching requests; here we keep
everything in-memory so the demo runs instantly.



### If you use this code, please cite the paper: 

- Garg, P. and Fetzer, T., 2025. **Causal claims in economics**. arXiv preprint arXiv:2501.06873.


### API Key Handling

* The notebook **never** hard-codes an OpenAI key.  
* We first look for an environment variable `OPENAI_API_KEY`.  
* If absent, we try to read `key/openai_key.txt` (one line, your key).  
* If still missing, we raise an error with clear setup instructions.

This mirrors the pattern used in _Notebook 00 · Smoke Test_ so we only use
one way to manage secrets.


In [1]:
# Install/upgrade the SDK quietly (does nothing if already present)
# %pip -q install --upgrade openai

import os, pathlib, pandas as pd, json, textwrap
from openai import OpenAI

# -------------------------------------------------------------------
# Locate API key: env var ➜ key/openai_key.txt
# -------------------------------------------------------------------
key_path = pathlib.Path("key/openai_key.txt")

if os.getenv("OPENAI_API_KEY") is None and key_path.exists():
    os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()

if not os.getenv("OPENAI_API_KEY"):
    raise ValueError(
        "No API key found.\n"
        "Create key/openai_key.txt (single line) or export OPENAI_API_KEY in your shell."
    )

client = OpenAI()  # SDK reads the key from the env var


### 1 · Load three sample abstracts

We manufacture a *tiny* DataFrame on the fly and dump it to
`int_data/abstracts_sample.tsv` (so you can inspect the raw file if you like).


In [None]:
sample = pd.DataFrame({
    "id":    ["paper_1", "paper_2", "paper_3"],
    "title": ["Does Sleep Affect Productivity?",
              "Trade Shocks and Wages",
              "Climate Policy and Innovation"],
    "abstract": [
        "Using survey data we show that lack of sleep causes lower productivity. Stress also leads to sleep deprivation.",
        "A sudden tariff increase reduces exports and wages in affected industries.",
        "Stronger carbon pricing increases patenting in clean technologies."
    ]
})

path = pathlib.Path("int_data")
path.mkdir(parents=True, exist_ok=True)
sample.to_csv(path / "abstracts_sample.tsv", sep="\t", index=False)
sample


Unnamed: 0,id,title,abstract
0,paper_1,Does Sleep Affect Productivity?,Using survey data we show that lack of sleep c...
1,paper_2,Trade Shocks and Wages,A sudden tariff increase reduces exports and w...
2,paper_3,Climate Policy and Innovation,Stronger carbon pricing increases patenting in...


### 2 · Minimal JSON schema (`edges_v0`)

We only care about four fields:

| field | type    | notes |
|-------|---------|-------|
| `claim` | string  | human-readable: `"lack of sleep → lower productivity"` |
| `source` | string | left-hand variable |
| `sink`   | string | right-hand variable |
| `is_causal` | boolean | `True` if authors claim causality |

The schema is strict (`additionalProperties: false`) so the LLM can’t slip in
extra keys.  **Why strict?** — downstream code won’t break if the model
changes format.


In [None]:
response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "edges_v0",
        "strict": True,
        "schema": {
            "type": "object", # default (choose from object or array)
            "properties": {
                "edges": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "claim":    {"type": "string"},
                            "source":   {"type": "string"},
                            "sink":     {"type": "string"},
                            "is_causal":{"type": "boolean"}
                        },
                        "required": ["claim", "source", "sink", "is_causal"],
                        "additionalProperties": False
                    }
                }
            },
            "required": ["edges"],
            "additionalProperties": False
        }
    }
}

system_prompt = (
    "You are an expert annotator. "
    "Extract every explicit cause–effect statement from the abstract and "
    "return JSON that matches the provided schema."
)


### 3 · Call GPT-4o-mini for each abstract

* **Temperature 0.7** → a good balance (low hallucination, still flexible).  
* We embed the title + abstract in the prompt; length is ~70 tokens, well under limits.  
* We parse the JSON reply with `json.loads`.


In [5]:
def run_extract(row):
    user_prompt = f"Title: {row.title}\nAbstract: {row.abstract}\nExtract edges."
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_prompt}
        ],
        temperature=0.7,
        response_format=response_format
    )
    return json.loads(response.choices[0].message.content)["edges"]

# Collect edges for all three papers
edges = []
for _, r in sample.iterrows():
    for edge in run_extract(r):
        edge["paper_id"] = r.id
        edges.append(edge)

df_edges = pd.json_normalize(edges)
df_edges


Unnamed: 0,claim,source,sink,is_causal,paper_id
0,lack of sleep causes lower productivity,lack of sleep,lower productivity,True,paper_1
1,stress leads to sleep deprivation,stress,sleep deprivation,True,paper_1
2,A sudden tariff increase reduces exports,sudden tariff increase,exports,True,paper_2
3,A sudden tariff increase reduces wages in affe...,sudden tariff increase,wages in affected industries,True,paper_2
4,A reduction in exports leads to a reduction in...,reduction in exports,reduction in wages in affected industries,True,paper_2
5,Stronger carbon pricing increases patenting in...,Stronger carbon pricing,patenting in clean technologies,True,paper_3


### 4 · Quick sanity check

* **Are all expected pairs present?**  
  *Paper 1* → ✅ `lack of sleep → lower productivity`, ✅ `stress → sleep deprivation`.  
  *Paper 2* → ✅ `tariff increase → lower exports`, ✅ `tariff increase → lower wages`.  
  *Paper 3* → ✅ `carbon pricing → more clean-tech patents`.

* **Do directions make sense?** All negative/positive signs look plausible.

> **Take-away** — with < 30 lines of code and a 15-line schema we turned raw
English abstracts into a structured edge list ready for a database or
graph-analysis package.
