# 01 — Single PDF Demo (OpenAI SDK ≥ 1.0)

This notebook runs the **AI Paper Agent** pipeline on one PDF using the **new OpenAI Python SDK**.  
Place this file in your repo at: `ai-paper-agent-starter/notebooks/01_single_pdf_demo.ipynb`.

**What it does:**
- Loads environment variables from `.env`
- Processes a single PDF → exports Markdown + JSON + CSV to `../outputs/`
- Shows a preview of the generated Markdown and JSON
- Includes an optional batch cell to process a whole folder of PDFs

> Make sure you've updated `src/llm_extract.py` to the new SDK version and have `openai` ≥ 1.0 installed.

## 0) Environment setup — load `.env` and confirm API key

In [1]:
import os
from dotenv import load_dotenv

# Load variables from .env at repo root (../.env relative to this notebook)
load_dotenv(dotenv_path="../.env")

# Optional: set/override for this session only (uncomment and paste a key if needed)
# os.environ["OPENAI_API_KEY"] = "sk-..."
# os.environ["OPENAI_MODEL"] = "gpt-5-mini"
# os.environ["TEMPERATURE"] = "0.0"

# Sensible defaults if not set
os.environ.setdefault("OPENAI_MODEL", "gpt-5-mini")
os.environ.setdefault("TEMPERATURE", "0.0")

print("API key present?", bool(os.getenv("OPENAI_API_KEY")))
print("Model:", os.getenv("OPENAI_MODEL"))
print("Length:", len(os.getenv("OPENAI_API_KEY")))

API key present? True
Model: gpt-4.1-mini
Length: 164


In [2]:
# does it work

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    project=os.getenv("OPENAI_PROJECT")
)

resp = client.chat.completions.create(
    model=os.getenv("OPENAI_MODEL"),
    messages=[{"role": "user", "content": "Say OK if the key + project ID works"}],
    max_completion_tokens=5,  # 👈 updated
)

print(resp.choices[0].message.content)



OK


## 1) Run the pipeline on **one** PDF

In [3]:
import sys, json, pathlib
sys.path.append("../")  # allow `import src.*` from the notebook

from src.batch import process_pdf

# >>> EDIT THIS PATH to your own PDF inside ../data/...
pdf = "../data/my_papers/chatgpt_usuage.pdf"
# Example: pdf = "../data/my_papers/MyArticle.pdf"

out_dir = "../outputs"
result = process_pdf(pdf, out_dir)
result

{'md': '../outputs\\summaries\\chatgpt_usuage.md',
 'csv': '../outputs\\csv\\master_table.csv',
 'json': '../outputs\\summaries\\chatgpt_usuage.json'}

## 2) Preview the generated Markdown (first ~2000 chars)

In [4]:
from pathlib import Path

# show first ~2000 characters from the Markdown output
md_text = Path(result['md']).read_text(encoding='utf-8')
print(md_text[:2000])


# Article Summary

**Citation:** Bick et al., "The Rapid Adoption of Generative AI," National Bureau of Economic Research, 2024.

---
### 1) What is it about — Main questions
How is ChatGPT used by consumers globally and what are the patterns of usage by topic, intent, and demographics? _(pp. 1, 2, 11, 12, 25, 27, 33)_

### 1) Purpose / aim
To document the rapid growth, usage patterns, and demographic characteristics of ChatGPT users globally while preserving user privacy. _(pp. 1, 3, 5, 6)_

### 1) Theory / key concepts
ChatGPT likely improves worker output by providing decision support, particularly in knowledge-intensive jobs where productivity depends on decision quality. _(pp. 3, 35)_

### 2) Methods — Research design
Large-scale observational study using a privacy-preserving automated classification pipeline to analyze user messages and demographics from ChatGPT consumer plans between May 2024 and July 2025. _(pp. 3, 5, 6)_

### 2) Methods — Data sources
Data includes total daily

In [5]:
from IPython.display import Markdown, display

display(Markdown(md_text))

# Article Summary

**Citation:** Bick et al., "The Rapid Adoption of Generative AI," National Bureau of Economic Research, 2024.

---
### 1) What is it about — Main questions
How is ChatGPT used by consumers globally and what are the patterns of usage by topic, intent, and demographics? _(pp. 1, 2, 11, 12, 25, 27, 33)_

### 1) Purpose / aim
To document the rapid growth, usage patterns, and demographic characteristics of ChatGPT users globally while preserving user privacy. _(pp. 1, 3, 5, 6)_

### 1) Theory / key concepts
ChatGPT likely improves worker output by providing decision support, particularly in knowledge-intensive jobs where productivity depends on decision quality. _(pp. 3, 35)_

### 2) Methods — Research design
Large-scale observational study using a privacy-preserving automated classification pipeline to analyze user messages and demographics from ChatGPT consumer plans between May 2024 and July 2025. _(pp. 3, 5, 6)_

### 2) Methods — Data sources
Data includes total daily message volumes, randomly sampled de-identified user messages, and aggregated employment and education data accessed via a secure Data Clean Room. _(pp. 5, 6)_

### 2) Methods — Sample/participants
Approximately 1.1 million randomly sampled conversations from ChatGPT Free, Plus, and Pro users over May 2024 to June 2025, with additional samples for demographic analysis involving ~130,000 users. _(pp. 5, 6)_

### 2) Methods — Instruments/tools
Automated large language model classifiers deployed with privacy filters to categorize messages by work-relatedness, topic, intent (Asking, Doing, Expressing), and job-related intermediate work activities (O*NET IWAs). _(pp. 7, 11, 50)_

### 3) Analysis — Type
Descriptive and inferential analysis combining message classification frequencies with demographic regressions and validation against human annotations. _(pp. 3, 5, 27, 50)_

### 3) Analysis — Techniques/frameworks
Large Language Model-based text classification with prompt engineering, weighting to adjust for sampling, regression modeling to study demographic effects, and validation against human-coded datasets. _(pp. 5, 6, 27, 50)_

### 3) Analysis — Validation/reliability
Classifier outputs were validated by comparing with human annotations on the WildChat dataset, showing substantial agreement for most categories, with details provided on agreement statistics and biases. _(pp. 50, 51, 52, 53, 56)_

### 4) Results — Core findings
ChatGPT usage grew rapidly to 700 million weekly active users by July 2025. Non-work queries dominate (73% by June 2025). Most common conversation topics are Practical Guidance, Writing, and Seeking Information (77% combined). Gender gaps closed, with roughly equal male and female users. Younger users dominate, with nearly half under age 26. Work-related usage more common among educated users and high-paid professional occupations. Writing is the most prevalent work task (42%), mostly editing/modifying text. User intent classification finds 49% Asking, 40% Doing, and 11% Expressing. O*NET mapping indicates most work-related uses involve information processing and decision-making tasks. Interaction quality tends to be higher for Asking messages. _(pp. 1, 2, 3, 11, 12, 16, 17, 18, 19, 20, 25, 27, 33, 35)_

### 4) Results — Surprising results
The share of programming-related queries is low (4.2%), contrasting with other chatbot studies. Social-emotional or companionship queries are rare (under 2%). Gender gap in usage has narrowed dramatically over time, reversing initial male dominance. _(pp. 2, 3, 25, 26)_

### 4) Results — Contributions
Provides the first large-scale, privacy-preserving, internal analysis of ChatGPT usage globally. Develops novel classification taxonomies and links usage with rich demographic data. Identifies non-work applications' rapid growth and the predominance of decision support and writing tasks at work. Documents demographic shifts in usage patterns. _(pp. 1, 3, 5, 25, 27, 33)_

### 4) Results — Limitations
Data excludes non-consumer ChatGPT plans and logged-out users in early periods. Employment and education data limited to a subset of users and aggregated. Classifications rely on automated classifiers which, despite validation, have inherent uncertainty and potential misclassification biases. _(pp. 6, 27, 50)_

### 5) Future — Gaps
Need for further research on long-term economic impacts, generalizability beyond consumer plans, and detailed occupational adoption mechanisms. Analysis of corporate and educational plan users is missing. Improved classifier precision and ground truth validation could be enhanced. _(pp. 35)_

### 5) Future — Extensions
Extensions could include linking ChatGPT usage to productivity outcomes, exploring usage in developing countries more deeply, and evaluating interventions to increase adoption in underrepresented groups. _(pp. 35)_

### 5) Future — Your ideas
Suggest studying the causal impact of ChatGPT usage on individual and firm-level productivity, including decision quality measures. Develop finer-grained intent taxonomies to capture emerging use cases. Investigate longitudinal user behavior changes. _(pp. 35)_



## 3) Inspect JSON output (keys + example evidence pages)

In [6]:
import json, pathlib
data = json.loads(pathlib.Path(result['json']).read_text(encoding='utf-8'))
list(data.keys()), data.get("results_core", {}).get("evidence_pages", [])

(['citation',
  'about_main_questions',
  'about_purpose',
  'about_theory',
  'methods_design',
  'methods_data_sources',
  'methods_sample',
  'methods_instruments',
  'analysis_type',
  'analysis_techniques',
  'analysis_validation',
  'results_core',
  'results_surprising',
  'results_contributions',
  'results_limitations',
  'future_gaps',
  'future_extensions',
  'future_your_ideas',
  'raw_sections',
  'raw_llm_json'],
 [1, 2, 3, 11, 12, 16, 17, 18, 19, 20, 25, 27, 33, 35])

## 4) (Optional) Batch process all PDFs in a folder

In [7]:
import os, glob, sys
sys.path.append("../")
from src.batch import process_pdf

in_dir = "../data/my_papers"   # change this to your folder
out_dir = "../outputs"

pdfs = sorted(glob.glob(os.path.join(in_dir, "*.pdf")))
print("Found PDFs:", [os.path.basename(p) for p in pdfs])

all_results = []
for p in pdfs:
    print("Processing:", os.path.basename(p))
    all_results.append(process_pdf(p, out_dir))

print("\nWrote:")
for r in all_results:
    print(" -", r["md"])

Found PDFs: ['chatgpt_usuage.pdf']
Processing: chatgpt_usuage.pdf

Wrote:
 - ../outputs\summaries\chatgpt_usuage.md


---

### Notes & Troubleshooting

- If the summary is empty, check that the **API key is present** in the first cell. You should see `API key present? True`.
- If your PDF has **no selectable text** (a scanned image), the current parser will extract little or nothing. We can add OCR later.
- Always open the Markdown files in **VS Code** to avoid Windows encoding issues.
- Outputs go to `../outputs/summaries/` (Markdown + JSON) and `../outputs/csv/master_table.csv`.