# GPT-4 API: Extract Structured Data from Papers

Before starting, create a `.env` file in the root directory of the project and add the following environment variables:

```bash
OPENAI_API_KEY=your_openai_api_key
```

## Structure

1. Create a new GPT assistant with PDF file search enabled.
2. Create a list of file paths of our PDF papers saved in the `pdfs` directory.
3. Use the assistant to extract structured data from a single PDF file.
4. Standardize GPTs response to JSON format.
5. Save the data to Airtable.
6. [Exercise] Extract data from multiple PDFs in parallel.

## Step 1: Create a new GPT Assistant with File Search Enabled

The Assistants API allows you to build AI assistants within your own applications. An Assistant has instructions and can leverage models, tools, and files to respond to user queries. The Assistants API currently supports three types of tools: Code Interpreter, File Search, and Function calling.

- [GPT Assistants API documentation](https://platform.openai.com/docs/assistants/overview)

In [1]:
!pip -q install python-dotenv openai

In [2]:
# Load OPENAI_API_KEY value from .env file
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
from openai import OpenAI

client = OpenAI()

assistant_name = "Paper Data Extractor"

# List available assistants and create a new one if "Paper Data Extractor" is not in the list
assistants = client.beta.assistants.list()
assistant_names = [assistant.name for assistant in assistants.data]

# If assistant with the given name does not exist, create a new one
if assistant_name not in assistant_names:
    assistant = client.beta.assistants.create(
        name=assistant_name,
        model="gpt-4-turbo",
        tools=[{"type": "file_search"}],
    )
    print(f"New assistant created: {assistant.id}")
# If assistant with the given name exists, use it
else:
    assistant = assistants.data[assistant_names.index(assistant_name)]
    print(f"Assistant '{assistant_name}' found with ID: {assistant.id}")

Assistant 'Paper Data Extractor' found with ID: asst_WQsHaxQTTCkAPUbn6nXS1kbr


## Step 2: Read the PDF file paths

In [4]:
# Read all files in "current dir/pdfs" and save to file_paths
import os

file_paths = []

for root, dirs, files in os.walk("pdfs"):
    for file in files:
        if file.endswith(".pdf"):
            file_paths.append(os.path.join(root, file))

file_paths

['pdfs/1-Tracking Real Time Layoffs with SEC Filings - A Preliminary Investigation.pdf',
 'pdfs/2-Overnight Post-Earnings Announcement Drift and SEC Form 8-K Disclosures.pdf',
 'pdfs/3-Forecasting Stock Excess Returns With SEC 8-K Filings.pdf']

## Step 3: Use assistant to extract data from a PDF

In [5]:
# Upload a single PDF to the assistant
file_path = file_paths[0]
message_file = client.files.create(file=open(file_path, "rb"), purpose="assistants")
print(f"Uploaded '{file_path}' to OpenAI")

Uploaded 'pdfs/1-Tracking Real Time Layoffs with SEC Filings - A Preliminary Investigation.pdf' to OpenAI


In [6]:
# Here's where the magic happens!
# We ask GPT to extract data from the PDF and return it as a structured JSON object
prompt = """Extract the following data from the provided paper: Title, the research questions, the types of data used for the study, the size of the data set (i.e. how many samples were analyzed), the history of the dataset (i.e. how many years does it cover), the source of the data, the methods used to answer the research questions, the various metrics used for measuring, and the outcomes the authors found. Return the extracted structured data as a JSON object. Only respond with the JSON object, and do not respond with anything else.

Return your response as a structured JSON object using the following format:
''' 
{
  "title_of_paper": "What is the title of the paper?", // string: the title of the paper
  "research_questions": ["What is the research question?"], // array of strings: the research questions. If there are multiple research questions, list them all as separate items in the array
  "data_types": ["What types of data were used?"], // array of strings: the types of data used for the study. If there are multiple types of data, list them all as separate items in the array
  "data_size": "What is the size of the dataset?", // string: the size of the data set, i.e. number of observations, samples, etc.
  "data_history": "How many years does the dataset cover?", // string: the history of the dataset
  "data_sources": ["What are the sources of the data?"], // array of string: the sources of the data. If there are multiple sources, list them all as separate items in the array
  "methods": ["What methods were used to answer the research questions?"], // array of string: the methods used to answer the research questions. If there are multiple methods, list them all as separate items in the array
  "metrics": ["What metrics were used for measuring?"], // array of string: the various metrics used for measuring. If there are multiple metrics, list them all as separate items in the array
  "outcomes": ["What outcomes did the authors find?"] // array of string: the outcomes the authors found. If there are multiple outcomes, list them all as separate items in the array
}
'''

Response:
"""

# Create a conversation thread and attach the file to the message
thread = client.beta.threads.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
            "attachments": [
                {"file_id": message_file.id, "tools": [{"type": "file_search"}]}
            ],
        }
    ]
)

# The thread now has a vector store with that file in its tool resources.
print(thread.tool_resources.file_search)

ToolResourcesFileSearch(vector_store_ids=['vs_PacY8x8TzTdslL6ST0p94Q7K'])


In [7]:
# Request the assistant to run the thread and create a response
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant.id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

message_content = messages[0].content[0].text
# Remove annotations from response
for index, annotation in enumerate(message_content.annotations):
    message_content.value = message_content.value.replace(annotation.text, "")

print(message_content.value)

```json
{
  "title_of_paper": "Tracking Real Time Layoffs with SEC Filings: A Preliminary Investigation",
  "research_questions": ["How are 8-K filings used as a new source of data on layoffs?", "How do 8-K filed layoffs correlate with other unemployment and layoff indicators?", "How can 8-K filings be used to forecast important labor market dynamics?"],
  "data_types": ["Company names from 8-K filings", "Linked WARN notices", "Employment data from Compustat"],
  "data_size": "285 linked layoffs",
  "data_history": "Covers recent years including specific references to 2022 and 2023",
  "data_sources": ["8-K filings", "WARN Notices", "Compustat employment data"],
  "methods": ["Sentence embeddings using BERT", "Large language models (Llama 2) for natural language processing", "Analyzing filing dates and employment data correlation", "Fuzzy matching of company names between WARN and 8-K data"],
  "metrics": ["Number of reported layoff events", "Number of affected workers", "Time differen

## Step 4: Convert GPT's string output to JSON

In [22]:
import re
import json

# Remove starting "```json" and ending "```" values from GPTs response
json_string = (
    message_content.value.replace("```json", "")
    .replace("```", "")
    .replace("\n", "")  # Remove newlines
)
# Remove annotations from JSON string
json_string = re.sub(r"【.*】", "", json_string)
# Convert message_content.value to JSON
extracted_data_json = json.loads(json_string)

extracted_data_json

{'title_of_paper': 'Tracking Real Time Layoffs with SEC Filings: A Preliminary Investigation',
 'research_questions': ['How are 8-K filings used as a new source of data on layoffs?',
  'How do 8-K filed layoffs correlate with other unemployment and layoff indicators?',
  'How can 8-K filings be used to forecast important labor market dynamics?'],
 'data_types': ['Company names from 8-K filings',
  'Linked WARN notices',
  'Employment data from Compustat'],
 'data_size': '285 linked layoffs',
 'data_history': 'Covers recent years including specific references to 2022 and 2023',
 'data_sources': ['8-K filings', 'WARN Notices', 'Compustat employment data'],
 'methods': ['Sentence embeddings using BERT',
  'Large language models (Llama 2) for natural language processing',
  'Analyzing filing dates and employment data correlation',
  'Fuzzy matching of company names between WARN and 8-K data'],
 'metrics': ['Number of reported layoff events',
  'Number of affected workers',
  'Time differen

In [23]:
print("Title:", extracted_data_json["title_of_paper"])
print("Research Questions:\n - " + "\n - ".join(extracted_data_json["research_questions"]))
print("Outcomes:\n - " + "\n - ".join(extracted_data_json["outcomes"]))

Title: Tracking Real Time Layoffs with SEC Filings: A Preliminary Investigation
Research Questions:
 - How are 8-K filings used as a new source of data on layoffs?
 - How do 8-K filed layoffs correlate with other unemployment and layoff indicators?
 - How can 8-K filings be used to forecast important labor market dynamics?
Outcomes:
 - 8-K filings provide timely data on layoffs
 - 8-K filings data does not cover all layoffs but captures significant subset
 - 8-K filings data can be useful for forecasting unemployment rate and initial unemployment insurance claims


In [18]:
!pip -q install pandas

In [24]:
# Convert the JSON response to a pandas DataFrame
import pandas as pd

extracted_data_df = pd.DataFrame([extracted_data_json])

extracted_data_df

Unnamed: 0,title_of_paper,research_questions,data_types,data_size,data_history,data_sources,methods,metrics,outcomes
0,Tracking Real Time Layoffs with SEC Filings: A...,[How are 8-K filings used as a new source of d...,"[Company names from 8-K filings, Linked WARN n...",285 linked layoffs,Covers recent years including specific referen...,"[8-K filings, WARN Notices, Compustat employme...","[Sentence embeddings using BERT, Large languag...","[Number of reported layoff events, Number of a...","[8-K filings provide timely data on layoffs, 8..."


## Step 5: Store extracted data to Airtable

Airtable looks like a spreadsheet but acts like a database. It's a platform that allows you to organize and structure data. In order to add data to Airtable, you need to create a new base and table, and then either use front-end/website or API webhooks to add records to the table. In this case, we will use the API webhook to add records to the table.

- Create a new base and table in [Airtable](https://airtable.com/).
- Create a new automation using the "When webhook received" trigger and "Create record" action.
- Map the JSON fields to the fields in the Airtable table inside the "Create record" tab. For example, map the `title_of_paper` field to the `Title` field in the Airtable table.

Alternative storage options: Python Pandas dataframe saved to local Parquet file, MongoDB, SQL databases (MySQL, PostgreSQL, etc), DynamoDB, Google Sheets, etc.

In [20]:
!pip -q install requests

In [25]:
import requests

# The webhook URL for the table in Airtable
webhook_url = "https://hooks.airtable.com/workflows/v1/genericWebhook/apphgC5p4..."

# Make the POST request to insert the data
headers = {"Content-Type": "application/json"}
response = requests.post(webhook_url, json=extracted_data_json, headers=headers)

# Check the response
if response.status_code == 200:
    print("Data inserted successfully!")
else:
    print(f"Failed to insert data: {response.text}")

Data inserted successfully!


Airtable table result after processing the first two PDFs:

![Airtable Table](https://i.imgur.com/0KCixZn.png)

## Step 6: Process all PDF files

In [26]:
# Exercise: Develop methods to extract structured data from multiple PDFs 
# and insert them into the Airtable table.
# Bonus: Use `pandarallel` to parallelize the extraction process and speed up the data extraction.