# Reading & Resources

- [GitHub Repository with a list of LLMs](https://github.com/cheahjs/free-llm-api-resources)
- [Harnessing the power of LLMs for automated data extraction](https://www.seldon.io/harnessing-the-power-of-llms-for-automated-data-extraction/)

In [None]:
!pip install python-dotenv
!pip install pandas
!pip install openai==2.8.1

# Imports

In [None]:
import os
import requests
import json
from IPython.display import display

import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI

# Data Extraction

## Load env variables

Environment variables include API Keys for LLMs.

In [None]:
load_dotenv()

## Variables

In [None]:
OPENROUTER_API_KEY = os.environ['OPENROUTER_API_KEY']
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

local_news_articles_csv = "../../../data/local_news_articles.csv"
police_press_releases_csv = "../../../data/police_press_releases.csv"

## Local News Articles

Preprocessing of local news articles dataframe. We keep the `article_id` column just in case we need it later on.

In [None]:
articles_df = pd.read_csv(local_news_articles_csv)

print("Original News Articles DataFrame:")
display(articles_df) 
print(articles_df.info())

articles_df = articles_df[
    [
        "article_id", # article id to trace back
        "title",
        "subtitle",
        "content",
        "publish_date",
    ]
]

articles_df["llm_input_text"] = (
    "Title: " + articles_df["title"].fillna("") + "\n" +
    "Subtitle: " + articles_df["subtitle"].fillna("") + "\n" +
    "Content: " + articles_df["content"].fillna("") + "\n" +
    "Publish Date: " + articles_df["publish_date"].astype(str).fillna("none")
)

print("News Articles DataFrame after selecting only interested columns")
display(articles_df)

## Police Press Releases

Preprocessing of police press releases.

In [None]:
press_releases_df = pd.read_csv(police_press_releases_csv)
press_releases_df.insert(0, 'release_id', range(1, len(press_releases_df) + 1)) # use similar pre-processing used by Isaac to generate surrogate key

print("Original Press Releases DataFrame:")
display(press_releases_df) 
print(press_releases_df.info())

press_releases_df = press_releases_df[
    [
        "release_id", # release_id
        "title",
        "date_published",
        "content",
    ]
]

print("Police Press Releases DataFrame after selecting only interested columns")
display(press_releases_df) 

## Using OpenAI Model

In [None]:
PROMPT = """
You are a helpful data entry assistant whose responsibility is extracting traffic accident data from news articles.
The following is such a news article. Please extract details of the accident and return them in a JSON dict with keys:

- 'is_accident' (boolean) — true if the news article describes an actual traffic incident, false otherwise.
- If 'is_accident' is true, include the following additional keys:
    -'accident_datetime'
    -'street'
    -'city'
    -'number_injured'
    -'accident_severity'
    -'drivers' (a list of objects, each with the following keys:)
        -'vehicle_type'
        -'vehicle_damage_severity'
        -'driver_age'
        -'driver_gender'

Please ensure that:
-'incident_datetime' is in the format 'YYYY-MM-DD HH:MM' (24-hour format) if possible.
-'number_injured' is an integer greater or equal to 0
-'accident_severity' which relates to how severe the accident was and is one of: 'No Injuries', 'Minor', 'Serious' or 'Fatal'
-'driver_gender' is either 'M' or 'F'.
-'vehicle_damage_severity' is one of: 'No damage', 'Minor' or 'Major' where 'Minor' means small damages and 'Major' means total loss or big damages

Please only return JSON—do not add any other text! If values are missing, set them to the string: "none".
"""

In [None]:
text_to_extract = articles_df.iloc[0]["llm_input_text"]
text_to_extract

In [None]:
client = OpenAI(
    api_key=OPENAI_API_KEY,
)

# models returned same results but 4o-mini is cheaper
# gpt-4o-mini-2024-07-18 
# o4-mini-2025-04-16

response = client.responses.create(
    model="o4-mini-2025-04-16",
    instructions=PROMPT,
    input=text_to_extract,
)

print(response.output_text)

In [None]:
print(type(response.output_text))

data = json.loads(response.output_text.replace("```json", "").replace("```", ""))
print(data)
print(type(data))   # dict