# Reading & Resources

- [GitHub Repository with a list of LLMs](https://github.com/cheahjs/free-llm-api-resources)
- [Harnessing the power of LLMs for automated data extraction](https://www.seldon.io/harnessing-the-power-of-llms-for-automated-data-extraction/)

In [None]:
!pip install python-dotenv
!pip install pandas
!pip install openai==2.8.1

# Imports

In [None]:
import os
import requests
import json
import time
from IPython.display import display

import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI

# Data Extraction

## Load env variables

Environment variables include API Keys for LLMs.

In [None]:
load_dotenv()

## Variables

In [None]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

local_news_articles_csv = "../../../data/local_news_articles.csv"
police_press_releases_csv = "../../../data/police_press_releases.csv"

## Local News Articles

Preprocessing of local news articles dataframe. We keep the `article_id` column just in case we need it later on.

In [None]:
articles_df = pd.read_csv(local_news_articles_csv)

print("Original News Articles DataFrame:")
display(articles_df) 
print(articles_df.info())

articles_df = articles_df[
    [
        "article_id", # article id to trace back
        "title",
        "subtitle",
        "content",
        "publish_date",
    ]
]

articles_df["llm_input_text"] = (
    "Title: " + articles_df["title"].fillna("") + "\n" +
    "Subtitle: " + articles_df["subtitle"].fillna("") + "\n" +
    "Content: " + articles_df["content"].fillna("") + "\n" +
    "Publish Date: " + articles_df["publish_date"].astype(str).fillna("none")
)

print("News Articles DataFrame after selecting only interested columns")
display(articles_df)

## Police Press Releases

Preprocessing of police press releases.

In [None]:
press_releases_df = pd.read_csv(police_press_releases_csv)
press_releases_df.insert(0, 'release_id', range(1, len(press_releases_df) + 1)) # use similar pre-processing used by Isaac to generate surrogate key

print("Original Press Releases DataFrame:")
display(press_releases_df) 
print(press_releases_df.info())

press_releases_df = press_releases_df[
    [
        "release_id", # release_id
        "title",
        "date_published",
        "content",
    ]
]

press_releases_df["llm_input_text"] = (
    "Title: " + press_releases_df["title"].fillna("") + "\n" +
    "Content: " + press_releases_df["content"].fillna("") + "\n" +
    "Publish Date: " + press_releases_df["date_published"].astype(str).fillna("none")
)

print("Police Press Releases DataFrame after selecting only interested columns")
display(press_releases_df) 

## Using OpenAI Model

Experimented with both o4-mini and 4o-mini. Decided on the 4o-mini due to returning structure JSON and slightly better.

### News Articles Prompt

Prompt used to extract data from news articles.

In [None]:
NEWS_ARTICLES_PROMPT = """
You are a helpful data entry assistant whose responsibility is extracting traffic accident data from news articles.
The following is such a news article. Please extract details of the accident and return them in a JSON dict with keys:

- 'is_accident' (boolean) — true if the news article describes an actual traffic accident, false otherwise.
- If 'is_accident' is true, include the following additional keys:
    -'accident_datetime'
    -'street'
    -'city'
    -'number_injured'
    -'accident_severity'
    -'drivers' (a list of objects, each with the following keys:)
        -'vehicle_type'
        -'vehicle_damage_severity'
        -'driver_age'
        -'driver_gender'
        -'is_victim' (boolean)

Please ensure that:
-'incident_datetime' is in the format 'YYYY-MM-DD HH:MM' (24-hour format) if possible.
-'number_injured' is an integer greater or equal to 0
-'accident_severity' which relates to how severe the accident in terms of human injuries and and is one of: 'No Injuries', 'Minor', 'Serious' or 'Fatal'
-'driver_gender' is either 'M' or 'F'.
-'vehicle_damage_severity' is one of: 'No damage', 'Minor' or 'Major' where 'Minor' means small damages and 'Major' means total loss or big damages

Please only return JSON—do not add any other text! If values are missing, set them to the string: "none".
"""

### Police Press Releases Prompt

Prompt used to extract data from police press releases using a LLM.

In [None]:
POLICE_PRESS_RELEASES_PROMPT = """
You are a helpful data entry assistant whose responsibility is extracting traffic accident data from police press releases.
The following is such a press release. Please extract details of the accident and return them in a JSON dict with keys:

- 'is_accident' (boolean) — true if the news article describes an actual traffic accident, false otherwise.
- If 'is_accident' is true, include the following additional keys:
    -'accident_datetime'
    -'street'
    -'city'
    -'number_injured'
    -'accident_severity'
    -'drivers' (a list of objects, each with the following keys:)
        -'vehicle_type'
        -'vehicle_damage_severity'
        -'driver_age'
        -'driver_gender'
        -'is_victim' (boolean)

Please ensure that:
-'incident_datetime' is in the format 'YYYY-MM-DD HH:MM' (24-hour format) if possible.
-'number_injured' is an integer greater or equal to 0
-'accident_severity' which relates to how severe the accident in terms of human injuries and and is one of: 'No Injuries', 'Minor', 'Serious' or 'Fatal'
-'driver_gender' is either 'M' or 'F'.
-'vehicle_damage_severity' is one of: 'No damage', 'Minor' or 'Major' where 'Minor' means small damages and 'Major' means total loss or big damages

Please only return JSON—do not add any other text! If values are missing, set them to the string: "none".
"""

### Methods for LLM Extraction

In [None]:
def extract_llm_output(json_str: str) -> dict:
    return json.loads(
        json_str.replace("```json", "").replace("```", "")
    )

def extract_features_from_df(
    pd_df: pd.DataFrame,
    id_column: str,
    prompt: str,
    json_save_path: str
) -> None:
    client = OpenAI(api_key=OPENAI_API_KEY)
    results = []
    
    for index, row in pd_df.iterrows():
        id_value = row[id_column]
        input_text = row["llm_input_text"]

        retry_count = 0
        success = False
    
        print(f"Processing row with {id_column} '{id_value}'...")   

        while retry_count < 3 and not success:
            try:
                response = client.responses.create(
                    model="o4-mini-2025-04-16",
                    instructions=prompt,
                    input=input_text,
                )
                
                llm_output = extract_llm_output(response.output_text)
                llm_output["id_column"] = id_value
                llm_output["input_text"] = input_text
                results.append(llm_output)

                print(f"Successfully processed row with {id_column} '{id_value}'")
                success = True
            except Exception as e:
                retry_count += 1
                print(f"Error for row with {id_column} '{id_value}' (attempt {retry_count}/3): {e}")

                if retry_count < 3:
                    time.sleep(2)  # backoff retry delay
                else:
                    print(f"Failed to process row with {id_column} '{id_value}' after 3 attempts")
        
    with open(json_save_path, 'w') as f:
        json.dump(results, f)

### Extraction and Analysis of News Articles

In [None]:
raw_articles_json = "raw_articles.json"

if os.path.isfile(raw_articles_json):
    print("LLM Feature Extraction from news articles was already done")
else:
    extract_features_from_df(
        pd_df=articles_df,
        id_column="article_id",
        prompt=NEWS_ARTICLES_PROMPT,
        json_save_path=raw_articles_json,
    )

In [None]:
# processed by the LLM
processed_articles_df = pd.read_json(raw_articles_json)
processed_articles_df.to_csv("../../../data/processed/llm_local_news_articles.csv")

display(processed_articles_df)

processed_articles_df["is_accident"].value_counts(dropna=False)

In [None]:
# rows that the LLM did not manage to process
unprocessed_articles_df = articles_df[~articles_df["article_id"].isin(processed_articles_df["id_column"])] 

display(unprocessed_articles_df)

### Extraction and Analysis of Police Press Releases

In [None]:
raw_press_releases_json = "raw_press_releases.json"

if os.path.isfile(raw_press_releases_json):
    print("LLM Feature Extraction from police press releases was already done")
else:
    extract_features_from_df(
        pd_df=press_releases_df,
        id_column="release_id",
        prompt=POLICE_PRESS_RELEASES_PROMPT,
        json_save_path=raw_press_releases_json,
    )

In [None]:
# processed by the LLM
processed_press_releases_df = pd.read_json(raw_press_releases_json)
processed_press_releases_df.to_csv("../../../data/processed/llm_press_releases.csv")

display(processed_press_releases_df)

processed_press_releases_df["is_accident"].value_counts(dropna=False)

In [None]:
# rows that the LLM did not manage to process
unprocessed_press_releases_df = press_releases_df[~press_releases_df["release_id"].isin(processed_press_releases_df["id_column"])] 

display(unprocessed_press_releases_df)