# Data Preparation Notebook

Notebook to prepare data and extract features for the project.

## Install required packages

In [1]:
!pip install python-dotenv
!pip install pandas
!pip install openai==2.8.1



## Imports

Import any packages required for this project.

In [2]:
import ast
import os
import requests
import json
import time
import re
import unicodedata
from datetime import datetime, timedelta

import pandas as pd
from IPython.display import display
from dotenv import load_dotenv
from openai import OpenAI

## Load env variables

Load environment variables.

In [3]:
load_dotenv()

False

## Variables

Variables required throughout the whole notebook

In [4]:
#OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

data_folder = "../../data"
raw_data_folder = f"{data_folder}/raw"
staging_data_folder = f"{data_folder}/staging"

local_news_articles_csv = f"{raw_data_folder}/local_news_articles.csv"
police_press_releases_csv = f"{raw_data_folder}/police_press_releases.csv"

## Explanation & Mini Preparation

High-level explanation of the process as follows:

1. We first extract certain features using regex from the original CSV files.
2. We use LLMs to extract other features from the original CSV files.
3. Combine everything together into one CSV file for manual auditing.
4. After we manually audit, we combine both of the audit CSV files together and deduplicate
5. On the deduplicated data, extract other features using rule-based processing and other dimensions (date, street, town).

### News Articles Preparation

In [5]:
local_news_articles_df = pd.read_csv(local_news_articles_csv)

print(f"Dataset shape: {local_news_articles_df.shape}")
print(f"\nColumn names: {local_news_articles_df.columns.tolist()}")
print(f"\nPrimary key range: {local_news_articles_df['article_id'].min()} to {local_news_articles_df['article_id'].max()}")

print("\nDataset Info:")
display(local_news_articles_df.info())

print("\nFirst few rows:")
display(local_news_articles_df.head())

Dataset shape: (321, 14)

Column names: ['article_id', 'url', 'source_name', 'source_url', 'title', 'subtitle', 'author_name', 'publish_date', 'content', 'top_image_url', 'top_image_caption', 'created_at', 'tags', 'categories']

Primary key range: 40 to 496772

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321 entries, 0 to 320
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   article_id         321 non-null    int64 
 1   url                321 non-null    object
 2   source_name        321 non-null    object
 3   source_url         321 non-null    object
 4   title              321 non-null    object
 5   subtitle           313 non-null    object
 6   author_name        321 non-null    object
 7   publish_date       321 non-null    object
 8   content            321 non-null    object
 9   top_image_url      318 non-null    object
 10  top_image_caption  312 non-null    object
 11  cre

None


First few rows:


Unnamed: 0,article_id,url,source_name,source_url,title,subtitle,author_name,publish_date,content,top_image_url,top_image_caption,created_at,tags,categories
0,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,Emma Borg,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,The broken car mirror. Photo: Frank Xerri De Caro,2025-07-03 15:14:21.554132+00,"{Accident,Lesa,National}",{}
1,4167,https://timesofmalta.com/article/pn-slams-gove...,Times of Malta,https://timesofmalta.com,PN slams government for diverting EU bus funds...,"'By encouraging the use of private cars, the g...",Times of Malta,2024-12-09,The PN on Monday slammed the government for di...,https://cdn-attachments.timesofmalta.com/d9afe...,"PN spokespeople Ryan Callus, Mark Anthony Samm...",2025-07-03 15:14:10.643172+00,"{""Climate Change"",Environment,""European Union""...",{}
2,4093,https://timesofmalta.com/article/motorcyclist-...,Times of Malta,https://timesofmalta.com,Motorcyclist seriously hurt in St Paul's Bay b...,Residents complained several times about inade...,Times of Malta,2024-12-11,A motorcyclist was rushed to hospital in a cri...,https://cdn-attachments.timesofmalta.com/633f6...,Photo: Malta Police Force,2025-07-03 15:13:50.605708+00,"{Accident,National,""St Paul’S Bay"",Traffic}",{}
3,4110,https://timesofmalta.com/article/skip-involved...,Times of Malta,https://timesofmalta.com,Skip involved in horror St Paul’s Bay bypass c...,Motorcyclist hurt in crash on Wednesday evenin...,Emma Borg,2024-12-12,A private contractor who placed a skip on St P...,https://cdn-attachments.timesofmalta.com/fc23e...,A 54-year-old man was seriously injured when h...,2025-07-03 15:13:54.812813+00,"{Accident,National,""St Paul’S Bay""}",{}
4,4066,https://timesofmalta.com/article/two-people-in...,Times of Malta,https://timesofmalta.com,"Two people, including teenage girl, critically...",Incidents in Mellieħa and Gudja on Friday even...,Times of Malta,2024-12-14,A 29-year-old man and 17-year-old girl were cr...,https://cdn-attachments.timesofmalta.com/f1761...,The Ford Fiesta involved in the Gudja collisio...,2025-07-03 15:13:43.83839+00,"{Accident,Gudja,Mellieħa,National,Traffic}",{}


### Police Press Releases Preparation

In [6]:
police_press_releases_df = pd.read_csv(police_press_releases_csv)
police_press_releases_df.insert(0, 'release_id', range(1, len(police_press_releases_df) + 1))

print(f"Dataset shape: {police_press_releases_df.shape}")
print(f"\nColumn names: {police_press_releases_df.columns.tolist()}")
print(f"\nPrimary key range: {police_press_releases_df['release_id'].min()} to {police_press_releases_df['release_id'].max()}")

print("\nDataset Info:")
display(police_press_releases_df.info())

print("\nFirst few rows:")
display(police_press_releases_df.head())

Dataset shape: (111, 5)

Column names: ['release_id', 'title', 'date_published', 'date_modified', 'content']

Primary key range: 1 to 111

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   release_id      111 non-null    int64 
 1   title           111 non-null    object
 2   date_published  111 non-null    object
 3   date_modified   111 non-null    object
 4   content         111 non-null    object
dtypes: int64(1), object(4)
memory usage: 4.5+ KB


None


First few rows:


Unnamed: 0,release_id,title,date_published,date_modified,content
0,1,Collision between a car and a motorbike in Żur...,2025-10-09,2025-10-09,"Today, at around 0930hrs, the Police were info..."
1,2,Car-motorcycle traffic accident,2025-06-20,2025-06-20,"Yesterday, at around 1830hrs, the Police were ..."
2,3,Car-motorcycle collision in Ħal Qormi,2025-05-12,2025-05-12,"Today, at around 0800hrs, the Police were info..."
3,4,Collision between motorcycle and car in Għaxaq,2025-07-30,2025-07-30,"Yesterday, at around 1800hrs, the Police were ..."
4,5,Car-motorcycle collision,2025-04-07,2025-04-07,"Yesterday, at around quarter to nine in the ev..."


## 1. Regex Extraction

Extract datetime from both police press releases and local news articles using regex.

### Variables

In [7]:
regex_extract_data_folder = f"{staging_data_folder}/regex_extract"

### News Articles

Extract accident datetime from news articles.

#### Methods

In [8]:
def classify_article_road_accident(row):
    """Classify if article is about a road accident."""
    text = (str(row['title']) + ' ' + str(row['content'])).lower()
    tags = str(row['tags']).lower()

    # exclude policy articles (government, budget, legislation, etc.)
    policy_keywords = ['government', 'minister', 'policy', 'budget', 'funds', 'legislation', 'parliament', 'proposal', 'grant', 'incentive', 'subsidy']
    if sum(1 for k in policy_keywords if k in text) >= 3:
        return 0

    # exclude non-accident traffic incidents
    non_accident_keywords = ['speeding', 'speed gun', 'caught doing', 'clocked at', 'pothole', 'flat tyre', 'flat tire', 'road damage', 'traffic violation', 'employer']
    if any(k in text for k in non_accident_keywords):
        return 0

    person_vehicle_terms = ['motorcyclist', 'cyclist', 'pedestrian']
    accident_keywords = ['crash', 'collision', 'injured', 'grievously injured', 'seriously injured', 'hit by', 'overturned', 'lost control', 'hit-and-run', 'run over']
    vehicles = ['car', 'bus', 'truck', 'van', 'motorcycle', 'bike', 'bicycle', 'scooter', 'vehicle']

    has_person_vehicle = any(k in text for k in person_vehicle_terms)
    has_accident_keyword = any(k in text for k in accident_keywords)
    has_vehicle = any(v in text for v in vehicles)

    if 'accident' in tags and (has_vehicle or has_accident_keyword or has_person_vehicle):
        return 1
    if has_accident_keyword and (has_vehicle or has_person_vehicle):
        return 1
    if has_person_vehicle and (has_accident_keyword or 'accident' in tags):
        return 1

    return 0

In [9]:
def parse_time_to_datetime(date_obj, time_str):
    if not time_str or pd.isna(date_obj):
        return None
    try:
        time_str = time_str.lower().strip().replace('.', ':')
        time_obj = datetime.strptime(time_str, '%I:%M%p' if ':' in time_str else '%I%p').time()
        date_only = date_obj.date() if isinstance(date_obj, pd.Timestamp) else (date_obj.date() if hasattr(date_obj, 'date') else date_obj)
        return datetime.combine(date_only, time_obj)
    except:
        return None

In [10]:
def extract_article_accident_datetime(row):
    text = str(row['content']).lower()
    title = str(row['title']).lower()
    try:
        pub_date = pd.to_datetime(row['publish_date'])
    except:
        pub_date = None

    # time patterns
    time_patterns = [
        r'at (?:around |about )?(\d{1,2}(?:\.\d{2})?(?:am|pm))',
        r'at (?:around |about )?(\d{1,2}:\d{2})\s*([ap]\.?m\.?)',
        r'(?:reported|occurred|happened|took place) at (?:around |about )?(\d{1,2}:\d{2})\s*([ap]\.?m\.?)',
        r'(?:reported|occurred|happened|took place) at (?:around |about )?(\d{1,2}(?:\.\d{2})?(?:am|pm))',
        r'at (?:around |about )?(\d{1,2}:\d{2})',  # Fallback without am/pm
    ]

    extracted_time = None
    for pattern in time_patterns:
        match = re.search(pattern, text)
        if match:
            if len(match.groups()) == 2:
                # pattern with separate time and am/pm groups
                time_part = match.group(1)
                ampm_part = match.group(2).replace('.', '').replace(' ', '')
                extracted_time = time_part + ampm_part
            else:
                extracted_time = match.group(1)
            break

    # day patterns
    day_patterns = {
        r'on monday': ('Monday', 0),
        r'on tuesday': ('Tuesday', 1),
        r'on wednesday': ('Wednesday', 2),
        r'on thursday': ('Thursday', 3),
        r'on friday': ('Friday', 4),
        r'on saturday': ('Saturday', 5),
        r'on sunday': ('Sunday', 6),
        r'this (?:morning|afternoon|evening|night)': ('today', None),
        r'(?:yesterday|last night)': ('yesterday', None),
        r'last (?:monday|tuesday|wednesday|thursday|friday|saturday|sunday)': ('last_week', -7),
    }

    extracted_day = None
    accident_date = None

    # check both title and content for day references
    combined_text = title + ' ' + text

    for pattern, (day_value, weekday) in day_patterns.items():
        if re.search(pattern, combined_text):
            extracted_day = day_value
            if pub_date is not None:
                if day_value == 'today':
                    accident_date = pub_date
                elif day_value == 'yesterday':
                    accident_date = pub_date - timedelta(days=1)
                elif day_value == 'last_week':
                    # For "last Monday", etc - go back to that day in the previous week
                    target_day = pattern.split()[-1].rstrip(r')')
                    day_map = {'monday': 0, 'tuesday': 1, 'wednesday': 2, 'thursday': 3,
                              'friday': 4, 'saturday': 5, 'sunday': 6}
                    if target_day in day_map:
                        target_weekday = day_map[target_day]
                        days_back = (pub_date.weekday() - target_weekday) % 7
                        if days_back == 0:
                            days_back = 7  # If same day, go back full week
                        accident_date = pub_date - timedelta(days=days_back)
                elif weekday is not None:
                    # Calculate days back, ensuring we go into the past
                    days_back = (pub_date.weekday() - weekday) % 7
                    if days_back == 0:
                        # If same weekday as publish date, assume it's today (0 days back)
                        days_back = 0
                    accident_date = pub_date - timedelta(days=days_back)
            break

    # if no specific day found, assume accident date = publish date
    if accident_date is None and pub_date is not None:
        accident_date = pub_date

    # combine date and time
    if extracted_time:
        accident_datetime = parse_time_to_datetime(accident_date, extracted_time)
    else:
        accident_datetime = pd.Timestamp(accident_date) if accident_date is not None else None

    # CRITICAL: Ensure accident_datetime is never after publish_date
    # compare dates only (not times) to avoid false positives on same-day accidents
    if accident_datetime is not None and pub_date is not None:
        accident_date_only = accident_datetime.date() if hasattr(accident_datetime, 'date') else accident_datetime
        pub_date_only = pub_date.date() if hasattr(pub_date, 'date') else pub_date

        if accident_date_only > pub_date_only:
            # If date is in the future, go back one week to the same weekday
            accident_datetime = accident_datetime - timedelta(days=7)
            accident_date = accident_datetime.date() if hasattr(accident_datetime, 'date') else accident_datetime

    return extracted_time, extracted_day, accident_date, accident_datetime

#### Main Logic

In [11]:
regex_extract_news_articles_csv = f"{regex_extract_data_folder}/news_articles.csv"

if os.path.isfile(regex_extract_news_articles_csv):
    print("Regex Extraction from news articles was already done")
else:
    # use regex to determine whether or not the article is a road accident
    regex_extract_news_articles_df = local_news_articles_df.copy()
    regex_extract_news_articles_df['is_road_accident'] = regex_extract_news_articles_df.apply(classify_article_road_accident, axis=1)

    # use regex to extract datetime
    (
        regex_extract_news_articles_df['accident_time'],
        regex_extract_news_articles_df['accident_day'],
        regex_extract_news_articles_df['accident_date'],
        regex_extract_news_articles_df['accident_datetime']
    ) = zip(*regex_extract_news_articles_df.apply(extract_article_accident_datetime, axis=1))

    regex_extract_news_articles_df['time_confidence'] = regex_extract_news_articles_df.apply(
        lambda row: 'High' if pd.notna(row['accident_time']) else ('Medium' if pd.notna(row['accident_day']) else 'Low'), axis=1
    )

    regex_extract_news_articles_df['accident_hour'] = regex_extract_news_articles_df['accident_datetime'].apply(lambda dt: dt.hour if pd.notna(dt) else None)

    regex_extract_news_articles_df['accident_is_weekend'] = regex_extract_news_articles_df['accident_datetime'].apply(
        lambda dt: 1 if (pd.notna(dt) and dt.weekday() >= 5) else (0 if pd.notna(dt) else None)
    )

    regex_extract_news_articles_df['publication_delay_hours'] = regex_extract_news_articles_df.apply(
        lambda row: max(0, (pd.to_datetime(row['publish_date']) - row['accident_datetime']).total_seconds() / 3600) if pd.notna(row['accident_datetime']) else None, axis=1
    )

    # show stats for accidents only
    accidents_df = regex_extract_news_articles_df[regex_extract_news_articles_df['is_road_accident'] == 1].copy()
    print(f"High: {(accidents_df['time_confidence'] == 'High').sum()} | Medium: {(accidents_df['time_confidence'] == 'Medium').sum()} | Low: {(accidents_df['time_confidence'] == 'Low').sum()}")
    print(f"\nTotal articles processed: {len(regex_extract_news_articles_df)}")
    print(f"\nAccidents identified: {len(accidents_df)} ({len(accidents_df)/len(regex_extract_news_articles_df)*100:.1f}%)")

    columns_to_save = [col for col in regex_extract_news_articles_df.columns if col not in ['accident_day', 'accident_hour']]
    regex_extract_news_articles_df[columns_to_save].to_csv(regex_extract_news_articles_csv, index=False)

    print(f"✓ Saved {len(regex_extract_news_articles_df)} articles to {regex_extract_news_articles_csv}")
    print(f"  - Road accidents: {(regex_extract_news_articles_df['is_road_accident'] == 1).sum()}")
    print(f"  - Other articles: {(regex_extract_news_articles_df['is_road_accident'] == 0).sum()}")
    print(f"  - Articles with datetime extracted: {regex_extract_news_articles_df['accident_datetime'].notna().sum()}")

regex_extract_news_articles_df = pd.read_csv(regex_extract_news_articles_csv)
display(regex_extract_news_articles_df)

Regex Extraction from news articles was already done


Unnamed: 0,article_id,url,source_name,source_url,title,subtitle,author_name,publish_date,content,top_image_url,...,created_at,tags,categories,is_road_accident,accident_time,accident_date,accident_datetime,time_confidence,accident_is_weekend,publication_delay_hours
0,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,Emma Borg,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,...,2025-07-03 15:14:21.554132+00,"{Accident,Lesa,National}",{},0,,2024-12-04,2024-12-04 00:00:00,Medium,0.0,72.0
1,4167,https://timesofmalta.com/article/pn-slams-gove...,Times of Malta,https://timesofmalta.com,PN slams government for diverting EU bus funds...,"'By encouraging the use of private cars, the g...",Times of Malta,2024-12-09,The PN on Monday slammed the government for di...,https://cdn-attachments.timesofmalta.com/d9afe...,...,2025-07-03 15:14:10.643172+00,"{""Climate Change"",Environment,""European Union""...",{},0,,2024-12-09,2024-12-09 00:00:00,Medium,0.0,0.0
2,4093,https://timesofmalta.com/article/motorcyclist-...,Times of Malta,https://timesofmalta.com,Motorcyclist seriously hurt in St Paul's Bay b...,Residents complained several times about inade...,Times of Malta,2024-12-11,A motorcyclist was rushed to hospital in a cri...,https://cdn-attachments.timesofmalta.com/633f6...,...,2025-07-03 15:13:50.605708+00,"{Accident,National,""St Paul’S Bay"",Traffic}",{},1,5pm,2024-12-11,2024-12-11 17:00:00,High,0.0,0.0
3,4110,https://timesofmalta.com/article/skip-involved...,Times of Malta,https://timesofmalta.com,Skip involved in horror St Paul’s Bay bypass c...,Motorcyclist hurt in crash on Wednesday evenin...,Emma Borg,2024-12-12,A private contractor who placed a skip on St P...,https://cdn-attachments.timesofmalta.com/fc23e...,...,2025-07-03 15:13:54.812813+00,"{Accident,National,""St Paul’S Bay""}",{},1,1pm,2024-12-11,2024-12-11 13:00:00,High,0.0,11.0
4,4066,https://timesofmalta.com/article/two-people-in...,Times of Malta,https://timesofmalta.com,"Two people, including teenage girl, critically...",Incidents in Mellieħa and Gudja on Friday even...,Times of Malta,2024-12-14,A 29-year-old man and 17-year-old girl were cr...,https://cdn-attachments.timesofmalta.com/f1761...,...,2025-07-03 15:13:43.83839+00,"{Accident,Gudja,Mellieħa,National,Traffic}",{},1,5.30pm,2024-12-13,2024-12-13 17:30:00,High,0.0,6.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
316,496574,https://timesofmalta.com/article/watch-msida-f...,Times of Malta,https://timesofmalta.com,Watch: Msida flyover to open by end of year as...,'The flyover will allow us to create a new ope...,Daniel Ellul,2025-10-12,The Msida flyover will open by the end of the ...,https://cdn-attachments.timesofmalta.com/f47de...,...,2025-10-12 16:57:00.930159+00,"{Msida,National,Traffic}",{},0,,2025-10-12,2025-10-12 00:00:00,Medium,1.0,0.0
317,496586,https://timesofmalta.com/article/today-front-p...,Times of Malta,https://timesofmalta.com,Today's front pages,The top stories in Malta's newspapers,Times of Malta,2025-10-13,The following are the top stories in Malta's n...,https://cdn-attachments.timesofmalta.com/28065...,...,2025-10-13 08:04:55.910209+00,"{Media,National,""Social Media"",Traffic}",{},0,,2025-10-13,2025-10-13 00:00:00,Medium,0.0,0.0
318,496577,https://timesofmalta.com/article/traffic-overt...,Times of Malta,https://timesofmalta.com,Traffic overtakes cost of living to become peo...,Poll data suggests frustration on Maltese road...,Bertrand Borg,2025-10-13,"Traffic, parking and public transport-related ...",https://cdn-attachments.timesofmalta.com/861ca...,...,2025-10-13 05:03:22.837075+00,"{National,Politics,Traffic}",{},0,,2025-10-13,2025-10-13 00:00:00,Low,0.0,0.0
319,496733,https://timesofmalta.com/article/employer-clea...,Times of Malta,https://timesofmalta.com,Employer cleared of responsibility for young w...,"Court raps police, OHSA for not working togeth...",Monique Agius,2025-10-14,A court has sharply criticised the police and ...,https://cdn-attachments.timesofmalta.com/2d4fb...,...,2025-10-14 14:01:03.493399+00,"{Accident,Construction,Court,National}",{},0,,2025-10-14,2025-10-14 00:00:00,Low,0.0,0.0


### Police Press Releases

#### Methods

In [12]:
def extract_press_release_accident_datetime(content: str, published_date: str) -> dict:
    """
    Extract accident date and time from police press release content.

    Returns:
        Dictionary with accident_datetime, accident_date, accident_time, time_confidence
    """
    result = {
        'accident_datetime': None,
        'accident_date': None,
        'accident_time': None,
        'time_confidence': 'low'
    }

    if pd.isna(content) or pd.isna(published_date):
        return result

    content_lower = content.lower()
    published_dt = pd.to_datetime(published_date)
    extracted_time = None

    # HIGH confidence: Exact time with "hrs"
    hrs_patterns = [
        (r'at around (\d{4})\s*hrs', 1), (r'at (\d{4})\s*hrs', 1),
        (r'around (\d{4})\s*hrs', 1), (r'\((\d{4})\s*hrs\)', 1),
        (r'\w+\s*\((\d{4})\s*hrs\)', 1),
        (r'at around (\d{1,2}):(\d{2})\s*hrs', 2), (r'at (\d{1,2}):(\d{2})\s*hrs', 2),
        (r'around (\d{1,2}):(\d{2})\s*hrs', 2),
        (r'at around (\d{1,2})\.(\d{2})\s*hrs', 2), (r'at (\d{1,2})\.(\d{2})\s*hrs', 2),
    ]

    for pattern, num_groups in hrs_patterns:
        match = re.search(pattern, content_lower)
        if match:
            if num_groups == 1:
                time_str = match.group(1)
                if len(time_str) == 4:
                    hour, minute = int(time_str[:2]), int(time_str[2:])
                    if 0 <= hour <= 23 and 0 <= minute <= 59:
                        extracted_time = f"{hour:02d}:{minute:02d}"
                        result['time_confidence'] = 'high'
                        break
            else:
                hour, minute = int(match.group(1)), int(match.group(2))
                if 0 <= hour <= 23 and 0 <= minute <= 59:
                    extracted_time = f"{hour:02d}:{minute:02d}"
                    result['time_confidence'] = 'high'
                    break

    # HIGH confidence: Standard time formats without "hrs"
    if not extracted_time:
        time_patterns = [
            (r'at around (\d{1,2}):(\d{2})', 2), (r'at (\d{1,2}):(\d{2})', 2),
            (r'around (\d{1,2}):(\d{2})', 2),
            (r'at around (\d{1,2})\.(\d{2})', 2), (r'at (\d{1,2})\.(\d{2})', 2),
        ]
        for pattern, _ in time_patterns:
            match = re.search(pattern, content_lower)
            if match:
                hour, minute = int(match.group(1)), int(match.group(2))
                if 0 <= hour <= 23 and 0 <= minute <= 59:
                    extracted_time = f"{hour:02d}:{minute:02d}"
                    result['time_confidence'] = 'high'
                    break

    # MEDIUM-HIGH confidence: Specific time markers
    if not extracted_time:
        time_markers = [
            (r'\bmidnight\b', '00:00', 'high'), (r'\bnoon\b|\bmidday\b', '12:00', 'high'),
            (r'\bdawn\b|\bsunrise\b', '06:00', 'medium'), (r'\bdusk\b|\bsunset\b', '19:00', 'medium'),
        ]
        for pattern, time, conf in time_markers:
            if re.search(pattern, content_lower):
                extracted_time, result['time_confidence'] = time, conf
                break

    # MEDIUM confidence: Time ranges (calculate midpoint)
    if not extracted_time:
        match = re.search(r'between (\d{1,2})[:\.]?(\d{2})?\s*(?:and|&|-)\s*(\d{1,2})[:\.]?(\d{2})?', content_lower)
        if match:
            hour1 = int(match.group(1))
            min1 = int(match.group(2)) if match.group(2) else 0
            hour2 = int(match.group(3))
            min2 = int(match.group(4)) if match.group(4) else 0
            if 0 <= hour1 <= 23 and 0 <= hour2 <= 23:
                mid_minutes = ((hour1 * 60 + min1) + (hour2 * 60 + min2)) // 2
                extracted_time = f"{mid_minutes // 60:02d}:{mid_minutes % 60:02d}"
                result['time_confidence'] = 'medium'

    # MEDIUM confidence: General time periods
    if not extracted_time:
        time_periods = [
            (r'\bearly hours\b', '03:00'), (r'\blate hours\b|\blate at night\b', '23:00'),
            (r'\bearly morning\b', '06:00'), (r'\bmorning\b', '09:00'),
            (r'\bafternoon\b', '15:00'), (r'\bevening\b', '19:00'), (r'\bnight\b', '22:00'),
        ]
        for pattern, time in time_periods:
            if re.search(pattern, content_lower):
                extracted_time, result['time_confidence'] = time, 'medium'
                break

    # Date extraction
    accident_date = None
    explicit_patterns = [
        (r'on (\d{1,2}(?:st|nd|rd|th)?\s+(?:january|february|march|april|may|june|july|august|september|october|november|december))', '%d %B'),
        (r'on ((?:january|february|march|april|may|june|july|august|september|october|november|december)\s+\d{1,2}(?:st|nd|rd|th)?)', '%B %d'),
    ]

    for pattern, date_format in explicit_patterns:
        match = re.search(pattern, content_lower)
        if match:
            try:
                date_str = re.sub(r'(st|nd|rd|th)', '', match.group(1))
                accident_date = pd.to_datetime(date_str, format=date_format).replace(year=published_dt.year)
                if accident_date > published_dt + timedelta(days=30):
                    accident_date = accident_date.replace(year=published_dt.year - 1)
                break
            except:
                pass

    if accident_date is None:
        if re.search(r'^\s*today[,\s]|^today at', content_lower):
            accident_date = published_dt
        elif re.search(r'^\s*yesterday[,\s]|^yesterday at', content_lower):
            accident_date = published_dt - timedelta(days=1)
        elif re.search(r'\bthis morning\b', content_lower):
            accident_date = published_dt
        elif re.search(r'\blast night\b', content_lower):
            accident_date = published_dt - timedelta(days=1)
        elif re.search(r'\blast evening\b', content_lower):
            accident_date = published_dt - timedelta(days=1)
        elif match := re.search(r'\blast (monday|tuesday|wednesday|thursday|friday|saturday|sunday)', content_lower):
            days = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
            days_back = (published_dt.weekday() - days.index(match.group(1))) % 7 or 7
            accident_date = published_dt - timedelta(days=days_back)
        else:
            accident_date = published_dt

    # Combine date and time
    if accident_date is not None:
        result['accident_date'] = accident_date.date()
        if extracted_time:
            try:
                hour, minute = map(int, extracted_time.split(':'))
                result['accident_datetime'] = accident_date.replace(hour=hour, minute=minute, second=0)
                result['accident_time'] = extracted_time
            except:
                result['accident_datetime'] = accident_date
        else:
            result['accident_datetime'] = accident_date

    return result

In [13]:
def apply_press_releases_manual_corrections(df, corrections):
    """Apply manual time/date corrections to the dataframe."""
    df = df.copy()
    corrections_applied = 0

    for release_id, correction_data in corrections.items():
        mask = df['release_id'] == release_id
        if not mask.any():
            print(f"Release ID {release_id} not found")
            continue

        idx = df[mask].index[0]

        # Determine date
        if 'date' in correction_data:
            new_date = pd.to_datetime(correction_data['date'])
        else:
            new_date = pd.to_datetime(df.loc[idx, 'accident_date'] if pd.notna(df.loc[idx, 'accident_date'])
                                     else df.loc[idx, 'date_published'])

        # Apply time
        if 'time' in correction_data:
            try:
                hour, minute = map(int, correction_data['time'].split(':'))
                if 0 <= hour <= 23 and 0 <= minute <= 59:
                    df.loc[idx, 'accident_time'] = f"{hour:02d}:{minute:02d}"
                    df.loc[idx, 'accident_datetime'] = new_date.replace(hour=hour, minute=minute, second=0)
                    df.loc[idx, 'accident_date'] = new_date.date()
                    df.loc[idx, 'time_confidence'] = 'manual'
                    delay_hours = (pd.to_datetime(df.loc[idx, 'date_published']) - df.loc[idx, 'accident_datetime']).total_seconds() / 3600
                    df.loc[idx, 'publication_delay_hours'] = max(0, delay_hours)
                    corrections_applied += 1
                    print(f"Release ID {release_id}: {correction_data['time']}")
                else:
                    print(f"Invalid time for {release_id}: {correction_data['time']}")
            except Exception as e:
                print(f"Error for {release_id}: {e}")

    return df, corrections_applied

#### Main Logic

In [14]:
regex_extract_press_releases_csv = f"{regex_extract_data_folder}/press_releases.csv"

if os.path.isfile(regex_extract_press_releases_csv):
    print("Regex Extraction from police press releases was already done")
else:
    regex_extract_press_releases_df = police_press_releases_df.copy()

    extraction_results = regex_extract_press_releases_df.apply(
        lambda row: extract_press_release_accident_datetime(row['content'], row['date_published']), axis=1
    )

    # add columns
    extraction_df = pd.DataFrame(extraction_results.tolist())
    for col in ['accident_datetime', 'accident_date', 'accident_time', 'time_confidence']:
        regex_extract_press_releases_df[col] = extraction_df[col]

    # add derived features
    regex_extract_press_releases_df['accident_is_weekend'] = pd.to_datetime(regex_extract_press_releases_df['accident_datetime']).dt.dayofweek.isin([5, 6]).astype(int)

    regex_extract_press_releases_df['publication_delay_hours'] = ((
        pd.to_datetime(regex_extract_press_releases_df['date_published']) -
        pd.to_datetime(regex_extract_press_releases_df['accident_datetime'])
    ).dt.total_seconds() / 3600).apply(lambda x: max(0, x) if pd.notna(x) else x)

    # Display low confidence records
    low_conf = regex_extract_press_releases_df[regex_extract_press_releases_df['time_confidence'] == 'low']
    print(f"Low confidence records: {len(low_conf)}")

    if len(low_conf) > 0:
        for idx, row in low_conf.head(10).iterrows():
            print(f"ID: {row['release_id']} | Published: {row['date_published']} | Current: {row['accident_time']}")
            print(f"Content: {row['content'][:150]}...")
            print("-" * 80)

    # manual corrections dictionary - add your corrections here
    manual_corrections = {
        25: {'time': '07:45'},
        76: {'time': '17:45'},
        77: {'time': '10:15'}
        # Add more: release_id: {'time': 'HH:MM', 'date': 'YYYY-MM-DD'}
    }

    if manual_corrections:
        regex_extract_press_releases_df, num_applied = apply_press_releases_manual_corrections(regex_extract_press_releases_df, manual_corrections)
        print(f"\n{num_applied} manual corrections applied")
    else:
        print("No manual corrections defined")

    # show stats
    print(f"High: {(regex_extract_press_releases_df['time_confidence'] == 'high').sum()} | Medium: {(regex_extract_press_releases_df['time_confidence'] == 'medium').sum()} | Low: {(regex_extract_press_releases_df['time_confidence'] == 'low').sum()}")
    print(f"\nTotal press releases processed: {len(regex_extract_press_releases_df)}")

    columns_to_save = [col for col in regex_extract_press_releases_df.columns if col not in ['accident_day', 'accident_hour']]
    regex_extract_press_releases_df[columns_to_save].to_csv(regex_extract_press_releases_csv, index=False)

    print(f"✓ Saved {len(regex_extract_press_releases_df)} articles to {regex_extract_press_releases_csv}")
    print(f"  - Articles with datetime extracted: {regex_extract_press_releases_df['accident_datetime'].notna().sum()}")

regex_extract_press_releases_df = pd.read_csv(regex_extract_press_releases_csv)
display(regex_extract_press_releases_df)

Regex Extraction from police press releases was already done


Unnamed: 0,release_id,title,date_published,date_modified,content,accident_datetime,accident_date,accident_time,time_confidence,accident_is_weekend,publication_delay_hours
0,1,Collision between a car and a motorbike in Żur...,2025-10-09,2025-10-09,"Today, at around 0930hrs, the Police were info...",2025-10-09 09:30:00,2025-10-09,09:30,high,0,0.00
1,2,Car-motorcycle traffic accident,2025-06-20,2025-06-20,"Yesterday, at around 1830hrs, the Police were ...",2025-06-19 18:30:00,2025-06-19,18:30,high,0,5.50
2,3,Car-motorcycle collision in Ħal Qormi,2025-05-12,2025-05-12,"Today, at around 0800hrs, the Police were info...",2025-05-12 08:00:00,2025-05-12,08:00,high,0,0.00
3,4,Collision between motorcycle and car in Għaxaq,2025-07-30,2025-07-30,"Yesterday, at around 1800hrs, the Police were ...",2025-07-29 18:00:00,2025-07-29,18:00,high,0,6.00
4,5,Car-motorcycle collision,2025-04-07,2025-04-07,"Yesterday, at around quarter to nine in the ev...",2025-04-06 20:45:00,2025-04-06,20:45,high,1,3.25
...,...,...,...,...,...,...,...,...,...,...,...
106,107,Motorcycle accident in Attard,2025-02-05,2025-02-05,"A 52-year-old man and residing in Ħaż-Żebbuġ, ...",2025-02-05 09:00:00,2025-02-05,09:00,high,0,0.00
107,108,Naxxar traffic accident,2024-12-19,2024-12-19,"Today, at around 1045hrs, the Police were info...",2024-12-19 10:45:00,2024-12-19,10:45,high,0,0.00
108,109,Żebbuġ traffic accident,2025-03-16,2025-03-16,"Today, at around 0800hrs, the Police were info...",2025-03-16 08:00:00,2025-03-16,08:00,high,1,0.00
109,110,Collision between a car and e-scooter,2025-07-18,2025-07-18,"Yesterday, at around 2215 hrs, the Police were...",2025-07-17 22:15:00,2025-07-17,22:15,high,0,1.75


## 2. LLM Extraction

Extract features from LLM.

### Variables

In [15]:
llm_extract_data_folder = f"{staging_data_folder}/llm_extract"

### Methods

In [16]:
def extract_llm_output(json_str: str) -> dict:
    return json.loads(json_str.replace("```json", "").replace("```", ""))

In [17]:
def extract_features_from_df(
    pd_df: pd.DataFrame,
    id_column: str,
    prompt: str,
    json_save_path: str
) -> None:
    client = OpenAI(api_key=OPENAI_API_KEY)
    results = []

    for index, row in pd_df.iterrows():
        id_value = row[id_column]
        input_text = row["llm_input_text"]

        retry_count = 0
        success = False

        print(f"Processing row with {id_column} '{id_value}'...")

        while retry_count < 3 and not success:
            try:
                response = client.responses.create(
                    model="o4-mini-2025-04-16",
                    instructions=prompt,
                    input=input_text,
                )

                llm_output = extract_llm_output(response.output_text)
                llm_output["id_column"] = id_value
                llm_output["input_text"] = input_text
                results.append(llm_output)

                print(f"Successfully processed row with {id_column} '{id_value}'")
                success = True
            except Exception as e:
                retry_count += 1
                print(f"Error for row with {id_column} '{id_value}' (attempt {retry_count}/3): {e}")

                if retry_count < 3:
                    time.sleep(2)  # backoff retry delay
                else:
                    print(f"Failed to process row with {id_column} '{id_value}' after 3 attempts")

    with open(json_save_path, 'w') as f:
        json.dump(results, f)

### News Articles

Extract features from news articles using LLM.

#### Prompt

In [18]:
NEWS_ARTICLES_PROMPT = """
You are a helpful data entry assistant whose responsibility is extracting traffic accident data from news articles.
The following is such a news article. Please extract details of the accident and return them in a JSON dict with keys:

- 'is_accident' (boolean) — true if the news article describes an actual traffic accident, false otherwise.
- If 'is_accident' is true, include the following additional keys:
    -'accident_datetime'
    -'street'
    -'city'
    -'number_injured'
    -'accident_severity'
    -'drivers' (a list of objects, each with the following keys:)
        -'vehicle_type'
        -'vehicle_damage_severity'
        -'driver_age'
        -'driver_gender'
        -'is_victim' (boolean)

Please ensure that:
-'incident_datetime' is in the format 'YYYY-MM-DD HH:MM' (24-hour format) if possible.
-'number_injured' is an integer greater or equal to 0
-'accident_severity' which relates to how severe the accident in terms of human injuries and and is one of: 'No Injuries', 'Minor', 'Serious' or 'Fatal'
-'driver_gender' is either 'M' or 'F'.
-'vehicle_damage_severity' is one of: 'No damage', 'Minor' or 'Major' where 'Minor' means small damages and 'Major' means total loss or big damages

Please only return JSON—do not add any other text! If values are missing, set them to the string: "none".
"""

#### Main Logic

In [19]:
llm_extract_news_articles_csv = f"{llm_extract_data_folder}/news_articles.csv"

if os.path.isfile(llm_extract_news_articles_csv):
    print("LLM Extraction from news articles was already done")
else:
    llm_input_articles_df = local_news_articles_df[
        [
            "article_id", # article id to trace back
            "title",
            "subtitle",
            "content",
            "publish_date",
        ]
    ]

    llm_input_articles_df["llm_input_text"] = (
        "Title: " + local_news_articles_df["title"].fillna("") + "\n" +
        "Subtitle: " + local_news_articles_df["subtitle"].fillna("") + "\n" +
        "Content: " + local_news_articles_df["content"].fillna("") + "\n" +
        "Publish Date: " + local_news_articles_df["publish_date"].astype(str).fillna("none")
    )

    raw_articles_json = f"{llm_extract_data_folder}/raw_articles.json"

    if os.path.isfile(raw_articles_json):
        print("LLM Feature Extraction from news articles was already done and saved to JSON")
    else:
        extract_features_from_df(
            pd_df=llm_input_articles_df,
            id_column="article_id",
            prompt=NEWS_ARTICLES_PROMPT,
            json_save_path=raw_articles_json,
        )

    llm_extract_news_articles_df = pd.read_json(raw_articles_json)
    llm_extract_news_articles_df.to_csv(llm_extract_news_articles_csv)

    print(f"✓ Saved {len(llm_extract_news_articles_df)} articles to {llm_extract_news_articles_csv}")

llm_extract_news_articles_df = pd.read_csv(llm_extract_news_articles_csv)
display(llm_extract_news_articles_df)

LLM Extraction from news articles was already done


Unnamed: 0.1,Unnamed: 0,is_accident,accident_datetime,street,city,number_injured,accident_severity,drivers,id_column,input_text,accidents
0,0,True,2024-12-04 17:00,Regional Road,St Julian's,0,No Injuries,"[{'vehicle_type': 'Toyota Yaris', 'vehicle_dam...",4208,Title: Driver stuck in traffic says speeding L...,
1,1,False,,,,,,,4167,Title: PN slams government for diverting EU bu...,
2,2,True,2024-12-11 17:00,St Paul's Bay bypass,St Paul's Bay,1,Serious,"[{'vehicle_type': 'Motorcycle', 'vehicle_damag...",4093,Title: Motorcyclist seriously hurt in St Paul'...,
3,3,True,2024-12-11 17:00,St Paul’s Bay bypass,St Paul's Bay,1,Serious,"[{'vehicle_type': 'Motorcycle', 'vehicle_damag...",4110,Title: Skip involved in horror St Paul’s Bay b...,
4,4,True,,,,,,,4066,"Title: Two people, including teenage girl, cri...","[{'accident_datetime': '2024-12-14 17:30', 'st..."
...,...,...,...,...,...,...,...,...,...,...,...
313,313,False,,,,,,,496574,Title: Watch: Msida flyover to open by end of ...,
314,314,False,,,,,,,496586,Title: Today's front pages\nSubtitle: The top ...,
315,315,False,,,,,,,496577,Title: Traffic overtakes cost of living to bec...,
316,316,False,,,,,,,496733,Title: Employer cleared of responsibility for ...,


### Police Press Releases

#### Prompt

In [20]:
POLICE_PRESS_RELEASES_PROMPT = """
You are a helpful data entry assistant whose responsibility is extracting traffic accident data from police press releases.
The following is such a press release. Please extract details of the accident and return them in a JSON dict with keys:

- 'is_accident' (boolean) — true if the news article describes an actual traffic accident, false otherwise.
- If 'is_accident' is true, include the following additional keys:
    -'accident_datetime'
    -'street'
    -'city'
    -'number_injured'
    -'accident_severity'
    -'drivers' (a list of objects, each with the following keys:)
        -'vehicle_type'
        -'vehicle_damage_severity'
        -'driver_age'
        -'driver_gender'
        -'is_victim' (boolean)

Please ensure that:
-'incident_datetime' is in the format 'YYYY-MM-DD HH:MM' (24-hour format) if possible.
-'number_injured' is an integer greater or equal to 0
-'accident_severity' which relates to how severe the accident in terms of human injuries and and is one of: 'No Injuries', 'Minor', 'Serious' or 'Fatal'
-'driver_gender' is either 'M' or 'F'.
-'vehicle_damage_severity' is one of: 'No damage', 'Minor' or 'Major' where 'Minor' means small damages and 'Major' means total loss or big damages

Please only return JSON—do not add any other text! If values are missing, set them to the string: "none".
"""

#### Main Logic

In [21]:
llm_extract_press_releases_csv = f"{llm_extract_data_folder}/press_releases.csv"

if os.path.isfile(llm_extract_press_releases_csv):
    print("LLM Extraction from police press releases was already done")
else:
    llm_input_press_releases_df = police_press_releases_df[
        [
            "release_id", # release_id
            "title",
            "date_published",
            "content",
        ]
    ]

    llm_input_press_releases_df["llm_input_text"] = (
        "Title: " + police_press_releases_df["title"].fillna("") + "\n" +
        "Content: " + police_press_releases_df["content"].fillna("") + "\n" +
        "Publish Date: " + police_press_releases_df["date_published"].astype(str).fillna("none")
    )

    raw_press_releases_json = f"{llm_extract_data_folder}/raw_press_releases.json"

    if os.path.isfile(raw_press_releases_json):
        print("LLM Feature Extraction from police press releases was already done and saved to JSON")
    else:
        extract_features_from_df(
            pd_df=llm_input_press_releases_df,
            id_column="release_id",
            prompt=POLICE_PRESS_RELEASES_PROMPT,
            json_save_path=raw_press_releases_json,
        )

    llm_extract_press_releases_df = pd.read_json(raw_press_releases_json)
    llm_extract_press_releases_df.to_csv(llm_extract_press_releases_csv)

    print(f"✓ Saved {len(llm_extract_press_releases_df)} articles to {llm_extract_press_releases_csv}")

llm_extract_press_releases_df = pd.read_csv(llm_extract_press_releases_csv)
display(llm_extract_press_releases_df)

LLM Extraction from police press releases was already done


Unnamed: 0.1,Unnamed: 0,is_accident,accident_datetime,street,city,number_injured,accident_severity,drivers,id_column,input_text,accidents
0,0,True,2025-10-09 09:30,Triq il-Belt Valletta,Żurrieq,1.0,Serious,"[{'vehicle_type': 'Car', 'vehicle_damage_sever...",1,Title: Collision between a car and a motorbike...,
1,1,True,2025-06-19 18:30,Triq Dawret il-Gudja,Gudja,1.0,Serious,"[{'vehicle_type': 'Honda fit', 'vehicle_damage...",2,Title: Car-motorcycle traffic accident\nConten...,
2,2,True,2025-05-12 08:00,Valley Road,Qormi,1.0,Serious,"[{'vehicle_type': 'Ford Transit', 'vehicle_dam...",3,Title: Car-motorcycle collision in Ħal Qormi\n...,
3,3,True,2025-07-29 18:00,Triq Dawret Ħal Għaxaq,Għaxaq,1.0,Serious,"[{'vehicle_type': 'Volvo XC60', 'vehicle_damag...",4,Title: Collision between motorcycle and car in...,
4,4,True,2025-04-06 20:45,Triq il-Buqana,Rabat,1.0,Serious,"[{'vehicle_type': 'Car', 'vehicle_damage_sever...",5,Title: Car-motorcycle collision\nContent: Yest...,
...,...,...,...,...,...,...,...,...,...,...,...
106,106,True,2025-02-05 09:00,Vjal L-Istadium Nazzjonali,Attard,1.0,Serious,"[{'vehicle_type': 'Motorcycle', 'vehicle_damag...",107,Title: Motorcycle accident in Attard\nContent:...,
107,107,True,2024-12-19 10:45,Triq il-Ġermanja,Naxxar,1.0,Serious,"[{'vehicle_type': 'Toyota Vitz', 'vehicle_dama...",108,Title: Naxxar traffic accident\nContent: Today...,
108,108,True,2025-03-16 08:00,Vjal il-Helsien,Zebbug,2.0,Serious,"[{'vehicle_type': 'Peugeot 306', 'vehicle_dama...",109,Title: Żebbuġ traffic accident \nContent: Tod...,
109,109,True,2025-07-17 22:15,Triq il-Wied ta’ Birkirkara,Birkirkara,1.0,Serious,"[{'vehicle_type': 'Car', 'vehicle_damage_sever...",110,Title: Collision between a car and e-scooter\n...,


## 3. Join Extraction Together

Join the regex extracted data and the LLM extracted data together for manual auditing.

- Datetime feature extraction using Regex.
- General feature extraction using LLM.

### Variables

In [22]:
joined_extract_data_folder = f"{staging_data_folder}/joined_extract"
og_prefix = "og_"
regex_dtime_prefix = "regxdt_"
llm_prefix = "llm_"

### Methods

In [23]:
def parse_llm_drivers(x):
    if pd.isna(x) or x.strip() == "":
        return []
    return ast.literal_eval(x)

### News Articles

In [24]:
joined_news_articles_csv = f"{joined_extract_data_folder}/news_articles.csv"

if os.path.isfile(joined_news_articles_csv):
    print("Extraction CSVs were already joined together")
else:
    to_join_news_articles_df = (
        local_news_articles_df[[
            "article_id",
            "url",
            "source_name",
            "source_url",
            "title",
            "subtitle",
            # "author_name", -> not interested in the name of the author
            "publish_date",
            "content",
            "top_image_url",
            "top_image_caption",
            "created_at",
            "tags",
            # "categories" -> always empty set, not interested in this column
        ]]
        .rename(columns={
            "article_id": "article_id",
            "url": f"{og_prefix}url",
            "source_name": f"{og_prefix}source_name",
            "source_url": f"{og_prefix}source_url",
            "title": f"{og_prefix}title",
            "subtitle": f"{og_prefix}subtitle",
            "publish_date": f"{og_prefix}publish_date",
            "content": f"{og_prefix}content",
            "top_image_url": f"{og_prefix}top_image_url",
            "top_image_caption": f"{og_prefix}top_image_caption",
            "created_at": f"{og_prefix}created_at",
            "tags": f"{og_prefix}tags",
        })
    )

    to_join_regex_news_articles_df = (
        regex_extract_news_articles_df[[
            "article_id",
            "accident_datetime",
        ]]
        .rename(columns={
            "article_id": "article_id",
            "accident_datetime": f"{regex_dtime_prefix}accident_datetime",
        })
    )

    to_join_llm_news_articles_df = (
        llm_extract_news_articles_df[[
            "id_column", # article_id
            "is_accident",
            "street",
            "city",
            "number_injured",
            "accident_severity",
            "drivers",
        ]]
        .rename(columns={
            "id_column": "article_id",
            "is_accident": f"{llm_prefix}is_accident",
            "street": f"{llm_prefix}street",
            "city": f"{llm_prefix}city",
            "number_injured": f"{llm_prefix}number_injured",
            "accident_severity": f"{llm_prefix}accident_severity",
            "drivers": f"{llm_prefix}drivers",
        })
    )

    joined_news_articles_df = (
        to_join_news_articles_df
        .merge(to_join_regex_news_articles_df, on="article_id", how="left")
        .merge(to_join_llm_news_articles_df, on="article_id", how="left")
    )

    joined_news_articles_df["llm_drivers"] = joined_news_articles_df["llm_drivers"].apply(parse_llm_drivers)
    exploded_news_articles_df = joined_news_articles_df.explode("llm_drivers", ignore_index=True)

    joined_news_articles_df = pd.concat(
        [
            exploded_news_articles_df.drop(columns=["llm_drivers"]),
            pd.json_normalize(exploded_news_articles_df["llm_drivers"])
        ],
        axis=1
    ).rename(columns={
        "vehicle_type": f"{llm_prefix}_vehicle_type",
        "vehicle_damage_severity": f"{llm_prefix}vehicle_damage_severity",
        "driver_age": f"{llm_prefix}driver_age",
        "driver_gender": f"{llm_prefix}driver_gender",
        "is_victim": f"{llm_prefix}is_victim",
    })

    joined_news_articles_df.to_csv(joined_news_articles_csv)

    print(f"✓ Saved {len(joined_news_articles_df)} articles to {joined_news_articles_csv}")


joined_news_articles_df = pd.read_csv(joined_news_articles_csv)
display(joined_news_articles_df)

Extraction CSVs were already joined together


Unnamed: 0.1,Unnamed: 0,article_id,og_url,og_source_name,og_source_url,og_title,og_subtitle,og_publish_date,og_content,og_top_image_url,...,llm_is_accident,llm_street,llm_city,llm_number_injured,llm_accident_severity,llm__vehicle_type,llm_vehicle_damage_severity,llm_driver_age,llm_driver_gender,llm_is_victim
0,0,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,...,True,Regional Road,St Julian's,0,No Injuries,Toyota Yaris,Minor,78,M,True
1,1,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,...,True,Regional Road,St Julian's,0,No Injuries,LESA vehicle,No damage,none,none,False
2,2,4167,https://timesofmalta.com/article/pn-slams-gove...,Times of Malta,https://timesofmalta.com,PN slams government for diverting EU bus funds...,"'By encouraging the use of private cars, the g...",2024-12-09,The PN on Monday slammed the government for di...,https://cdn-attachments.timesofmalta.com/d9afe...,...,False,,,,,,,,,
3,3,4093,https://timesofmalta.com/article/motorcyclist-...,Times of Malta,https://timesofmalta.com,Motorcyclist seriously hurt in St Paul's Bay b...,Residents complained several times about inade...,2024-12-11,A motorcyclist was rushed to hospital in a cri...,https://cdn-attachments.timesofmalta.com/633f6...,...,True,St Paul's Bay bypass,St Paul's Bay,1,Serious,Motorcycle,Major,54,M,True
4,4,4110,https://timesofmalta.com/article/skip-involved...,Times of Malta,https://timesofmalta.com,Skip involved in horror St Paul’s Bay bypass c...,Motorcyclist hurt in crash on Wednesday evenin...,2024-12-12,A private contractor who placed a skip on St P...,https://cdn-attachments.timesofmalta.com/fc23e...,...,True,St Paul’s Bay bypass,St Paul's Bay,1,Serious,Motorcycle,none,54,M,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
448,448,496574,https://timesofmalta.com/article/watch-msida-f...,Times of Malta,https://timesofmalta.com,Watch: Msida flyover to open by end of year as...,'The flyover will allow us to create a new ope...,2025-10-12,The Msida flyover will open by the end of the ...,https://cdn-attachments.timesofmalta.com/f47de...,...,False,,,,,,,,,
449,449,496586,https://timesofmalta.com/article/today-front-p...,Times of Malta,https://timesofmalta.com,Today's front pages,The top stories in Malta's newspapers,2025-10-13,The following are the top stories in Malta's n...,https://cdn-attachments.timesofmalta.com/28065...,...,False,,,,,,,,,
450,450,496577,https://timesofmalta.com/article/traffic-overt...,Times of Malta,https://timesofmalta.com,Traffic overtakes cost of living to become peo...,Poll data suggests frustration on Maltese road...,2025-10-13,"Traffic, parking and public transport-related ...",https://cdn-attachments.timesofmalta.com/861ca...,...,False,,,,,,,,,
451,451,496733,https://timesofmalta.com/article/employer-clea...,Times of Malta,https://timesofmalta.com,Employer cleared of responsibility for young w...,"Court raps police, OHSA for not working togeth...",2025-10-14,A court has sharply criticised the police and ...,https://cdn-attachments.timesofmalta.com/2d4fb...,...,False,,,,,,,,,


### Police Press Releases

In [25]:
joined_press_releases_csv = f"{joined_extract_data_folder}/press_releases.csv"

if os.path.isfile(joined_press_releases_csv):
    print("Extraction CSVs were already joined together")
else:
    to_join_press_releases_df = (
        police_press_releases_df[[
            "release_id",
            "title",
            "content",
            "date_published",
            "date_modified",
        ]]
        .rename(columns={
            "release_id": "release_id",
            "title": f"{og_prefix}title",
            "content": f"{og_prefix}content",
            "date_published": f"{og_prefix}date_published",
            "date_modified": f"{og_prefix}date_modified",
        })
    )

    to_join_regex_press_releases_df = (
        regex_extract_press_releases_df[[
            "release_id",
            "accident_datetime",
        ]]
        .rename(columns={
            "release_id": "release_id",
            "accident_datetime": f"{regex_dtime_prefix}accident_datetime",
        })
    )

    to_join_llm_press_releases_df = (
        llm_extract_press_releases_df[[
            "id_column", # release_id
            "is_accident",
            "street",
            "city",
            "number_injured",
            "accident_severity",
            "drivers",
        ]]
        .rename(columns={
            "id_column": "release_id",
            "is_accident": f"{llm_prefix}is_accident",
            "street": f"{llm_prefix}street",
            "city": f"{llm_prefix}city",
            "number_injured": f"{llm_prefix}number_injured",
            "accident_severity": f"{llm_prefix}accident_severity",
            "drivers": f"{llm_prefix}drivers",
        })
    )

    joined_press_releases_df = (
        to_join_press_releases_df
        .merge(to_join_regex_press_releases_df, on="release_id", how="left")
        .merge(to_join_llm_press_releases_df, on="release_id", how="left")
    )

    joined_press_releases_df["llm_drivers"] = joined_press_releases_df["llm_drivers"].apply(parse_llm_drivers)
    exploded_press_releases_df = joined_press_releases_df.explode("llm_drivers", ignore_index=True)

    joined_press_releases_df = pd.concat(
        [
            exploded_press_releases_df.drop(columns=["llm_drivers"]),
            pd.json_normalize(exploded_press_releases_df["llm_drivers"])
        ],
        axis=1
    ).rename(columns={
        "vehicle_type": f"{llm_prefix}_vehicle_type",
        "vehicle_damage_severity": f"{llm_prefix}vehicle_damage_severity",
        "driver_age": f"{llm_prefix}driver_age",
        "driver_gender": f"{llm_prefix}driver_gender",
        "is_victim": f"{llm_prefix}is_victim",
    })

    joined_press_releases_df.to_csv(joined_press_releases_csv)

    print(f"✓ Saved {len(joined_press_releases_df)} articles to {joined_press_releases_csv}")

joined_press_releases_df = pd.read_csv(joined_press_releases_csv)
display(joined_press_releases_df)

Extraction CSVs were already joined together


Unnamed: 0.1,Unnamed: 0,release_id,og_title,og_content,og_date_published,og_date_modified,regxdt_accident_datetime,llm_is_accident,llm_street,llm_city,llm_number_injured,llm_accident_severity,llm__vehicle_type,llm_vehicle_damage_severity,llm_driver_age,llm_driver_gender,llm_is_victim
0,0,1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",2025-10-09,2025-10-09,2025-10-09 09:30:00,True,Triq il-Belt Valletta,Żurrieq,1.0,Serious,Car,none,67,F,False
1,1,1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",2025-10-09,2025-10-09,2025-10-09 09:30:00,True,Triq il-Belt Valletta,Żurrieq,1.0,Serious,Motorbike,none,61,M,True
2,2,2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",2025-06-20,2025-06-20,2025-06-19 18:30:00,True,Triq Dawret il-Gudja,Gudja,1.0,Serious,Honda fit,none,64,M,False
3,3,2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",2025-06-20,2025-06-20,2025-06-19 18:30:00,True,Triq Dawret il-Gudja,Gudja,1.0,Serious,Kawasaki Ninja motorcycle,none,23,M,True
4,4,3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",2025-05-12,2025-05-12,2025-05-12 08:00:00,True,Valley Road,Qormi,1.0,Serious,Ford Transit,none,34,M,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,170,109,Żebbuġ traffic accident,"Today, at around 0800hrs, the Police were info...",2025-03-16,2025-03-16,2025-03-16 08:00:00,True,Vjal il-Helsien,Zebbug,2.0,Serious,Peugeot 306,none,59,M,False
171,171,110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",2025-07-18,2025-07-18,2025-07-17 22:15:00,True,Triq il-Wied ta’ Birkirkara,Birkirkara,1.0,Serious,Car,none,41,none,False
172,172,110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",2025-07-18,2025-07-18,2025-07-17 22:15:00,True,Triq il-Wied ta’ Birkirkara,Birkirkara,1.0,Serious,E-scooter,none,17,none,True
173,173,111,Traffic accident in Gwardamanġa,"Today, at around 0700hrs, the Police were info...",2025-08-12,2025-08-12,2025-08-12 07:00:00,True,St Luke’s Square,Gwardamanġa,2.0,Serious,Volkswagen Caddy,none,62,M,True


## 4. Combine datasets and deduplicate

DataFrames have been manually audited and validated. Some values have been updated.

Now, we combine both of the datasets together and deduplicate. Deduplication will happen programatically to find duplicates and manually to resolve them.

### Variables

In [26]:
audited_data_folder = f"{staging_data_folder}/audited"
deduplication_data_folder = f"{staging_data_folder}/deduplication"

dim_data_folder = f"{staging_data_folder}/dims"
dim_town_csv = f"{dim_data_folder}/dim_town.csv"
dim_street_csv = f"{dim_data_folder}/dim_street.csv"

columns_to_select = [
    "id",
    "og_title",
    "og_content",
    "og_date_published",
    "accident_datetime",
    "is_applicable_accident",
    "street",
    "city",
    "number_injured",
    "accident_severity",
    "vehicle_type",
    "driver_age",
    "driver_gender",
]

rename_columns = {
    "og_title": "title",
    "og_content": "content",
    "og_date_published": "date_published"
}

### Audited News Articles

In [27]:
audited_news_articles_csv = f"{audited_data_folder}/news_articles.csv"
audited_news_articles_df = pd.read_csv(audited_news_articles_csv)

print("Number of rows in news articles before removing non-accidents:", len(audited_news_articles_df))
audited_news_articles_df = audited_news_articles_df[audited_news_articles_df["is_applicable_accident"]]
print("Number of rows in news articles after removing non-accidents:", len(audited_news_articles_df))

audited_news_articles_df["id"] = ("article_" + audited_news_articles_df["article_id"].astype(str))

audited_news_articles_df = (
    audited_news_articles_df[columns_to_select]
    .rename(columns=rename_columns)
)

max_id_frequency_news_articles = audited_news_articles_df["id"].value_counts().max()
print(f"The maximum frequency of any 'id' in the news articles DataFrame is: {max_id_frequency_news_articles}")

display(audited_news_articles_df)

Number of rows in news articles before removing non-accidents: 489
Number of rows in news articles after removing non-accidents: 351
The maximum frequency of any 'id' in the news articles DataFrame is: 4


Unnamed: 0,id,title,content,date_published,accident_datetime,is_applicable_accident,street,city,number_injured,accident_severity,vehicle_type,driver_age,driver_gender
0,article_4208.0,Driver stuck in traffic says speeding LESA car...,A motorist claims his car mirror was shattered...,07/12/2024,04/12/2024 17:00,True,Regional Road,St Julian's,0,not injured,Car,78,M
1,article_4208.0,Driver stuck in traffic says speeding LESA car...,A motorist claims his car mirror was shattered...,07/12/2024,04/12/2024 17:00,True,Regional Road,St Julian's,0,not injured,Car,Unknown,Unknown
3,article_4093.0,Motorcyclist seriously hurt in St Paul's Bay b...,A motorcyclist was rushed to hospital in a cri...,11/12/2024,11/12/2024 17:00,True,St Paul's Bay bypass,St Paul's Bay,1,Serious,Motorbike,54,M
4,article_4110.0,Skip involved in horror St Paul’s Bay bypass c...,A private contractor who placed a skip on St P...,12/12/2024,11/12/2024 17:00,True,St Paul’s Bay bypass,St Paul's Bay,1,Serious,Motorbike,54,M
5,article_4066.0,"Two people, including teenage girl, critically...",A 29-year-old man and 17-year-old girl were cr...,14/12/2024,13/12/2024 17:30,True,Triq il-Marfa,Mellieħa,0,not injured,Car,52,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...
479,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,True,Triq Sant Anna,Floriana,0,not injured,car,none,none
480,article_496362.0,Għajnsielem mayor urges traffic safety measure...,The Mayor of Għajnsielem on Friday called on t...,10/10/2025,10/10/2025 9:00,True,Triq l-Imġarr,Għajnsielem,1,Serious,car,35,M
481,article_496362.0,Għajnsielem mayor urges traffic safety measure...,The Mayor of Għajnsielem on Friday called on t...,10/10/2025,10/10/2025 9:00,True,Triq l-Imġarr,Għajnsielem,1,Serious,Motorbike,43,M
482,article_496362.0,Għajnsielem mayor urges traffic safety measure...,The Mayor of Għajnsielem on Friday called on t...,10/10/2025,10/10/2025 9:00,True,Triq l-Imġarr,Għajnsielem,0,not injured,car,67,M


### Audited Press Releases

In [28]:
audited_press_releases_csv = f"{audited_data_folder}/press_releases.csv"
audited_press_releases_df = pd.read_csv(audited_press_releases_csv)

print("Number of rows in press releases before removing non-accidents:", len(audited_press_releases_df))
audited_press_releases_df = audited_press_releases_df[audited_press_releases_df["is_applicable_accident"]]
print("Number of rows in press releases after removing non-accidents:", len(audited_press_releases_df))

audited_press_releases_df["id"] = ("release_" + audited_press_releases_df["release_id"].astype(str))

audited_press_releases_df = (
    audited_press_releases_df[columns_to_select]
    .rename(columns=rename_columns)
)

max_id_frequency_press_releases = audited_press_releases_df["id"].value_counts().max()
print(f"The maximum frequency of any 'id' in the press releases DataFrame is: {max_id_frequency_press_releases}")

display(audited_press_releases_df)

Number of rows in press releases before removing non-accidents: 191
Number of rows in press releases after removing non-accidents: 191
The maximum frequency of any 'id' in the press releases DataFrame is: 4


Unnamed: 0,id,title,content,date_published,accident_datetime,is_applicable_accident,street,city,number_injured,accident_severity,vehicle_type,driver_age,driver_gender
0,release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,True,Triq il-Belt Valletta,Żurrieq,0,not injured,Car,67,F
1,release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,True,Triq il-Belt Valletta,Żurrieq,1,Serious,Motorbike,61,M
2,release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,True,Triq Dawret il-Gudja,Gudja,0,not injured,Car,64,M
3,release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,True,Triq Dawret il-Gudja,Gudja,1,Grievious,Motorbike,23,M
4,release_3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",12/05/2025,12/05/2025 08:00,True,Valley Road,Qormi,0,not injured,Van,34,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...
186,release_109,Żebbuġ traffic accident,"Today, at around 0800hrs, the Police were info...",16/03/2025,16/03/2025 08:00,True,Vjal il-Helsien,Żebbuġ,2,Grievious,Car,59,M
187,release_110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",18/07/2025,17/07/2025 22:15,True,Triq il-Wied ta’ Birkirkara,Birkirkara,0,not injured,Car,41,M
188,release_110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",18/07/2025,17/07/2025 22:15,True,Triq il-Wied ta’ Birkirkara,Birkirkara,1,Grievious,Bicycle,17,M
189,release_111,Traffic accident in Gwardamanġa,"Today, at around 0700hrs, the Police were info...",12/08/2025,12/08/2025 07:00,True,St Luke’s Square,Gwardamanġa,1,Grievious,Car,62,M


### Combining both news articles and press releases together

In [29]:
# function to process dynamic columns for each group
def process_dynamic_group(group):
    row_data = {}
    # Using uppercase letters for prefixes A, B, C...
    prefixes = [chr(i) for i in range(ord('A'), ord('Z') + 1)]

    for i, (_, row) in enumerate(group[dynamic_cols].iterrows()):
        prefix = prefixes[i]
        for col in dynamic_cols:
            row_data[f'{prefix}_{col}'] = row[col]
    return pd.Series(row_data)

In [30]:
MALTESE_MAP = {
    "għ": "gh",
    "ċ": "c",
    "ġ": "g",
    "ħ": "h",
    "ż": "z",
}

def normalise_maltese_text(text):
    text = str(text).strip().lower()

    # Explicit Maltese replacements
    for k, v in MALTESE_MAP.items():
        text = text.replace(k, v)

    # Unicode normalization (handles remaining accents safely)
    text = "".join(
        c for c in unicodedata.normalize("NFKD", text)
        if not unicodedata.combining(c)
    )

    # Keep ASCII only
    text = text.encode("ascii", "ignore").decode("ascii")

    return text

In [31]:
max_id_frequency = max(max_id_frequency_press_releases, max_id_frequency_news_articles)
print(f"The maximum frequency of any 'id' in both of the DataFrames is: {max_id_frequency}")

# define the static columns that should be unique per release_id
static_cols = [
    "id", "title", "content", "date_published",
    "accident_datetime", "street", "city"
]

# define the dynamic columns that need to be expanded with prefixes
dynamic_cols = [
    "number_injured", "accident_severity", "vehicle_type",
    "driver_age", "driver_gender"
]

all_columns = static_cols + dynamic_cols

unioned_df = pd.concat([audited_press_releases_df, audited_news_articles_df], ignore_index=True)
print("Unioned DataFrame of both news articles and press releases:")
display(unioned_df)

static_unioned_df = unioned_df[static_cols].drop_duplicates(subset=["id"]).set_index("id")
print("Unioned DataFrame of both news articles and press releases with only the 'static' columns:")
display(static_unioned_df)

dynamic_unioned_df = unioned_df.groupby("id").apply(process_dynamic_group, include_groups=False).unstack()
print("Unioned DataFrame of both news articles and press releases with only the 'dynamic' columns:")
display(static_unioned_df)


accidents_with_duplicates_df = static_unioned_df.merge(dynamic_unioned_df, left_index=True, right_index=True, how='left')
accidents_with_duplicates_df = accidents_with_duplicates_df.reset_index()

accidents_with_duplicates_df["street"] = accidents_with_duplicates_df["street"].apply(normalise_maltese_text)
accidents_with_duplicates_df["city"] = accidents_with_duplicates_df["city"].apply(normalise_maltese_text)

print("All accidents, with duplicates from both data sources:")
display(accidents_with_duplicates_df)

The maximum frequency of any 'id' in both of the DataFrames is: 4
Unioned DataFrame of both news articles and press releases:


Unnamed: 0,id,title,content,date_published,accident_datetime,is_applicable_accident,street,city,number_injured,accident_severity,vehicle_type,driver_age,driver_gender
0,release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,True,Triq il-Belt Valletta,Żurrieq,0,not injured,Car,67,F
1,release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,True,Triq il-Belt Valletta,Żurrieq,1,Serious,Motorbike,61,M
2,release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,True,Triq Dawret il-Gudja,Gudja,0,not injured,Car,64,M
3,release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,True,Triq Dawret il-Gudja,Gudja,1,Grievious,Motorbike,23,M
4,release_3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",12/05/2025,12/05/2025 08:00,True,Valley Road,Qormi,0,not injured,Van,34,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...
537,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,True,Triq Sant Anna,Floriana,0,not injured,car,none,none
538,article_496362.0,Għajnsielem mayor urges traffic safety measure...,The Mayor of Għajnsielem on Friday called on t...,10/10/2025,10/10/2025 9:00,True,Triq l-Imġarr,Għajnsielem,1,Serious,car,35,M
539,article_496362.0,Għajnsielem mayor urges traffic safety measure...,The Mayor of Għajnsielem on Friday called on t...,10/10/2025,10/10/2025 9:00,True,Triq l-Imġarr,Għajnsielem,1,Serious,Motorbike,43,M
540,article_496362.0,Għajnsielem mayor urges traffic safety measure...,The Mayor of Għajnsielem on Friday called on t...,10/10/2025,10/10/2025 9:00,True,Triq l-Imġarr,Għajnsielem,0,not injured,car,67,M


Unioned DataFrame of both news articles and press releases with only the 'static' columns:


Unnamed: 0_level_0,title,content,date_published,accident_datetime,street,city
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,Triq il-Belt Valletta,Żurrieq
release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,Triq Dawret il-Gudja,Gudja
release_3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",12/05/2025,12/05/2025 08:00,Valley Road,Qormi
release_4,Collision between motorcycle and car in Għaxaq,"Yesterday, at around 1800hrs, the Police were ...",30/07/2025,29/07/2025 18:00,Triq Dawret Ħal Għaxaq,Għaxaq
release_5,Car-motorcycle collision,"Yesterday, at around quarter to nine in the ev...",07/04/2025,06/04/2025 20:45,Triq il-Buqana,Rabat
...,...,...,...,...,...,...
article_495969.0,Husband of injured nurse thanks donors as fund...,A crowdfunding campaign for a nurse left in a ...,07/10/2025,16/07/2025 0:00,Mount Carmel Hospital,Ħ'Attard
article_496206.0,Watch: Traffic accident caught live on air dur...,Malta’s traffic woes were brought home to TV v...,09/10/2025,09/10/2025 0:00,Paola roundabout,Paola
article_496202.0,Motorcyclist seriously injured in Żurrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,09/10/2025 9:30,Triq il-Belt Valletta,Żurrieq
article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,Triq Sant Anna,Floriana


Unioned DataFrame of both news articles and press releases with only the 'dynamic' columns:


Unnamed: 0_level_0,title,content,date_published,accident_datetime,street,city
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,Triq il-Belt Valletta,Żurrieq
release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,Triq Dawret il-Gudja,Gudja
release_3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",12/05/2025,12/05/2025 08:00,Valley Road,Qormi
release_4,Collision between motorcycle and car in Għaxaq,"Yesterday, at around 1800hrs, the Police were ...",30/07/2025,29/07/2025 18:00,Triq Dawret Ħal Għaxaq,Għaxaq
release_5,Car-motorcycle collision,"Yesterday, at around quarter to nine in the ev...",07/04/2025,06/04/2025 20:45,Triq il-Buqana,Rabat
...,...,...,...,...,...,...
article_495969.0,Husband of injured nurse thanks donors as fund...,A crowdfunding campaign for a nurse left in a ...,07/10/2025,16/07/2025 0:00,Mount Carmel Hospital,Ħ'Attard
article_496206.0,Watch: Traffic accident caught live on air dur...,Malta’s traffic woes were brought home to TV v...,09/10/2025,09/10/2025 0:00,Paola roundabout,Paola
article_496202.0,Motorcyclist seriously injured in Żurrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,09/10/2025 9:30,Triq il-Belt Valletta,Żurrieq
article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,Triq Sant Anna,Floriana


All accidents, with duplicates from both data sources:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,C_accident_severity,C_driver_age,C_driver_gender,C_number_injured,C_vehicle_type,D_accident_severity,D_driver_age,D_driver_gender,D_number_injured,D_vehicle_type
0,release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,triq il-belt valletta,zurrieq,not injured,67,F,...,,,,,,,,,,
1,release_2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",20/06/2025,19/06/2025 18:30,triq dawret il-gudja,gudja,not injured,64,M,...,,,,,,,,,,
2,release_3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",12/05/2025,12/05/2025 08:00,valley road,qormi,not injured,34,M,...,,,,,,,,,,
3,release_4,Collision between motorcycle and car in Għaxaq,"Yesterday, at around 1800hrs, the Police were ...",30/07/2025,29/07/2025 18:00,triq dawret hal ghaxaq,ghaxaq,not injured,42,M,...,,,,,,,,,,
4,release_5,Car-motorcycle collision,"Yesterday, at around quarter to nine in the ev...",07/04/2025,06/04/2025 20:45,triq il-buqana,rabat,not injured,25,M,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,article_495969.0,Husband of injured nurse thanks donors as fund...,A crowdfunding campaign for a nurse left in a ...,07/10/2025,16/07/2025 0:00,mount carmel hospital,h'attard,Serious,none,none,...,,,,,,,,,,
319,article_496206.0,Watch: Traffic accident caught live on air dur...,Malta’s traffic woes were brought home to TV v...,09/10/2025,09/10/2025 0:00,paola roundabout,paola,slight,none,none,...,,,,,,,,,,
320,article_496202.0,Motorcyclist seriously injured in Żurrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,09/10/2025 9:30,triq il-belt valletta,zurrieq,Serious,61,M,...,,,,,,,,,,
321,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,triq sant anna,floriana,not injured,none,none,...,,,,,,,,,,


### Add accident_date_id column

Add `accident_date_id` column, sort DataFrame by this column and save to perform manual deduplication.

In [32]:
accidents_with_duplicates_df["accident_date_id"] = (
    pd.to_datetime(accidents_with_duplicates_df['accident_datetime'], format='mixed', dayfirst=True, errors='coerce')
      .dt.strftime("%Y%m%d")
)

null_rows = accidents_with_duplicates_df[accidents_with_duplicates_df["accident_date_id"].isna()]
assert null_rows.empty, f"Found null accident_datetime_id values:\n{null_rows}"

accidents_with_duplicates_df = accidents_with_duplicates_df.sort_values(by=["accident_date_id"])

pre_deduplication_csv = f"{deduplication_data_folder}/pre_deduplication.csv"

accidents_with_duplicates_df.to_csv(pre_deduplication_csv, index=False)
print(f"✓ Saved {len(accidents_with_duplicates_df)} articles to {pre_deduplication_csv}")

✓ Saved 323 articles to ../../data/staging/deduplication/pre_deduplication.csv


### Data Deduplication

Scope --> De-duplication of observations (rows) referring to same incident.

Source: `pre_deduplication.csv`
Store new data in `deduplication.csv`

Logic applied:

1.   Identify observations with identical `accident datetime`
2.   Confirm that accident took place at same `city`, `street`
3.   Choose observations with worst case `accident_severity` and `max number of injured`
4.   If severity level is identical store only the police press release
5.   List any accidents with occurred at same date/time but at different city/street --> for manual evaluation, store as `duplicates_manual.csv`

#### Compute maximum accident severity

In [33]:
severity_order = {'Unknown' : 0, 'not injured': 1, 'Slight': 2, 'Grievious' : 3, 'Serious': 4, 'Fatal': 5}
def get_max_severity(row):
    severities = []
    for prefix in ['A', 'B', 'C', 'D']:
        col_name = f"{prefix}_accident_severity"
        if col_name in row and pd.notna(row[col_name]):
            severities.append(row[col_name])

    if not severities:
        return None

    # get the severity with the highest order (worst severity)
    max_severity = None
    max_severity_value = -1
    for severity in severities:
        if severity in severity_order:
            if severity_order[severity] > max_severity_value:
                max_severity_value = severity_order[severity]
                max_severity = severity
    return max_severity


accidents_with_duplicates_df['accident_severity'] = accidents_with_duplicates_df.apply(get_max_severity, axis=1)

accidents_with_duplicates_df.head()

Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,C_driver_gender,C_number_injured,C_vehicle_type,D_accident_severity,D_driver_age,D_driver_gender,D_number_injured,D_vehicle_type,accident_date_id,accident_severity
138,article_3834.0,Driver paid for damages he was never accused o...,A driver who was conditionally discharged over...,30/12/2024,17/12/2006 0:00,triq il-mithna,qormi,Grievious,Unknown,Unknown,...,,,,,,,,,20061217,Grievious
189,article_2101.0,Driver jailed for four years for killing passe...,A 48-year-old man has been jailed for four yea...,01/04/2025,20/08/2017 15:25,triq sir temi zammit,mgarr,Unknown,26,M,...,,,,,,,,,20170820,Fatal
242,article_353524.0,Elderly man acquitted of traffic fatality as v...,A man has been acquitted of causing the death ...,23/07/2025,13/03/2018 0:00,triq wied bladun,paola,Fatal,83,M,...,,,,,,,,,20180313,Fatal
203,article_1352.0,Motorcyclist injured in pothole incident award...,A 39-year-old Turkish national has been awarde...,09/05/2025,02/05/2020 8:00,triq ghar lapsi,siggiewi,Serious,39,M,...,,,,,,,,,20200502,Serious
275,article_490632.0,‘Why do we have to beg for justice?’,The brother of a food courier killed in a frea...,17/08/2025,01/02/2022 0:00,aldo moro road,marsa,Fatal,28,M,...,,,,,,,,,20220201,Fatal


#### Compute total no. of injured

In [34]:
injured_cols = ['A_number_injured', 'B_number_injured', 'C_number_injured', 'D_number_injured']
def calculate_total_injured(row):
    total = 0
    for col in injured_cols:
        if col in row.index and pd.notna(row[col]):
            # convert to numeric, handling any string values
            try:
                total += float(row[col])
            except (ValueError, TypeError):
                pass
    return total

accidents_with_duplicates_df['total_injured'] = accidents_with_duplicates_df.apply(calculate_total_injured, axis=1)

accidents_with_duplicates_df.head()

Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,C_number_injured,C_vehicle_type,D_accident_severity,D_driver_age,D_driver_gender,D_number_injured,D_vehicle_type,accident_date_id,accident_severity,total_injured
138,article_3834.0,Driver paid for damages he was never accused o...,A driver who was conditionally discharged over...,30/12/2024,17/12/2006 0:00,triq il-mithna,qormi,Grievious,Unknown,Unknown,...,,,,,,,,20061217,Grievious,1.0
189,article_2101.0,Driver jailed for four years for killing passe...,A 48-year-old man has been jailed for four yea...,01/04/2025,20/08/2017 15:25,triq sir temi zammit,mgarr,Unknown,26,M,...,,,,,,,,20170820,Fatal,3.0
242,article_353524.0,Elderly man acquitted of traffic fatality as v...,A man has been acquitted of causing the death ...,23/07/2025,13/03/2018 0:00,triq wied bladun,paola,Fatal,83,M,...,,,,,,,,20180313,Fatal,1.0
203,article_1352.0,Motorcyclist injured in pothole incident award...,A 39-year-old Turkish national has been awarde...,09/05/2025,02/05/2020 8:00,triq ghar lapsi,siggiewi,Serious,39,M,...,,,,,,,,20200502,Serious,1.0
275,article_490632.0,‘Why do we have to beg for justice?’,The brother of a food courier killed in a frea...,17/08/2025,01/02/2022 0:00,aldo moro road,marsa,Fatal,28,M,...,,,,,,,,20220201,Fatal,1.0


#### Add `delete_flag`

Set `delete_flag` as 'suspect' to all accidents having identical accident date/time. All other observations to be set as 'retain'.

In [35]:
accidents_with_duplicates_df['accident_datetime'] = pd.to_datetime(
    accidents_with_duplicates_df['accident_datetime'],
    format='mixed', dayfirst=True, errors='coerce'
)

datetime_counts = accidents_with_duplicates_df['accident_datetime'].value_counts()
datetime_counts.head()

accident_datetime
2024-12-11 17:00:00    8
2025-03-16 08:00:00    4
2025-01-15 07:45:00    4
2025-08-10 05:20:00    4
2025-08-17 16:00:00    4
Name: count, dtype: int64

In [36]:
accidents_with_duplicates_df['delete_flag'] = accidents_with_duplicates_df['accident_datetime'].map(datetime_counts)
accidents_with_duplicates_df['delete_flag'] = accidents_with_duplicates_df['delete_flag'].apply(lambda x: 'suspect' if x > 1 else 'retain')

print("Statistics for 'delete_flag' column:")
display(accidents_with_duplicates_df['delete_flag'].value_counts())

Statistics for 'delete_flag' column:


delete_flag
suspect    193
retain     130
Name: count, dtype: int64

#### Add canonical 'city', `C_city` from 'dim_town.csv'

In [37]:
dim_town_df = pd.read_csv(dim_town_csv)

print("Head of town DF:")
display(dim_town_df.head())

# prepare dim_town_df for merging
# select only 'town' and 'variant' columns.
# drop duplicates based on 'variant' to ensure each variant maps to a single town if multiple entries exist
dim_town_lookup = dim_town_df[['town', 'variant']].drop_duplicates(subset=['variant']).copy()

# perform a left merge to add the canonical 'town' name as 'C_city'
# matching accidents_with_duplicates_df['city'] with dim_town_lookup['variant']
accidents_with_duplicates_df = accidents_with_duplicates_df.merge(
    dim_town_lookup,
    left_on='city',
    right_on='variant',
    how='left',
    suffixes=('', '_dim_town') # add suffix to avoid collision with existing 'town' if it exists.
)

# rename the 'town' column from the merge to 'C_city'
accidents_with_duplicates_df.rename(columns={'town': 'C_city'}, inplace=True)

# drop the 'variant' column brought in from the merge, as it's no longer needed
accidents_with_duplicates_df.drop(columns=['variant'], inplace=True)

# display the updated DataFrame to show the new 'C_city' column
print("DataFrame with new 'C_city' column (first 5 rows of relevant columns):")
display(accidents_with_duplicates_df[['id', 'city', 'C_city']])

missing_cities_df = accidents_with_duplicates_df[accidents_with_duplicates_df['C_city'].isna()]
missing_cities = set(missing_cities_df["city"].tolist())

print(f"Number of rows where 'C_city' is null (no match found): {accidents_with_duplicates_df['C_city'].isna().sum()}")
print(f"Missing cities: {missing_cities}")

Head of town DF:


Unnamed: 0,town,variant,is_canonical
0,attard,attard,True
1,attard,attard road,False
2,attard,h attard,False
3,attard,h'attard,False
4,balzan,balzan,True


DataFrame with new 'C_city' column (first 5 rows of relevant columns):


Unnamed: 0,id,city,C_city
0,article_3834.0,qormi,qormi
1,article_2101.0,mgarr,mgarr
2,article_353524.0,paola,paola
3,article_1352.0,siggiewi,siggiewi
4,article_490632.0,marsa,marsa
...,...,...,...
318,article_496202.0,zurrieq,zurrieq
319,release_1,zurrieq,zurrieq
320,article_496274.0,floriana,floriana
321,article_496206.0,paola,paola


Number of rows where 'C_city' is null (no match found): 0
Missing cities: set()


#### Add canonical 'street', `C_street` from 'dim_street.csv'

In [38]:
# add canonical 'street', 'C_street' from 'dim_street.csv'
dim_street_df = pd.read_csv(dim_street_csv)

print("Head of street DF:")
display(dim_street_df.head())

# drepare dim_street_df for merging
# select only 'street' and 'variant' columns
# drop duplicates based on 'variant' to ensure each variant maps to a single town if multiple entries exist
dim_street_lookup = dim_street_df[['street', 'variant']].drop_duplicates(subset=['variant']).copy()

# perform a left merge to add the canonical 'street' name as 'C_street'
# matching accidents_with_duplicates_df['street'] with dim_street_lookup['variant']
accidents_with_duplicates_df = accidents_with_duplicates_df.merge(
    dim_street_lookup,
    left_on='street',
    right_on='variant',
    how='left',
    suffixes=('', '_dim_street') # Add suffix to avoid collision with existing 'street' if it exists.
)

# rename the 'street' column from the merge to 'C_street'
accidents_with_duplicates_df.rename(columns={'street_dim_street': 'C_street'}, inplace=True)

# drop the 'variant' column brought in from the merge, as it's no longer needed
accidents_with_duplicates_df.drop(columns=['variant'], inplace=True)

# display the updated DataFrame to show the new 'C_street' column
print("DataFrame with new 'C_street' column (first 5 rows of relevant columns):")
display(accidents_with_duplicates_df[['id', 'street', 'C_street']])

missing_streets_df = accidents_with_duplicates_df[accidents_with_duplicates_df['C_street'].isna()]
missing_streets = set(missing_streets_df["street"].tolist())

print(f"Number of rows where 'C_street' is null (no match found): {accidents_with_duplicates_df['C_street'].isna().sum()}")
print(f"Missing streets: {missing_streets}")

Head of street DF:


Unnamed: 0,street,variant,is_canonical,town_name,street_latitude,street_longitude,street_type
0,aldo moro road,aldo moro road,True,marsa,35.877239,14.494806,primary
1,aldo moro road,aldo moro street,False,marsa,,,
2,aldo moro road,marsa's aldo moro road,False,marsa,,,
3,aldo moro road,the busy aldo moro road,False,marsa,,,
4,aldo moro road,triq aldo moro,False,marsa,,,


DataFrame with new 'C_street' column (first 5 rows of relevant columns):


Unnamed: 0,id,street,C_street
0,article_3834.0,triq il-mithna,triq il-mithna
1,article_2101.0,triq sir temi zammit,triq sir temi zammit
2,article_353524.0,triq wied bladun,triq wied bladun
3,article_1352.0,triq ghar lapsi,triq ghar lapsi
4,article_490632.0,aldo moro road,aldo moro road
...,...,...,...
318,article_496202.0,triq il-belt valletta,triq il-belt valletta
319,release_1,triq il-belt valletta,triq il-belt valletta
320,article_496274.0,triq sant anna,triq sant anna
321,article_496206.0,paola roundabout,paola roundabout


Number of rows where 'C_street' is null (no match found): 0
Missing streets: set()


#### Mark duplicates programmatically

Create a new flag inside the DataFrame with either 'retain', 'delete', 'suspect'.

If two (or more) rows have the identical `C_city`, `C_street`, we retain the row that has the highest `accident_severity` followed by the highest `total_injured`. If both are identical, we retain only the police press release row by marking the others as 'delete'. Else, we mark them as 'suspect' for manual auditing.

In [39]:
def resolve_duplicates_in_group(group):
    # if there's only one item, or if all are already 'retain', no need to process
    if len(group) == 1 or 'suspect' not in group['delete_flag'].values:
        # ensure duplicate_of_id column exists for groups with no duplicates
        if 'duplicate_of_id' not in group.columns:
            group['duplicate_of_id'] = None
        return group

    group_copy = group.copy()

    # assign numerical rank to severity for sorting
    group_copy['severity_rank'] = group_copy['accident_severity'].map(severity_order)

    # assign a priority for 'release' IDs (lower number means higher priority)
    group_copy['id_priority'] = group_copy['id'].apply(lambda x: 0 if 'release' in str(x).lower() else 1)

    # sort to prioritize retaining the "best" record based on criteria:
    # 1. worst accident_severity (highest severity_rank)
    # 2. highest total_injured
    # 3. prefer 'release' IDs (lower id_priority)
    # 4. for news articles with identical city/street, prefer latest date_published (string comparison, descending)
    # 5. as a final tie-breaker, use the 'id' itself to ensure deterministic selection (chronological order)
    group_copy = group_copy.sort_values(
        by=['severity_rank', 'total_injured', 'id_priority', 'date_published', 'id'],
        ascending=[False, False, True, False, True] # False for severity_rank, total_injured, and date_published (descending), True for id_priority and id (ascending)
    )

    # the first row after sorting is the one to be retained
    retained_id = group_copy.iloc[0]['id']

    # ensure duplicate_of_id column exists
    if 'duplicate_of_id' not in group_copy.columns:
        group_copy['duplicate_of_id'] = None

    # mark the first row after sorting as 'retain'
    group_copy.iloc[0, group_copy.columns.get_loc('delete_flag')] = 'retain'
    group_copy.iloc[0, group_copy.columns.get_loc('duplicate_of_id')] = None  # Retained record points to nothing

    # mark all other rows in this specific group as 'delete' and point to the retained ID
    if len(group_copy) > 1:
        group_copy.iloc[1:, group_copy.columns.get_loc('delete_flag')] = 'delete'
        group_copy.iloc[1:, group_copy.columns.get_loc('duplicate_of_id')] = retained_id

    return group_copy.drop(columns=['severity_rank', 'id_priority'])

In [40]:
# add the 'duplicate_of_id' column before processing
accidents_with_duplicates_df['duplicate_of_id'] = None

# apply this resolution logic only to groups that were previously marked as 'suspect' or have potential duplicates
# group by the specified columns: accident_datetime, city (canonical), street (canonical)
accidents_with_duplicates_df = accidents_with_duplicates_df.groupby(
    ['accident_datetime', 'C_city', 'C_street'],
    group_keys=False, # to avoid adding grouping keys to the result
    dropna=False  # add this to keep records with NaN in C_city or C_street
).apply(resolve_duplicates_in_group)

print("Updated `delete_flag` after applying deduplication logic:")
display(accidents_with_duplicates_df['delete_flag'].value_counts())

print(f"Duplicates pointing to retained records: {accidents_with_duplicates_df['duplicate_of_id'].notna().sum()}")

Updated `delete_flag` after applying deduplication logic:


  ).apply(resolve_duplicates_in_group)


delete_flag
retain     199
delete      90
suspect     34
Name: count, dtype: int64

Duplicates pointing to retained records: 90


In [41]:
duplicates_manual_csv = f"{deduplication_data_folder}/duplicates_manual.csv"
accidents_with_duplicates_df.to_csv(duplicates_manual_csv, index=False)

print(f"✓ Saved {len(accidents_with_duplicates_df)} records to {duplicates_manual_csv}")
print("Now, we will need to go through this and manually review those rows that are marked as 'suspect'")
display(accidents_with_duplicates_df)

✓ Saved 323 records to ../../data/staging/deduplication/duplicates_manual.csv
Now, we will need to go through this and manually review those rows that are marked as 'suspect'


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,D_driver_gender,D_number_injured,D_vehicle_type,accident_date_id,accident_severity,total_injured,delete_flag,C_city,C_street,duplicate_of_id
0,article_3834.0,Driver paid for damages he was never accused o...,A driver who was conditionally discharged over...,30/12/2024,2006-12-17 00:00:00,triq il-mithna,qormi,Grievious,Unknown,Unknown,...,,,,20061217,Grievious,1.0,retain,qormi,triq il-mithna,
1,article_2101.0,Driver jailed for four years for killing passe...,A 48-year-old man has been jailed for four yea...,01/04/2025,2017-08-20 15:25:00,triq sir temi zammit,mgarr,Unknown,26,M,...,,,,20170820,Fatal,3.0,retain,mgarr,triq sir temi zammit,
2,article_353524.0,Elderly man acquitted of traffic fatality as v...,A man has been acquitted of causing the death ...,23/07/2025,2018-03-13 00:00:00,triq wied bladun,paola,Fatal,83,M,...,,,,20180313,Fatal,1.0,retain,paola,triq wied bladun,
3,article_1352.0,Motorcyclist injured in pothole incident award...,A 39-year-old Turkish national has been awarde...,09/05/2025,2020-05-02 08:00:00,triq ghar lapsi,siggiewi,Serious,39,M,...,,,,20200502,Serious,1.0,retain,siggiewi,triq ghar lapsi,
4,article_490632.0,‘Why do we have to beg for justice?’,The brother of a food courier killed in a frea...,17/08/2025,2022-02-01 00:00:00,aldo moro road,marsa,Fatal,28,M,...,,,,20220201,Fatal,1.0,retain,marsa,aldo moro road,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321,article_496206.0,Watch: Traffic accident caught live on air dur...,Malta’s traffic woes were brought home to TV v...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,,,,20251009,,0.0,retain,paola,paola roundabout,
318,article_496202.0,Motorcyclist seriously injured in Żurrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,,,,20251009,Serious,2.0,retain,zurrieq,triq il-belt valletta,
319,release_1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,not injured,67,F,...,,,,20251009,Serious,1.0,delete,zurrieq,triq il-belt valletta,article_496202.0
320,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,,,,20251009,not injured,0.0,retain,floriana,triq sant anna,


#### Perform manual deduplication

Look through CSV file containing duplicates and create a copy.

In the copy, go through all of the rows and make sure that they are correct. Also, resolve 'suspect' flags and either mark them as 'delete' or 'retain'.

In [42]:
duplicates_manual_audit_csv = f"{deduplication_data_folder}/duplicates_manual_audit.csv"
duplicates_manual_audit_df = pd.read_csv(duplicates_manual_audit_csv)

print("Duplicates Manual Audit DF:")
display(duplicates_manual_audit_df)

# let's print out what flags where changed after the manual audit
dedup_manual_audit_df = pd.merge(
    accidents_with_duplicates_df[["id", "delete_flag"]],
    duplicates_manual_audit_df[["id", "delete_flag"]],
    on="id",
    how="inner",
    suffixes=('', '_audit') # Add suffix to avoid collision with existing 'street' if it exists.
)

dedup_manual_audit_df = dedup_manual_audit_df[
    dedup_manual_audit_df["delete_flag"] != dedup_manual_audit_df["delete_flag_audit"]
]

print("Deduplication manual audit trail:")
display(dedup_manual_audit_df)

Duplicates Manual Audit DF:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,D_driver_gender,D_number_injured,D_vehicle_type,accident_date_id,accident_severity,total_injured,delete_flag,C_city,C_street,duplicate_of_id
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,10/10/2025 09:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,M,0.0,car,20251010,Serious,2,retain,ghajnsielem,triq l-imgarr,
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,triq sant anna,floriana,not injured,none,none,...,,,,20251009,not injured,0,retain,floriana,triq sant anna,
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,09/10/2025 09:30,triq il-belt valletta,zurrieq,Serious,61,M,...,,,,20251009,Serious,2,retain,zurrieq,triq il-belt valletta,
3,release_1,Collision between a car and a motorbike in Å»u...,"Today, at around 0930hrs, the Police were info...",09/10/2025,09/10/2025 09:30,triq il-belt valletta,zurrieq,not injured,67,F,...,,,,20251009,Serious,1,delete,zurrieq,triq il-belt valletta,article_496202.0
4,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,09/10/2025 00:00,paola roundabout,paola,slight,none,none,...,,,,20251009,,0,retain,paola,paola roundabout,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,article_490632.0,â€˜Why do we have to beg for justice?â€™,The brother of a food courier killed in a frea...,17/08/2025,01/02/2022 00:00,aldo moro road,marsa,Fatal,28,M,...,,,,20220201,Fatal,1,retain,marsa,aldo moro road,
319,article_1352.0,Motorcyclist injured in pothole incident award...,A 39-year-old Turkish national has been awarde...,09/05/2025,02/05/2020 08:00,triq ghar lapsi,siggiewi,Serious,39,M,...,,,,20200502,Serious,1,retain,siggiewi,triq ghar lapsi,
320,article_353524.0,Elderly man acquitted of traffic fatality as v...,A man has been acquitted of causing the death ...,23/07/2025,13/03/2018 00:00,triq wied bladun,paola,Fatal,83,M,...,,,,20180313,Fatal,1,retain,paola,triq wied bladun,
321,article_2101.0,Driver jailed for four years for killing passe...,A 48-year-old man has been jailed for four yea...,01/04/2025,20/08/2017 15:25,triq sir temi zammit,mgarr,Unknown,26,M,...,,,,20170820,Fatal,3,retain,mgarr,triq sir temi zammit,


Deduplication manual audit trail:


Unnamed: 0,id,delete_flag,delete_flag_audit
15,release_101,suspect,delete
36,release_68,suspect,retain
37,article_3964.0,suspect,delete
43,release_46,suspect,delete
61,release_800,suspect,retain
67,article_3456.0,suspect,delete
68,release_105,suspect,retain
84,article_3100.0,suspect,delete
85,release_100,suspect,retain
107,article_2666.0,suspect,retain


## 5. Feature Extraction

Extract features on the final DataFrame using rule based processing & dimensions.

### Variables

In [43]:
dim_date_csv = f"{dim_data_folder}/dim_date.csv"

deduplicated_df = duplicates_manual_audit_df[duplicates_manual_audit_df["delete_flag"] == 'retain']

print("Delete Flag Counts (should be 100% retain):")
display(deduplicated_df['delete_flag'].value_counts())

dim_date_df = pd.read_csv(dim_date_csv)

# drop certain columns that are no longer needed
deduplicated_df = deduplicated_df.drop(["delete_flag", "duplicate_of_id"], axis=1)

print("Deduplicated DF:")
display(deduplicated_df)

Delete Flag Counts (should be 100% retain):


delete_flag
retain    219
Name: count, dtype: int64

Deduplicated DF:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,D_accident_severity,D_driver_age,D_driver_gender,D_number_injured,D_vehicle_type,accident_date_id,accident_severity,total_injured,C_city,C_street
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,10/10/2025 09:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,not injured,55.0,M,0.0,car,20251010,Serious,2,ghajnsielem,triq l-imgarr
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,09/10/2025 13:00,triq sant anna,floriana,not injured,none,none,...,,,,,,20251009,not injured,0,floriana,triq sant anna
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,09/10/2025 09:30,triq il-belt valletta,zurrieq,Serious,61,M,...,,,,,,20251009,Serious,2,zurrieq,triq il-belt valletta
4,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,09/10/2025 00:00,paola roundabout,paola,slight,none,none,...,,,,,,20251009,,0,paola,paola roundabout
5,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,06/10/2025 09:30,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,,,,,,20251006,Grievious,1,naxxar,triq il-kappella tax-xaghra
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,article_490632.0,â€˜Why do we have to beg for justice?â€™,The brother of a food courier killed in a frea...,17/08/2025,01/02/2022 00:00,aldo moro road,marsa,Fatal,28,M,...,,,,,,20220201,Fatal,1,marsa,aldo moro road
319,article_1352.0,Motorcyclist injured in pothole incident award...,A 39-year-old Turkish national has been awarde...,09/05/2025,02/05/2020 08:00,triq ghar lapsi,siggiewi,Serious,39,M,...,,,,,,20200502,Serious,1,siggiewi,triq ghar lapsi
320,article_353524.0,Elderly man acquitted of traffic fatality as v...,A man has been acquitted of causing the death ...,23/07/2025,13/03/2018 00:00,triq wied bladun,paola,Fatal,83,M,...,,,,,,20180313,Fatal,1,paola,triq wied bladun
321,article_2101.0,Driver jailed for four years for killing passe...,A 48-year-old man has been jailed for four yea...,01/04/2025,20/08/2017 15:25,triq sir temi zammit,mgarr,Unknown,26,M,...,,,,,,20170820,Fatal,3,mgarr,triq sir temi zammit


### Rule based features

Extract features based on rules applied to other columns in the DataFrame.

#### Accident Time Category

Column based on the the `accident_datetime` category, specifically the time part of the date.

- early_morning: - 06:01 to 08:00
- morning        - 08:01 to 12:00
- afternoon:     - 12:01 to 18:00
- evening:       - 18:01 to 21:00
- late_evening:  - 21:01 to 23:00
- night:         - 23:01 to 06:00

In [44]:
def categorize_time(row):
    hour = row['accident_datetime'].hour
    minute = row['accident_datetime'].minute

    # Convert time to minutes for easier comparison
    time_in_minutes = hour * 60 + minute

    if 361 <= time_in_minutes <= 480: # 06:01 (361) to 08:00 (480)
        return 'early_morning'
    elif 481 <= time_in_minutes <= 720: # 08:01 (481) to 12:00 (720)
        return 'morning'
    elif 721 <= time_in_minutes <= 1080: # 12:01 (721) to 18:00 (1080)
        return 'afternoon'
    elif 1081 <= time_in_minutes <= 1260: # 18:01 (1081) to 21:00 (1260)
        return 'evening'
    elif 1261 <= time_in_minutes <= 1380: # 21:01 (1261) to 23:00 (1380)
        return 'late_evening'
    else: # 23:01 (1381) to 06:00 (360)
        return 'night'

deduplicated_df['accident_datetime'] = pd.to_datetime(deduplicated_df['accident_datetime'], format='mixed', dayfirst=True, errors='coerce')
deduplicated_df["accident_time_category"] = deduplicated_df.apply(categorize_time, axis=1)

print("New 'accident_time_category' column added:")
display(deduplicated_df[['accident_datetime', 'accident_time_category']].head())

print("Value counts for `accident_time_category`:")
print(deduplicated_df['accident_time_category'].value_counts())

# add one-hot encoding for `accident_time_category`
accident_time_categories_df = pd.get_dummies(deduplicated_df["accident_time_category"], prefix="accident_time")
deduplicated_df = deduplicated_df.join(accident_time_categories_df)

print("Display DF after adding accident_time_category one-hot encoding:")
display(deduplicated_df.head())

print("DataFrame columns:", deduplicated_df.columns.tolist())

New 'accident_time_category' column added:


Unnamed: 0,accident_datetime,accident_time_category
0,2025-10-10 09:00:00,morning
1,2025-10-09 13:00:00,afternoon
2,2025-10-09 09:30:00,morning
4,2025-10-09 00:00:00,night
5,2025-10-06 09:30:00,morning


Value counts for `accident_time_category`:
accident_time_category
afternoon        68
morning          45
night            43
early_morning    26
evening          23
late_evening     14
Name: count, dtype: int64
Display DF after adding accident_time_category one-hot encoding:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,total_injured,C_city,C_street,accident_time_category,accident_time_afternoon,accident_time_early_morning,accident_time_evening,accident_time_late_evening,accident_time_morning,accident_time_night
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,2025-10-10 09:00:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,2,ghajnsielem,triq l-imgarr,morning,False,False,False,False,True,False
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,0,floriana,triq sant anna,afternoon,True,False,False,False,False,False
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,2,zurrieq,triq il-belt valletta,morning,False,False,False,False,True,False
4,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,0,paola,paola roundabout,night,False,False,False,False,False,True
5,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,2025-10-06 09:30:00,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,1,naxxar,triq il-kappella tax-xaghra,morning,False,False,False,False,True,False


DataFrame columns: ['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night']


#### Driver age features using rules

From driver ages, extract the following columns.

Boolean features:
- driver_under_18
- driver_18_to_24
- driver_25_to_49
- driver_50_to_64
- driver_65_plus

Numerical features:
- num_drivers_under_18
- num_drivers_18_to_24
- num_drivers_25_to_49
- num_drivers_50_to_64
- num_drivers_65_plus

In [45]:
# age feature one-hot encoding
driver_age_cols_to_check = [f"{element}_driver_age" for element in ["A", "B", "C", "D"]]
age_df = deduplicated_df[driver_age_cols_to_check].apply(pd.to_numeric, errors="coerce").astype("Int64")

deduplicated_df["driver_under_18"] = age_df.apply(lambda c: c < 18).any(axis=1).astype(bool)
deduplicated_df["driver_18_to_24"] = age_df.apply(lambda c: c.between(18, 24)).any(axis=1).astype(bool)
deduplicated_df["driver_25_to_49"] = age_df.apply(lambda c: c.between(25, 49)).any(axis=1).astype(bool)
deduplicated_df["driver_50_to_64"] = age_df.apply(lambda c: c.between(50, 64)).any(axis=1).astype(bool)
deduplicated_df["driver_65_plus"] = age_df.apply(lambda c: c >= 65).any(axis=1).astype(bool)

deduplicated_df["num_drivers_under_18"] = age_df.apply(lambda c: c < 18).sum(axis=1).astype(int)
deduplicated_df["num_drivers_18_to_24"] = age_df.apply(lambda c: c.between(18, 24)).sum(axis=1).astype(int)
deduplicated_df["num_drivers_25_to_49"] = age_df.apply(lambda c: c.between(25, 49)).sum(axis=1).astype(int)
deduplicated_df["num_drivers_50_to_64"] = age_df.apply(lambda c: c.between(50, 64)).sum(axis=1).astype(int)
deduplicated_df["num_drivers_65_plus"] = age_df.apply(lambda c: c >= 65).sum(axis=1).astype(int)

print("Display DF after adding age based features:")
display(deduplicated_df.head())

print("DataFrame columns:", deduplicated_df.columns.tolist())

Display DF after adding age based features:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,driver_under_18,driver_18_to_24,driver_25_to_49,driver_50_to_64,driver_65_plus,num_drivers_under_18,num_drivers_18_to_24,num_drivers_25_to_49,num_drivers_50_to_64,num_drivers_65_plus
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,2025-10-10 09:00:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,False,False,True,True,True,0,0,2,1,1
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,False,False,False,False,False,0,0,0,0,0
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,False,False,False,True,True,0,0,0,1,1
4,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,False,False,False,False,False,0,0,0,0,0
5,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,2025-10-06 09:30:00,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,False,False,False,False,True,0,0,0,0,2


DataFrame columns: ['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night', 'driver_under_18', 'driver_18_to_24', 'driver_25_to_49', 'driver_50_to_64', 'driver_65_plus', 'num_drivers_under_18', 'num_drivers_18_to_24', 'num_drivers_25_to_49', 'num_drivers_50_to_64', 'num_drivers_65_plus']


#### Driver gender features using rules

From drivers genders, extract the following columns:

Boolean features:
- driver_male
- driver_female
- driver_unknown

Numerical features:
- num_drivers_male
- num_drivers_female
- num_drivers_unknown

In [46]:
# driver gender one-hot encoding
driver_gender_cols_to_check = [f"{element}_driver_gender" for element in  ["A", "B", "C", "D"]]
gender_df = deduplicated_df[driver_gender_cols_to_check].apply(lambda x: x.str.lower())

deduplicated_df["driver_male"] = gender_df.eq("m").any(axis=1).astype(bool)
deduplicated_df["driver_female"] = gender_df.eq("f").any(axis=1).astype(bool)
deduplicated_df["driver_unknown"] = gender_df.isin(["unknown", "unk", "u", "none"]).any(axis=1).astype(bool)

deduplicated_df["num_drivers_male"] = gender_df.eq("m").sum(axis=1).astype(int)
deduplicated_df["num_drivers_female"] = gender_df.eq("f").sum(axis=1).astype(int)
deduplicated_df["num_drivers_unknown"] = gender_df.isin(["unknown", "unk", "u", "none"]).sum(axis=1).astype(int)

print("Display DF after adding gender based features:")
display(deduplicated_df.head())

print("DataFrame columns:", deduplicated_df.columns.tolist())

Display DF after adding gender based features:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,num_drivers_18_to_24,num_drivers_25_to_49,num_drivers_50_to_64,num_drivers_65_plus,driver_male,driver_female,driver_unknown,num_drivers_male,num_drivers_female,num_drivers_unknown
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,2025-10-10 09:00:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,0,2,1,1,True,False,False,4,0,0
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,0,0,0,0,False,False,True,0,0,1
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,0,0,1,1,True,True,False,1,1,0
4,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,0,0,0,0,False,False,True,0,0,2
5,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,2025-10-06 09:30:00,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,0,0,0,2,True,True,False,1,1,0


DataFrame columns: ['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night', 'driver_under_18', 'driver_18_to_24', 'driver_25_to_49', 'driver_50_to_64', 'driver_65_plus', 'num_drivers_under_18', 'num_drivers_18_to_24', 'num_drivers_25_to_49', 'num_drivers_50_to_64', 'num_drivers_65_plus', 'driver_male', 'driver_female', '

#### Vehicle Type features using rules

From vehicles, extract the following columns.

Boolean features:
- vehicle_unknown
- vehicle_pedestrian
- vehicle_bicycle
- vehicle_motorbike
- vehicle_car
- vehicle_van
- vehicle_bus

Numerical features:
- num_vehicles_unknown
- num_vehicle_pedestrian
- num_vehicle_bicycle
- num_vehicle_motorbike
- num_vehicle_car
- num_vehicle_van
- num_vehicle_bus

In [47]:
# vehicle type one-hot encoding
vehicle_type_cols_to_check = [f"{element}_vehicle_type" for element in  ["A", "B", "C", "D"]]
vehicle_df = deduplicated_df[vehicle_type_cols_to_check].apply(lambda x: x.str.lower())

deduplicated_df["vehicle_unknown"] = vehicle_df.eq("unknown").any(axis=1).astype(bool)
deduplicated_df["vehicle_pedestrian"] = vehicle_df.eq("pedestrian").any(axis=1).astype(bool)
deduplicated_df["vehicle_bicycle"] = vehicle_df.eq("bicycle").any(axis=1).astype(bool)
deduplicated_df["vehicle_motorbike"] = vehicle_df.isin(["motorbike", "motorcycle"]).any(axis=1).astype(bool)
deduplicated_df["vehicle_car"] = vehicle_df.eq("car").any(axis=1).astype(bool)
deduplicated_df["vehicle_van"] = vehicle_df.eq("van").any(axis=1).astype(bool)
deduplicated_df["vehicle_bus"] = vehicle_df.eq("bus").any(axis=1).astype(bool)

deduplicated_df["num_vehicle_unknown"] = vehicle_df.eq("unknown").sum(axis=1).astype(int)
deduplicated_df["num_vehicle_pedestrian"] = vehicle_df.eq("pedestrian").sum(axis=1).astype(int)
deduplicated_df["num_vehicle_bicycle"] = vehicle_df.eq("bicycle").sum(axis=1).astype(int)
deduplicated_df["num_vehicle_motorbike"] = vehicle_df.isin(["motorbike", "motorcycle"]).sum(axis=1).astype(int)
deduplicated_df["num_vehicle_car"] = vehicle_df.eq("car").sum(axis=1).astype(int)
deduplicated_df["num_vehicle_van"] = vehicle_df.eq("van").sum(axis=1).astype(int)
deduplicated_df["num_vehicle_bus"] = vehicle_df.eq("bus").sum(axis=1).astype(int)

print("Display DF after adding vehicle type based features:")
display(deduplicated_df.head())

print("DataFrame columns:", deduplicated_df.columns.tolist())

Display DF after adding vehicle type based features:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,vehicle_car,vehicle_van,vehicle_bus,num_vehicle_unknown,num_vehicle_pedestrian,num_vehicle_bicycle,num_vehicle_motorbike,num_vehicle_car,num_vehicle_van,num_vehicle_bus
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,2025-10-10 09:00:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,True,False,False,0,0,0,1,3,0,0
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,True,False,False,0,0,0,0,1,0,0
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,True,False,False,0,0,0,1,1,0,0
4,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,True,False,False,0,0,0,0,2,0,0
5,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,2025-10-06 09:30:00,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,True,False,False,0,1,0,0,1,0,0


DataFrame columns: ['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night', 'driver_under_18', 'driver_18_to_24', 'driver_25_to_49', 'driver_50_to_64', 'driver_65_plus', 'num_drivers_under_18', 'num_drivers_18_to_24', 'num_drivers_25_to_49', 'num_drivers_50_to_64', 'num_drivers_65_plus', 'driver_male', 'driver_female', '

### Dimension based features

Adding additional information based on the date and street dimensions.

#### Date Dimension

Adding the following columns from the date dimension:

- is_weekend
- is_public_holiday_mt
- is_school_holiday_mt
- is_school_day_mt

In [48]:
# first we check that there are no missing dates
missing_in_date_dim = deduplicated_df[~deduplicated_df["accident_date_id"].isin(dim_date_df["date_key"])]
missing_date_ids = set(missing_in_date_dim["accident_date_id"].tolist())
print("Missing Date IDs:", missing_date_ids)
assert len(missing_date_ids) == 0, "Date entries in date dimension still missing. Let's add these before proceeding"

# row count before join
before_count = len(deduplicated_df)

# perform inner join
deduplicated_df = deduplicated_df.merge(
    dim_date_df[
        [
            "date_key",
            "is_weekend",
            "is_public_holiday_mt",
            "is_school_holiday_mt",
            "is_school_day_mt",
        ]
    ],
    how="inner",
    left_on="accident_date_id",
    right_on="date_key",
)

# row count after join
after_count = len(deduplicated_df)

# assert counts are equal
assert before_count == after_count, (f"Row count mismatch after join: before={before_count}, after={after_count}")

# drop the join key from the right table if not needed
deduplicated_df = deduplicated_df.drop(columns=["date_key"])

print("Display DF after adding features from date dimension:")
display(deduplicated_df.head())

print("DataFrame columns:", deduplicated_df.columns.tolist())

Missing Date IDs: set()
Display DF after adding features from date dimension:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,num_vehicle_pedestrian,num_vehicle_bicycle,num_vehicle_motorbike,num_vehicle_car,num_vehicle_van,num_vehicle_bus,is_weekend,is_public_holiday_mt,is_school_holiday_mt,is_school_day_mt
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,2025-10-10 09:00:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,0,0,1,3,0,0,False,False,False,True
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,0,0,0,1,0,0,False,False,False,True
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,0,0,1,1,0,0,False,False,False,True
3,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,0,0,0,2,0,0,False,False,False,True
4,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,2025-10-06 09:30:00,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,1,0,0,1,0,0,False,False,False,True


DataFrame columns: ['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night', 'driver_under_18', 'driver_18_to_24', 'driver_25_to_49', 'driver_50_to_64', 'driver_65_plus', 'num_drivers_under_18', 'num_drivers_18_to_24', 'num_drivers_25_to_49', 'num_drivers_50_to_64', 'num_drivers_65_plus', 'driver_male', 'driver_female', '

#### Street Dimension

Adding the following columns from the street dimension:

- street_type (and perform one-hot encoding)

In [49]:
# get canonical streets only
canonical_dim_street_df = dim_street_df[dim_street_df["is_canonical"]].drop_duplicates(subset=['street']).copy()

display(canonical_dim_street_df)
print("Value count of all street types:", canonical_dim_street_df["street_type"].value_counts(dropna=False))

# to-do: add `street_type` for all canonical streets

# row count before join
before_count = len(deduplicated_df)

# perform inner join
deduplicated_df = deduplicated_df.merge(
    canonical_dim_street_df[
        [
            "street",
            "street_type",
        ]
    ],
    how="left",
    left_on="C_street",
    right_on="street",
    suffixes=("", "_street_dim")
)

# row count after join
after_count = len(deduplicated_df)

# assert counts are equal
assert before_count == after_count, (f"Row count mismatch after join: before={before_count}, after={after_count}")

# drop the join key from the right table if not needed
deduplicated_df = deduplicated_df.drop(columns=["street_street_dim"])

print("Display DF after adding features from date dimension:")
display(deduplicated_df.head())

print("DataFrame columns:", deduplicated_df.columns.tolist())

Unnamed: 0,street,variant,is_canonical,town_name,street_latitude,street_longitude,street_type
0,aldo moro road,aldo moro road,True,marsa,35.877239,14.494806,primary
5,attard road,attard road,True,zebbug,35.882870,14.440159,primary
6,balluta bay,balluta bay,True,st julian,,,
7,bella vista street,bella vista street,True,san gwann,35.907253,14.472029,secondary
8,birkirkara bypass,birkirkara bypass,True,birkirkara,35.901567,14.470050,trunk
...,...,...,...,...,...,...,...
271,vjal l-istadium nazzjonali,vjal l-istadium nazzjonali,True,attard,35.890521,14.422300,secondary
272,wignacourt aqueduct,wignacourt aqueduct,True,birkirkara,,,
273,xatt l-ghassara tal-gheneb,xatt l-ghassara tal-gheneb,True,marsa,35.886342,14.501907,tertiary
275,xemxija bypass,xemxija bypass,True,san pawl il bahar,,,


Value count of all street types: street_type
NaN            70
secondary      24
primary        22
residential    19
trunk          18
tertiary        8
pedestrian      3
bus_stop        3
Name: count, dtype: int64
Display DF after adding features from date dimension:


Unnamed: 0,id,title,content,date_published,accident_datetime,street,city,A_accident_severity,A_driver_age,A_driver_gender,...,num_vehicle_bicycle,num_vehicle_motorbike,num_vehicle_car,num_vehicle_van,num_vehicle_bus,is_weekend,is_public_holiday_mt,is_school_holiday_mt,is_school_day_mt,street_type
0,article_496362.0,GÄ§ajnsielem mayor urges traffic safety measur...,The Mayor of GÄ§ajnsielem on Friday called on ...,10/10/2025,2025-10-10 09:00:00,triq l-imgarr,ghajnsielem,Serious,35,M,...,0,1,3,0,0,False,False,False,True,
1,article_496274.0,"Car catches fire on Floriana main road, causin...",A man was driving on the Floriana main road wh...,09/10/2025,2025-10-09 13:00:00,triq sant anna,floriana,not injured,none,none,...,0,0,1,0,0,False,False,False,True,trunk
2,article_496202.0,Motorcyclist seriously injured in Å»urrieq crash,A motorcyclist was left with serious injuries ...,09/10/2025,2025-10-09 09:30:00,triq il-belt valletta,zurrieq,Serious,61,M,...,0,1,1,0,0,False,False,False,True,primary
3,article_496206.0,Watch: Traffic accident caught live on air dur...,Maltaâ€™s traffic woes were brought home to TV...,09/10/2025,2025-10-09 00:00:00,paola roundabout,paola,slight,none,none,...,0,0,2,0,0,False,False,False,True,
4,release_52,Woman grievously injured in a traffic accident,"An 84-year-old woman, a resident of Naxxar, wa...",06/10/2025,2025-10-06 09:30:00,triq il-kappella ta xaghra,naxxar,not injured,84,M,...,0,0,1,0,0,False,False,False,True,


DataFrame columns: ['id', 'title', 'content', 'date_published', 'accident_datetime', 'street', 'city', 'A_accident_severity', 'A_driver_age', 'A_driver_gender', 'A_number_injured', 'A_vehicle_type', 'B_accident_severity', 'B_driver_age', 'B_driver_gender', 'B_number_injured', 'B_vehicle_type', 'C_accident_severity', 'C_driver_age', 'C_driver_gender', 'C_number_injured', 'C_vehicle_type', 'D_accident_severity', 'D_driver_age', 'D_driver_gender', 'D_number_injured', 'D_vehicle_type', 'accident_date_id', 'accident_severity', 'total_injured', 'C_city', 'C_street', 'accident_time_category', 'accident_time_afternoon', 'accident_time_early_morning', 'accident_time_evening', 'accident_time_late_evening', 'accident_time_morning', 'accident_time_night', 'driver_under_18', 'driver_18_to_24', 'driver_25_to_49', 'driver_50_to_64', 'driver_65_plus', 'num_drivers_under_18', 'num_drivers_18_to_24', 'num_drivers_25_to_49', 'num_drivers_50_to_64', 'num_drivers_65_plus', 'driver_male', 'driver_female', '

In [50]:
# to-do: 1. one-hot encode street types (article for one-hot encoding example: https://www.kdnuggets.com/2023/07/pandas-onehot-encode-data.html)
# to-do: 2. add regions and one-hot encode regions
# to-do: 3. add total no. of vehicles/drivers involved per accident
# to-do: 4. add weather data (is_raining)
# to-do: 5. add google routes api (high, moderate, low, unknown)
# DONE: 6. add one-hot encoding for accident datetime category
# to-do: 6. save final DataFrame

In [51]:
# Random Forest -> Isaac
# SVMs -> Michael V
# KNNs/Naive Bayes -> Michael M
# Logistic/Linear Regression -> Paul