# Notebook Purpose

The purpose of this notebook is to combine all the feature extraction points into one CSV file so that we can validate all the data extracted programatically via regex or LLMs manually.

## Download Packages Required

- Install `pandas` package.

In [1]:
!pip install pandas



## Imports

In [2]:
import ast
import pandas as pd
from IPython.display import display

## Attributes

In [3]:
data_folder_path = "../../../data"
processed_data_folder_path = f"{data_folder_path}/processed"

local_news_articles_csv = f"{data_folder_path}/local_news_articles.csv"
police_press_releases_csv = f"{data_folder_path}/police_press_releases.csv"
og_prefix = "og_"

regex_dtime_news_articles_csv = f"{processed_data_folder_path}/road_accidents_with_datetime.csv"
regex_dtime_press_releases_csv = f"{processed_data_folder_path}/police_releases_with_datetime.csv"
regex_dtime_prefix = "regxdt_"

llm_news_articles_csv = f"{processed_data_folder_path}/llm_local_news_articles.csv"
llm_press_releases_csv = f"{processed_data_folder_path}/llm_press_releases.csv"
llm_prefix = "llm_"

# to-do: wait for Paul to extract town/street of both CSVs to include in feature exctration

# csv save file paths of combined dataframes
combined_news_articles_csv = f"{processed_data_folder_path}/combined_news_articles.csv"
combined_press_releases_csv = f"{processed_data_folder_path}/combined_press_releases.csv"

## Methods

In [4]:
def parse_llm_drivers(x):
    if pd.isna(x) or x.strip() == "":
        return []
    return ast.literal_eval(x)

## Combine all DataFrames into one

Various different attempts have been made to extract features.

- Datetime feature extraction using Regex.
- Town/Street feature extraction using Reges.
- General feature extraction using LLM.

### Local News Articles

Combination of all dataframes for the local news articles.

#### Original CSV File

We select only the columns of importance and add `og_` prefix to the column names.

This way, when we join the DataFrames together, we will know from which DataFrame the column comes from.

In [5]:
articles_df = pd.read_csv(local_news_articles_csv)

articles_df = (
    articles_df[[
        "article_id",
        "url",
        "source_name",
        "source_url",
        "title",
        "subtitle",
        # "author_name", -> not interested in the name of the author
        "publish_date",
        "content",
        "top_image_url",
        "top_image_caption",
        "created_at",
        "tags",
        # "categories" -> always empty set, not interested in this column
    ]]
    .rename(columns={
        "article_id": "article_id",
        "url": f"{og_prefix}url",
        "source_name": f"{og_prefix}source_name",
        "source_url": f"{og_prefix}source_url",
        "title": f"{og_prefix}title",
        "subtitle": f"{og_prefix}subtitle",
        "publish_date": f"{og_prefix}publish_date",
        "content": f"{og_prefix}content",
        "top_image_url": f"{og_prefix}top_image_url",
        "top_image_caption": f"{og_prefix}top_image_caption",
        "created_at": f"{og_prefix}created_at",
        "tags": f"{og_prefix}tags",
    })
)

display(articles_df)

Unnamed: 0,article_id,og_url,og_source_name,og_source_url,og_title,og_subtitle,og_publish_date,og_content,og_top_image_url,og_top_image_caption,og_created_at,og_tags
0,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,The broken car mirror. Photo: Frank Xerri De Caro,2025-07-03 15:14:21.554132+00,"{Accident,Lesa,National}"
1,4167,https://timesofmalta.com/article/pn-slams-gove...,Times of Malta,https://timesofmalta.com,PN slams government for diverting EU bus funds...,"'By encouraging the use of private cars, the g...",2024-12-09,The PN on Monday slammed the government for di...,https://cdn-attachments.timesofmalta.com/d9afe...,"PN spokespeople Ryan Callus, Mark Anthony Samm...",2025-07-03 15:14:10.643172+00,"{""Climate Change"",Environment,""European Union""..."
2,4093,https://timesofmalta.com/article/motorcyclist-...,Times of Malta,https://timesofmalta.com,Motorcyclist seriously hurt in St Paul's Bay b...,Residents complained several times about inade...,2024-12-11,A motorcyclist was rushed to hospital in a cri...,https://cdn-attachments.timesofmalta.com/633f6...,Photo: Malta Police Force,2025-07-03 15:13:50.605708+00,"{Accident,National,""St Paul’S Bay"",Traffic}"
3,4110,https://timesofmalta.com/article/skip-involved...,Times of Malta,https://timesofmalta.com,Skip involved in horror St Paul’s Bay bypass c...,Motorcyclist hurt in crash on Wednesday evenin...,2024-12-12,A private contractor who placed a skip on St P...,https://cdn-attachments.timesofmalta.com/fc23e...,A 54-year-old man was seriously injured when h...,2025-07-03 15:13:54.812813+00,"{Accident,National,""St Paul’S Bay""}"
4,4066,https://timesofmalta.com/article/two-people-in...,Times of Malta,https://timesofmalta.com,"Two people, including teenage girl, critically...",Incidents in Mellieħa and Gudja on Friday even...,2024-12-14,A 29-year-old man and 17-year-old girl were cr...,https://cdn-attachments.timesofmalta.com/f1761...,The Ford Fiesta involved in the Gudja collisio...,2025-07-03 15:13:43.83839+00,"{Accident,Gudja,Mellieħa,National,Traffic}"
...,...,...,...,...,...,...,...,...,...,...,...,...
316,496574,https://timesofmalta.com/article/watch-msida-f...,Times of Malta,https://timesofmalta.com,Watch: Msida flyover to open by end of year as...,'The flyover will allow us to create a new ope...,2025-10-12,The Msida flyover will open by the end of the ...,https://cdn-attachments.timesofmalta.com/f47de...,The sixth and final steel piece measures 18 me...,2025-10-12 16:57:00.930159+00,"{Msida,National,Traffic}"
317,496586,https://timesofmalta.com/article/today-front-p...,Times of Malta,https://timesofmalta.com,Today's front pages,The top stories in Malta's newspapers,2025-10-13,The following are the top stories in Malta's n...,https://cdn-attachments.timesofmalta.com/28065...,File photo: Times of Malta,2025-10-13 08:04:55.910209+00,"{Media,National,""Social Media"",Traffic}"
318,496577,https://timesofmalta.com/article/traffic-overt...,Times of Malta,https://timesofmalta.com,Traffic overtakes cost of living to become peo...,Poll data suggests frustration on Maltese road...,2025-10-13,"Traffic, parking and public transport-related ...",https://cdn-attachments.timesofmalta.com/861ca...,Cars in the Santa Venera tunnel. Photo: Chris ...,2025-10-13 05:03:22.837075+00,"{National,Politics,Traffic}"
319,496733,https://timesofmalta.com/article/employer-clea...,Times of Malta,https://timesofmalta.com,Employer cleared of responsibility for young w...,"Court raps police, OHSA for not working togeth...",2025-10-14,A court has sharply criticised the police and ...,https://cdn-attachments.timesofmalta.com/2d4fb...,The site of the 2015 tragic accident.,2025-10-14 14:01:03.493399+00,"{Accident,Construction,Court,National}"


#### Datetime extraction with regex

We select only the columns of importance and add `regxdt_` prefix to the column names.

In [6]:
regex_dtime_articles_df = pd.read_csv(regex_dtime_news_articles_csv)

regex_dtime_articles_df = (
    regex_dtime_articles_df[[
        "article_id",
        "accident_datetime",
    ]]
    .rename(columns={
        "article_id": "article_id",
        "accident_datetime": f"{regex_dtime_prefix}accident_datetime",
    })
)

display(regex_dtime_articles_df)

Unnamed: 0,article_id,regxdt_accident_datetime
0,4208,2024-12-04 00:00:00
1,4167,2024-12-09 00:00:00
2,4093,2024-12-11 17:00:00
3,4110,2024-12-11 13:00:00
4,4066,2024-12-13 17:30:00
...,...,...
316,496574,2025-10-12 00:00:00
317,496586,2025-10-13 00:00:00
318,496577,2025-10-13 00:00:00
319,496733,2025-10-14 00:00:00


#### Feature Extraction with LLM

We select only the columns of importance and add `llm_` prefix to the column names.

In [7]:
llm_articles_df = pd.read_csv(llm_news_articles_csv)

llm_articles_df = (
    llm_articles_df[[
        "id_column", # article_id
        "is_accident",
        "street",
        "city",
        "number_injured",
        "accident_severity",
        "drivers",
    ]]
    .rename(columns={
        "id_column": "article_id",
        "is_accident": f"{llm_prefix}is_accident",
        "street": f"{llm_prefix}street",
        "city": f"{llm_prefix}city",
        "number_injured": f"{llm_prefix}number_injured",
        "accident_severity": f"{llm_prefix}accident_severity",
        "drivers": f"{llm_prefix}drivers",
    })
)

display(llm_articles_df)

Unnamed: 0,article_id,llm_is_accident,llm_street,llm_city,llm_number_injured,llm_accident_severity,llm_drivers
0,4208,True,Regional Road,St Julian's,0,No Injuries,"[{'vehicle_type': 'Toyota Yaris', 'vehicle_dam..."
1,4167,False,,,,,
2,4093,True,St Paul's Bay bypass,St Paul's Bay,1,Serious,"[{'vehicle_type': 'Motorcycle', 'vehicle_damag..."
3,4110,True,St Paul’s Bay bypass,St Paul's Bay,1,Serious,"[{'vehicle_type': 'Motorcycle', 'vehicle_damag..."
4,4066,True,,,,,
...,...,...,...,...,...,...,...
313,496574,False,,,,,
314,496586,False,,,,,
315,496577,False,,,,,
316,496733,False,,,,,


#### Combined News Articles

Combine news articles DataFrames together.

In [8]:
joined_articles_df = (
    articles_df
    .merge(regex_dtime_articles_df, on="article_id", how="left")
    .merge(llm_articles_df, on="article_id", how="left")
)

joined_articles_df["llm_drivers"] = joined_articles_df["llm_drivers"].apply(parse_llm_drivers)
exploded_articles_df = joined_articles_df.explode("llm_drivers", ignore_index=True)

combined_articles_df = pd.concat(
    [
        exploded_articles_df.drop(columns=["llm_drivers"]),
        pd.json_normalize(exploded_articles_df["llm_drivers"])
    ],
    axis=1
).rename(columns={
    "vehicle_type": f"{llm_prefix}_vehicle_type",
    "vehicle_damage_severity": f"{llm_prefix}vehicle_damage_severity",
    "driver_age": f"{llm_prefix}driver_age",
    "driver_gender": f"{llm_prefix}driver_gender",
    "is_victim": f"{llm_prefix}is_victim",
})

display(combined_articles_df)

combined_articles_df.to_csv(combined_news_articles_csv)

Unnamed: 0,article_id,og_url,og_source_name,og_source_url,og_title,og_subtitle,og_publish_date,og_content,og_top_image_url,og_top_image_caption,...,llm_is_accident,llm_street,llm_city,llm_number_injured,llm_accident_severity,llm__vehicle_type,llm_vehicle_damage_severity,llm_driver_age,llm_driver_gender,llm_is_victim
0,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,The broken car mirror. Photo: Frank Xerri De Caro,...,True,Regional Road,St Julian's,0,No Injuries,Toyota Yaris,Minor,78,M,True
1,4208,https://timesofmalta.com/article/driver-stuck-...,Times of Malta,https://timesofmalta.com,Driver stuck in traffic says speeding LESA car...,‘I was shocked at that moment but more so frus...,2024-12-07,A motorist claims his car mirror was shattered...,https://cdn-attachments.timesofmalta.com/706da...,The broken car mirror. Photo: Frank Xerri De Caro,...,True,Regional Road,St Julian's,0,No Injuries,LESA vehicle,No damage,none,none,False
2,4167,https://timesofmalta.com/article/pn-slams-gove...,Times of Malta,https://timesofmalta.com,PN slams government for diverting EU bus funds...,"'By encouraging the use of private cars, the g...",2024-12-09,The PN on Monday slammed the government for di...,https://cdn-attachments.timesofmalta.com/d9afe...,"PN spokespeople Ryan Callus, Mark Anthony Samm...",...,False,,,,,,,,,
3,4093,https://timesofmalta.com/article/motorcyclist-...,Times of Malta,https://timesofmalta.com,Motorcyclist seriously hurt in St Paul's Bay b...,Residents complained several times about inade...,2024-12-11,A motorcyclist was rushed to hospital in a cri...,https://cdn-attachments.timesofmalta.com/633f6...,Photo: Malta Police Force,...,True,St Paul's Bay bypass,St Paul's Bay,1,Serious,Motorcycle,Major,54,M,True
4,4110,https://timesofmalta.com/article/skip-involved...,Times of Malta,https://timesofmalta.com,Skip involved in horror St Paul’s Bay bypass c...,Motorcyclist hurt in crash on Wednesday evenin...,2024-12-12,A private contractor who placed a skip on St P...,https://cdn-attachments.timesofmalta.com/fc23e...,A 54-year-old man was seriously injured when h...,...,True,St Paul’s Bay bypass,St Paul's Bay,1,Serious,Motorcycle,none,54,M,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
448,496574,https://timesofmalta.com/article/watch-msida-f...,Times of Malta,https://timesofmalta.com,Watch: Msida flyover to open by end of year as...,'The flyover will allow us to create a new ope...,2025-10-12,The Msida flyover will open by the end of the ...,https://cdn-attachments.timesofmalta.com/f47de...,The sixth and final steel piece measures 18 me...,...,False,,,,,,,,,
449,496586,https://timesofmalta.com/article/today-front-p...,Times of Malta,https://timesofmalta.com,Today's front pages,The top stories in Malta's newspapers,2025-10-13,The following are the top stories in Malta's n...,https://cdn-attachments.timesofmalta.com/28065...,File photo: Times of Malta,...,False,,,,,,,,,
450,496577,https://timesofmalta.com/article/traffic-overt...,Times of Malta,https://timesofmalta.com,Traffic overtakes cost of living to become peo...,Poll data suggests frustration on Maltese road...,2025-10-13,"Traffic, parking and public transport-related ...",https://cdn-attachments.timesofmalta.com/861ca...,Cars in the Santa Venera tunnel. Photo: Chris ...,...,False,,,,,,,,,
451,496733,https://timesofmalta.com/article/employer-clea...,Times of Malta,https://timesofmalta.com,Employer cleared of responsibility for young w...,"Court raps police, OHSA for not working togeth...",2025-10-14,A court has sharply criticised the police and ...,https://cdn-attachments.timesofmalta.com/2d4fb...,The site of the 2015 tragic accident.,...,False,,,,,,,,,


### Police Press Releases

Combination of all dataframes for the police press releases.

#### Original CSV File

In [9]:
police_releases_df = pd.read_csv(police_press_releases_csv)
police_releases_df.insert(0, 'release_id', range(1, len(police_releases_df) + 1)) # use similar pre-processing used by Isaac to generate surrogate key

police_releases_df = (
    police_releases_df[[
        "release_id",
        "title",
        "content",
        "date_published",
        "date_modified",
    ]]
    .rename(columns={
        "release_id": "release_id",
        "title": f"{og_prefix}title",
        "content": f"{og_prefix}content",
        "date_published": f"{og_prefix}date_published",
        "date_modified": f"{og_prefix}date_modified",
    })
)

display(police_releases_df)

Unnamed: 0,release_id,og_title,og_content,og_date_published,og_date_modified
0,1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",2025-10-09,2025-10-09
1,2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",2025-06-20,2025-06-20
2,3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",2025-05-12,2025-05-12
3,4,Collision between motorcycle and car in Għaxaq,"Yesterday, at around 1800hrs, the Police were ...",2025-07-30,2025-07-30
4,5,Car-motorcycle collision,"Yesterday, at around quarter to nine in the ev...",2025-04-07,2025-04-07
...,...,...,...,...,...
106,107,Motorcycle accident in Attard,"A 52-year-old man and residing in Ħaż-Żebbuġ, ...",2025-02-05,2025-02-05
107,108,Naxxar traffic accident,"Today, at around 1045hrs, the Police were info...",2024-12-19,2024-12-19
108,109,Żebbuġ traffic accident,"Today, at around 0800hrs, the Police were info...",2025-03-16,2025-03-16
109,110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",2025-07-18,2025-07-18


#### Datetime extraction with regex

In [10]:
regex_dtime_police_releases_df = pd.read_csv(regex_dtime_press_releases_csv)

regex_dtime_police_releases_df = (
    regex_dtime_police_releases_df[[
        "release_id",
        "accident_datetime",
    ]]
    .rename(columns={
        "release_id": "release_id",
        "accident_datetime": f"{regex_dtime_prefix}accident_datetime",
    })
)

display(regex_dtime_police_releases_df)

Unnamed: 0,release_id,regxdt_accident_datetime
0,1,09/10/2025 9:30
1,2,19/06/2025 18:30
2,3,12/05/2025 8:00
3,4,29/07/2025 18:00
4,5,06/04/2025 20:45
...,...,...
106,107,05/02/2025 9:00
107,108,19/12/2024 10:45
108,109,16/03/2025 8:00
109,110,17/07/2025 22:15


#### Feature Extraction with LLM

In [11]:
llm_police_releases_df = pd.read_csv(llm_press_releases_csv)

llm_police_releases_df = (
    llm_police_releases_df[[
        "id_column", # release_id
        "is_accident",
        "street",
        "city",
        "number_injured",
        "accident_severity",
        "drivers",
    ]]
    .rename(columns={
        "id_column": "release_id",
        "is_accident": f"{llm_prefix}is_accident",
        "street": f"{llm_prefix}street",
        "city": f"{llm_prefix}city",
        "number_injured": f"{llm_prefix}number_injured",
        "accident_severity": f"{llm_prefix}accident_severity",
        "drivers": f"{llm_prefix}drivers",
    })
)

display(llm_police_releases_df)

Unnamed: 0,release_id,llm_is_accident,llm_street,llm_city,llm_number_injured,llm_accident_severity,llm_drivers
0,1,True,Triq il-Belt Valletta,Żurrieq,1.0,Serious,"[{'vehicle_type': 'Car', 'vehicle_damage_sever..."
1,2,True,Triq Dawret il-Gudja,Gudja,1.0,Serious,"[{'vehicle_type': 'Honda fit', 'vehicle_damage..."
2,3,True,Valley Road,Qormi,1.0,Serious,"[{'vehicle_type': 'Ford Transit', 'vehicle_dam..."
3,4,True,Triq Dawret Ħal Għaxaq,Għaxaq,1.0,Serious,"[{'vehicle_type': 'Volvo XC60', 'vehicle_damag..."
4,5,True,Triq il-Buqana,Rabat,1.0,Serious,"[{'vehicle_type': 'Car', 'vehicle_damage_sever..."
...,...,...,...,...,...,...,...
106,107,True,Vjal L-Istadium Nazzjonali,Attard,1.0,Serious,"[{'vehicle_type': 'Motorcycle', 'vehicle_damag..."
107,108,True,Triq il-Ġermanja,Naxxar,1.0,Serious,"[{'vehicle_type': 'Toyota Vitz', 'vehicle_dama..."
108,109,True,Vjal il-Helsien,Zebbug,2.0,Serious,"[{'vehicle_type': 'Peugeot 306', 'vehicle_dama..."
109,110,True,Triq il-Wied ta’ Birkirkara,Birkirkara,1.0,Serious,"[{'vehicle_type': 'Car', 'vehicle_damage_sever..."


#### Combined Police Press Releases

In [12]:
joined_police_releases_df = (
    police_releases_df
    .merge(regex_dtime_police_releases_df, on="release_id", how="left")
    .merge(llm_police_releases_df, on="release_id", how="left")
)

joined_police_releases_df["llm_drivers"] = joined_police_releases_df["llm_drivers"].apply(parse_llm_drivers)
exploded_police_released_df = joined_police_releases_df.explode("llm_drivers", ignore_index=True)

combined_police_releases_df = pd.concat(
    [
        exploded_police_released_df.drop(columns=["llm_drivers"]),
        pd.json_normalize(exploded_police_released_df["llm_drivers"])
    ],
    axis=1
).rename(columns={
    "vehicle_type": f"{llm_prefix}_vehicle_type",
    "vehicle_damage_severity": f"{llm_prefix}vehicle_damage_severity",
    "driver_age": f"{llm_prefix}driver_age",
    "driver_gender": f"{llm_prefix}driver_gender",
    "is_victim": f"{llm_prefix}is_victim",
})

display(combined_police_releases_df)

combined_police_releases_df.to_csv(combined_press_releases_csv)

Unnamed: 0,release_id,og_title,og_content,og_date_published,og_date_modified,regxdt_accident_datetime,llm_is_accident,llm_street,llm_city,llm_number_injured,llm_accident_severity,llm__vehicle_type,llm_vehicle_damage_severity,llm_driver_age,llm_driver_gender,llm_is_victim
0,1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",2025-10-09,2025-10-09,09/10/2025 9:30,True,Triq il-Belt Valletta,Żurrieq,1.0,Serious,Car,none,67,F,False
1,1,Collision between a car and a motorbike in Żur...,"Today, at around 0930hrs, the Police were info...",2025-10-09,2025-10-09,09/10/2025 9:30,True,Triq il-Belt Valletta,Żurrieq,1.0,Serious,Motorbike,none,61,M,True
2,2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",2025-06-20,2025-06-20,19/06/2025 18:30,True,Triq Dawret il-Gudja,Gudja,1.0,Serious,Honda fit,none,64,M,False
3,2,Car-motorcycle traffic accident,"Yesterday, at around 1830hrs, the Police were ...",2025-06-20,2025-06-20,19/06/2025 18:30,True,Triq Dawret il-Gudja,Gudja,1.0,Serious,Kawasaki Ninja motorcycle,none,23,M,True
4,3,Car-motorcycle collision in Ħal Qormi,"Today, at around 0800hrs, the Police were info...",2025-05-12,2025-05-12,12/05/2025 8:00,True,Valley Road,Qormi,1.0,Serious,Ford Transit,none,34,M,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170,109,Żebbuġ traffic accident,"Today, at around 0800hrs, the Police were info...",2025-03-16,2025-03-16,16/03/2025 8:00,True,Vjal il-Helsien,Zebbug,2.0,Serious,Peugeot 306,none,59,M,False
171,110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",2025-07-18,2025-07-18,17/07/2025 22:15,True,Triq il-Wied ta’ Birkirkara,Birkirkara,1.0,Serious,Car,none,41,none,False
172,110,Collision between a car and e-scooter,"Yesterday, at around 2215 hrs, the Police were...",2025-07-18,2025-07-18,17/07/2025 22:15,True,Triq il-Wied ta’ Birkirkara,Birkirkara,1.0,Serious,E-scooter,none,17,none,True
173,111,Traffic accident in Gwardamanġa,"Today, at around 0700hrs, the Police were info...",2025-08-12,2025-08-12,12/08/2025 7:00,True,St Luke’s Square,Gwardamanġa,2.0,Serious,Volkswagen Caddy,none,62,M,True
