# Notebook Purpose

The purpose of this notebook is to combine all the feature extraction points into one CSV file so that we can validate all the data extracted programatically via regex or LLMs manually.

## Download Packages Required

- Install `pandas` package.

In [None]:
!pip install pandas

## Imports

In [None]:
import pandas as pd
from IPython.display import display

## Attributes

In [None]:
data_folder_path = "../../../data"
processed_data_folder_path = f"{data_folder_path}/processed"

local_news_articles_csv = f"{data_folder_path}/local_news_articles.csv"
police_press_releases_csv = f"{data_folder_path}/police_press_releases.csv"

regex_dtime_news_articles_csv = f"{processed_data_folder_path}/road_accidents_with_datetime.csv"
regex_dtime_press_releases_csv = f"{processed_data_folder_path}/police_releases_with_datetime.csv"

llm_news_articles_csv = f"{processed_data_folder_path}/llm_local_news_articles.csv"
llm_press_releases_csv = f"{processed_data_folder_path}/llm_press_releases.csv"

# to-do: wait for Paul to extract town/street of both CSVs to include in feature exctration

## Combine all DataFrames into one

Various different attempts have been made to extract features.

- Datetime feature extraction using Regex.
- Town/Street feature extraction using Reges.
- General feature extraction using LLM.

### Local News Articles

Combination of all dataframes for the local news articles.

#### Original CSV File

We select only the columns of importance and add `og_` prefix to the column names.

This way, when we join the DataFrames together, we will know from which DataFrame the column comes from.

In [None]:
articles_df = pd.read_csv(local_news_articles_csv)

articles_df = (
    articles_df[[
        "article_id",
        "url",
        "source_name",
        "source_url",
        "title",
        "subtitle",
        # "author_name", -> not interested in the name of the author
        "publish_date",
        "content",
        "top_image_url",
        "top_image_caption",
        "created_at",
        "tags",
        # "categories" -> always empty set, not interested in this column
    ]]
    .rename(columns={
        "article_id": "og_article_id",
        "url": "og_url",
        "source_name": "og_source_name",
        "source_url": "og_source_url",
        "title": "og_title",
        "subtitle": "og_subtitle",
        "publish_date": "og_publish_date",
        "content": "og_content",
        "top_image_url": "og_top_image_url",
        "top_image_caption": "og_top_image_caption",
        "created_at": "og_created_at",
        "tags": "og_tags",
    })
)

display(articles_df)

In [None]:
# 

### Police Press Releases

Combination of all dataframes for the police press releases.

In [None]:
police_releases_df = pd.read_csv(police_press_releases_csv)

In [None]:
display(police_releases_df)