# 00. Data Ingestion and Initial Exploration

**Objective:** Load the dataset and perform an initial inspection to understand its structure, identify immediate data quality issues, and gather basic statistics.

**PRD References:** 
- 3.1.1 Data Ingestion
- 4.1 Data Source
- FR1: Data Loading and Preprocessing

## 1. Setup and Library Imports

In [16]:
import pandas as pd
import numpy as np

# Display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

## 2. Data Loading

Load the CSV dataset into a Pandas DataFrame.

In [17]:
# make this path cross compatible with local and google colab
import os
if os.path.exists('/content'):
    DATA_PATH = '/content/drive/My Drive/data/raw/RTA_EDSA_2007-2016.csv' # Google Colab  
else: 
    DATA_PATH = '../data/raw/RTA_EDSA_2007-2016.csv'


try:
    df = pd.read_csv(DATA_PATH)
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print(f"Error: The file was not found at {DATA_PATH}")
except Exception as e:
    print(f"An error occurred: {e}")

Dataset loaded successfully!


## 3. Initial Data Inspection

Perform basic checks to understand the dataset's structure and content.

### 3.1. DataFrame Shape

In [18]:
if 'df' in locals():
    print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

The dataset has 22072 rows and 26 columns.


### 3.2. First Few Rows (Head)

In [19]:
if 'df' in locals():
    display(df.head())

Unnamed: 0,LOCATION_TEXT,ROAD,WEATHER,LIGHT,DESC,REPORTING_AGENCY,MAIN_CAUSE,INCIDENTDETAILS_ID,DATE_UTC,TIME_UTC,ADDRESS,killed_driver,killed_passenger,killed_pedestrian,injured_driver,injured_passenger,injured_pedestrian,killed_uncategorized,injured_uncategorized,killed_total,injured_total,DATETIME_PST,SEVERITY,Y,X,COLLISION_TYPE
0,,EDSA,,,"No Accident Factor, No Collision Stated (based...",MMDA Road Safety Unit,Human error,2414f311-e805-40c6-b8b9-116b51c35be7,2014-06-30,05:40:00,Congressional Ave. Edsa Quezon City,0,0,0,0,0,0,0,0,0,0,2014-06-30 13:40,Property,14.657714,121.019788,No Collision Stated
1,,EDSA,,,"No Accident Factor, No Collision Stated (based...",MMDA Road Safety Unit,Human error,c359744e-e0e7-4541-8db8-5f334c2b4fd7,2014-03-17,01:00:00,Congressional Ave. near EDSA Exit Gate of SM W...,0,0,0,0,0,0,0,0,0,0,2014-03-17 9:00,Property,14.657714,121.019788,No Collision Stated
2,,EDSA,,,"No Accident Factor, No Collision Stated (based...",MMDA Road Safety Unit,Human error,84e0c352-c029-4393-9252-7516c77f471f,2013-11-26,02:00:00,Congressional Ave. EDSA Quezon City,0,0,0,0,1,0,0,0,0,1,2013-11-26 10:00,Injury,14.657714,121.019788,No Collision Stated
3,,EDSA,,,"No Accident Factor, No Collision Stated (based...",MMDA Road Safety Unit,Human error,33462d75-3bf8-483f-8cf2-63adab3a6924,2013-10-26,13:00:00,Congressional Ave. EDSA Bgy.Magsaysay Quezon City,0,0,0,0,0,0,0,0,0,0,2013-10-26 21:00,Property,14.657714,121.019788,No Collision Stated
4,,EDSA,,,"No Accident Factor, No Collision Stated (based...",MMDA Road Safety Unit,Human error,2d2ef179-9c5c-4d5b-90de-8f12871a005c,2013-06-26,23:30:00,Congressional Ave. fronting EDSA N/B Quezon City,0,0,0,1,0,0,0,0,0,1,2013-06-27 7:30,Injury,14.657706,121.01966,No Collision Stated


### 3.3. Last Few Rows (Tail)

In [24]:
if 'df' in locals():
    display(df.tail())

Unnamed: 0,LOCATION_TEXT,ROAD,WEATHER,LIGHT,DESC,REPORTING_AGENCY,MAIN_CAUSE,INCIDENTDETAILS_ID,DATE_UTC,TIME_UTC,ADDRESS,killed_driver,killed_passenger,killed_pedestrian,injured_driver,injured_passenger,injured_pedestrian,killed_uncategorized,injured_uncategorized,killed_total,injured_total,DATETIME_PST,SEVERITY,Y,X,COLLISION_TYPE
22067,"DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...",EDSA,clear-day,day,BUMPED FROM BEHIND,MMDA Metrobase,,30502357-1f20-40a3-ab37-229908775f43,2016-10-29,00:47:30,,0,0,0,0,0,0,0,0,0,0,2016-10-29 8:47,Property,14.644445,121.037136,Rear-End
22068,"EDSA Quezon Avenue Flyover Northbound, South T...",EDSA,clear-night,night,Bumped from Behind,MMDA Metrobase,,845dd06b-3537-41cc-9b5c-f4d1d98c89c8,2016-10-22,11:50:47,,0,0,0,0,0,0,0,0,0,0,2016-10-22 19:50,Property,14.644503,121.037418,Rear-End
22069,"DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...",EDSA,rain,day,BUMPED FROM BEHIND,Other,,9a9ef68e-a7a6-4949-a848-c0f8557157b8,2016-10-21,02:00:00,,0,0,0,0,0,0,0,0,0,0,2016-10-21 10:00,Property,14.644524,121.037112,Rear-End
22070,"DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...",EDSA,clear-day,day,(+)MC DRIVER INJURED,MMDA Metrobase,,e89aef3c-d577-435d-b144-28abb6ed6d04,2016-10-24,23:17:32,,0,0,0,1,0,0,0,0,0,1,2016-10-25 7:17,Injury,14.644586,121.037042,Side Swipe
22071,"DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...",EDSA,rain,day,(+)MC DRIVER INJURED,Other,,c8212532-4db1-4a71-81ce-d9d22fed1765,2016-10-20,22:38:24,,0,0,0,1,0,0,0,0,0,1,2016-10-21 6:38,Injury,14.644701,121.037074,Side Swipe


### 3.4. Data Types and Non-Null Counts (Info)

In [25]:
if 'df' in locals():
    df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22072 entries, 0 to 22071
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LOCATION_TEXT          304 non-null    object 
 1   ROAD                   22072 non-null  object 
 2   WEATHER                304 non-null    object 
 3   LIGHT                  304 non-null    object 
 4   DESC                   22072 non-null  object 
 5   REPORTING_AGENCY       22072 non-null  object 
 6   MAIN_CAUSE             304 non-null    object 
 7   INCIDENTDETAILS_ID     22072 non-null  object 
 8   DATE_UTC               22072 non-null  object 
 9   TIME_UTC               22072 non-null  object 
 10  ADDRESS                21768 non-null  object 
 11  killed_driver          22072 non-null  int64  
 12  killed_passenger       22072 non-null  int64  
 13  killed_pedestrian      22072 non-null  int64  
 14  injured_driver         22072 non-null  int64  
 15  in

### 3.5. Descriptive Statistics (Describe)

In [26]:
if 'df' in locals():
    display(df.describe(include='all'))

Unnamed: 0,LOCATION_TEXT,ROAD,WEATHER,LIGHT,DESC,REPORTING_AGENCY,MAIN_CAUSE,INCIDENTDETAILS_ID,DATE_UTC,TIME_UTC,ADDRESS,killed_driver,killed_passenger,killed_pedestrian,injured_driver,injured_passenger,injured_pedestrian,killed_uncategorized,injured_uncategorized,killed_total,injured_total,DATETIME_PST,SEVERITY,Y,X,COLLISION_TYPE
count,304,22072,304,304,22072,22072,304,22072,22072,22072,21768,22072.0,22072.0,22072.0,22072.0,22072.0,22072.0,22072.0,22072.0,22072.0,22072.0,22072,22072,22072.0,22072.0,22072
unique,187,1,7,3,13910,3,4,22071,3391,582,7813,,,,,,,,,,,21089,3,,,8
top,"Caltex, EDSA, Malamig, Mandaluyong, Metro Mani...",EDSA,clear-day,day,"No Accident Factor, No Collision Stated (based...",MMDA Road Safety Unit,Human error,73997ea2-e6f6-44e0-ac58-aff95b4cccdb,2016-05-06,08:00:00,EDSA Taft Ave. Pasay,,,,,,,,,,,2016-10-06 18:00,Property,,,No Collision Stated
freq,10,22072,96,229,114,21768,299,2,41,347,361,,,,,,,,,,,4,20555,,,7323
mean,,,,,,,,,,,,0.00068,0.000227,4.5e-05,0.048025,0.042497,0.000544,9.1e-05,0.002582,0.001042,0.093648,,,14.598604,121.029992,
std,,,,,,,,,,,,0.026061,0.01505,0.006731,0.219263,0.518523,0.06169,0.009519,0.052508,0.03364,0.579482,,,0.044841,0.022182,
min,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,14.535388,120.983664,
25%,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,14.554971,121.013693,
50%,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,14.597563,121.034861,
75%,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,14.644244,121.047956,


### 3.6. Check for Duplicated Rows

In [27]:
if 'df' in locals():
    num_duplicates = df.duplicated().sum()
    print(f"Number of duplicated rows: {num_duplicates}")

Number of duplicated rows: 0


## 4. Initial Observations

Based on the initial inspection, document key observations here:

1.  **Dataset Dimensions:** (Number of rows, Number of columns)
2.  **Data Types:** (List any immediate concerns or necessary conversions, e.g., dates as objects)
3.  **Missing Values:** (Note columns with a significant number of missing values based on `df.info()`)
4.  **Categorical Features:** (Identify potential categorical columns from `df.describe(include='all')` and `df.info()`)
5.  **Numerical Features:** (Identify potential numerical columns and their basic statistics like mean, min, max)
6.  **Duplicates:** (Presence and number of duplicate rows)
7.  **Potential Target-Related Columns:** (Identify columns like `SEVERITY`, `killed_total`, `injured_total` that will be crucial for target variable definition)
8.  **Date/Time Columns:** (Identify `DATE_UTC`, `TIME_UTC`, `DATETIME_PST` and note their current format)
9.  **Text/ID Columns:** (Identify columns like `LOCATION_TEXT`, `DESC`, `INCIDENTDETAILS_ID`, `ADDRESS` and consider their potential utility or need for exclusion/processing)
10. **Initial Data Quality Concerns:** (Any other immediate red flags, e.g., inconsistent values if visible in `head()`/`tail()` or unique counts from `describe()`)