# Commit 1: Initial Data Loading and Preliminary Inspection

This notebook covers the tasks for Commit 1:
1. Load the dataset.
2. Perform basic data inspection:
    * Display column names and data types.
    * Check for missing values in `SEVERITY`, `killed_total`, and `injured_total`.
    * Get a first look at unique values for `SEVERITY`.
    * Get basic descriptive statistics for `killed_total` and `injured_total`.

In [1]:
import pandas as pd
import numpy as np

## 1. Load Dataset

In [4]:
file_path = '../data/raw/RTA_EDSA_2007-2016.csv'
try:
    df = pd.read_csv(file_path)
    print(df)
except FileNotFoundError:
    print(f"File not found: {file_path}")
    df = None

                                           LOCATION_TEXT  ROAD      WEATHER  \
0                                                    NaN  EDSA          NaN   
1                                                    NaN  EDSA          NaN   
2                                                    NaN  EDSA          NaN   
3                                                    NaN  EDSA          NaN   
4                                                    NaN  EDSA          NaN   
...                                                  ...   ...          ...   
22067  DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...  EDSA    clear-day   
22068  EDSA Quezon Avenue Flyover Northbound, South T...  EDSA  clear-night   
22069  DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...  EDSA         rain   
22070  DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...  EDSA    clear-day   
22071  DILG-NAPOLCOM Center, EDSA, South Triangle, Qu...  EDSA         rain   

       LIGHT                                       

In [7]:
if df is not None:
    print(df.head())

  LOCATION_TEXT  ROAD WEATHER LIGHT  \
0           NaN  EDSA     NaN   NaN   
1           NaN  EDSA     NaN   NaN   
2           NaN  EDSA     NaN   NaN   
3           NaN  EDSA     NaN   NaN   
4           NaN  EDSA     NaN   NaN   

                                                DESC       REPORTING_AGENCY  \
0  No Accident Factor, No Collision Stated (based...  MMDA Road Safety Unit   
1  No Accident Factor, No Collision Stated (based...  MMDA Road Safety Unit   
2  No Accident Factor, No Collision Stated (based...  MMDA Road Safety Unit   
3  No Accident Factor, No Collision Stated (based...  MMDA Road Safety Unit   
4  No Accident Factor, No Collision Stated (based...  MMDA Road Safety Unit   

    MAIN_CAUSE                    INCIDENTDETAILS_ID    DATE_UTC  TIME_UTC  \
0  Human error  2414f311-e805-40c6-b8b9-116b51c35be7  2014-06-30  05:40:00   
1  Human error  c359744e-e0e7-4541-8db8-5f334c2b4fd7  2014-03-17  01:00:00   
2  Human error  84e0c352-c029-4393-9252-7516c77f471f  20

## 2. Basic Data Inspection

### 2.1 Display Column Names and Data Types

In [9]:
if df is not None:
    print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22072 entries, 0 to 22071
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   LOCATION_TEXT          304 non-null    object 
 1   ROAD                   22072 non-null  object 
 2   WEATHER                304 non-null    object 
 3   LIGHT                  304 non-null    object 
 4   DESC                   22072 non-null  object 
 5   REPORTING_AGENCY       22072 non-null  object 
 6   MAIN_CAUSE             304 non-null    object 
 7   INCIDENTDETAILS_ID     22072 non-null  object 
 8   DATE_UTC               22072 non-null  object 
 9   TIME_UTC               22072 non-null  object 
 10  ADDRESS                21768 non-null  object 
 11  killed_driver          22072 non-null  int64  
 12  killed_passenger       22072 non-null  int64  
 13  killed_pedestrian      22072 non-null  int64  
 14  injured_driver         22072 non-null  int64  
 15  in

### 2.2 Check for Missing Values in `SEVERITY`, `killed_total`, and `injured_total`

In [None]:
if df is not None:
    key_cols_for_missing = ['SEVERITY', 'killed_total', 'injured_total']
    existing_key_cols = [col for col in key_cols_for_missing if col in df.columns]
    if not existing_key_cols:
        print("No key columns found in the DataFrame.")
    else:
        missing_values = df[existing_key_cols].isnull().sum()
        print(missing_values)
        
        missing_percentage = (df[existing_key_cols].isnull().sum() / len(df)) * 100
        print(missing_percentage)

SEVERITY         0
killed_total     0
injured_total    0
dtype: int64
SEVERITY         0.0
killed_total     0.0
injured_total    0.0
dtype: float64


### 2.3 Get a First Look at Unique Values for `SEVERITY`

In [12]:
if df is not None and 'SEVERITY' in df.columns:
    severity_unique_values = df['SEVERITY'].unique()
    print(severity_unique_values)
    
    severity_value_counts = df['SEVERITY'].value_counts(dropna=False)
    print(severity_value_counts)
elif df is not None:
    print("Column 'SEVERITY' not found in DataFrame.")

['Property' 'Injury' 'Fatal']
SEVERITY
Property    20555
Injury       1495
Fatal          22
Name: count, dtype: int64


### 2.4 Get Basic Descriptive Statistics for `killed_total` and `injured_total`

In [None]:
if df is not None:
    casualty_cols = ['killed_total', 'injured_total']
    existing_casualty_cols = [col for col in casualty_cols if col in df.columns]
    if not existing_casualty_cols:
        print(existing_casualty_cols)
    else:
        print(df[existing_casualty_cols].describe())