# 00. Data Ingestion and Initial Exploration

**Objective:** Load the dataset and perform an initial inspection to understand its structure, identify immediate data quality issues, and gather basic statistics.

**PRD References:** 
- 3.1.1 Data Ingestion
- 4.1 Data Source
- FR1: Data Loading and Preprocessing

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np

# Display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

## 2. Data Loading

Load the CSV dataset into a Pandas DataFrame.

In [None]:
DATA_PATH = '../data/raw/RTA_EDSA_2007-2016.csv'

try:
    df = pd.read_csv(DATA_PATH)
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print(f"Error: The file was not found at {DATA_PATH}")
except Exception as e:
    print(f"An error occurred: {e}")

## 3. Initial Data Inspection

Perform basic checks to understand the dataset's structure and content.

### 3.1. DataFrame Shape

In [None]:
if 'df' in locals():
    print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

### 3.2. First Few Rows (Head)

In [None]:
if 'df' in locals():
    display(df.head())

### 3.3. Last Few Rows (Tail)

In [None]:
if 'df' in locals():
    display(df.tail())

### 3.4. Data Types and Non-Null Counts (Info)

In [None]:
if 'df' in locals():
    df.info()

### 3.5. Descriptive Statistics (Describe)

In [None]:
if 'df' in locals():
    display(df.describe(include='all'))

### 3.6. Check for Duplicated Rows

In [None]:
if 'df' in locals():
    num_duplicates = df.duplicated().sum()
    print(f"Number of duplicated rows: {num_duplicates}")

## 4. Initial Observations

Based on the initial inspection, document key observations here:

1.  **Dataset Dimensions:** (Number of rows, Number of columns)
2.  **Data Types:** (List any immediate concerns or necessary conversions, e.g., dates as objects)
3.  **Missing Values:** (Note columns with a significant number of missing values based on `df.info()`)
4.  **Categorical Features:** (Identify potential categorical columns from `df.describe(include='all')` and `df.info()`)
5.  **Numerical Features:** (Identify potential numerical columns and their basic statistics like mean, min, max)
6.  **Duplicates:** (Presence and number of duplicate rows)
7.  **Potential Target-Related Columns:** (Identify columns like `SEVERITY`, `killed_total`, `injured_total` that will be crucial for target variable definition)
8.  **Date/Time Columns:** (Identify `DATE_UTC`, `TIME_UTC`, `DATETIME_PST` and note their current format)
9.  **Text/ID Columns:** (Identify columns like `LOCATION_TEXT`, `DESC`, `INCIDENTDETAILS_ID`, `ADDRESS` and consider their potential utility or need for exclusion/processing)
10. **Initial Data Quality Concerns:** (Any other immediate red flags, e.g., inconsistent values if visible in `head()`/`tail()` or unique counts from `describe()`)