# NYC Motor Vehicle Collisions Analysis

## 1. Introduction
This study analyzes motor vehicle collisions in New York City using NYPD incident data. The objective is to identify patterns in traffic safety through data visualization and statistical testing. The analysis covers data cleaning, exploratory trends, and probability modeling.

## 2. Data Acquisition
The dataset is retrieved from the NYC Open Data portal. A subset of recent records is used for the following analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
import datetime
import os

# Plotting configuration
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12
sns.set_palette("viridis")

In [None]:
DATA_FILE = 'nyc_collisions_subset.csv'
if not os.path.exists(DATA_FILE):
    url = "https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=100000&$order=crash_date%20DESC"
    df = pd.read_csv(url, low_memory=False)
    df.to_csv(DATA_FILE, index=False)
else:
    df = pd.read_csv(DATA_FILE, low_memory=False)

print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

## 3. Data Cleaning and Preparation
This section covers the handling of missing values, geographic filtering, and the creation of new features for analysis.

### 3.1 Analysis of Missing Data

In [None]:
null_pct = (df.isnull().sum() / len(df)) * 100
plt.figure(figsize=(10, 5))
null_pct[null_pct > 0].sort_values().plot(kind='barh')
plt.title('Percentage of Missing Values per Column')
plt.xlabel('Percentage (%)')
plt.show()

### 3.2 Geographic Filtering and Validation
Records are filtered to ensure they fall within New York City boundaries.

In [None]:
# Filter for NYC coordinates
df_clean = df[
    (df['latitude'].between(40.4, 41.0)) &
    (df['longitude'].between(-74.3, -73.6))
].dropna(subset=['borough']).copy()

df_clean = df_clean.drop_duplicates(subset=['collision_id'])
print(f"Cleaned record count: {len(df_clean)}")

### 3.3 Feature Engineering
New variables are created to facilitate temporal and categorical analysis.

In [None]:
df_clean['crash_date'] = pd.to_datetime(df_clean['crash_date'])
df_clean['hour'] = pd.to_datetime(df_clean['crash_time'], format='%H:%M').dt.hour
df_clean['month'] = df_clean['crash_date'].dt.month
df_clean['day_name'] = df_clean['crash_date'].dt.day_name()

df_clean['is_weekend'] = df_clean['crash_date'].dt.dayofweek.isin([5, 6])
df_clean['is_rush_hour'] = df_clean['hour'].isin([7,8,9,16,17,18])

def categorize_vehicle(v):
    v = str(v).lower()
    if 'sedan' in v: return 'Sedan'
    if 'utility' in v or 'suv' in v: return 'SUV'
    if 'truck' in v or 'van' in v or 'bus' in v: return 'Commercial'
    if 'bike' in v or 'motorcycle' in v: return 'Two-Wheeled'
    return 'Other'

df_clean['vehicle_group'] = df_clean['vehicle_type_code1'].apply(categorize_vehicle)

## 4. Exploratory Data Analysis


### 4.1 Temporal Trends
Analysis of accident frequency by month and hour of the day.

In [None]:
monthly_counts = df_clean.groupby('month').size()
plt.figure(figsize=(10, 5))
plt.plot(monthly_counts.index, monthly_counts.values, marker='o')
plt.title('Monthly Collision Frequency')
plt.xlabel('Month')
plt.ylabel('Count')
plt.xticks(range(1, 13))
plt.show()

### 4.2 Collision Intensity (Heatmap)
The heatmap displays the concentration of accidents by day and hour.

In [None]:
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
pivot = df_clean.pivot_table(index='day_name', columns='hour', values='collision_id', aggfunc='count').reindex(day_order)

plt.figure(figsize=(12, 6))
sns.heatmap(pivot, cmap='Blues')
plt.title('Accident Intensity: Day vs. Hour')
plt.show()

### 4.3 Severity and Victim Analysis

In [None]:
victim_data = df_clean[['number_of_pedestrians_injured', 'number_of_cyclist_injured', 'number_of_motorist_injured']].sum()
plt.figure(figsize=(7, 7))
plt.pie(victim_data, labels=['Pedestrians', 'Cyclists', 'Motorists'], autopct='%1.1f%%')
plt.title('Proportion of Injuries by Victim Type')
plt.show()

## 5. Statistical Analysis and Probability


### 5.1 Probability of Injury by Contributing Factor

In [None]:
df_clean['has_injury'] = df_clean['number_of_persons_injured'] > 0
injury_prob = df_clean.groupby('contributing_factor_vehicle_1')['has_injury'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 5))
injury_prob.plot(kind='bar')
plt.title('Injury Probability by Top Contributing Factors')
plt.ylabel('P(Injury)')
plt.show()

### 5.2 Poisson Distribution Analysis
Fitting daily accident counts to a Poisson distribution to model incident arrivals.

In [None]:
daily_freq = df_clean.groupby('crash_date').size()
lam = daily_freq.mean()
plt.figure(figsize=(10, 5))
sns.histplot(daily_freq, stat='density', label='Actual')
x = np.arange(daily_freq.min(), daily_freq.max())
plt.plot(x, stats.poisson.pmf(x, lam), 'r--', label='Poisson Fit')
plt.title('Daily Accident Counts vs. Poisson Fit')
plt.legend()
plt.show()

### 5.3 Hypothesis Testing (ANOVA)
A one-way ANOVA is performed to compare the mean number of injuries across boroughs.

In [None]:
groups = [df_clean[df_clean['borough'] == b]['number_of_persons_injured'] for b in df_clean['borough'].unique()]
f_stat, p_val = stats.f_oneway(*groups)
print(f"F-Statistic: {f_stat:.4f}, P-Value: {p_val:.4f}")

## 6. Geospatial Density Analysis

In [None]:
plt.figure(figsize=(10, 10))
plt.hexbin(df_clean['longitude'], df_clean['latitude'], gridsize=100, cmap='YlOrRd', bins='log')
plt.title('Geospatial Collision Density in NYC')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

## 7. Summary of Findings
The analysis identifies clear temporal and geographic patterns in New York City traffic incidents. Statistical testing confirms variations in accident severity across different boroughs and time periods. These insights provide a basis for further safety investigations.