# Violation Types: Distribution and Spatial Concentration

This notebook analyzes the distribution and spatial concentration of violation types in MTA data. We will visualize top types, map their locations, state a hypothesis, and summarize findings.

## 1. Import Required Libraries

We will use pandas, numpy, matplotlib, seaborn, and optionally folium/geopandas for spatial analysis.

In [11]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: for spatial analysis
try:
    import folium
    import geopandas as gpd
except ImportError:
    print('folium/geopandas not installed. Spatial mapping will be skipped.')

sns.set(style='whitegrid')

## Download All Datasets (Automated Script)

Below is a code cell to download all referenced datasets using Python. Large files may require manual download or special handling.

In [None]:
# Download all datasets (large files may require manual download)

import os

import requests


def download_file(url, dest_folder, filename=None):

    if not os.path.exists(dest_folder):

        os.makedirs(dest_folder)

    if not filename:

        filename = url.split('/')[-2] + '.csv'

    dest_path = os.path.join(dest_folder, filename)

    if os.path.exists(dest_path):

        print(f"Already exists: {dest_path}")

        return

    try:

        print(f"Downloading {url} ...")

        r = requests.get(url, stream=True, timeout=60)

        r.raise_for_status()

        with open(dest_path, 'wb') as f:

            for chunk in r.iter_content(chunk_size=8192):

                if chunk:

                    f.write(chunk)

        print(f"Saved to {dest_path}")

    except Exception as e:

        print(f"Failed to download {url}: {e}")


# List of dataset URLs and filenames (CSV endpoints where possible)

datasets = [

    # MTA Bus Automated Camera Enforcement Violations

    ('https://data.ny.gov/api/views/kh8p-hcbm/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_camera_violations.csv'),

    # MTA Bus Automated Camera Enforced Routes

    ('https://data.ny.gov/api/views/ki2b-sg5y/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_camera_enforced_routes.csv'),

    # MTA Bus Hourly Ridership: 2020-2024

    ('https://data.ny.gov/api/views/kv7t-n8in/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_hourly_ridership_2020_2024.csv'),

    # MTA Bus Hourly Ridership: Beginning 2025

    ('https://data.ny.gov/api/views/gxb3-akrn/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_hourly_ridership_2025.csv'),

    # MTA Subway Origin-Destination Ridership Estimate: 2024

    ('https://data.ny.gov/api/views/jsu2-fbtj/rows.csv?accessType=DOWNLOAD', 'data', 'mta_subway_od_ridership_2024.csv'),

    # Bus Lanes - Local Streets

    ('https://data.cityofnewyork.us/api/views/rx8t-6euq/rows.csv?accessType=DOWNLOAD', 'data', 'bus_lanes_local_streets.csv'),

    # MTA Bus Wait Assessment: 2015-2019

    ('https://data.ny.gov/api/views/bmix-dpzc/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_wait_assessment_2015_2019.csv'),

    # MTA Bus Wait Assessment: 2020 - 2024

    ('https://data.ny.gov/api/views/swky-c3v4/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_wait_assessment_2020_2024.csv'),

    # MTA Bus Route Segment Speeds: 2023 - 2024

    ('https://data.ny.gov/api/views/58t6-89vi/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_route_segment_speeds_2023_2024.csv'),

    # MTA Bus Route Segment Speeds: Beginning 2025

    ('https://data.ny.gov/api/views/kufs-yh3x/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_route_segment_speeds_2025.csv'),

    # MTA Bus Customer Journey-Focused Metrics: Beginning 2025

    ('https://data.ny.gov/api/views/k5f7-e4wr/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_customer_journey_2025.csv'),

    # MTA Bus Customer Journey-Focused Metrics: 2020 - 2024

    ('https://data.ny.gov/api/views/wrt8-4b59/rows.csv?accessType=DOWNLOAD', 'data', 'mta_bus_customer_journey_2020_2024.csv'),

    # Bus Lanes (duplicate, different endpoint)

    ('https://data.cityofnewyork.us/api/views/ycrg-ses3/rows.csv?accessType=DOWNLOAD', 'data', 'bus_lanes.csv'),

    # 2020 Community District Tabulation Areas (CDTAs)

    ('https://data.cityofnewyork.us/api/views/xn3r-zk6y/rows.csv?accessType=DOWNLOAD', 'data', 'cdtas_2020.csv'),

    # 2020 Census Tracts to 2020 NTAs and CDTAs Equivalency

    ('https://data.cityofnewyork.us/api/views/hm78-6dwm/rows.csv?accessType=DOWNLOAD', 'data', 'census_tracts_to_nta_cdta.csv'),

    # 2020 Neighborhood Tabulation Areas (NTAs)

    ('https://data.cityofnewyork.us/api/views/9nt8-h7nd/rows.csv?accessType=DOWNLOAD', 'data', 'ntas_2020.csv'),

    # 2020 Census Tracts

    ('https://data.cityofnewyork.us/api/views/63ge-mke6/rows.csv?accessType=DOWNLOAD', 'data', 'census_tracts_2020.csv'),

]

for url, folder, fname in datasets:

    download_file(url, folder, fname)


print('Note: Some datasets are extremely large and may require manual download or special handling. For GTFS, Keeping Track Online, and Commute Times, please download manually from the provided links.')

Downloading https://data.ny.gov/api/views/kh8p-hcbm/rows.csv?accessType=DOWNLOAD ...


## 2. Load the MTA Bus Automated Camera Enforcement Violations Dataset

We will load a sample of the dataset and print columns to identify type and location fields.

In [None]:
# Load the main dataset (ask user for path if not provided)
import os
from pathlib import Path

try:
    import ipywidgets as widgets
    from IPython.display import display
    
    path_widget = widgets.Text(
        value='../data/mta_bus_camera_violations.csv',
        description='CSV Path:',
        disabled=False
    )
    display(path_widget)
    print('Please enter the path to the MTA Bus Automated Camera Enforcement Violations CSV and press Enter.')
    user_path = path_widget.value
except ImportError:
    print('ipywidgets not installed. Please set the path manually below.')
    user_path = input('Enter path to CSV file: ')
    if not os.path.exists(user_path):
        print('File not found. Using default.')
        user_path = '../data/mta_bus_camera_violations.csv'

try:
    df = pd.read_csv(user_path)
    print(f"Loaded {df.shape[0]:,} rows, {df.shape[1]} columns from {user_path}.")
    display(df.head(3))
    print("\nColumns:", list(df.columns))
    # Diagnostics: check for likely type and location columns
    type_cols = [c for c in df.columns if 'type' in c.lower() or 'violation' in c.lower()]
    loc_cols = [c for c in df.columns if any(x in c.lower() for x in ['lat', 'lon', 'location', 'coord', 'address'])]
    print("\nLikely type columns:", type_cols)
    print("Likely location columns:", loc_cols)
except Exception as e:
    print("Error loading data:", e)

Error loading data: [Errno 2] No such file or directory: '../data/mta_bus_camera_violations.csv'


In [None]:
# Data quality diagnostics: missing values and basic stats

if 'df' in locals():

    print('Missing values by column:')

    print(df.isnull().mean().sort_values(ascending=False).head(10))

    print('\nRows with any missing values:', df.isnull().any(axis=1).sum())

    print('\nBasic stats:')

    display(df.describe(include='all').T.head(10))

else:

    print('Dataframe not loaded.')

Dataframe not loaded.


## 3. Diagnostics: Check for Type and Location Columns

If type or location columns are missing or uninformative, fallback to alternative columns or aggregate at a higher level.

In [None]:
# Diagnostics: Robustly select type and location columns

# Try to find the best available columns for violation type and location

violation_type_col = None

location_cols = []


# Try common patterns

for c in df.columns:

    if 'type' in c.lower() and df[c].nunique() > 1:

        violation_type_col = c

        break

if not violation_type_col:

    for c in df.columns:

        if 'violation' in c.lower() and df[c].nunique() > 1:

            violation_type_col = c

            break

if not violation_type_col:

    print('No clear violation type column found. Will aggregate by all available columns.')

else:

    print(f'Using violation type column: {violation_type_col}')


# Location columns

for c in df.columns:

    if any(x in c.lower() for x in ['lat', 'lon', 'location', 'coord', 'address']):

        location_cols.append(c)

if not location_cols:

    print('No clear location columns found. Will aggregate by type only.')

else:

    print(f'Using location columns: {location_cols}')

NameError: name 'df' is not defined

In [None]:
# Guard: Only run if df is defined

if 'df' in locals():

    # Try to find the best available columns for violation type and location

    violation_type_col = None

    location_cols = []

    # Try common patterns

    for c in df.columns:

        if 'type' in c.lower() and df[c].nunique() > 1:

            violation_type_col = c

            break

    if not violation_type_col:

        for c in df.columns:

            if 'violation' in c.lower() and df[c].nunique() > 1:

                violation_type_col = c

                break

    if not violation_type_col:

        print('No clear violation type column found. Will aggregate by all available columns.')

    else:

        print(f'Using violation type column: {violation_type_col}')

    # Location columns

    for c in df.columns:

        if any(x in c.lower() for x in ['lat', 'lon', 'location', 'coord', 'address']):

            location_cols.append(c)

    if not location_cols:

        print('No clear location columns found. Will aggregate by type only.')

    else:

        print(f'Using location columns: {location_cols}')

else:

    print('Dataframe not loaded. Please run the data loading cell first.')

## 4. EDA: Top Violation Types

Visualize the most frequent violation types. If type column is missing, aggregate by available columns.

In [None]:
# EDA: Top violation types

if violation_type_col:

    top_types = df[violation_type_col].value_counts().head(10)

    plt.figure(figsize=(10,5))

    sns.barplot(x=top_types.values, y=top_types.index, palette='viridis')

    plt.title('Top Violation Types')

    plt.xlabel('Count')

    plt.ylabel('Violation Type')

    plt.show()

    print(top_types)

else:

    print('No violation type column found. Showing top columns by unique values:')

    nunique = df.nunique().sort_values(ascending=False)

    print(nunique.head(10))

In [None]:
# Time trend of top violation types (if date/time column exists)

date_col = next((c for c in df.columns if 'date' in c.lower() or 'time' in c.lower()), None)

if violation_type_col and date_col:

    df[date_col] = pd.to_datetime(df[date_col], errors='coerce')

    trend = df.groupby([pd.Grouper(key=date_col, freq='M'), violation_type_col]).size().unstack(fill_value=0)

    trend[top_types.index].plot(figsize=(12,6))

    plt.title('Monthly Trend of Top Violation Types')

    plt.ylabel('Count')

    plt.xlabel('Month')

    plt.legend(title='Violation Type')

    plt.show()

else:

    print('No date/time column found for time trend analysis.')

In [None]:
# Pie chart and percentage table for top violation types

if violation_type_col:

    top_types = df[violation_type_col].value_counts().head(10)

    pct = top_types / top_types.sum() * 100

    fig, ax = plt.subplots(figsize=(7,7))

    ax.pie(top_types, labels=top_types.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('viridis', n_colors=10))

    ax.set_title('Top Violation Types (Share)')

    plt.show()

    pct_table = pd.DataFrame({'Count': top_types, 'Percent': pct.round(2)})

    display(pct_table)

else:

    print('No violation type column for pie chart.')

## 5. Spatial Mapping of Violation Types

Map the locations of top violation types if location columns are available. If not, skip spatial mapping.

In [None]:
# Spatial mapping of top violation types (if possible)

if location_cols and violation_type_col:

    lat_col = next((c for c in location_cols if 'lat' in c.lower()), None)

    lon_col = next((c for c in location_cols if 'lon' in c.lower()), None)

    if lat_col and lon_col:

        top_type = df[violation_type_col].value_counts().idxmax()

        df_top = df[df[violation_type_col] == top_type].copy()

        df_top = df_top.dropna(subset=[lat_col, lon_col])

        if 'folium' in globals():

            m = folium.Map(location=[df_top[lat_col].mean(), df_top[lon_col].mean()], zoom_start=12)

            for _, row in df_top.iterrows():

                folium.CircleMarker(location=[row[lat_col], row[lon_col]], radius=2, color='blue', fill=True).add_to(m)

            display(m)

        else:

            plt.figure(figsize=(8,6))

            plt.scatter(df_top[lon_col], df_top[lat_col], alpha=0.3, s=10)

            plt.title(f'Spatial Distribution of Top Violation Type: {top_type}')

            plt.xlabel('Longitude')

            plt.ylabel('Latitude')

            plt.show()

    else:

        print('No latitude/longitude columns found for mapping.')

else:

    print('Spatial mapping skipped: missing type or location columns.')

## 6. Hypothesis and Summary of Findings

- **Hypothesis:** Certain violation types are more prevalent in specific locations or corridors, possibly due to street design, enforcement patterns, or bus route characteristics.
- **Summary:** Summarize the most common violation types, their spatial concentration (if mapped), and any notable patterns or outliers.

## 7. Next Steps and Further Analysis Suggestions

- Explore correlations between violation types and external factors (weather, events, bus routes).
- Use clustering or anomaly detection to find unusual patterns.
- Compare violation patterns across different boroughs or time periods.
- Share findings with stakeholders and iterate based on feedback.