## Traffic Data Analysis Brain-Storming

My area of study was "How do certain driver behaviors contribute to vehicular accidents in the Tempe area?"

The professor has asked us to bear in mind "How - if a person who is under influence of alchohol is more likely to have accident - only part of the story?"

We'll have to make sure that we're using the right statistical tests and sampling.

### Load and quick look at the dataset values

In [1]:
# enable many obj dumps in a cell w/out print - not used this wk
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import pandas as pd
pd.set_option('display.max_columns', None)

df = pd.read_csv("data.csv")
#print(df)
df.tail(2)

Unnamed: 0,X,Y,OBJECTID,Incidentid,DateTime,Year,StreetName,CrossStreet,Distance,JunctionRelation,Totalinjuries,Totalfatalities,Injuryseverity,Collisionmanner,Lightcondition,Weather,SurfaceCondition,Unittype_One,Age_Drv1,Gender_Drv1,Traveldirection_One,Unitaction_One,Violation1_Drv1,AlcoholUse_Drv1,DrugUse_Drv1,Unittype_Two,Age_Drv2,Gender_Drv2,Traveldirection_Two,Unitaction_Two,Violation1_Drv2,AlcoholUse_Drv2,DrugUse_Drv2,Latitude,Longitude
51303,-111.926404,33.435576,51304,4155885.0,2024/03/14 13:20:00+00,2024.0,SR-202 Exit 7 T-Ramp,,0.0,Entrance Exit Ramp 205,0.0,0.0,No Injury,Rear End,Daylight,Clear,Dry,Driver,40.0,Male,East,Making Right Turn,Followed Too Closely,No Apparent Influence,No Apparent Influence,Driver,25.0,Female,West,Making Right Turn,No Improper Action,No Apparent Influence,No Apparent Influence,33.435576,-111.926404
51304,-111.909869,33.436621,51305,4155890.0,2024/03/15 07:11:00+00,2024.0,SR-202 Exit 8 J-Ramp,,0.0,Entrance Exit Ramp 205,0.0,0.0,No Injury,Single Vehicle,Dark Lighted,Clear,Dry,Driver,26.0,Male,North,Making Left Turn,Speed To Fast For Conditions,No Apparent Influence,No Apparent Influence,,,,,,,,,33.436621,-111.909869


#### Get some statistics around interesting values

<b><u><i>First - our numeric data</b></u></i>

Looks like we have some bad data in here:

Ages range from 2-255.  Are these special use constants or typeos...need to dig deeper later.  I'm guessing the younger ones are pedestrians (kids) hit by drivers.  Then when it gets to 111+ it's either default or code.

In [2]:
# Get stats for numeric columns of interest
cols = ['Distance', 'Totalinjuries', 'Totalfatalities', 'Age_Drv1', 'Age_Drv2']

num_desc = df[cols].describe().T  # transpose for readability
num_desc

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Distance,51303.0,4.49004,269.258463,-5280.0,-69.0,0.0,75.0,5377.152
Totalinjuries,51303.0,0.457556,0.80677,0.0,0.0,0.0,1.0,9.0
Totalfatalities,51303.0,0.003138,0.058654,0.0,0.0,0.0,0.0,3.0
Age_Drv1,51256.0,47.087151,41.140888,2.0,23.0,32.0,54.0,255.0
Age_Drv2,46547.0,39.26803,22.146805,2.0,24.0,34.0,49.0,255.0


<b><u><i>Now - our categorical data</b></u></i>

In [3]:
# Get describe output for all categorical/object columns
cat_desc = df.describe(include='object').T  # transpose for readability

# Optional: sort by number of unique values
cat_desc = cat_desc.sort_values("unique", ascending=False)

cat_desc

Unnamed: 0,count,unique,top,freq
DateTime,51304,50842,2022/05/10 23:18:00+00,3
CrossStreet,50817,708,Rural Rd,2516
StreetName,51282,525,Rural Rd,5178
JunctionRelation,51303,35,Not Junction Related,17049
Violation1_Drv1,51256,28,Speed To Fast For Conditions,15134
Unitaction_Two,47916,24,Going Straight Ahead,22295
Violation1_Drv2,46547,23,No Improper Action,42924
Unitaction_One,51303,23,Going Straight Ahead,24779
Collisionmanner,51303,14,Rear End,18415
Traveldirection_Two,47916,10,East,12572


#### Injury/Severity is a first-class outcome

Let's look at our counts by value first.  Appears to be no nulls.

In [4]:
severity_counts = df['Injuryseverity'].value_counts()
severity_counts

Injuryseverity
No Injury                    34846
Possible Injury               7962
Non Incapacitating Injury     4411
Suspected Minor Injury        3046
Incapacitating Injury          559
Suspected Serious Injury       325
Fatal                          154
Name: count, dtype: int64

Dig a little deeper and pair these with the numeric values

In [5]:
# Group by Injuryseverity and calculate stats for injuries and fatalities
severity_stats = df.groupby("Injuryseverity").agg(
    TotalAccidents=('OBJECTID','count'),
    TotalInjuries=('Totalinjuries','sum'),
    TotalFatalities=('Totalfatalities','sum'),
    AvgInjuries=('Totalinjuries','mean'),
    AvgFatalities=('Totalfatalities','mean')
).reset_index().sort_values("TotalAccidents", ascending=False)

severity_stats

Unnamed: 0,Injuryseverity,TotalAccidents,TotalInjuries,TotalFatalities,AvgInjuries,AvgFatalities
2,No Injury,34846,0.0,0.0,0.0,0.0
4,Possible Injury,7962,10724.0,0.0,1.346898,0.0
3,Non Incapacitating Injury,4411,6583.0,0.0,1.492405,0.0
5,Suspected Minor Injury,3046,4738.0,0.0,1.555483,0.0
1,Incapacitating Injury,559,875.0,0.0,1.565295,0.0
6,Suspected Serious Injury,325,473.0,0.0,1.455385,0.0
0,Fatal,154,81.0,161.0,0.525974,1.045455


#### Collision manner could be related to behaviors

Need to dig deeper on this one.

Also need to think about how to layer in Lightcondition, Weather, SurfaceCondition

In [6]:
coll_counts = df['Collisionmanner'].value_counts()
coll_counts

# Group by Injuryseverity and calculate stats for injuries and fatalities
coll_stats = df.groupby("Collisionmanner").agg(
    TotalAccidents=('OBJECTID','count'),
    TotalInjuries=('Totalinjuries','sum'),
    TotalFatalities=('Totalfatalities','sum'),
    AvgInjuries=('Totalinjuries','mean'),
    AvgFatalities=('Totalfatalities','mean')
).reset_index().sort_values("TotalAccidents", ascending=False)

coll_stats

Collisionmanner
Rear End                                       18415
Left Turn                                       9448
Sideswipe Same Direction                        7030
ANGLE (Front To Side)(Other Than Left Turn)     5255
Angle - Other Than Left Turn 2                  3516
Single Vehicle                                  3387
Other                                           2144
Unknown                                          649
Head On                                          625
Sideswipe Opposite Direction                     407
Rear To Side                                     184
U Turn                                           155
Rear To Rear                                      61
10                                                27
Name: count, dtype: int64

Unnamed: 0,Collisionmanner,TotalAccidents,TotalInjuries,TotalFatalities,AvgInjuries,AvgFatalities
6,Rear End,18415,7405.0,6.0,0.402118,0.000326
4,Left Turn,9448,6426.0,25.0,0.680144,0.002646
10,Sideswipe Same Direction,7030,928.0,2.0,0.132006,0.000284
1,ANGLE (Front To Side)(Other Than Left Turn),5255,2999.0,11.0,0.570695,0.002093
2,Angle - Other Than Left Turn 2,3516,2326.0,7.0,0.661547,0.001991
11,Single Vehicle,3387,1188.0,32.0,0.350753,0.009448
5,Other,2144,1481.0,71.0,0.690765,0.033116
13,Unknown,649,92.0,2.0,0.141757,0.003082
3,Head On,625,445.0,5.0,0.712,0.008
9,Sideswipe Opposite Direction,407,95.0,0.0,0.233415,0.0


### These are the fields that have two inputs - one for each driver

They need to be treated a bit differently b/c there are two variables.

In [7]:
cols = {
    'Unit Type': ['Unittype_One', 'Unittype_Two'],
    'Gender': ['Gender_Drv1', 'Gender_Drv2'],
    'Travel Direction': ['Traveldirection_One', 'Traveldirection_Two'],
    'Unit Action': ['Unitaction_One', 'Unitaction_Two'],
    'Violation': ['Violation1_Drv1', 'Violation1_Drv2'],
    'Alcohol Use': ['AlcoholUse_Drv1', 'AlcoholUse_Drv2'],
    'Drug Use': ['DrugUse_Drv1', 'DrugUse_Drv2']
}

#### A quick peek at all categorical fields that I care about

Deeper dive below

In [8]:
cat_cols = sum(cols.values(), [])  # flatten dict values into a list
cat_desc_subset = df[cat_cols].describe(include='object').T
cat_desc_subset

Unnamed: 0,count,unique,top,freq
Unittype_One,51303,4,Driver,49680
Unittype_Two,47916,4,Driver,45396
Gender_Drv1,50375,3,Male,27168
Gender_Drv2,46447,3,Male,25807
Traveldirection_One,51303,10,East,12574
Traveldirection_Two,47916,10,East,12572
Unitaction_One,51303,23,Going Straight Ahead,24779
Unitaction_Two,47916,24,Going Straight Ahead,22295
Violation1_Drv1,51256,28,Speed To Fast For Conditions,15134
Violation1_Drv2,46547,23,No Improper Action,42924


#### Like value comparison function

Our dataset has two like categorical functions for each row, one for each participant in the accident.

This function allows me to do a quick comparison

In [9]:
def compare_like_columns(category) :

    # Pick the two columns
    col1 = cols[category][0]
    col2 = cols[category][1]
    
    # Get value counts for each column
    counts_one = df[col1].value_counts(dropna=False).rename(col1)
    counts_two = df[col2].value_counts(dropna=False).rename(col2)
    
    # Combine into one DataFrame
    summary = pd.concat([counts_one, counts_two], axis=1).fillna(0).astype(int)
    summary.index.name = category
    
    return summary

#### Unit Type Comparisons

Start with a high-level comparison of driver one and driver two involved.

Lots of missing data for unit type 2 (almost 7%).  Single-car accidents perhaps? 

How should we deal with these in general.  Need to think about how to deal with them at each study level too.

Driverless is interesting, not for any other reason than it is a novel category.

In [10]:
compare_like_columns('Unit Type')

Unnamed: 0_level_0,Unittype_One,Unittype_Two
Unit Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Driver,49680,45396
Pedalcyclist,1119,623
Pedestrian,457,528
Driverless,47,1369
,2,3389


<b><u><i>Possible Studies<i></u></b>
1. <b>Frequency distributions.</b> Compare how often each unit type appears in accidents for Unittype_One vs Unittype_Two. This shows whether certain types (drivers, pedestrians, bicycles, motorcycles) are disproportionately represented in primary or secondary roles.

2. <b>Severity by unit type.</b> Look at the average number of injuries and fatalities associated with each unit type. This can be tested with ANOVA or non-parametric equivalents to see if severity differs by unit type.

3. <b>Unit-type combinations.</b> Build a cross-tab of Unittype_One × Unittype_Two to identify which pairings (for example, car vs car, car vs pedestrian, car vs bicycle) occur most often and which are associated with more severe outcomes.

4. <b>Driver behavior by unit type.</b> Compare the distribution of violations and unsafe actions across unit types. For instance, speeding may cluster with motorcycles, while pedestrians may be more associated with crosswalk violations. This can be analyzed with chi-square tests of independence.

5. <b>Environmental context.</b> Assess how unit types appear under different light, weather, or road surface conditions. For example, pedestrian accidents may be more common at night, and motorcycles may be more sensitive to wet road conditions. Logistic regression could be used to model the probability of injury given unit type and environmental factors.

6. <b>Demographics by unit type.</b> Compare age and gender distributions across drivers of different unit types. You can test whether younger drivers are disproportionately represented in motorcycle accidents or whether severity varies by age.
Temporal trends. Examine whether unit-type distributions vary by time of day or season. For example, bicycle accidents may be more frequent during commute hours, and pedestrian accidents may peak at night.

#### Gender Comparisons

Again, seeing reasonable amounts of missing data.

How do we want to deal with 'Unknown'?  Treat it as it's own or liken it to missing?

In [11]:
compare_like_columns('Gender')

Unnamed: 0_level_0,Gender_Drv1,Gender_Drv2
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,27168,25807
Female,18753,20198
Unknown,4454,442
,930,4858


<b><u><i>Possible Studies<i></u></b>

1. <b>Frequency distributions.</b> Compare how often male and female drivers appear in accidents for Gender_Drv1 vs Gender_Drv2. This shows whether one gender is overrepresented in primary or secondary roles.

2. <b>Severity by gender.</b> Examine the average number of injuries and fatalities in crashes involving male vs female drivers. This can be tested with t-tests or non-parametric equivalents to see if severity differs systematically.

3. <b>Violation patterns.</b> Cross-tab violations against gender to see whether men and women commit different types of infractions (e.g., speeding vs right-of-way failures). A chi-square test of independence can be used here.

4. <b>Substance involvement.</b> Compare alcohol and drug use rates in crashes by gender. For instance, assess whether impaired-driving incidents are disproportionately male.

5. <b>Unit type by gender.</b> Evaluate whether gender distributions differ across unit types (e.g., motorcycles vs cars vs pedestrians). This can highlight whether risk exposure differs by gender.
	
6. <b>Age × gender interactions.</b> Assess whether young male drivers differ from young female drivers in accident severity, or whether older drivers show different patterns. Two-way ANOVA or logistic regression can model these effects.

7. <b>Environmental context.</b> Compare whether accidents involving male vs female drivers occur under different light, weather, or surface conditions. For example, male drivers may be more represented in nighttime accidents.
Temporal trends. Assess whether male- vs female-involved crashes cluster at different times of day or year. This can reveal behavioral or exposure differences tied to gender.

#### Travel Direction Comparisons

Again, seeing reasonable amounts of missing data - skewed HEAVILY toward me thinking this is single car accident.

How do we want to deal with 'Unknown'?  Treat it as it's own or liken it to missing?

No idea what 255 means, but it's used in 74 instances.

In [12]:
compare_like_columns('Travel Direction')

Unnamed: 0_level_0,Traveldirection_One,Traveldirection_Two
Travel Direction,Unnamed: 1_level_1,Unnamed: 2_level_1
East,12574,12572
West,11645,11238
North,11517,10644
South,11394,11681
Unknown,894,518
Northwest,861,312
Southwest,806,281
Southeast,805,345
Northeast,786,272
255,21,53


<b><u><i>Possible Studies<i></u></b>

1. <b>Frequency distributions.</b> Compare how often each travel direction appears in Traveldirection_One vs Traveldirection_Two to establish baseline representation.

2. <b>Directional combinations.</b> Cross-tabulate unit one vs unit two travel directions to identify high-risk conflict patterns such as head-on, rear-end, or right-angle crashes.

3. <b>Severity by direction.</b> Compare injuries and fatalities across direction categories, testing whether opposite-direction collisions are more severe than same-direction ones.

4. <b>Action × direction.</b> Examine whether certain maneuvers (turning left, going straight) combined with specific directions result in higher crash frequency or severity.

5. <b>Environmental context.</b> Assess whether direction interacts with time of day (sun glare eastbound in morning, westbound in evening) or with light/weather conditions.

6. <b>Demographics.</b> Compare whether driver age and gender distributions differ by travel direction, suggesting possible behavioral or exposure patterns.

#### Action Comparisons

have to deal with single cars, other, unknown

wtf is lying? lmfao. like lying to the officer or lying in the road?

In [13]:
compare_like_columns('Unit Action')

Unnamed: 0_level_0,Unitaction_One,Unitaction_Two
Unit Action,Unnamed: 1_level_1,Unnamed: 2_level_1
Going Straight Ahead,24779,22295
Making Left Turn,11630,3537
Changing Lanes,3952,273
Making Right Turn,3640,1986
Unknown,1919,611
Slowing In Trafficway,1346,2667
Backing,907,32
Making U Turn,830,117
Crossing Road,545,384
Other,406,192


<b><u><i>Possible Studies<i></u></b>

1. <b>Frequency distributions.</b> Compare how often each action (straight, turning, stopped, backing, slowing) occurs for Unitaction_One vs Unitaction_Two.

2. <b>Severity by action.</b> Measure injuries and fatalities per action to identify high-risk maneuvers, such as left turns.

3. <b>Action combinations.</b> Cross-tab unit one vs unit two actions to reveal dangerous pairings (for example, left turn vs through traffic).

4. <b>Violations by action.</b> Assess whether certain violations are associated with specific actions, such as improper turns or following too closely.

5. <b>Environmental context.</b> Evaluate whether certain actions (rear-end crashes with stopped vehicles) occur more often at peak times or under certain light and weather conditions.

6. <b>Demographics.</b> Compare age and gender distributions across actions to detect whether risky maneuvers cluster in specific groups.

#### Violation Comparisons

Unknowns, Other, missing, numeric codes: 108, 109, 49

In [14]:
compare_like_columns('Violation')

Unnamed: 0_level_0,Violation1_Drv1,Violation1_Drv2
Violation,Unnamed: 1_level_1,Unnamed: 2_level_1
Speed To Fast For Conditions,15134,270
Failed To Yield Right Of Way,10452,158
Unsafe Lane Change,4104,129
Followed Too Closely,3722,54
Unknown,3702,2427
Disregarded Traffic Signal,3539,53
Other,2860,270
Made Improper Turn,2182,59
Failed To Keep In Proper Lane,1549,46
Inattention Distraction,1544,54


<b><u><i>Possible Studies<i></u></b>

1. <b>Frequency distributions.</b> Compare how often each violation appears across both drivers to identify the most common infractions.

2. <b>Severity by violation.</b> Analyze which violations lead to more severe crashes in terms of injuries and fatalities.

3. <b>Violation combinations.</b> Cross-tab primary vs secondary driver violations to see how pairs of behaviors interact.

4. <b>Contextual patterns.</b> Test whether certain violations are more likely under specific conditions (time of day, light, weather, surface).

5. <b>Unit type by violation.</b> Determine whether different unit types are linked with particular violations (e.g., motorcycles with speeding).
Demographics. Assess whether age or gender distributions differ across violations, highlighting behavioral differences between groups.

#### Alc Use Comparisons

Pretty straight forward.  Mostly no-use. Single cars - whatever.  But what about missing for driver 1?

In [15]:
compare_like_columns('Alcohol Use')

Unnamed: 0_level_0,AlcoholUse_Drv1,AlcoholUse_Drv2
Alcohol Use,Unnamed: 1_level_1,Unnamed: 2_level_1
No Apparent Influence,48803,46414
Alcohol,2453,133
,49,4758


<b><u><i>Possible Studies<i></u></b>

1. <b>Frequency distributions.</b> Compare proportions of alcohol involvement versus no apparent influence for both drivers.

2. <b>Severity by alcohol use.</b> Analyze whether alcohol-involved crashes result in more injuries and fatalities compared to sober crashes.

3. <b>Driver pairing.</b> Cross-tab AlcoholUse_Drv1 × AlcoholUse_Drv2 to see how often one versus both drivers are impaired.

4. <b>Temporal context.</b> Break down alcohol-involved crashes by time of day and day of week to test for nighttime or weekend concentration.

5. <b>Environmental context.</b> Assess whether alcohol-involved crashes are linked to specific light or surface conditions.

6. <b>Demographics.</b> Compare age and gender distributions across alcohol involvement.

7. <b>Violation overlap.</b> Examine whether certain violations cluster with alcohol involvement.

#### Drug Use Comparisons

Same concerns as alc.

In [16]:
compare_like_columns('Drug Use')

Unnamed: 0_level_0,DrugUse_Drv1,DrugUse_Drv2
Drug Use,Unnamed: 1_level_1,Unnamed: 2_level_1
No Apparent Influence,50807,46531
Drugs,449,16
,49,4758


<b><u><i>Possible Studies<i></u></b>

1. <b>Frequency distributions.</b> Compare proportions of drug-involved crashes versus no apparent influence across both drivers.

2. <b>Severity by drug use.</b> Test whether drug-related crashes show higher injuries and fatalities compared to non-involved crashes.

3. <b>Driver pairing.</b> Cross-tab DrugUse_Drv1 × DrugUse_Drv2 to see whether crashes usually involve one or both drug-involved drivers.

4. <b>Temporal context.</b> Break down drug-related crashes by time of day and day of week to detect concentration patterns.

5. <b>Environmental context.</b> Assess whether drug-involved crashes cluster under specific light, weather, or surface conditions.

6. <b>Demographics.</b> Compare age and gender distributions in drug-related crashes.

7. <b>Violation overlap.</b> Test whether specific violations (unsafe lane changes, inattention) are associated with drug involvement.