# Q3. Car Crash Analysis
The dataset crash.csv consists of drivers involved in crashes for a given year. The following information was available for each accident: driver’s age cohort (agecat), sex, severity of crash (degree), road user class, and accident frequency (number). Develop a binomial model that explains the proportion of car crashes that are non-casualty versus injury or fatal. Explain your reasoning and the model diagnostics. Conduct preliminary exploratory data analysis if necessary to support your reasoning. Please note that you will have to organized the data before you are able to apply the glm function.

## Data Analysis

In [92]:
import pandas as pd
import numpy as np

df = pd.read_csv("crash.csv")
df.head()

Unnamed: 0,agecat,roaduserclass,sex,degree,number
0,17-20,0-car,0-male,fatal,53
1,21-25,0-car,0-male,fatal,37
2,26-29,0-car,0-male,fatal,19
3,30-39,0-car,0-male,fatal,44
4,040-49,0-car,0-male,fatal,34


In [94]:
count_agecat = df.groupby("agecat").size().reset_index(name="Count agecat")
print(count_agecat)

   agecat  Count agecat
0  040-49            30
1   17-20            21
2   21-25            25
3   26-29            28
4   30-39            29
5   50-59            30
6     60+            46


In [96]:
count_degree = df.groupby("degree").size().reset_index(name="Count degree")
print(count_degree)

        degree  Count degree
0        fatal            53
1       injury            79
2  noncasualty            77


In [98]:
count_roaduserclass = df.groupby("roaduserclass").size().reset_index(name="Count roaduserclass")
print(count_roaduserclass)

   roaduserclass  Count roaduserclass
0          0-car                   48
1  a light truck                   42
2      bus+truck                   81
3     motorcycle                   38


## Data Cleaning

In [101]:
# Clean Age
def recategorize_age(agecat):
    if agecat in ["17-20"]:
        return "17-20"
    elif agecat in ["21-25", "26-29"]:
        return "21-29"
    elif agecat in ["30-39"]:
        return "30-39"
    elif agecat in ["040-49"]:
        return "40-49"
    elif agecat in ["50-59"]:
        return "50-59"
    elif agecat in ["60+"]:
        return "60+"
    else:
        return "Unknown"

df["age_group"] = df["agecat"].apply(recategorize_age)
df.head()

Unnamed: 0,agecat,roaduserclass,sex,degree,number,age_group
0,17-20,0-car,0-male,fatal,53,17-20
1,21-25,0-car,0-male,fatal,37,21-29
2,26-29,0-car,0-male,fatal,19,21-29
3,30-39,0-car,0-male,fatal,44,30-39
4,040-49,0-car,0-male,fatal,34,40-49


In [103]:
# Clean Road User Class
def recategorize_roaduserclass(x):
    if x in ["0-car"]:
        return 0
    elif x in ["a light truck"]:
        return 1
    elif x in ["bus+truck"]:
        return 2
    elif x in ["motorcycle"]:
        return 3
    else:
        return "Unknown"

df["roaduserclass_group"] = df["roaduserclass"].apply(recategorize_roaduserclass)
df.head(20)

Unnamed: 0,agecat,roaduserclass,sex,degree,number,age_group,roaduserclass_group
0,17-20,0-car,0-male,fatal,53,17-20,0
1,21-25,0-car,0-male,fatal,37,21-29,0
2,26-29,0-car,0-male,fatal,19,21-29,0
3,30-39,0-car,0-male,fatal,44,30-39,0
4,040-49,0-car,0-male,fatal,34,40-49,0
5,50-59,0-car,0-male,fatal,31,50-59,0
6,60+,0-car,0-male,fatal,24,60+,0
7,60+,0-car,0-male,fatal,36,60+,0
8,17-20,0-car,female,fatal,21,17-20,0
9,21-25,0-car,female,fatal,19,21-29,0


In [105]:
# Clean Sex
def recategorize_sex(x):
    if x in ["0-male"]:
        return 0
    elif x in ["female"]:
        return 1
    else:
        return "Unknown"

df["sex_group"] = df["sex"].apply(recategorize_sex)
df.head(20)

Unnamed: 0,agecat,roaduserclass,sex,degree,number,age_group,roaduserclass_group,sex_group
0,17-20,0-car,0-male,fatal,53,17-20,0,0
1,21-25,0-car,0-male,fatal,37,21-29,0,0
2,26-29,0-car,0-male,fatal,19,21-29,0,0
3,30-39,0-car,0-male,fatal,44,30-39,0,0
4,040-49,0-car,0-male,fatal,34,40-49,0,0
5,50-59,0-car,0-male,fatal,31,50-59,0,0
6,60+,0-car,0-male,fatal,24,60+,0,0
7,60+,0-car,0-male,fatal,36,60+,0,0
8,17-20,0-car,female,fatal,21,17-20,0,1
9,21-25,0-car,female,fatal,19,21-29,0,1


In [107]:
# Organize the data - add casualty
df['casualty'] = df['degree'].apply(
    lambda x: 1 if x in ['injury', 'fatal'] else 0
)
df.head(20)

Unnamed: 0,agecat,roaduserclass,sex,degree,number,age_group,roaduserclass_group,sex_group,casualty
0,17-20,0-car,0-male,fatal,53,17-20,0,0,1
1,21-25,0-car,0-male,fatal,37,21-29,0,0,1
2,26-29,0-car,0-male,fatal,19,21-29,0,0,1
3,30-39,0-car,0-male,fatal,44,30-39,0,0,1
4,040-49,0-car,0-male,fatal,34,40-49,0,0,1
5,50-59,0-car,0-male,fatal,31,50-59,0,0,1
6,60+,0-car,0-male,fatal,24,60+,0,0,1
7,60+,0-car,0-male,fatal,36,60+,0,0,1
8,17-20,0-car,female,fatal,21,17-20,0,1,1
9,21-25,0-car,female,fatal,19,21-29,0,1,1


In [109]:
# Organize the data - add dummy variables
df_w_dummies = pd.get_dummies(df, columns=["age_group","roaduserclass_group", "sex_group"], drop_first=True)
dummy_columns = df_w_dummies.select_dtypes(include='bool').columns
df_w_dummies[dummy_columns] = df_w_dummies[dummy_columns].astype(int)
df_w_dummies

Unnamed: 0,agecat,roaduserclass,sex,degree,number,casualty,age_group_21-29,age_group_30-39,age_group_40-49,age_group_50-59,age_group_60+,roaduserclass_group_1,roaduserclass_group_2,roaduserclass_group_3,sex_group_1
0,17-20,0-car,0-male,fatal,53,1,0,0,0,0,0,0,0,0,0
1,21-25,0-car,0-male,fatal,37,1,1,0,0,0,0,0,0,0,0
2,26-29,0-car,0-male,fatal,19,1,1,0,0,0,0,0,0,0,0
3,30-39,0-car,0-male,fatal,44,1,0,1,0,0,0,0,0,0,0
4,040-49,0-car,0-male,fatal,34,1,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204,17-20,motorcycle,female,noncasualty,2,0,0,0,0,0,0,0,0,1,1
205,21-25,motorcycle,female,noncasualty,2,0,1,0,0,0,0,0,0,1,1
206,26-29,motorcycle,female,noncasualty,1,0,1,0,0,0,0,0,0,1,1
207,30-39,motorcycle,female,noncasualty,3,0,0,1,0,0,0,0,0,1,1


In [115]:
count_degree_group = df_w_dummies.groupby("casualty").size().reset_index(name="Count degree group")
print(count_degree_group)

   casualty  Count degree group
0         0                  77
1         1                 132


## Build a Binomial Model

In [118]:
import statsmodels.api as sm

# Drop rows with NaN values in 'casualty'
df_w_dummies = df_w_dummies.dropna(subset=['casualty'])

# Create the list of tuples (dependent variable)
Y = df_w_dummies['casualty']
X = df_w_dummies[['number', 'age_group_21-29', 'age_group_30-39', 'age_group_40-49', 'age_group_50-59', 'age_group_60+', 'roaduserclass_group_1', 'roaduserclass_group_2', 'roaduserclass_group_3', 'sex_group_1' ]]

# Fit the logistic regression model
intercept_model = sm.GLM(
    Y,
    sm.add_constant(X),  
    family=sm.families.Binomial(link=sm.families.links.logit())
).fit()

# Print the summary of the model
print(intercept_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               casualty   No. Observations:                  209
Model:                            GLM   Df Residuals:                      198
Model Family:                Binomial   Df Model:                           10
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -122.90
Date:                Mon, 02 Dec 2024   Deviance:                       245.81
Time:                        23:05:48   Pearson chi2:                     197.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.1307
Covariance Type:            nonrobust                                         
                            coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     3.47



## Result Analysis
- Constant: Shows a strong baseline likelihood of fatal/injury outcomes when all predictors are zero.
- Number (Accident Frequency): Statistically significant, but with a relatively small effect, slightly reducing the likelihood of fatal/injury outcomes.
- Sex Group 1 (Female): Associated with a lower likelihood of fatal/injury outcomes compared to males.
- Roaduserclass Groups: Group 1 (Light Truck), Group 2 (Bus + Truck), and Group 3 (Motorcycle) all negatively affect the likelihood of fatal/injury outcomes, meaning these road users are less likely to experience severe outcomes compared to the baseline group.