# Weekly Project 2!

## Introduction to Road Traffic Accidents (RTA) Dataset

### Dataset Overview
The RTA Dataset provides a detailed snapshot of road traffic accidents, capturing a range of data from accident conditions to casualty details. This dataset is essential for analyzing patterns and causes of accidents to improve road safety.

### Data Characteristics
- **Entries**: The dataset contains 12,316 entries.
- **Features**: There are 32 features in the dataset, which include:
  - `Time`: Time when the accident occurred.
  - `Day_of_week`: Day of the week.
  - `Age_band_of_driver`: Age group of the driver involved.
  - `Sex_of_driver`: Gender of the driver.
  - `Educational_level`: Educational level of the driver.
  - `Type_of_vehicle`: Type of vehicle involved in the accident.
  - `Cause_of_accident`: Reported cause of the accident.
  - `Accident_severity`: Severity of the accident.
- **Target Column**: `Accident_severity` is used as the target column for modeling. This feature classifies the severity of each accident.

### Objective
Students will use this dataset to apply various data visualization, modeling, and evaluation techniques learned in class. The primary goal is to build models that can accurately predict the severity of accidents and to identify the key factors that contribute to severe accidents.

## Import Libraries
Import all the necessary libraries here. Include libraries for handling data (like pandas), visualization (like matplotlib and seaborn), and modeling (like scikit-learn).

In [1]:
# DISCLAIMER, im sorry for any typos in my comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

## Load Data
Load the dataset from the provided CSV file into a DataFrame.

In [2]:
df = pd.read_csv('RTA_Dataset.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'RTA_Dataset.csv'

## Exploratory Data Analysis (EDA)
Perform EDA to understand the data better. This involves several steps to summarize the main characteristics, uncover patterns, and establish relationships:
* Find the dataset information and observe the datatypes.
* Check the shape of the data to understand its structure.
* View the the data with various functions to get an initial sense of the data.
* Perform summary statistics on the dataset to grasp central tendencies and variability.
* Check for duplicated data.
* Check for null values.

And apply more if needed!


In [None]:
df.info()

## Data Preprocessing
Data preprocessing is essential for transforming raw data into a format suitable for further analysis and modeling. Follow these steps to ensure your data is ready for predictive modeling or advanced analytics:
- **Handling Missing Values**: Replace missing values with appropriate statistics (mean, median, mode) or use more complex imputation techniques.
- **Normalization/Scaling**: Scale data to a small, specified range like 0 to 1, or transform it to have a mean of zero and a standard deviation of one.
- **Label Encoding**: Convert categorical text data into model-understandable numbers where the labels are ordered.
- **One-Hot Encoding**: Use for nominal categorical data where no ordinal relationship exists to transform the data into a binary column for each category. (Be careful not to increase the dimensionality significantly)
- **Detection and Treatment of Outliers**: Use statistical tests, box plots, or scatter plots to identify outliers and then cap, trim, or use robust methods to reduce the effect of outliers, depending on the context.
- **Feature Engineering**: Enhance your dataset by creating new features and transforming existing ones. This might involve combining data from different columns, applying transformations, or reducing dimensionality with techniques like PCA to improve model performance.

Consider these steps as a foundation, and feel free to introduce additional preprocessing techniques as needed to address specific characteristics of your dataset.

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.sample(5)

In [None]:
df.tail()

In [None]:
# First we'll check whether if each data type is appropriate or not:
df.dtypes

In [None]:
df['Casualty_severity'].unique()

In [None]:
df['Accident_severity'].unique()

In [None]:
# The Casualty severity feature can be turned into int:
# df['Casualty_severity'] = df['Casualty_severity'].astype(int)
# now check for null values:
df.isnull().sum()

In [None]:
df['Educational_level'].unique()

In [None]:
df['Owner_of_vehicle'].unique()

In [None]:
df['Service_year_of_vehicle'].unique()

In [None]:
df['Work_of_casuality'].unique()

In [None]:
df['Vehicle_driver_relation'].unique()

In [None]:
df['Fitness_of_casuality'].unique()

In [None]:
# Before going into the proccess of cleaning, I'll drop certain columns:
# For a model that predects the accedints severity, I deem the following not needed:
# The educational_level of a person would not help a model with predicting the severity of a car crash.
df.drop('Educational_level', axis = 1, inplace = True)
# The owner of the vehicle wouldn't help the model predict the severity and the Latitude would help plotting data in a map, but it would only distort the model when fitting.
df.drop('Owner_of_vehicle', axis = 1, inplace = True)
# Similar to educational level, the occupation of the casualty is not directly related to the severity of the accident.
df.drop('Work_of_casuality', axis = 1, inplace = True)
# While fitness might influence the likelihood of an accident, it is less likely to affect the severity once an accident has occurred.
df.drop('Fitness_of_casuality', axis = 1, inplace = True)
# The relationship between the driver and the vehicle owner is not relevant to the severity of the accident.
df.drop('Vehicle_driver_relation', axis = 1, inplace = True)
# As a whole accident severity is more influenced by situational factors and / or factors that may indicate how severe the accident is.

In [None]:
df.info()

In [None]:
df['Age_band_of_casualty'].isnull().sum()

In [None]:
df['Age_band_of_driver'].unique()

In [None]:
df.sample(10)

In [None]:
df.replace('na', pd.NA, inplace=True) # Change any record with 'na' to NaN


In [None]:
df.isnull().sum()

In [None]:
df['Age_band_of_casualty'].unique()

In [None]:
df['Age_band_of_casualty'].isnull().sum()

In [None]:
df['Casualty_class'].isnull().sum()

In [None]:
df['Casualty_class'].unique()

In [None]:
df.columns

In [None]:
def fill_with_mode(series):
    mode_value = series.mode()
    if not mode_value.empty:
        return series.fillna(mode_value[0])
    return series



In [None]:
df['Casualty_class'].isnull().sum()

In [None]:
# Fill the casualty class columns with the mode:
df['Casualty_class'] = df.groupby(['Number_of_casualties'])['Casualty_class'].transform(fill_with_mode)


In [None]:
df['Casualty_class'].isnull().sum()

In [None]:
df['Type_of_vehicle'] = df.groupby(["Casualty_class"])['Type_of_vehicle'].transform(fill_with_mode)


In [None]:
df['Type_of_vehicle'].isnull().sum()

In [None]:
df['Driving_experience'] = df.groupby(['Age_band_of_driver'])['Driving_experience'].transform(fill_with_mode)


In [None]:
df['Driving_experience'].isnull().sum()

In [None]:
df['Service_year_of_vehicle'] = df.groupby(['Type_of_vehicle'])['Service_year_of_vehicle'].transform(fill_with_mode)


In [None]:
df['Service_year_of_vehicle'].isnull().sum()

In [None]:
df['Defect_of_vehicle'] = df.groupby(['Age_band_of_driver','Driving_experience'])['Defect_of_vehicle'].transform(fill_with_mode)


In [None]:
df['Driving_experience'].isnull().sum()

In [None]:
df['Road_surface_type'] = df.groupby('Road_surface_conditions')['Road_surface_type'].transform(fill_with_mode)

In [None]:
df['Road_surface_type'].isnull().sum()

In [None]:
df['Area_accident_occured'] = df['Area_accident_occured'].fillna(df['Area_accident_occured'].mode()[0]) # Since there isn't any column that would help determine the location.

In [None]:
df['Area_accident_occured'].isnull().sum()

In [None]:
df['Lanes_or_Medians'] = df.groupby(['Area_accident_occured'])['Lanes_or_Medians'].transform(fill_with_mode)


In [None]:
df['Lanes_or_Medians'].isnull().sum()

In [None]:
df['Road_allignment'] = df.groupby(['Lanes_or_Medians'])['Road_allignment'].transform(fill_with_mode)
df['Lanes_or_Medians'].isnull().sum()

In [None]:
df['Types_of_Junction'] = df.groupby(['Area_accident_occured','Road_allignment'])['Types_of_Junction'].transform(fill_with_mode)


In [None]:
df['Types_of_Junction'].isnull().sum()

In [None]:
df['Type_of_collision'].unique()

In [None]:
df['Type_of_collision'] = df.groupby(['Accident_severity','Cause_of_accident'])['Type_of_collision'].transform(fill_with_mode)

In [None]:
df['Vehicle_movement'] = df.groupby(['Type_of_collision','Pedestrian_movement'])['Vehicle_movement'].transform(fill_with_mode)


In [None]:
df['Vehicle_movement'].fillna(df['Vehicle_movement'].mode()[0], inplace=True) # fill the rest

In [None]:
df['Age_band_of_casualty'] = df.groupby(['Age_band_of_driver'])['Age_band_of_casualty'].transform(fill_with_mode)

In [None]:
df['Age_band_of_casualty'].fillna(df['Age_band_of_casualty'].mode()[0], inplace=True) # fill the rest

In [None]:
df['Sex_of_casualty'] = df.groupby(['Sex_of_driver'])['Sex_of_casualty'].transform(fill_with_mode)

In [None]:
df['Casualty_severity'] = df.groupby(['Accident_severity'])['Casualty_severity'].transform(fill_with_mode)

In [None]:
df.isnull().sum() # SUCCESS

In [None]:
# Check for the duplicates:
df.duplicated().sum()

In [None]:
# Before further preproccesing I will store the current df in another varieble for the visualization:
df_vis = df.copy()
df_vis

In [None]:
df.dtypes

In [None]:
df.columns

In [None]:
# Encoding categorical data:
# First lets start with the ordinal data:
day_mapping = {'Sunday': 1, 'Monday': 2, 'Tuesday': 3, 'Wednesday': 4, 'Thursday': 5, 'Friday': 6, 'Saturday': 7}
df['Day_of_week'] = df['Day_of_week'].map(day_mapping)
df['Day_of_week'].isnull().sum()

In [None]:
df['Day_of_week'].unique()

In [None]:
df['Age_band_of_driver'].unique()

In [None]:
age_mapping = {'Under 18': 1, '18-30': 2, '31-50': 3, 'Over 51': 4, 'Unknown': 0}
df['Age_band_of_driver'] = df['Age_band_of_driver'].map(age_mapping)
df['Age_band_of_driver'].isnull().sum()

In [None]:
df['Driving_experience'].unique()

In [None]:
dr_mapping = {'No Licence': 1, 'Below 1yr': 2, '1-2yr': 3, '2-5yr': 4, '5-10yr': 5, 'Above 10yr': 6, 'unknown': 0}
df['Driving_experience'] = df['Driving_experience'].map(dr_mapping)
df['Driving_experience'].isnull().sum()

In [None]:
df['Service_year_of_vehicle'].unique()

In [None]:
sy_mapping = {'Below 1yr': 1, '1-2yr': 2, '2-5yrs': 3, '5-10yrs': 4, 'Above 10yr': 5, 'Unknown': 0}
df['Service_year_of_vehicle'] = df['Service_year_of_vehicle'].map(sy_mapping)
df['Service_year_of_vehicle'].isnull().sum()

In [None]:
df['Age_band_of_casualty'].unique()

In [None]:
agec_mapping = {'Under 18': 1, '18-30': 2, '31-50': 3, 'Over 51': 4, '5': 0}
df['Age_band_of_casualty'] = df['Age_band_of_casualty'].map(age_mapping)

In [None]:
df['Accident_severity'].unique()

In [None]:
acc_mapping = {'Slight Injury': 1, 'Serious Injury': 2, 'Fatal injury': 3}
df['Accident_severity'] = df['Accident_severity'].map(acc_mapping)
df['Accident_severity'].isnull().sum()


In [None]:
df.isnull().sum()

In [None]:
df.columns

In [None]:
#perform label encoding on the rest:
label = LabelEncoder()
df['Sex_of_driver'] = label.fit_transform(df['Sex_of_driver'])
df['Sex_of_casualty	'] = label.fit_transform(df['Sex_of_casualty'])

In [None]:
#perform label encoding on the rest:
label = LabelEncoder()
df['Type_of_vehicle'] = label.fit_transform(df['Type_of_vehicle'])
df['Defect_of_vehicle'] = label.fit_transform(df['Defect_of_vehicle'])
df['Cause_of_accident'] = label.fit_transform(df['Cause_of_accident'])
df['Area_accident_occured'] = label.fit_transform(df['Area_accident_occured'])
df['Lanes_or_Medians'] = label.fit_transform(df['Lanes_or_Medians'])
df['Road_allignment'] = label.fit_transform(df['Road_allignment'])
df['Types_of_Junction'] = label.fit_transform(df['Types_of_Junction'])
df['Road_surface_type	'] = label.fit_transform(df['Road_surface_type'])
df['Road_surface_conditions'] = label.fit_transform(df['Road_surface_conditions'])
df['Light_conditions'] = label.fit_transform(df['Light_conditions'])
df['Weather_conditions	'] = label.fit_transform(df['Weather_conditions'])
df['Type_of_collision'] = label.fit_transform(df['Type_of_collision'])
df['Vehicle_movement'] = label.fit_transform(df['Vehicle_movement'])
df['Casualty_class'] = label.fit_transform(df['Casualty_class'])
df['Pedestrian_movement	'] = label.fit_transform(df['Pedestrian_movement'])
df['Accident_severity'] = label.fit_transform(df['Accident_severity'])
df['Casualty_severity'] = label.fit_transform(df['Casualty_severity'])



In [None]:
df['Road_surface_type'] = label.fit_transform(df['Road_surface_type'])
df['Weather_conditions'] = label.fit_transform(df['Weather_conditions'])
df['Pedestrian_movement'] = label.fit_transform(df['Pedestrian_movement'])
df['Defect_of_vehicle'] = label.fit_transform(df['Defect_of_vehicle'])


In [None]:
df['Sex_of_casualty'] = label.fit_transform(df['Sex_of_casualty'])


In [None]:
df['Time'] = label.fit_transform(df['Time'])

In [None]:
df.dtypes

## Data Visualization
Create various plots to visualize the relationships in the data. Consider using the following to show different aspects of the data:

* Heatmap of Correlation Matrix.
* Line plots.
* Scatter plots.
* Histograms.
* Boxplots.

Use more if needed!

In [None]:
df_vis.columns

In [None]:
# One of the better ways to identify patters in data is the correlation matrix and the heat map.
# After observing the data we understand that the only actual numeric columns are (Number_of_Vehicles-Number_of_Casualties) but before we check the correlation we should check the outliers.
fig = px.box(df_vis, y = 'Number_of_vehicles_involved', color = 'Accident_severity')
fig.show()
# We note from the graph that there exists outliers for each class.

In [None]:
# check the number of casualties for outliers too:
fig = px.box(df_vis, y = 'Number_of_casualties', color = 'Accident_severity')
fig.show()
# Again we notice outliers in both of our numerical columns.

In [3]:
# Create function that deals with outliers:
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]


In [4]:
numeric_col = df[['Number_of_casualties', 'Number_of_vehicles_involved']]
numeric_col_corr = numeric_col.corr()
print(numeric_col_corr)

NameError: name 'df' is not defined

In [5]:
plt.figure(figsize = (8,5))
sns.heatmap(numeric_col_corr, annot = True)
plt.show()
# Shockingly the correlation between the two features is low.


NameError: name 'numeric_col_corr' is not defined

<Figure size 800x500 with 0 Axes>

In [6]:
df['Weather_conditions']

NameError: name 'df' is not defined

In [7]:
df.columns

NameError: name 'df' is not defined

In [8]:
# Check for more correlation
corr_df = df[['Day_of_week','Age_band_of_driver','Sex_of_driver','Driving_experience','Weather_conditions','Defect_of_vehicle', 'Area_accident_occured','Road_surface_conditions','Number_of_vehicles_involved','Number_of_casualties','Casualty_severity','Cause_of_accident','Accident_severity']]
plt.figure(figsize = (15,10))
sns.heatmap(corr_df.corr(), annot = True)

NameError: name 'df' is not defined

In [9]:
# Knowing the correlation between the features, The following plots will hopefully expand more on them:
# The pair plot is a great visual tool for showing correlation between features
plt.figure(figsize = (15,10))
sns.pairplot(corr_df)
plt.title('Correlation between features')

NameError: name 'corr_df' is not defined

<Figure size 1500x1000 with 0 Axes>

In [10]:
df_vis.columns

NameError: name 'df_vis' is not defined

In [11]:
# Using plotly for the interactive graphs, The graph shows the frequency of each severity class.
plt.figure(figsize = (10,6))
fig = px.histogram(df_vis, x = 'Accident_severity',color = 'Sex_of_casualty')
fig.show()

NameError: name 'df_vis' is not defined

<Figure size 1000x600 with 0 Axes>

In [12]:
# Using plotly for the interactive graphs, The graph shows the frequency of each severity class.
plt.figure(figsize = (10,6))
fig = px.histogram(df, x = 'Number_of_casualties',color = 'Day_of_week')
fig.show()

NameError: name 'df' is not defined

<Figure size 1000x600 with 0 Axes>

In [13]:
fig = px.scatter(df, x = 'Number_of_vehicles_involved', y = 'Number_of_casualties', color = 'Accident_severity')
fig.show()

NameError: name 'df' is not defined

## Feature Selection
- Choose features that you believe will most influence the outcome based on your analysis and the insights from your visualizations. Focus on those that appear most impactful to include in your modeling.

## Train-Test Split
* Divide the dataset into training and testing sets to evaluate the performance of your models.

In [14]:
df.isnull().sum()

NameError: name 'df' is not defined

In [15]:
# Feature selection:
# Using the chi-score class from sklearn.feature_selection will allow us to select our wanted fetures
from sklearn.feature_selection import SelectKBest, chi2
# Assuming your data is in a DataFrame called df
X = df.drop('Accident_severity', axis=1)
y = df['Accident_severity']

# Select the top 10 features with the highest Chi-square scores
chi2_selector = SelectKBest(chi2, k=10)
X_kbest = chi2_selector.fit_transform(X, y)

# Get the scores for each feature
chi2_scores = chi2_selector.scores_
chi2_pvalues = chi2_selector.pvalues_

# Create a DataFrame to view the scores and p-values
feature_scores = pd.DataFrame({'Feature': X.columns, 'Chi2 Score': chi2_scores, 'p-value': chi2_pvalues})
feature_scores = feature_scores.sort_values(by='Chi2 Score', ascending=False).head(10)
print(feature_scores)


NameError: name 'df' is not defined

In [16]:
df_vis['Accident_severity'].unique()

NameError: name 'df_vis' is not defined

In [17]:
df['Accident_severity'] = df_vis['Accident_severity']
df['Accident_severity'].unique()

NameError: name 'df_vis' is not defined

In [18]:
df.dtypes

NameError: name 'df' is not defined

In [19]:
# After analyzing and evaluating the correlation between the features, the following shall pass to the model:
Featured_selected = df[['Time', 'Number_of_casualties', 'Number_of_vehicles_involved', 'Light_conditions',
    'Type_of_collision', 'Road_surface_type', 'Day_of_week', 'Age_band_of_driver', 'Types_of_Junction',
    'Lanes_or_Medians']] # The features.
y = df['Accident_severity'].apply(lambda x: 1 if x == 3 else 0) # Create the target, having the label 3
# Make the train test split at 70% / 30%
X_train, X_test, y_train, y_test = train_test_split(Featured_selected,y,test_size = 0.3, random_state = 42) # I added the random state to keep the split constant

NameError: name 'df' is not defined

## Modeling

Once the data is split into training and testing sets, the next step is to build models to make predictions. Here, we will explore several machine learning algorithms, each with its unique characteristics and suitability for different types of data and problems. You will implement the following models:

### 1. Logistic Regression

### 2. Decision Tree Classifier

### 3. Support Vector Machine (SVM)

### 4. K-Neighbors Classifier

### Implementing the Models
- For each model, use the training data you have prepared to train the model.

#### Logistic Regression

In [20]:
# The following model is a logestic regression model i.e (binary classifier) that will predict wether if the accident severity is level 3 or not.
# After analyzing and evaluating the correlation between the features, the following shall pass to the model:
steps = [
    ('sclaer', StandardScaler()), # Transformation step
    ('model', LogisticRegression(solver = 'saga')) # Model-solver = 'saga'
]
pipeline = Pipeline(steps) # create a pipline
pipeline.fit(X_train, y_train) # fit the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Calculate cross-validation
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy')
print('Training accuracy:',total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='precision')
print('Training precision:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='recall')
print('Training recall:', total_test_scores)


NameError: name 'X_train' is not defined

#### Decision Tree Classifier

In [21]:
dec_steps = [
    ('scaler', StandardScaler()),  # Transformation step
    ('model', DecisionTreeClassifier())  # decison tree model
]

dec_pipeline = Pipeline(steps) # create a pipline
dec_pipeline.fit(X_train, y_train) # fit the model

y_pred = dec_pipeline.predict(X_test) # make predection and calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Calculate cross-validation scores
total_test_scores = cross_val_score(dec_pipeline, X_train, y_train, cv=3, scoring='accuracy')
print('Training accuracy:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='precision')
print('Training precision:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='recall')
print('Training recall:', total_test_scores)


NameError: name 'X_train' is not defined

#### Support Vector Machine (SVM)

In [22]:
# Define the steps for the pipeline
steps = [
    ('scaler', StandardScaler()),  # Transformation step
    ('model', SVC(kernel='rbf'))  # Model with RBF kernel
]

# Create the pipeline
knn_pipeline = Pipeline(steps)

# Fit the model
knn_pipeline.fit(X_train, y_train)

# Predict the test set
y_pred = knn_pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Calculate cross-validation scores
total_test_scores = cross_val_score(knn_pipeline, X_train, y_train, cv=3, scoring='accuracy')
print('Training accuracy:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='precision')
print('Training precision:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='recall')
print('Training recall:', total_test_scores)


NameError: name 'X_train' is not defined

#### K-Neighbors Classifier

In [23]:
steps = [
    ('scaler', StandardScaler()),  # Transformation step
    ('model', KNeighborsClassifier(n_neighbors=5))  # Model with 3 neighbors
]

pipeline = Pipeline(steps)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy')
print('Training accuracy:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='precision')
print('Training precision:', total_test_scores)
total_test_scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='recall')
print('Training recall:', total_test_scores)


NameError: name 'X_train' is not defined

## Model Evaluation

After training your models, it's crucial to evaluate their performance to understand their effectiveness and limitations. This section outlines various techniques and metrics to assess the performance of each model you have implemented.

### Evaluation Techniques
1. **Confusion Matrix**

2. **Accuracy**

3. **Precision and Recall**

4. **F1 Score**

5. **ROC Curve and AUC**

### Implementing Evaluation
- Calculate the metrics listed above using your test data.

## Project Questions:

### Comparative Analysis

- **Compare Metrics**: Examine the performance metrics (such as accuracy, precision, and recall) of each model. Document your observations on which model performs best for your dataset and the problem you're addressing.
- **Evaluate Trade-offs**: Discuss the trade-offs you encountered when choosing between models. Consider factors like computational efficiency, ease of implementation, and model interpretability.
- **Justify Your Choice**: After comparing and evaluating, explain why you believe one model is the best choice. Provide a clear rationale based on the performance metrics and trade-offs discussed.
- **Feature Importance**: Identify and discuss the most important features for the best-performing model. How do these features impact the predictions? Use the visualizations you have created to justify your answer if necessary.
- **Model Limitations**: Discuss any limitations you encountered with the models you used. Are there any aspects of the data or the problem that these models do not handle well?
- **Future Improvements**: Suggest potential improvements or further steps you could take to enhance model performance. This could include trying different algorithms, feature engineering techniques, or tuning hyperparameters.

### Answer Here:

 Sorry for any typos.

1. Sadly, I faced certain issues with the precesion and recall on the legestic regression model, deciasion tree, and the SVC model for reasons that I didn't have enough time to identify.

2. Each model offers a unique trait, or has a unique usage.
so it's really tough to share trade offs between each model for this data set.

3. At first I thought that the logestic regression for predecting whether a crash was fatal or not would be the best fit for this data set, but it due to the fact that we should judge models as it stands currently, the KNN comes out in top due to the fact that no errors occure when calculating the recall metric.

4. Using the chi-square feature selector, the algorithem pointed out the best features and upon rechecking the heatmap for this data set the algorithem did indeed select the best possible features for this claasifecation problem

5. Each model serves as a unique problem solver.  
6. Fixing issues and preproccesising the data better would without a doubt make the models far supperior to their current form.





  



