**Predicting Severity:** Predicting the severity of accidents is particularly important because it allows for timely and appropriate responses. By assessing accident severity, responders can allocate resources, prioritize medical treatment, and dispatch appropriate personnel. Predictive models can take into account various factors such as road conditions, weather, vehicle type, and collision type to estimate the likelihood of severe outcomes, helping improve emergency response and medical care.

This is a countrywide car accident dataset that covers 49 states of the USA. The accident data were collected from February 2016 to March 2023, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, including the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. The dataset currently contains approximately 7.7 million accident records. For more information about this dataset, please visit here.


This dataset was collected in real-time using multiple Traffic APIs. It contains accident data collected from February 2016 to March 2023 for the Contiguous United States.

**Description for Each and Every Columns**:
*This Data comprises of 46 columns and 7,728,394 rows*

**ID**: This is a unique identifier of the accident record.

**Source**: Source of raw accident data

**Severity**: Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).

**Start_Time**: Shows start time of the accident in local time zone.

**End_Time**: Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.

**Start_Lat**: Shows latitude in GPS coordinate of the start point.

**Start_Lng**: Shows longitude in GPS coordinate of the start point.

**End_Lat**: Shows latitude in GPS coordinate of the end point.

**End_Lng**: Shows longitude in GPS coordinate of the end point.

**Distance(mi)**: The length of the road extent affected by the accident in miles.

**Description**: Shows a human provided description of the accident.

**Street**: Shows the street name in address field.

**City**: Shows the city in address field.

**County**: Shows the county in address field.

**State**: Shows the state in address field.

**Zipcode**: Shows the zipcode in address field.

**Country**: Shows the country in address field.

**Timezone**: Shows timezone based on the location of the accident (eastern, central, etc.).

**Airport_Code**: Denotes an airport-based weather station which is the closest one to location of the accident.

**Weather_Timestamp**: Shows the time-stamp of weather observation record (in local time).

**Temperature(F)**: Shows the temperature (in Fahrenheit).

**Wind_Chill(F):** Shows the wind chill (in Fahrenheit).

**Humidity(%)**: Shows the humidity (in percentage).

**Pressure(in)**: Shows the air pressure (in inches).

**Visibility(mi)**: Shows visibility (in miles).

**Wind_Direction**: Shows wind direction.

**Wind_Speed(mph)**: Shows wind speed (in miles per hour).

**Precipitation(in)**: Shows precipitation amount in inches, if there is any.

**Weather_Condition**: Shows the weather condition (rain, snow, thunderstorm, fog, etc.)

**Amenity**: A POI annotation which indicates presence of amenity in a nearby location.

**Bump**: A POI annotation which indicates presence of speed bump or hump in a nearby location.

**Crossing**: A POI annotation which indicates presence of crossing in a nearby location.

**Give_Way**: A POI annotation which indicates presence of give_way in a nearby location.

**Junction**: A POI annotation which indicates presence of junction in a nearby location.

**No_Exit**: A POI annotation which indicates presence of no_exit in a nearby location.

**Railway**: A POI annotation which indicates presence of railway in a nearby location.

**Roundabout**: A POI annotation which indicates presence of roundabout in a nearby location.

**Station**: A POI annotation which indicates presence of station in a nearby location.

**Stop**: A POI annotation which indicates presence of stop in a nearby location.

**Traffic_Calming**: A POI annotation which indicates presence of traffic_calming in a nearby location.

**Traffic_Signal**: A POI annotation which indicates presence of traffic_signal in a nearby location.

**Turning_Loop**: A POI annotation which indicates presence of turning_loop in a nearby location.

**Sunrise_Sunset**: Shows the period of day (i.e. day or night) based on sunrise/sunset.

**Civil_Twilight**: Shows the period of day (i.e. day or night) based on civil twilight.

**Nautical_Twilight**: Shows the period of day (i.e. day or night) based on nautical twilight.

**Astronomical_Twilight**: Shows the period of day (i.e. day or night) based on astronomical twilight.

#### librerias

In [1]:
import nltk
import os
import re
import string
import pandas as pd
import numpy as np
import requests
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import h2o
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from math import radians, sin, cos, sqrt, atan2
from shapely.geometry import Polygon
from sklearn.preprocessing import LabelEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# from evalml.automl import AutoMLSearch
# from pycaret.classification import setup, compare_models, predict_model

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from statsmodels.tools.tools import add_constant

# from evalml.objectives import get_optimization_objectives
# from evalml.problem_types import ProblemTypes

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from h2o.automl import H2OAutoML
from tpot import TPOTClassifier

nltk.download("stopwords")
stop_words=stopwords.words("english")
new_stopping_words = stop_words[:len(stop_words)-36]
new_stopping_words.remove("not")
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USUARIO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# Definir la ruta a la carpeta Data
Data_path = '../Data'

# Cargar los datos
file_path = os.path.join(Data_path, 'US_Accidents_March23.csv')
df = pd.read_csv(file_path)

# Tomar una muestra aleatoria del 10% de los datos
sample_df = df.sample(frac=0.1, random_state=42)  # random_state asegura reproducibilidad

# Mostrar las primeras filas de la muestra
print("Primeras filas de la muestra (10% del conjunto de datos):")
print(sample_df.head())

FileNotFoundError: [Errno 2] No such file or directory: '../Data\\US_Accidents_March23.csv'