<h2 align="center"> Data Mining and Machine Learning </h2>
<h3 align="center"> Final Project </h3>
<h2 align="center"> <b> <i> CrashSpot </i> </b> </h2>
<h4 align="center"> Lorenzo Ceccanti matr. 564490 </h4>

### <b> Data Integration </b>

We begin this phase from the cleaned version of the dataset in which the granularity is for accident

In [2]:
import os
import pandas as pd

df = pd.read_csv(os.path.join('editedDataset', 'CLEANED_brasilEnglishAggr.csv'))
pd.set_option('display.max_columns', None)
df.head(3)

Unnamed: 0,inverse_data,week_day,hour,state,road_id,km,city,cause_of_accident,type_of_accident,victims_condition,weather_timestamp,road_direction,wheather_condition,road_type,road_delineation,people,deaths,slightly_injured,severely_injured,uninjured,unharmed,total_injured,vehicles_involved,latitude,longitude,regional,police_station
0,2017-01-01,sunday,01:45:00,RS,116.0,349,VACARIA,Mechanical loss/defect of vehicle,Rear-end collision,With injured victims,Night,Decreasing,Clear sky,Simple,Straight,6,0,4,0,2,0,4,2,-28.50712,-50.94118,SPRF-RS,DEL05-RS
1,2017-01-01,sunday,01:00:00,PR,376.0,636,TIJUCAS DO SUL,Incompatible velocity,Run-off-road,With dead victims,Night,Increasing,Drizzle,Double,Curve,2,1,0,0,1,0,0,2,-25.754,-49.1266,SPRF-PR,DEL01-PR
2,2017-01-01,sunday,04:40:00,BA,101.0,65,ENTRE RIOS,Driver was sleeping,Head-on collision,With dead victims,Sunrise,Decreasing,Cloudy,Simple,Curve,5,1,1,1,2,0,2,2,-11.9618,-38.0953,SPRF-BA,DEL01-BA


In [3]:
pd.reset_option('display.max_columns', None)

# Controllare le coordinate dal primissimo file.

Let's inspect the dataset in which the granularity is per occupant.

This dataset is not in English, so first it requires a little bit of translation

The source of data however it's the same. Each file contains different years.

<b> Problem </b>: If I try to import directly the dataset `BRASIL_RAW` we obtain an UnicodeDecodeError. We discover that the encoding for the dataset is not UTF-8.

In [4]:
import os
import chardet

for i in range(2017,2024):
    with open(os.path.join('dataset/BRASIL_RAW/acidentes', f'acidentes{i}.csv'), 'rb') as f:
        result = chardet.detect(f.read(10000))  # leggi i primi 10k byte
        print(f'Encoding for acidentes{i}: ' + result['encoding'])

Encoding for acidentes2017: ISO-8859-1
Encoding for acidentes2018: ISO-8859-1
Encoding for acidentes2019: ISO-8859-1
Encoding for acidentes2020: ISO-8859-1
Encoding for acidentes2021: ISO-8859-1
Encoding for acidentes2022: ISO-8859-1
Encoding for acidentes2023: ISO-8859-1


In [5]:
# This little script converts the original brasil_raw into UTF-8 encoding

# Checking if the new directory we want to create already exists
out_dir = "editedDataset/UTF_acidentes"
if not os.path.exists(out_dir):
    os.makedirs(out_dir)
    
    for i in range(2017,2024):
        src = os.path.join('dataset/BRASIL_RAW/acidentes', f'acidentes{i}.csv')
        dst = os.path.join('editedDataset/UTF_acidentes', f'utf_acidentes{i}.csv')
        # The \ operator is useful to truncate the writing of the code in multiple line for
        # improving the readability of the code
        with open(src, "r", encoding="iso-8859-1", errors="strict") as fin, \
            open(dst, "w", encoding="utf-8", newline="") as fout:
            for line in fin:
                fout.write(line)

In [6]:
arr_df_full = []
files_to_inspect = ['2017', '2018', '2019', '2020', '2021', '2022', '2023']

for y in files_to_inspect:
    # Step 1: Importing the dataset of year y
    df_full = pd.read_csv(os.path.join('editedDataset/UTF_acidentes', f'utf_acidentes{y}.csv'), sep=";", dtype={22: "string", 23:"string", 25:"string"})

    # Step 2: Translation of the category names in English
    # Taking the first attributes until road_delineation

    # Translation of the category names in English
    # Taking the first attributes until road_delineation

    en_attrNames_head = (df.loc[:,:'road_delineation'].columns).tolist()
    # We need the previous labels
    df_full_columns = df_full.columns.tolist()

    # With this selection we're sure to substitute only the names in the first part
    df_full_columns[2:17] = en_attrNames_head
    # Writing the translation for the central part of the columns
    df_full_columns[17:26] = ['without_passengers', 'veichle_id', 'veichle_type', 'veichle_brand', 'veichle_manufacturing_year', 'person_kind', 'person_condition', 'person_age', 'person_sex']
    df_full_columns[26:30] = ['person_is_unharmed', 'person_is_slightly_injured', 'person_is_severely_injured', 'person_is_dead']
    df_full_columns[33] = 'police_station'
    # Applying the translation to the DataFrame
    df_full.columns = df_full_columns

    # Step 3: Translating the instances values in English (attribute per attribute)

    without_passengers_map = {
        'Não': 'No',
        'Sim': 'Yes'
    }
    df_full["without_passengers"] = df_full["without_passengers"].replace(without_passengers_map)

    vehicle_type_map = {
        "Automóvel": "Car",
        "Motocicleta": "Motorcycle",
        "Semireboque": "Semi-trailer",
        "Caminhonete": "Pickup truck",
        "Caminhão-trator": "Tractor-trailer truck",
        "Caminhão": "Truck",
        "Ônibus": "Bus",
        "Camioneta": "Van",
        "Motoneta": "Scooter",
        "Utilitário": "Utility vehicle",
        "Bicicleta": "Bicycle",
        "Micro-ônibus": "Minibus",
        "Reboque": "Trailer",
        "Outros": "Others",
        "Ciclomotor": "Moped",
        "Carroça-charrete": "Cart-wagon",
        "Trator de rodas": "Wheeled tractor",
        "Motor-casa": "Motorhome",
        "Triciclo": "Tricycle",
        "Trem-bonde": "Tram",
        "Trator de esteira": "Crawler tractor",
        "Trator misto": "Backhoe loader",
        "Carro de mão": "Wheelbarrow",
        "Chassi-plataforma": "Chassis platform",
        "Quadriciclo": "Quadricycle"
    }
    df_full["veichle_type"] = df_full["veichle_type"].replace(vehicle_type_map)

    df_full['veichle_brand'] = df_full["veichle_brand"].replace({
        "Não Informado/Não Informado": pd.NA,
        "NA/NA": pd.NA
    })

    df_full['veichle_manufacturing_year'] = df_full["veichle_manufacturing_year"].replace(0,pd.NA)

    person_kind_map = {
        'Condutor': 'Driver',
        'Passageiro': 'Passenger',
        'Pedestre': 'Pedestrian',
        'Testemunha': 'Withness',
        'Cavaleiro': 'Knight'
    }
    df_full["person_kind"] = df_full["person_kind"].replace(person_kind_map)

    person_sex_map = {
        'Masculino': 'M',
        'Feminino': 'F',
        'Não Informado': pd.NA,
        'Ignorado': pd.NA
    }
    df_full["person_sex"] = df_full["person_sex"].replace(person_sex_map)

    person_condition_map = {
        'Ileso': 'Unharmed',
        'Lesões Leves': 'Slightly Injured',
        'Lesões Graves': 'Severely Injured',
        'Não Informado': pd.NA,
        'Óbito': 'Dead'
    }
    df_full["person_condition"] = df_full["person_condition"].replace(person_condition_map)

    # Step 4: Converting latitude, longitude from object to float64
    # We need also to round to 5 digits in order to make the join operation successful
    attr_to_conv = ["latitude", "longitude"]
    for attr in attr_to_conv:
        df_full[attr] = df_full[attr].astype(str).str.replace(",", ".").astype(float).round(5)
        df[attr] = df[attr].round(5)

    # Step 5: We sort by inverse_data, hour, city (in place)
    df.sort_values(by=['inverse_data', 'hour', 'city'], inplace=True)
    df_full.sort_values(by=['inverse_data', 'hour', 'city'], inplace=True)

    # Step 6: From df_full we remove all the duplicate attributes that coincides in both the level of granularity
    indexes = {0,2} # we drop all the head attributes exception made the ones useful for the join
    en_attrNames_head = [x for i,x in enumerate(en_attrNames_head) if i not in indexes]
    df_full = df_full.drop(columns=en_attrNames_head)

    # Step 7: We join the two tables with different granularity
    df_joined = pd.merge(df, df_full, on=['latitude', 'longitude', 'inverse_data', 'hour'])
    arr_df_full.append(df_joined)

`arr_df_full` is an array of DataFrames. In each DataFrame one instance is a person or a withness involved in an accidents, but also has the details about the accident.

Now, let's produce the file `INTEGRATION_brasilEnglishFull.csv`, which is the result of the join operation we've performed so far.

In [10]:
import os
out_dir = 'editedDataset'
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

file_path = os.path.join(out_dir, 'INTEGRATION_brasilEnglishFull.csv')
if not os.path.exists(file_path):
    df_all = pd.concat(arr_df_full, ignore_index=True)
    df_all.to_csv(file_path, index=False, encoding='utf-8')

### Road feature integration

Now, we'll exploit Overpass API to gather some additional details about the road.

In [48]:
start_indexes = [0,100000]
end_indexes = [100,100100]

In [49]:
import numpy as np
import requests
from time import sleep

url = "http://overpass-api.de/api/interpreter"
overpass_dfs = []

for start_index, end_index in zip(start_indexes, end_indexes):
    df_coords_unique = df.iloc[start_index:end_index+1].loc[:,'latitude':'longitude'].drop_duplicates()
    df_coords_unique

    tags = []
    for index, row in df_coords_unique.iterrows():
        latitude = row['latitude']
        longitude = row['longitude']

        query = f"""
        [out:json][timeout:10];
        way(around:5,{latitude},{longitude})[highway];
        out tags;
        """

        # retry loop
        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = requests.get(url, params={'data': query}, timeout=30)
                if response.status_code == 200:
                    break
                else:
                    print(f"Attempt {attempt+1}/{max_retries} failed with status {response.status_code}")
            except requests.exceptions.RequestException as e:
                print(f"Tentativo {attempt+1}/{max_retries} fallito con errore: {e}")
            sleep(2)
        else:
            # se esco dal ciclo senza break, vuol dire che tutti i tentativi sono falliti
            print(f"All attempts failed at {index}, skipping...")
            continue
        response = requests.get(url, params={'data': query})
        
        print(f"Processing index: {index}")
        data = response.json()
        if data["elements"]:
            elements_arr = data["elements"]
            for elem in elements_arr:
                if elem["tags"]:
                    dict = elem["tags"]
                    # I want to keep only: maxspeed, lanes, name, operator, toll, surface, oneway
                    # I use a list comprehesion
                    keep = {"maxspeed", "lanes", "name", "operator", "toll", "surface", "oneway"}
                    dict = {k: v for k, v in elem["tags"].items() if k in keep}
                    dict['latitude'] = latitude
                    dict['longitude'] = longitude
                    tags.append(dict)

    overpass_df = pd.DataFrame(tags)
    overpass_dfs.append(overpass_df)

Processing index: 24622
Processing index: 24623
Processing index: 24629
Processing index: 24620
Processing index: 89098
Processing index: 25311
Processing index: 24674
Processing index: 24625
Processing index: 24621
Processing index: 24671
Processing index: 24626
Processing index: 24634
Processing index: 24643
Processing index: 24628
Processing index: 26949
Processing index: 24633
Processing index: 24624
Processing index: 25213
Processing index: 24664
Processing index: 31823
Processing index: 1
Processing index: 24630
Processing index: 24635
Processing index: 24637
Processing index: 24642
Processing index: 24638
Processing index: 24631
Processing index: 24627
Processing index: 24679
Processing index: 0
Processing index: 24669
Processing index: 24632
Processing index: 24636
Processing index: 1262
Processing index: 24644
Processing index: 24639
Processing index: 24641
Processing index: 48780
Processing index: 24640
Processing index: 24646
Processing index: 24656
Processing index: 24657
P

In [50]:
overpass_dfs[0].shape[0]

67

In [51]:
overpass_dfs[1].shape[0]

62

In [52]:
overpass_dfs_clean = []
for overpass_df in overpass_dfs:
    overpass_df_clean = overpass_df.dropna(subset=['lanes', 'oneway', ])
    overpass_dfs_clean.append(overpass_df_clean)

In [53]:
overpass_dfs_clean[0].shape[0]

26

In [54]:
overpass_dfs_clean[1].shape[0]

29

Since in both cases we obtained that from 100 instances we begin from, only 25-30 of them have associated information concerning the speedlimit and other road features I decide to not consider this aspect in the project. 