<h2 align="center"> Data Mining and Machine Learning </h2>
<h3 align="center"> Final Project </h3>
<h2 align="center"> <b> <i> CrashSpot </i> </b> </h2>
<h4 align="center"> Lorenzo Ceccanti matr. 564490 </h4>

### <b> Data Cleaning </b>

In the _dataExploration_ notebook, we saw that on the 463152 entries only `police_station` attribute has a large quantity of null variables.

However, let's make a check for the whole dataset.

In [25]:
import os
import pandas as pd
df_clean = pd.read_csv(os.path.join('dataset/BRASIL_AGGR', 'accidents_2017_to_2023_english_BRASIL.csv'))
df_clean.rename(columns={"ignored": "unharmed"}, inplace=True)

total_tuples = df_clean.shape[0]

In [26]:
df_clean.isna().sum()

inverse_data             0
week_day                 0
hour                     0
state                    0
road_id                990
km                     990
city                     0
cause_of_accident        0
type_of_accident         0
victims_condition        0
weather_timestamp        0
road_direction           0
wheather_condition       0
road_type                0
road_delineation         0
people                   0
deaths                   0
slightly_injured         0
severely_injured         0
uninjured                0
unharmed                 0
total_injured            0
vehicles_involved        0
latitude                 0
longitude                0
regional                10
police_station        1310
dtype: int64

In [27]:
tuple_to_drop = df_clean['police_station'].isna().sum()
percentage = (tuple_to_drop / total_tuples) * 100
print(f'Total number of instances: {total_tuples}')
print(f'Number of instances lost: {tuple_to_drop}')
print(f'Estimation of the percentage of instances lost on the entire dataset: {percentage:.2f}%')


Total number of instances: 463152
Number of instances lost: 1310
Estimation of the percentage of instances lost on the entire dataset: 0.28%


Because the percentage of instances lost on the entire dataset is 0.28%, I decide to remove the instances in which `police_station` has null value.

For the same reasons, we remove also the instances in which `regional`, `road_id` and `km` have null values in the respective attributes.

In [28]:
# The subset parameter allows to specify in which columns we consider the presence of null instances. 
df_clean = df_clean.dropna(subset=['police_station', 'regional', 'road_id', 'km'])
print(f'Number of instances dropped: {total_tuples - df_clean.shape[0]}')
eff_percentage = ((total_tuples - df_clean.shape[0])/total_tuples)*100
print(f'Real percentage of instances lost on the total number of instances available: {eff_percentage:.2f}%')

Number of instances dropped: 2223
Real percentage of instances lost on the total number of instances available: 0.48%


Previously, during data exploration, we saw that 40 instances had _Not informed_ as value for `type_of_accident`.

Let's check if there still exists in the cleaned version of the dataset.

In [29]:
import numpy as np
df_tmp = df_clean.replace('Not informed', np.nan)
df_tmp.loc[:,'type_of_accident'].isna().sum()

40

Yes, there are still there. We proceed to remove such instances from the dataset.

In [30]:
df_clean.replace('Not informed', np.nan).dropna(subset=['type_of_accident'])
df_clean.loc[:,'type_of_accident'].isna().sum()

0

### Counting the records having a no-sense value for latitude and longitude

A latitude value has sense if it's between -90 and +90, while a longitude value has sense if it's between -180 and +180.

In [31]:
df_invalid = df_clean[
    (df_clean['latitude'] > 90) | (df_clean['latitude'] < -90) |
    (df_clean['longitude'] > 180) | (df_clean['longitude'] < -180)
]
tuple_to_drop = df_invalid.shape[0]
tuple_to_drop

32

Since there are only 32 instances affected by this problem, we decide to simply remove those instances from the dataset.

In [32]:
old_instances_count = df_clean.shape[0]
# We have available the original indexes used for df_clean in df_invalid, so we can use the drop method
df_clean = df_clean.drop(df_invalid.index)
new_instances_count = df_clean.shape[0]
print(f'Removed {old_instances_count - new_instances_count} instances')

Removed 32 instances


Let's write on file the cleaned version of BRASIL_AGGR.

In [33]:
import os
out_dir = 'editedDataset'
if not os.path.exists(out_dir):
    os.makedirs(out_dir)
file_path = os.path.join(out_dir, 'CLEANED_brasilEnglishAggr.csv')
df_clean.to_csv(file_path, index=False, encoding='utf-8')