# Mestrado em Sistemas Integrados de Apoio à Decisão

### Projeto de Grupo - Desenvolvimento de Poster Científico sobre Tomada de Decisão Baseada em Dados

#### Grupo 5
##### Elementos:
##### Pedro Conceição - Nº estudante 129188
##### Ricardo Mororó - Nº estudante 94562

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("Road.csv", index_col=False)

### Escolha das variáveis a serem utilizadas para análise

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12

#### Foram escolhidas variáveis sobre o condutor da viatura envolvida nos acidentes, além do local e das causas atribuídas ao mesmos. Como variável target, para aplicação de técnicas de ML, foi escolhida a variável relativa a severidade do acidente.

In [4]:
df_new = df[["Time", "Day_of_week", "Age_band_of_driver", "Driving_experience", "Type_of_vehicle", "Area_accident_occured",
            "Lanes_or_Medians", "Types_of_Junction", "Cause_of_accident", "Accident_severity"]]

In [5]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Time                   12316 non-null  object
 1   Day_of_week            12316 non-null  object
 2   Age_band_of_driver     12316 non-null  object
 3   Driving_experience     11487 non-null  object
 4   Type_of_vehicle        11366 non-null  object
 5   Area_accident_occured  12077 non-null  object
 6   Lanes_or_Medians       11931 non-null  object
 7   Types_of_Junction      11429 non-null  object
 8   Cause_of_accident      12316 non-null  object
 9   Accident_severity      12316 non-null  object
dtypes: object(10)
memory usage: 962.3+ KB


In [6]:
df_new.nunique()

Time                     1074
Day_of_week                 7
Age_band_of_driver          5
Driving_experience          7
Type_of_vehicle            17
Area_accident_occured      14
Lanes_or_Medians            7
Types_of_Junction           8
Cause_of_accident          20
Accident_severity           3
dtype: int64

### Tratamento de variáveis categóricas

#### Variável Type_of_vehicle

In [7]:
df_new['Type_of_vehicle'].unique()

array(['Automobile', 'Public (> 45 seats)', 'Lorry (41?100Q)', nan,
       'Public (13?45 seats)', 'Lorry (11?40Q)', 'Long lorry',
       'Public (12 seats)', 'Taxi', 'Pick up upto 10Q', 'Stationwagen',
       'Ridden horse', 'Other', 'Bajaj', 'Turbo', 'Motorcycle',
       'Special vehicle', 'Bicycle'], dtype=object)

In [8]:
# Tratamento da variável tipo de veículo de forma a dimunuir a quantidade de classes
category_mapping = {
    'Automobile': 'Car',
    'Public (> 45 seats)': 'Public Transport',
    'Public (13?45 seats)': 'Public Transport',
    'Public (12 seats)': 'Public Transport',
    'Lorry (41?100Q)': 'Truck',
    'Lorry (11?40Q)': 'Truck',
    'Long lorry': 'Truck',
    'Taxi': 'Public Transport',
    'Pick up upto 10Q': 'Truck',
    'Stationwagen': 'Car',
    'Ridden horse': 'Others',
    'Other': 'Others',
    'Bajaj': 'Motorcycle',
    'Turbo': 'Car',
    'Motorcycle': 'Motorcycle',
    'Special vehicle': 'Others',
    'Bicycle': 'Bicycle'
}

# Definindo uma função para categorizar
def categorize_vehicle_type(vehicle_type):
    """Categorização dos tipos de veículos com base no dicionário de mapeamento."""
    return category_mapping.get(vehicle_type, 'Unknown')  # Retorna 'Unknown' se não encontrar o valor

# Aplicando a função ao DataFrame
df_new['Type_of_vehicle'] = df_new['Type_of_vehicle'].apply(categorize_vehicle_type)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['Type_of_vehicle'] = df_new['Type_of_vehicle'].apply(categorize_vehicle_type)


In [9]:
df_new.nunique()

Time                     1074
Day_of_week                 7
Age_band_of_driver          5
Driving_experience          7
Type_of_vehicle             7
Area_accident_occured      14
Lanes_or_Medians            7
Types_of_Junction           8
Cause_of_accident          20
Accident_severity           3
dtype: int64

#### Variável Area_accident_occured

In [10]:
df_new['Area_accident_occured'].unique()

array(['Residential areas', 'Office areas', '  Recreational areas',
       ' Industrial areas', nan, 'Other', ' Church areas',
       '  Market areas', 'Unknown', 'Rural village areas',
       ' Outside rural areas', ' Hospital areas', 'School areas',
       'Rural village areasOffice areas', 'Recreational areas'],
      dtype=object)

In [11]:
# Tratamento da variável tipo de área de forma a dimunuir a quantidade de classes
category_mapping2 = {
    'Residential areas': 'Private',
    'Office areas': 'Business',
    '  Recreational areas': 'Public',
    ' Industrial areas': 'Business',
    'Other': 'Other',
    ' Church areas': 'Public',
    '  Market areas': 'Public',
    'Unknown': 'Other',
    'Rural village areas': 'Rural',
    ' Outside rural areas': 'Rural',
    ' Hospital areas': 'Public',
    'School areas': 'Public',
    'Rural village areasOffice areas': 'Rural',
    'Recreational areas': 'Public'
}

# Definindo uma função para categorizar
def categorize_area_type(area_type):
    """Categorização dos tipos de veículos com base no dicionário de mapeamento."""
    return category_mapping2.get(area_type, 'Other')  # Retorna 'Unknown' se não encontrar o valor

# Aplicando a função ao DataFrame
df_new['Area_accident_occured'] = df_new['Area_accident_occured'].apply(categorize_area_type)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['Area_accident_occured'] = df_new['Area_accident_occured'].apply(categorize_area_type)


In [12]:
df_new['Area_accident_occured'].unique()

array(['Private', 'Business', 'Public', 'Other', 'Rural'], dtype=object)

#### Variável Cause_of_accident

In [13]:
df_new['Cause_of_accident'].unique()

array(['Moving Backward', 'Overtaking', 'Changing lane to the left',
       'Changing lane to the right', 'Overloading', 'Other',
       'No priority to vehicle', 'No priority to pedestrian',
       'No distancing', 'Getting off the vehicle improperly',
       'Improper parking', 'Overspeed', 'Driving carelessly',
       'Driving at high speed', 'Driving to the left', 'Unknown',
       'Overturning', 'Turnover', 'Driving under the influence of drugs',
       'Drunk driving'], dtype=object)

In [14]:
# Tratamento da variável tipo de causa de forma a dimunuir a quantidade de classes
category_mapping3 = {
    'Overspeed': 'Risky driving',
    'Driving at high speed': 'Risky driving',
    'Driving to the left': 'Risky driving',
    'Driving under the influence of drugs': 'Risky driving',
    'Driving under the influence of drugs': 'Risky driving',
    'Drunk driving': 'Risky driving',
    'Driving carelessly': 'Risky driving',
    'Moving Backward': 'Directional changes',
    'Changing lane to the left': 'Directional changes',
    'Changing lane to the right': 'Directional changes',
    'Turnover': 'Directional changes',
    'Overtaking': 'Directional changes',
    'No priority to vehicle': 'Priority violation',
    'No priority to pedestrian': 'Priority violation',
    'No distancing': 'Reckless driving',
    'Getting off the vehicle improperly': 'Reckless driving',
    'Improper parking': 'Reckless driving',
    'Overloading' : 'Risky driving',
    'Unknown' : 'Other',
    'Overturning' : 'Reckless driving'
}

# Definindo uma função para categorizar
def categorize_cause_type(cause_type):
    """Categorização dos tipos de veículos com base no dicionário de mapeamento."""
    return category_mapping3.get(cause_type, 'Other')  # Retorna 'Unknown' se não encontrar o valor

# Aplicando a função ao DataFrame
df_new['Cause_of_accident'] = df_new['Cause_of_accident'].apply(categorize_cause_type)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['Cause_of_accident'] = df_new['Cause_of_accident'].apply(categorize_cause_type)


In [15]:
df_new['Cause_of_accident'].unique()

array(['Directional changes', 'Risky driving', 'Other',
       'Priority violation', 'Reckless driving'], dtype=object)

### Tratamento da variável Time

In [16]:
#Conversão da variável time de tipo object para o formato de hora:minuto:segundo
df_new['Time'] = pd.to_datetime(df_new['Time'], format='%H:%M:%S', errors='coerce')


# Define the function to categorize times
def categorize_time(time):
    if time.hour >= 6 and time.hour < 11:
        return 'Morning'
    elif time.hour >= 11 and time.hour < 14:
        return 'Afternoon'
    elif time.hour >= 14 and time.hour < 19:
        return 'Evening'
    elif time.hour >= 19 and time.hour < 23:
        return 'Night'
    else:
        return 'Dawn'

# Apply the function to create a new column 'Part_of_day'
df_new['Time'] = df_new['Time'].apply(categorize_time)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['Time'] = pd.to_datetime(df_new['Time'], format='%H:%M:%S', errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['Time'] = df_new['Time'].apply(categorize_time)


In [17]:
df_new['Time'].unique()


array(['Evening', 'Dawn', 'Night', 'Morning', 'Afternoon'], dtype=object)

### Tratamento de valores omissos

#### Escolheu-se imputar os valores omissos pela classe modal

In [18]:
#Proporção dos valores omissos nas colunas do novo dataframe
df_new.isnull().sum()/len(df_new)

Time                     0.000000
Day_of_week              0.000000
Age_band_of_driver       0.000000
Driving_experience       0.067311
Type_of_vehicle          0.000000
Area_accident_occured    0.000000
Lanes_or_Medians         0.031260
Types_of_Junction        0.072020
Cause_of_accident        0.000000
Accident_severity        0.000000
dtype: float64

In [19]:
df_clean = df_new.apply(lambda x: x.fillna(x.value_counts().index[0]))

In [20]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Time                   12316 non-null  object
 1   Day_of_week            12316 non-null  object
 2   Age_band_of_driver     12316 non-null  object
 3   Driving_experience     12316 non-null  object
 4   Type_of_vehicle        12316 non-null  object
 5   Area_accident_occured  12316 non-null  object
 6   Lanes_or_Medians       12316 non-null  object
 7   Types_of_Junction      12316 non-null  object
 8   Cause_of_accident      12316 non-null  object
 9   Accident_severity      12316 non-null  object
dtypes: object(10)
memory usage: 962.3+ KB


In [21]:
df_clean.to_csv("df_road.csv", index=False)

In [22]:
df_clean.columns.tolist()

['Time',
 'Day_of_week',
 'Age_band_of_driver',
 'Driving_experience',
 'Type_of_vehicle',
 'Area_accident_occured',
 'Lanes_or_Medians',
 'Types_of_Junction',
 'Cause_of_accident',
 'Accident_severity']

In [23]:
del(df)

In [24]:
del(df_new)