# Ponderada 3

## Construção e deploy de um modelo de predição com dados de acidentes em rodovias do Brasil de 2010 à 2023.

### Análise de Dados e Modelagem com PyCaret

Neste notebook, realiza-se uma análise de dados e modelagem usando a biblioteca PyCaret para prever o número de mortos em acidentes de trânsito no Brasil. Seguem-se as etapas a seguir:

1. Extração dos dados
2. Pré-processamento de dados
3. Análise exploratória
4. Modelagem com PyCaret
5. Avaliação do modelo
6. Salvamento do modelo e criação de API

## 1. Extração dos Dados

Primeiro, importam-se as bibliotecas necessárias e os dados do arquivo CSV são lidos:


In [191]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from pycaret.regression import *

In [3]:
df = pd.read_csv("Brazil: Total highway crashes 2010 - 2023/Brazil Total highway crashes 2010 - 2023.csv")
df

  df = pd.read_csv("Brazil: Total highway crashes 2010 - 2023/Brazil Total highway crashes 2010 - 2023.csv")


Unnamed: 0,data,horario,n_da_ocorrencia,tipo_de_ocorrencia,km,trecho,sentido,lugar_acidente,tipo_de_acidente,automovel,...,outros,tracao_animal,transporte_de_cargas_especiais,trator_maquinas,utilitarios,ilesos,levemente_feridos,moderadamente_feridos,gravemente_feridos,mortos
0,01/01/2010,04:21:00,18,sem vítima,167,BR-393/RJ,Norte,Rodovia do Aço,Derrapagem,1,...,,,,,,1.0,0.0,0.0,0.0,0.0
1,01/01/2010,02:13:00,20,sem vítima,2695,BR-116/PR,Sul,Autopista Regis Bittencourt,Colisão Traseira,2.0,...,,,,,,3.0,,,,
2,01/01/2010,03:35:00,000024/2010,sem vítima,77,BR-290/RS,Norte,Concepa,COLISÃO LATERAL,2.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
3,01/01/2010,07:31:00,000038/2010,sem vítima,52,BR-116/RS,Norte,Concepa,QUEDA DE MOTO,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,01/01/2010,04:57:00,000027/2010,sem vítima,33,BR-290/RS,Norte,Concepa,QUEDA DE MOTO,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
864556,31/12/2022,00:08:00,4,Acidente com Danos Materiais,865000,BR-262/MG,Oeste,Concebra,Saida de Pista,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
864557,31/12/2022,03:28:00,21,Com vítima,180000,BR-50/MG,Decrescente,ECO050,Capotamento,1.0,...,0.0,0.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0
864558,31/12/2022,05:05:55,14,Sem vítima,115100,BR-116/PR,Decrescente,Autopista Planalto Sul,Colisão traseira,2.0,...,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
864559,31/12/2022,13:49:33,339,Acidente com Danos Materiais,379000,BR-262/MG,Leste,Concebra,Saida de Pista,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


- Descrição das colunas do dataframe:

In [4]:
df.columns

Index(['data', 'horario', 'n_da_ocorrencia', 'tipo_de_ocorrencia', 'km',
       'trecho', 'sentido', 'lugar_acidente', 'tipo_de_acidente', 'automovel',
       'bicicleta', 'caminhao', 'moto', 'onibus', 'outros', 'tracao_animal',
       'transporte_de_cargas_especiais', 'trator_maquinas', 'utilitarios',
       'ilesos', 'levemente_feridos', 'moderadamente_feridos',
       'gravemente_feridos', 'mortos'],
      dtype='object')

In [5]:
df.describe

<bound method NDFrame.describe of               data   horario n_da_ocorrencia            tipo_de_ocorrencia  \
0       01/01/2010  04:21:00              18                    sem vítima   
1       01/01/2010  02:13:00              20                    sem vítima   
2       01/01/2010  03:35:00     000024/2010                    sem vítima   
3       01/01/2010  07:31:00     000038/2010                    sem vítima   
4       01/01/2010  04:57:00     000027/2010                    sem vítima   
...            ...       ...             ...                           ...   
864556  31/12/2022  00:08:00               4  Acidente com Danos Materiais   
864557  31/12/2022  03:28:00              21                    Com vítima   
864558  31/12/2022  05:05:55              14                    Sem vítima   
864559  31/12/2022  13:49:33             339  Acidente com Danos Materiais   
864560  31/12/2022  12:12:09             188                    Com vítima   

             km     trecho   

## 2. Pré-processamento de Dados

Em seguida, realiza-se o pré-processamento de dados, que inclui:

- Preenchimento de valores ausentes com zero:

In [12]:
df = df.fillna(0)
df.isna().sum()
df

Unnamed: 0,data,horario,n_da_ocorrencia,tipo_de_ocorrencia,trecho,sentido,lugar_acidente,tipo_de_acidente,automovel,bicicleta,...,tracao_animal,transporte_de_cargas_especiais,trator_maquinas,utilitarios,ilesos,levemente_feridos,moderadamente_feridos,gravemente_feridos,mortos,max_km
0,01/01/2010,04:21:00,18,sem vítima,BR-393/RJ,Norte,Rodovia do Aço,Derrapagem,1,0.0,...,0.0,0,0.0,0,1.0,0.0,0.0,0.0,0.0,167.00
1,01/01/2010,02:13:00,20,sem vítima,BR-116/PR,Sul,Autopista Regis Bittencourt,Colisão Traseira,2.0,0.0,...,0.0,0,0.0,0,3.0,0.0,0.0,0.0,0.0,269.50
2,01/01/2010,03:35:00,000024/2010,sem vítima,BR-290/RS,Norte,Concepa,COLISÃO LATERAL,2.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,77.00
3,01/01/2010,07:31:00,000038/2010,sem vítima,BR-116/RS,Norte,Concepa,QUEDA DE MOTO,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,52.00
4,01/01/2010,04:57:00,000027/2010,sem vítima,BR-290/RS,Norte,Concepa,QUEDA DE MOTO,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,33.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
864556,31/12/2022,00:08:00,4,Acidente com Danos Materiais,BR-262/MG,Oeste,Concebra,Saida de Pista,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,865.00
864557,31/12/2022,03:28:00,21,Com vítima,BR-50/MG,Decrescente,ECO050,Capotamento,1.0,0.0,...,0.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,180.00
864558,31/12/2022,05:05:55,14,Sem vítima,BR-116/PR,Decrescente,Autopista Planalto Sul,Colisão traseira,2.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,115.10
864559,31/12/2022,13:49:33,339,Acidente com Danos Materiais,BR-262/MG,Leste,Concebra,Saida de Pista,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,379.00


- Remoção da coluna 'tipo_de_ocorrencia', pois essa coluna não tem muita influencia na criação do modelo:

In [16]:
df = df.drop('tipo_de_ocorrencia', axis=1)

In [17]:
df['tipo_de_acidente'].unique()

array(['Derrapagem', 'Colisão Traseira', 'COLISÃO LATERAL',
       'QUEDA DE MOTO', 'SAÍDA DE PISTA', 'Saida de Pista',
       'Colisão Transversal', 'Queda de Moto', 'Choque - Objeto Fixo',
       'Choque em barreira New Jersey', 'Colisão Frontal',
       'Choque - Defensa, barreira ou "submarino"', 'Capotamento',
       'Colisão Lateral', 'Soterramento', 'Atropelamento de Animal',
       'Tombamento', 'Choque - Talude', 'Outros - Sequência',
       'COLISÃO TRASEIRA', 'Colisão traseira',
       'Choque - Elemento de Drenagem', 'Choque - Suporte de Sinalização',
       'Atropelamento - Morador', 'Choque em objeto fixo',
       'Queda de moto', 'Choque em objeto na pista', 'Choque em defensa',
       'Outros', 'Abalroamento longitudinal', 'Engavetamento',
       'Choque - Arvore', 'Queda de ribanceira', 'Choque Talude',
       'Não Def', 'Atropelamento de pedestre atravessando',
       'Atropelamento de animal', 'Colisão frontal', 'Choque - Poste',
       'Atropelamento - Pedestre', 'C

In [41]:
contador = df['tipo_de_acidente'].str.upper().str.startswith('COLIS').sum()
print(contador)

316968


In [36]:
df.columns

Index(['data', 'horario', 'trecho', 'sentido', 'lugar_acidente',
       'tipo_de_acidente', 'automovel', 'bicicleta', 'caminhao', 'moto',
       'onibus', 'outros', 'tracao_animal', 'transporte_de_cargas_especiais',
       'trator_maquinas', 'utilitarios', 'ilesos', 'levemente_feridos',
       'moderadamente_feridos', 'gravemente_feridos', 'mortos', 'km'],
      dtype='object')

In [31]:
df['lugar_acidente'].unique())


array(['Rodovia do Aço', 'Autopista Regis Bittencourt', 'Concepa',
       'Autopista Planalto Sul', 'Autopista Litoral Sul', 'Concer',
       'Novadutra', 'Autopista Fernão Dias', 'Autopista Fluminense',
       'Transbrasiliana', 'Crt', 'Via Bahia', 'Ecosul', 'ECO101',
       'Concebra', 'Cro', 'ECO050', 'VIA040', 'MSVIA', 'Ecoponte',
       'Via Sul', 'Ecovias do Cerrado', 'Via Costeira', 'Ecoriominas',
       'Via Brasil', 'RIOSP', 'Ecovias do Araguaia'], dtype=object)

In [37]:
df['bicicleta'].unique()

array([0., 1., 2., 3., 4., 5.])

In [35]:
df['km'].unique()

array([167.   , 269.5  ,  77.   , ..., 113.569, 146.425, 686.154])

- Aqui, é criada uma função que cria uma nova coluna 'faixa_km', baseado na coluna 'km', que trará ao modelo em que faixa de km da rodovia o acidente aconteceu.

In [40]:
def faixa_km(km):
    if km >=0 and km <=200:
        return 1
    elif km >=201 and km <=400:
        return 2
    elif km >=401 and km <=600:
        return 3
    elif km >=601 and km <=800:
        return 4
    elif km >=801 and km <=1000:
        return 5
    else:
        return 6

df['faixa_km'] = df['km'].apply(faixa_km)

df['faixa_km']

0         1
1         2
2         1
3         1
4         1
         ..
864556    5
864557    1
864558    1
864559    2
864560    2
Name: faixa_km, Length: 864561, dtype: int64

- Transformação da coluna 'tipo_de_acidente' em uma nova coluna 'categoria_acidente', que transforma esse dado categórico em numérico:

In [64]:
def tipo_de_acidente(acidente):
    if acidente == 0:
        return 0
    if acidente.upper().startswith('COLIS'):
        return 1
    elif acidente.upper().startswith('ATRO'):
        return 2
    elif acidente.upper().startswith('CHOQ'):
        return 3
    elif acidente.upper().startswith('ABAL'):
        return 4
    else:
        return 5

df['categoria_acidente'] = df['tipo_de_acidente'].apply(tipo_de_acidente)
df['categoria_acidente']

0         5
1         1
2         1
3         5
4         5
         ..
864556    5
864557    5
864558    1
864559    5
864560    5
Name: categoria_acidente, Length: 864561, dtype: int64

In [67]:
df.columns

Index(['data', 'horario', 'trecho', 'sentido', 'lugar_acidente',
       'tipo_de_acidente', 'automovel', 'bicicleta', 'caminhao', 'moto',
       'onibus', 'outros', 'tracao_animal', 'transporte_de_cargas_especiais',
       'trator_maquinas', 'utilitarios', 'ilesos', 'levemente_feridos',
       'moderadamente_feridos', 'gravemente_feridos', 'mortos', 'km',
       'faixa_km', 'categoria_acidente'],
      dtype='object')

- Descartando colunas que foram tratadas anteriormente e que foram adicionadas ao dataset com dados tratados:

In [69]:
df = df.drop('km', axis=1)

In [70]:
df = df.drop('tipo_de_acidente', axis=1)

- Transformação da coluna 'data' em formato de mês/ano:

In [76]:
df['data']=pd.to_datetime(df['data'], format='%d/%m/%Y')

def mes_ano(data):
    return data.strftime('%m/%Y')

df['mes_ano'] = df['data'].apply(lambda x: mes_ano(x))

df['mes_ano']

0         01/2010
1         01/2010
2         01/2010
3         01/2010
4         01/2010
           ...   
864556    12/2022
864557    12/2022
864558    12/2022
864559    12/2022
864560    12/2022
Name: mes_ano, Length: 864561, dtype: object

In [80]:
df

Unnamed: 0,trecho,sentido,lugar_acidente,automovel,bicicleta,caminhao,moto,onibus,outros,tracao_animal,...,trator_maquinas,utilitarios,ilesos,levemente_feridos,moderadamente_feridos,gravemente_feridos,mortos,faixa_km,categoria_acidente,mes_ano
0,BR-393/RJ,Norte,Rodovia do Aço,1,0.0,0,0,0.0,0.0,0.0,...,0.0,0,1.0,0.0,0.0,0.0,0.0,1,5,01/2010
1,BR-116/PR,Sul,Autopista Regis Bittencourt,2.0,0.0,0,0,0.0,0.0,0.0,...,0.0,0,3.0,0.0,0.0,0.0,0.0,2,1,01/2010
2,BR-290/RS,Norte,Concepa,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1,1,01/2010
3,BR-116/RS,Norte,Concepa,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,5,01/2010
4,BR-290/RS,Norte,Concepa,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,5,01/2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
864556,BR-262/MG,Oeste,Concebra,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,5,5,12/2022
864557,BR-50/MG,Decrescente,ECO050,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,1.0,0.0,0.0,0.0,1,5,12/2022
864558,BR-116/PR,Decrescente,Autopista Planalto Sul,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1,1,12/2022
864559,BR-262/MG,Leste,Concebra,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2,5,12/2022


In [82]:
df.columns

Index(['trecho', 'sentido', 'lugar_acidente', 'automovel', 'bicicleta',
       'caminhao', 'moto', 'onibus', 'outros', 'tracao_animal',
       'transporte_de_cargas_especiais', 'trator_maquinas', 'utilitarios',
       'ilesos', 'levemente_feridos', 'moderadamente_feridos',
       'gravemente_feridos', 'mortos', 'faixa_km', 'categoria_acidente',
       'mes_ano'],
      dtype='object')

- Aqui, o dataframe tratado é salvo em um novo arquivo '.csv' para que caso seja carregado mais facilmente ao modelo posteriormente

In [83]:
caminho_arquivo = 'Brazil: Total highway crashes 2010 - 2023/Brazil Total highway crashes 2010 - 2023 - tratado.csv'

In [189]:
df.to_csv(caminho_arquivo, index=False)

- Alguns dados aleatórios estavam em formato errado, e como eram poucos dados, esses foram encontrados e descartados manualmente do dataframe, assim como a transformação de colunas selecionadas abaixo para dados numéricos, para serem normalizadas e aplicadas ao modelo.

In [124]:
tipo = df['moto'][0]
print(type(tipo))

<class 'int'>


In [181]:
df = df.drop(df.index[759156])

In [188]:
df['categoria_acidente'] = pd.to_numeric(df['categoria_acidente'])

In [150]:
df.columns

Index(['trecho', 'sentido', 'lugar_acidente', 'automovel', 'bicicleta',
       'caminhao', 'moto', 'onibus', 'outros', 'tracao_animal',
       'transporte_de_cargas_especiais', 'trator_maquinas', 'utilitarios',
       'ilesos', 'levemente_feridos', 'moderadamente_feridos',
       'gravemente_feridos', 'mortos', 'faixa_km', 'categoria_acidente',
       'mes_ano'],
      dtype='object')

- Normalização das features usando Min-Max Scaling:

In [195]:
features_to_normalize = ['automovel', 'bicicleta',
       'caminhao', 'moto', 'onibus', 'outros', 'tracao_animal',
       'transporte_de_cargas_especiais', 'trator_maquinas', 'utilitarios',
       'ilesos', 'levemente_feridos', 'moderadamente_feridos',
       'gravemente_feridos', 'mortos', 'faixa_km', 'categoria_acidente']


scaler = MinMaxScaler()

df[features_to_normalize] = scaler.fit_transform(df[features_to_normalize])
df

Unnamed: 0,trecho,sentido,lugar_acidente,automovel,bicicleta,caminhao,moto,onibus,outros,tracao_animal,transporte_de_cargas_especiais,trator_maquinas,utilitarios,ilesos,levemente_feridos,moderadamente_feridos,gravemente_feridos,mortos,faixa_km,categoria_acidente,mes_ano
0,BR-393/RJ,Norte,Rodovia do Aço,0.066667,0.0,0.1250,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.001988,0.000000,0.0,0.0,0.0,0.0,1.0,01/2010
1,BR-116/PR,Sul,Autopista Regis Bittencourt,0.133333,0.0,0.1250,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.005964,0.000000,0.0,0.0,0.0,0.2,0.2,01/2010
2,BR-290/RS,Norte,Concepa,0.133333,0.0,0.1250,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.003976,0.000000,0.0,0.0,0.0,0.0,0.2,01/2010
3,BR-116/RS,Norte,Concepa,0.000000,0.0,0.1250,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.001988,0.000000,0.0,0.0,0.0,0.0,1.0,01/2010
4,BR-290/RS,Norte,Concepa,0.000000,0.0,0.1250,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.001988,0.000000,0.0,0.0,0.0,0.0,1.0,01/2010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
864556,BR-262/MG,Oeste,Concebra,0.066667,0.0,0.1250,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.001988,0.000000,0.0,0.0,0.0,0.8,1.0,12/2022
864557,BR-50/MG,Decrescente,ECO050,0.066667,0.0,0.1250,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.005964,0.019608,0.0,0.0,0.0,0.0,1.0,12/2022
864558,BR-116/PR,Decrescente,Autopista Planalto Sul,0.133333,0.0,0.1250,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.003976,0.000000,0.0,0.0,0.0,0.0,0.2,12/2022
864559,BR-262/MG,Leste,Concebra,0.000000,0.0,0.1875,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.001988,0.000000,0.0,0.0,0.0,0.2,1.0,12/2022


## 3.Análise Exploratória dos Dados:

## 4.Modelagem com PyCaret

Agora, usa-se a biblioteca PyCaret para construir e comparar modelos de regressão. Primeiro, seleciona-se uma amostra aleatória de 12.000 linhas dos dados:

In [202]:
amostra_aleatoria = df.sample(n=12000, random_state=42)

Em seguida, configura-se o ambiente com o PyCaret:

In [203]:
s = setup(data = amostra_aleatoria, target = 'mortos')

Unnamed: 0,Description,Value
0,Session id,5673
1,Target,mortos
2,Target type,Regression
3,Original data shape,"(12000, 21)"
4,Transformed data shape,"(12000, 35)"
5,Transformed train set shape,"(8400, 35)"
6,Transformed test set shape,"(3600, 35)"
7,Numeric features,16
8,Categorical features,4
9,Preprocess,True


E comparam-se modelos para encontrar o melhor modelo:

In [204]:
melhor_modelo = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lasso,Lasso Regression,0.0012,0.0,0.0051,-0.0025,0.0049,0.9818,0.072
en,Elastic Net,0.0012,0.0,0.0051,-0.0025,0.0049,0.9818,0.068
llar,Lasso Least Angle Regression,0.0012,0.0,0.0051,-0.0025,0.0049,0.9818,0.068
omp,Orthogonal Matching Pursuit,0.0012,0.0,0.0051,-0.0038,0.0049,0.9818,0.071
ridge,Ridge Regression,0.0012,0.0,0.0051,-0.0041,0.0049,0.9757,0.071
huber,Huber Regressor,0.0006,0.0,0.0051,-0.0151,0.005,0.9999,0.18
br,Bayesian Ridge,0.0014,0.0,0.0051,-0.0307,0.005,0.9665,0.075
lr,Linear Regression,0.0015,0.0,0.0051,-0.0406,0.005,0.9657,0.433
gbr,Gradient Boosting Regressor,0.0013,0.0,0.0051,-0.0434,0.005,0.9289,0.378
knn,K Neighbors Regressor,0.0009,0.0,0.0052,-0.0637,0.005,0.9264,0.11


Processing:   0%|          | 0/77 [00:00<?, ?it/s]

KeyboardInterrupt: 

## 5.Avaliação do Modelo

Avalia-se o desempenho do melhor modelo usando a função evaluate_model:

## 6.Salvamento do Modelo e Criação de API

Finalmente, salvamos o melhor modelo e criamos uma API com ele: