# 01 - Limpeza de Dados
Autora: Fernanda Baptista de Siqueira  
Curso: MBA em Tecnologia para Negócios – AI, Data Science e Big Data  
Tema: Análise de Acidentes de Trânsito em Porto Alegre (2020–2024)  
Origem DataFrame: Equipe Armazém de Dados de Mobilidade - EAMOB/CIET  
https://dadosabertos.poa.br/dataset/acidentes-de-transito-acidentes (11/05/2025)  

### 1. Importa bibliotecas e funções

In [1]:
from config import (
    pd, resumo_df, checar_nulos,
    salvar_parquet, ajustar_tipos,
    PATH_RAW, PATH_CLEAN, COLS_VEICULOS
)


### 2. Leitura e Schema Inicial
1) Carrega o arquivo `acidentes.csv`


In [2]:
arquivo = PATH_RAW + 'acidentes.csv'

try:
    df = pd.read_csv(
        arquivo,
        sep=';',
        encoding='utf-8',
        low_memory=False,
        dtype=str
        )
    print(f"CSV original lido com sucesso!")
except FileNotFoundError:
    print(f"Erro: Arquivo '{arquivo}' não encontrado. Verifique o caminho.")
except Exception as e:
    print(f"Ocorreu um erro: {e}")

CSV original lido com sucesso!


2) Entende e inspeciona DataFrame

In [3]:
resumo_df(df)
checar_nulos(df)

Dimensões: (69521, 34)

Tipos de dados:
data_extracao    object
predial1         object
queda_arr        object
data             object
feridos          object
feridos_gr       object
mortes           object
morte_post       object
fatais           object
auto             object
taxi             object
lotacao          object
onibus_urb       object
onibus_met       object
onibus_int       object
caminhao         object
moto             object
carroca          object
bicicleta        object
outro            object
cont_vit         object
ups              object
patinete         object
idacidente       object
longitude        object
latitude         object
log1             object
log2             object
tipo_acid        object
dia_sem          object
hora             object
noite_dia        object
regiao           object
consorcio        object
dtype: object

Nulos por coluna:
data_extracao        0
predial1          4079
queda_arr            0
data                 0
feridos            

Unnamed: 0,data_extracao,predial1,queda_arr,data,feridos,feridos_gr,mortes,morte_post,fatais,auto,taxi,lotacao,onibus_urb,onibus_met,onibus_int,caminhao,moto,carroca,bicicleta,outro,cont_vit,ups,patinete,idacidente,longitude,latitude,log1,log2,tipo_acid,dia_sem,hora,noite_dia,regiao,consorcio
0,2025-06-01 01:33:13,0,0.0,2020-10-17 00:00:00,1,0,0,0,0,3,0,0,0,0,0,0,1,0,0,0,1,5,0,190816,0.0,0.0,R MARCOS MOREIRA,R GASTON ENGLERT,ABALROAMENTO,SÁBADO,19:00:00.0000000,NOITE,NORTE,
1,2025-06-01 01:33:13,598,0.0,2020-01-01 00:00:00,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,5,0,669089,,,AV BENTO GONCALVES,,ABALROAMENTO,QUARTA-FEIRA,03:00:00.0000000,NOITE,LESTE,
2,2025-06-01 01:33:13,1271,0.0,2020-01-01 00:00:00,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,5,0,669097,,,AV INDEPENDENCIA,,ATROPELAMENTO,QUARTA-FEIRA,23:00:00.0000000,NOITE,LESTE,
3,2025-06-01 01:33:13,1901,0.0,2020-01-02 00:00:00,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,5,0,669098,,,AV EDUARDO PRADO,,ATROPELAMENTO,QUINTA-FEIRA,00:05:00.0000000,NOITE,SUL,
4,2025-06-01 01:33:13,3302,0.0,2020-01-02 00:00:00,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,5,0,669099,-51.21153485762743,-30.081535213015123,AV TERESOPOLIS,,ABALROAMENTO,QUINTA-FEIRA,09:00:00.0000000,DIA,SUL,


Percentual de valores nulos por coluna (%):


consorcio       96.83
log2            72.42
latitude        14.82
longitude       14.82
predial1         5.87
hora             0.80
log1             0.07
data_extracao    0.00
queda_arr        0.00
data             0.00
taxi             0.00
lotacao          0.00
feridos          0.00
feridos_gr       0.00
mortes           0.00
morte_post       0.00
fatais           0.00
auto             0.00
carroca          0.00
moto             0.00
caminhao         0.00
onibus_int       0.00
onibus_met       0.00
onibus_urb       0.00
bicicleta        0.00
outro            0.00
patinete         0.00
idacidente       0.00
cont_vit         0.00
ups              0.00
dia_sem          0.00
tipo_acid        0.00
noite_dia        0.00
regiao           0.00
dtype: float64

### 3. Tratamento  
1) Limpa Nomes; remove colunas

In [4]:
# Remove espaços em branco no início/fim dos nomes, deixa no estilo Snake das colunas (boa prática)
df.columns = (
    df.columns
    .str.strip()
    .str.lower()
    .str.replace(" ", "_")
)

# Remove colunas não utilizadas
df = df.drop(columns=[
    'data_extracao', 'consorcio', 'latitude',
     'longitude', 'mortes', 'morte_post'
    ], 
    errors='ignore')

# Lista nome de colunas
print(df.columns)

Index(['predial1', 'queda_arr', 'data', 'feridos', 'feridos_gr', 'fatais',
       'auto', 'taxi', 'lotacao', 'onibus_urb', 'onibus_met', 'onibus_int',
       'caminhao', 'moto', 'carroca', 'bicicleta', 'outro', 'cont_vit', 'ups',
       'patinete', 'idacidente', 'log1', 'log2', 'tipo_acid', 'dia_sem',
       'hora', 'noite_dia', 'regiao'],
      dtype='object')


2) Transforma tipos; remove inválidos

In [5]:
# Converte 'data' para datetime
df['data'] = pd.to_datetime(df['data'], errors='coerce')

# Remove inválidos 'data', 'hora', 'log1', 'regiao'
df = df.dropna(subset=['data', 'hora', 'log1', 'regiao'])

# Remove data fora do escopo (2020-2025)
df = df[(df['data'] >= '2020-01-01') & (df['data'] <= '2025-04-01')]
print("Datas fora do escopo (2020-2025) removidas com sucesso.")

# Remove chaves duplicadas
df = df.drop_duplicates(subset='idacidente')

# Padroniza nomes dos dias da semana
df["dia_sem"] = (
    df["dia_sem"]
    .str.title()                            # Deixa só a primeira letra maiúscula
    .str.replace("-Feira", "", regex=False) # Remove o sufixo "-Feira"
)

# Transforma colunas para Categoria
ajustar_tipos(df)

# Transforma coluna 'hora' para timedelta
df['hora'] = pd.to_timedelta(df['hora'], errors='coerce')

print("\nInformações após:")
resumo_df(df)
display(df.describe(include='all'))

Datas fora do escopo (2020-2025) removidas com sucesso.

Informações após:
Dimensões: (68837, 28)

Tipos de dados:
predial1                Int32
queda_arr               Int32
data           datetime64[ns]
feridos                 Int32
feridos_gr              Int32
fatais                  Int32
auto                    Int32
taxi                    Int32
lotacao                 Int32
onibus_urb              Int32
onibus_met              Int32
onibus_int              Int32
caminhao                Int32
moto                    Int32
carroca                 Int32
bicicleta               Int32
outro                   Int32
cont_vit                Int32
ups                     Int32
patinete                Int32
idacidente              Int32
log1           string[python]
log2           string[python]
tipo_acid            category
dia_sem              category
hora          timedelta64[ns]
noite_dia            category
regiao               category
dtype: object

Nulos por coluna:
predial1    

Unnamed: 0,predial1,queda_arr,data,feridos,feridos_gr,fatais,auto,taxi,lotacao,onibus_urb,onibus_met,onibus_int,caminhao,moto,carroca,bicicleta,outro,cont_vit,ups,patinete,idacidente,log1,log2,tipo_acid,dia_sem,hora,noite_dia,regiao
0,0,0,2020-10-17,1,0,0,3,0,0,0,0,0,0,1,0,0,0,1,5,0,190816,R MARCOS MOREIRA,R GASTON ENGLERT,ABALROAMENTO,Sábado,0 days 19:00:00,NOITE,NORTE
1,598,0,2020-01-01,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,5,0,669089,AV BENTO GONCALVES,,ABALROAMENTO,Quarta,0 days 03:00:00,NOITE,LESTE
2,1271,0,2020-01-01,1,1,0,1,0,0,0,0,0,0,0,0,0,0,1,5,0,669097,AV INDEPENDENCIA,,ATROPELAMENTO,Quarta,0 days 23:00:00,NOITE,LESTE
3,1901,0,2020-01-02,2,0,0,0,0,0,0,0,0,0,1,0,0,0,1,5,0,669098,AV EDUARDO PRADO,,ATROPELAMENTO,Quinta,0 days 00:05:00,NOITE,SUL
4,3302,0,2020-01-02,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,5,0,669099,AV TERESOPOLIS,,ABALROAMENTO,Quinta,0 days 09:00:00,DIA,SUL


Unnamed: 0,predial1,queda_arr,data,feridos,feridos_gr,fatais,auto,taxi,lotacao,onibus_urb,onibus_met,onibus_int,caminhao,moto,carroca,bicicleta,outro,cont_vit,ups,patinete,idacidente,log1,log2,tipo_acid,dia_sem,hora,noite_dia,regiao
count,64799.0,68837.0,68837,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837.0,68837,18975,68837,68837,68837,68837,68837
unique,,,,,,,,,,,,,,,,,,,,,,3803,2017,10,7,,2,4
top,,,,,,,,,,,,,,,,,,,,,,AV PROTASIO ALVES,AV IPIRANGA,ABALROAMENTO,Sexta,,DIA,LESTE
freq,,,,,,,,,,,,,,,,,,,,,,2551,493,29316,11686,,48944,21869
mean,1757.98,0.0,2022-11-07 08:27:28.415822592,0.45,0.08,0.01,1.43,0.01,0.01,0.03,0.01,0.01,0.08,0.32,0.0,0.02,0.0,0.39,2.6,0.0,714657.38,,,,,0 days 13:41:24.096052994,,
min,0.0,0.0,2020-01-01 00:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,190816.0,,,,,0 days 00:00:00,,
25%,192.0,0.0,2021-09-06 00:00:00,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,686520.0,,,,,0 days 09:50:00,,
50%,767.0,0.0,2022-12-07 00:00:00,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,704079.0,,,,,0 days 14:00:00,,
75%,2205.5,0.0,2024-01-28 00:00:00,1.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,5.0,0.0,741465.0,,,,,0 days 17:43:00,,
max,1097024.0,1.0,2025-03-31 00:00:00,34.0,5.0,4.0,10.0,2.0,2.0,2.0,2.0,2.0,3.0,4.0,1.0,2.0,2.0,1.0,13.0,2.0,768845.0,,,,,0 days 23:59:00,,


3) Cria colunas derivadas (feature engineering)

In [6]:
# Cria coluna 'hora_int'
df['hora_int'] = df['hora'].dt.components['hours']

# Cria coluna 'data_hora'
df['data_hora'] = df['data']+ df['hora']

# Cria coluna total de vítimas
df["total_vitimas"] = (
    df["feridos"].fillna(0) +
    df["fatais"].fillna(0)
)

# Cria coluna 'soma_veiculos
df['soma_veiculos'] = df[COLS_VEICULOS].sum(axis=1)

4) Salva DataFrames tratatos

In [None]:
# Salva dataframes tratado
salvar_parquet(df, PATH_CLEAN + "df_limpo.parquet")

Salvo: ../dados/intermediarios/df_limpo.parquet
Salvo: ../dados/intermediarios/df_20_24.parquet
Salvo: ../dados/intermediarios/df_2025.parquet
