# Projeto ETL SuperStore

Este notebook contém o processo de ETL (Extração, Transformação e Carga) para o conjunto de dados SuperStore. Utilizaremos diversas bibliotecas para manipulação de dados e, posteriormente, carregaremos os dados no BigQuery.

## 1. Configuração do Ambiente e Importação de Bibliotecas

Importação de todas as bibliotecas necessárias

In [31]:
# Importação das bibliotecas necessárias
import pandas as pd
import numpy as np
from pathlib import Path

print("✅ Bibliotecas importadas com sucesso!")

✅ Bibliotecas importadas com sucesso!


## 2. Carregamento dos Dados e Inicialização do DataFrame

Agora vamos carregar o conjunto de dados SuperStore do arquivo CSV e realizar uma inspeção inicial dos dados.

In [32]:
# Carregar o dataset
df = pd.read_csv("../data/raw/superstore.csv")  # AJUSTE O CAMINHO

print(f"\n📊 Formato da base: {df.shape}")
print(f"📊 Colunas: {df.columns.tolist()}")

df.head()


📊 Formato da base: (51290, 27)
📊 Colunas: ['category', 'city', 'country', 'customer_ID', 'customer_name', 'discount', 'market', 'unknown', 'order_date', 'order_id', 'order_priority', 'product_id', 'product_name', 'profit', 'quantity', 'region', 'row_id', 'sales', 'segment', 'ship_date', 'ship_mode', 'shipping_cost', 'state', 'sub_category', 'year', 'market2', 'weeknum']


Unnamed: 0,category,city,country,customer_ID,customer_name,discount,market,unknown,order_date,order_id,...,sales,segment,ship_date,ship_mode,shipping_cost,state,sub_category,year,market2,weeknum
0,Office Supplies,Los Angeles,United States,LS-172304,Lycoris Saunders,0.0,US,1,2011-01-07 0:00:00,CA-2011-130813,...,19,Consumer,2011-01-09 0:00:00,Second Class,4.37,California,Paper,2011,North America,2
1,Office Supplies,Los Angeles,United States,MV-174854,Mark Van Huff,0.0,US,1,2011-01-21 0:00:00,CA-2011-148614,...,19,Consumer,2011-01-26 0:00:00,Standard Class,0.94,California,Paper,2011,North America,4
2,Office Supplies,Los Angeles,United States,CS-121304,Chad Sievert,0.0,US,1,2011-08-05 0:00:00,CA-2011-118962,...,21,Consumer,2011-08-09 0:00:00,Standard Class,1.81,California,Paper,2011,North America,32
3,Office Supplies,Los Angeles,United States,CS-121304,Chad Sievert,0.0,US,1,2011-08-05 0:00:00,CA-2011-118962,...,111,Consumer,2011-08-09 0:00:00,Standard Class,4.59,California,Paper,2011,North America,32
4,Office Supplies,Los Angeles,United States,AP-109154,Arthur Prichep,0.0,US,1,2011-09-29 0:00:00,CA-2011-146969,...,6,Consumer,2011-10-03 0:00:00,Standard Class,1.32,California,Paper,2011,North America,40


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   category        51290 non-null  object 
 1   city            51290 non-null  object 
 2   country         51290 non-null  object 
 3   customer_ID     51290 non-null  object 
 4   customer_name   51290 non-null  object 
 5   discount        51290 non-null  float64
 6   market          51290 non-null  object 
 7   unknown         51290 non-null  int64  
 8   order_date      51290 non-null  object 
 9   order_id        51290 non-null  object 
 10  order_priority  51290 non-null  object 
 11  product_id      51290 non-null  object 
 12  product_name    51290 non-null  object 
 13  profit          51290 non-null  float64
 14  quantity        51290 non-null  int64  
 15  region          51290 non-null  object 
 16  row_id          51290 non-null  int64  
 17  sales           51290 non-null 

## 3. Tratamento de Valores Nulos

In [34]:
print("🔍 Valores nulos por coluna:\n")
print(df.isnull().sum().sort_values(ascending=False))

🔍 Valores nulos por coluna:

category          0
city              0
country           0
customer_ID       0
customer_name     0
discount          0
market            0
unknown           0
order_date        0
order_id          0
order_priority    0
product_id        0
product_name      0
profit            0
quantity          0
region            0
row_id            0
sales             0
segment           0
ship_date         0
ship_mode         0
shipping_cost     0
state             0
sub_category      0
year              0
market2           0
weeknum           0
dtype: int64


## 4. Tratamento de Valores Duplicados

In [35]:
duplicated_values = df[df.duplicated()]
print(f"Valores duplicados: {len(duplicated_values)}")

Valores duplicados: 0


In [36]:
df['row_id'].duplicated().sum()

np.int64(0)

## 5. Tratamento de Variáveis Categóricas

## 6. Tratamento de Variáveis Numéricas