# üóìÔ∏è 12 ‚Äî DIM_TEMPO (Gold)

- L√™ `silver/2018_anonimizado.xlsx` e `silver/2019_anonimizado.xlsx`
- Constr√≥i dimens√£o de datas usando `DATA`, `DATA_DO_ULTIMO_ATO`, `DATA_DE_ENTRADA_FASE_ATUAL`
- Exporta em `gold/output/dim_tempo.csv`


## 0) Imports

In [1]:
import pandas as pd
import numpy as np


## 1) Paths robustos

In [2]:
from pathlib import Path

# ------------------------------------------------------
# Notebook rodando em /gold
# Arquivos de entrada tamb√©m em /gold
# Sa√≠da em /gold/output
# ------------------------------------------------------

BASE_DIR = Path().resolve()   # pasta atual (gold/)
OUT_DIR = BASE_DIR / "output"
OUT_DIR.mkdir(parents=True, exist_ok=True)

INPUT_FILES = [
    BASE_DIR / "2018_anonimizado.xlsx",
    BASE_DIR / "2019_anonimizado.xlsx",
]

print("üìÅ BASE_DIR:", BASE_DIR)
print("üì• INPUT_FILES:")
for f in INPUT_FILES:
    print(" -", f, "| existe?", f.exists())

print("üì§ OUT_DIR:", OUT_DIR)


üìÅ BASE_DIR: C:\Users\LeaoN\OneDrive\Documents\GitHub\data_case_analysis\gold
üì• INPUT_FILES:
 - C:\Users\LeaoN\OneDrive\Documents\GitHub\data_case_analysis\gold\2018_anonimizado.xlsx | existe? True
 - C:\Users\LeaoN\OneDrive\Documents\GitHub\data_case_analysis\gold\2019_anonimizado.xlsx | existe? True
üì§ OUT_DIR: C:\Users\LeaoN\OneDrive\Documents\GitHub\data_case_analysis\gold\output


## 2) Ler Silver (2018 + 2019)

In [3]:
dfs = []
for f in INPUT_FILES:
    if not f.exists():
        raise FileNotFoundError(f"Arquivo n√£o encontrado: {f}")
    tmp = pd.read_excel(f, dtype=str)
    tmp["fonte_arquivo"] = f.name
    dfs.append(tmp)

df = pd.concat(dfs, ignore_index=True)

print("‚úÖ Linhas/Colunas consolidadas:", df.shape)
df.head()


‚úÖ Linhas/Colunas consolidadas: (732261, 87)


Unnamed: 0,ULTIMO_PROCESSO,SITUACAO_DO_PROCESSO,IS_SEDE_EAD,NO_DO_PROCESSO,MODALIDADE,ANO_DO_PROTOCOLO,DATA,ORGAO,ATO,CATEGORIA_ATO,...,CINE_AREA_ESPECIFICA,CODIGO_AREA_GERAL_CINE,AREA_GERAL_CINE,CODIGO_AREA_DETALHADA_CINE,AREA_DETALHADA_CINE,CODIGO_AREA_ESPECIFICA_CINE,AREA_ESPECIFICA_CINE,ROTULO_CINE,AVALIACAO_OFICIAL,fonte_arquivo
0,N√ÉO,Aguardando Pagamento,N,200810426,EAD,2009,2009-02-26 00:00:00,,Credenciamento EAD,Institui√ß√£o,...,,,,,,,,,,2018_anonimizado.xlsx
1,N√ÉO,Aguardando Pagamento,N,200810426,EAD,2009,2009-02-26 00:00:00,,Credenciamento EAD,Institui√ß√£o,...,,,,,,,,,,2018_anonimizado.xlsx
2,N√ÉO,Aguardando Pagamento,N,200810426,EAD,2009,2009-02-26 00:00:00,,Credenciamento EAD,Institui√ß√£o,...,,,,,,,,,,2018_anonimizado.xlsx
3,N√ÉO,Aguardando Pagamento,S,200810426,EAD,2009,2009-02-26 00:00:00,,Credenciamento EAD,Institui√ß√£o,...,,,,,,,,,,2018_anonimizado.xlsx
4,N√ÉO,Arquivado,N,20070028,PRESENCIAL,2008,2008-09-26 00:00:00,SERES/DIREG/CGRERCES,Reconhecimento de Curso,Curso,...,Humanidades (exceto l√≠nguas),2.0,Artes e humanidades,223.0,Filosofia e √©tica,22.0,Humanidades (exceto l√≠nguas),Filosofia,Regula√ß√£o,2018_anonimizado.xlsx


## 3) Construir DIM_TEMPO

In [4]:
date_cols = [c for c in ["DATA", "DATA_DO_ULTIMO_ATO", "DATA_DE_ENTRADA_FASE_ATUAL"] if c in df.columns]
if not date_cols:
    raise KeyError("N√£o encontrei colunas de data (DATA/DATA_DO_ULTIMO_ATO/DATA_DE_ENTRADA_FASE_ATUAL).")

all_dates = []
for c in date_cols:
    d = pd.to_datetime(df[c], errors="coerce", dayfirst=True).dt.normalize()
    all_dates.append(d)

dates = pd.concat(all_dates, axis=0).dropna().drop_duplicates().sort_values()

dim_tempo = pd.DataFrame({"data": dates})
dim_tempo["id_data"] = dim_tempo["data"].dt.strftime("%Y%m%d").astype(int)
dim_tempo["ano"] = dim_tempo["data"].dt.year
dim_tempo["mes"] = dim_tempo["data"].dt.month
dim_tempo["dia"] = dim_tempo["data"].dt.day
dim_tempo["trimestre"] = dim_tempo["data"].dt.quarter
dim_tempo["semana_ano"] = dim_tempo["data"].dt.isocalendar().week.astype(int)
dim_tempo["dia_semana"] = dim_tempo["data"].dt.dayofweek
dim_tempo["nome_dia"] = dim_tempo["data"].dt.day_name()
dim_tempo["nome_mes"] = dim_tempo["data"].dt.month_name()

print("‚úÖ DIM_TEMPO pronta:", dim_tempo.shape)
dim_tempo.head()


  d = pd.to_datetime(df[c], errors="coerce", dayfirst=True).dt.normalize()


‚úÖ DIM_TEMPO pronta: (6055, 10)


Unnamed: 0,data,id_data,ano,mes,dia,trimestre,semana_ano,dia_semana,nome_dia,nome_mes
4549,1969-12-31,19691231,1969,12,31,4,1,2,Wednesday,December
5667,2007-01-03,20070103,2007,1,3,1,1,2,Wednesday,January
13775,2007-01-06,20070106,2007,1,6,1,1,5,Saturday,January
12455,2007-01-08,20070108,2007,1,8,1,2,0,Monday,January
16027,2007-01-10,20070110,2007,1,10,1,2,2,Wednesday,January


## 4) Exportar

In [5]:
out_file = OUT_DIR / "dim_tempo.csv"
dim_tempo.to_csv(out_file, index=False, encoding="utf-8")
print("‚úÖ Salvo em:", out_file)


‚úÖ Salvo em: C:\Users\LeaoN\OneDrive\Documents\GitHub\data_case_analysis\gold\output\dim_tempo.csv
