# 03_serving_cleanup_gold_job

Este notebook realiza a limpeza dos arquivos temporários presentes na camada gold, removendo o dataset intermediário `flights_aggregated.parquet` gerado na transição **Silver** -> **Gold**, mantendo a organização e otimizando o espaço de armazenamento.


In [2]:
# Parameters

run_mode = "latest"
run_date = None

gold_path = "/opt/airflow/data-layer/gold"


In [1]:
import os
from pathlib import Path
from pyspark.sql import SparkSession

from transformer.utils.logger import get_logger
from transformer.utils.spark_helpers import get_spark_session
from transformer.utils.file_io import find_partition, delete_files

log = get_logger("serving_cleanup_gold")
spark = get_spark_session("CleanupGoldLayer")
log.info("[Serving][Cleanup] SparkSession iniciada.")


2025-11-09 20:51:18 [INFO] spark_helpers | [INFO] Logger inicializado no modo standalone (INFO).
2025-11-09 20:51:18 [INFO] file_io | [INFO] Logger inicializado no modo standalone (INFO).
2025-11-09 20:51:18 [INFO] serving_cleanup_gold | [INFO] Logger inicializado no modo standalone (INFO).
/usr/local/lib/python3.12/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found


:: loading settings :: url = jar:file:/usr/local/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9d31e674-d51c-458e-b18a-f7ce5cb63089;1.0
	confs: [default]
	found org.postgresql#postgresql;42.7.3 in central
	found org.checkerframework#checker-qual;3.42.0 in central
:: resolution report :: resolve 119ms :: artifacts dl 8ms
	:: modules in use:
	org.checkerframework#checker-qual;3.42.0 from central in [default]
	org.postgresql#postgresql;42.7.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------

In [3]:
def cleanup_gold_job(spark: SparkSession, gold_path: str, run_mode: str = "latest", run_date: str | None = None) -> None:
    """
    Remove o arquivos temporário flights_aggregated.parquet da camada gold.

    Args:
        spark (SparkSession): Sessão Spark ativa.
        gold_path (str): Caminho base da camada gold.
        run_mode (str): 'latest' ou 'date'.
        run_date (str | None): Data específica, se aplicável.
    """
    log.info("[Serving][Cleanup] Iniciando job de limpeza da gold.")

    partition = find_partition(gold_path, mode=run_mode, date_str=run_date)
    partition_dir = Path(gold_path) / partition / "PARQUET"
    target_file = partition_dir / "flights_aggregated.parquet"

    if not target_file.exists():
        log.warning(f"[Serving][Cleanup][WARN] Arquivo temporário não encontrado: {target_file}.")
        return

    log.info(f"[Serving][Cleanup] Removendo arquivo: {target_file}.")
    
    delete_files(spark, [str(target_file)])
    
    log.info("[Serving][Cleanup] Limpeza concluída com sucesso.")


In [4]:
try:
    cleanup_gold_job(spark, gold_path, run_mode, run_date)
except Exception as e:
    log.exception(f"[Serving][Cleanup][ERROR] Falha durante limpeza da camada gold: {e}.")
    raise
finally:
    log.info("[Serving][Cleanup] Fim do job de limpeza da gold.")


2025-11-09 20:52:25 [INFO] serving_cleanup_gold | [Serving][Cleanup] Iniciando job de limpeza da gold.
2025-11-09 20:52:25 [INFO] serving_cleanup_gold | [Serving][Cleanup] Removendo arquivo: /opt/airflow/data-layer/gold/2025-11-08/PARQUET/flights_aggregated.parquet.
2025-11-09 20:52:25 [INFO] file_io | [INFO] Deletando 1 arquivo(s).
2025-11-09 20:52:25 [INFO] file_io | [INFO] '/opt/airflow/data-layer/gold/2025-11-08/PARQUET/flights_aggregated.parquet' deletado com sucesso.
2025-11-09 20:52:25 [INFO] file_io | [INFO] Deleção concluída.
2025-11-09 20:52:25 [INFO] serving_cleanup_gold | [Serving][Cleanup] Limpeza concluída com sucesso.
2025-11-09 20:52:25 [INFO] serving_cleanup_gold | [Serving][Cleanup] Fim do job de limpeza da gold.


In [None]:
log.info("[Serving][Cleanup] Encerrando sessão Spark.")
spark.stop()
