# 05_refinement_silver_load_job
---
Este notebook realiza a carga do dataset agregado da camada **Silver** (`flights_aggregated.parquet`) para o PostgreSQL, populando a tabela `silver.flights_silver` conforme o ddl da camada.


In [1]:
# Parameters

run_mode = "latest"
run_date = None

silver_path = "/opt/airflow/data-layer/silver"
postgres_conn_id = "dw_postgres_conn"


In [2]:
from pathlib import Path
from pyspark.sql import DataFrame

from transformer.utils.spark_helpers import get_spark_session, load_to_postgres
from transformer.utils.file_io import find_partition
from transformer.utils.logger import get_logger

log = get_logger("refinement.flights_silver_load")

spark = get_spark_session("SilverLoad")
log.info("[Refinement][Load] Sessão Spark iniciada.")


2025-11-13 23:10:25 [INFO] spark_helpers | [INFO] Logger inicializado no modo standalone (INFO).
2025-11-13 23:10:25 [INFO] file_io | [INFO] Logger inicializado no modo standalone (INFO).
2025-11-13 23:10:25 [INFO] refinement.flights_silver_load | [INFO] Logger inicializado no modo standalone (INFO).
/usr/local/lib/python3.12/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found


:: loading settings :: url = jar:file:/usr/local/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0ffa3c4d-c547-48be-9893-a18fafc08047;1.0
	confs: [default]
	found org.postgresql#postgresql;42.7.3 in central
	found org.checkerframework#checker-qual;3.42.0 in central
:: resolution report :: resolve 139ms :: artifacts dl 5ms
	:: modules in use:
	org.checkerframework#checker-qual;3.42.0 from central in [default]
	org.postgresql#postgresql;42.7.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------------------

In [3]:
try:
    log.info("[Refinement][SilverLoad] Iniciando execução do job de carga.")

    # Encontra a partição mais recente da camada silver
    partition = find_partition(
        base_path=silver_path,
        mode=run_mode,
        date_str=run_date,
    )

    base_dir = Path(silver_path) / partition / "PARQUET"
    parquet_path = base_dir / "flights_aggregated.parquet"

    if not parquet_path.exists():
        raise FileNotFoundError(
            f"[Refinement][SilverLoad][ERROR] Arquivo não encontrado: {parquet_path}."
        )

    log.info(f"[Refinement][SilverLoad] Lendo dataset agregado: {parquet_path}.")
    df = spark.read.parquet(str(parquet_path))

    df.printSchema()

    log.info(f"[Refinement][SilverLoad] Tuplas encontradas: {df.count():,}.")

    # Carga para PostgreSQL
    log.info("[Refinement][SilverLoad] Inserindo dados em silver.flights_silver.")

    load_to_postgres(
        df=df,
        db_conn_id=postgres_conn_id,
        table_name="silver.flights_silver",
    )

    log.info("[Refinement][SilverLoad] Carga concluída com sucesso.")

except Exception as e:
    log.exception(f"[Refinement][SilverLoad][ERROR] Falha na execução: {e}")
    raise
finally:
    log.info("[Refinement][SilverLoad] Finalizando execução do job.")


2025-11-13 23:10:31 [INFO] refinement.flights_silver_load | [Refinement][SilverLoad] Iniciando execução do job de carga.
2025-11-13 23:10:31 [INFO] file_io | [INFO] Partição selecionada: 2025-11-12
2025-11-13 23:10:31 [INFO] refinement.flights_silver_load | [Refinement][SilverLoad] Lendo dataset agregado: /opt/airflow/data-layer/silver/2025-11-12/PARQUET/flights_aggregated.parquet.
                                                                                

root
 |-- flight_id: long (nullable = true)
 |-- flight_year: short (nullable = true)
 |-- flight_month: short (nullable = true)
 |-- flight_day: short (nullable = true)
 |-- flight_day_of_week: short (nullable = true)
 |-- flight_date: date (nullable = true)
 |-- airline_iata_code: string (nullable = true)
 |-- airline_name: string (nullable = true)
 |-- flight_number: integer (nullable = true)
 |-- tail_number: string (nullable = true)
 |-- origin_airport_iata_code: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- origin_state: string (nullable = true)
 |-- origin_latitude: double (nullable = true)
 |-- origin_longitude: double (nullable = true)
 |-- dest_airport_iata_code: string (nullable = true)
 |-- dest_airport_name: string (nullable = true)
 |-- dest_city: string (nullable = true)
 |-- dest_state: string (nullable = true)
 |-- dest_latitude: double (nullable = true)
 |-- dest_longitude: double (nullable 

2025-11-13 23:10:36 [INFO] refinement.flights_silver_load | [Refinement][SilverLoad] Tuplas encontradas: 5,208,259.
2025-11-13 23:10:36 [INFO] refinement.flights_silver_load | [Refinement][SilverLoad] Inserindo dados em silver.flights_silver.
2025-11-13 23:10:36 [WARN] spark_helpers | [WARN] Airflow indisponível, utilizando variáveis de ambiente para conexão PostgreSQL.
2025-11-13 23:10:36 [INFO] spark_helpers | [LOAD] Limpando tabela 'silver.flights_silver'.
25/11/13 23:10:38 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
2025-11-13 23:18:43 [INFO] spark_helpers | [LOAD] Dados carregados em 'silver.flights_silver' com sucesso (modo: append).
2025-11-13 23:18:43 [INFO] refinement.flights_silver_load | [Refinement][SilverLoad] Carga concluída com sucesso.
2025-11-13 23:18:43 [INFO] refinement.flights_silver_load | [Refinement][SilverLoad] Finalizando execução do jo

In [4]:
%%script false --no-raise-error # Comentar essa linha se estiver em debug ou se quiser rodar a célula.

df.printSchema()

df.limit(5).show(truncate=False)


root
 |-- flight_id: long (nullable = true)
 |-- flight_year: short (nullable = true)
 |-- flight_month: short (nullable = true)
 |-- flight_day: short (nullable = true)
 |-- flight_day_of_week: short (nullable = true)
 |-- flight_date: date (nullable = true)
 |-- airline_iata_code: string (nullable = true)
 |-- airline_name: string (nullable = true)
 |-- flight_number: integer (nullable = true)
 |-- tail_number: string (nullable = true)
 |-- origin_airport_iata_code: string (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- origin_state: string (nullable = true)
 |-- origin_latitude: double (nullable = true)
 |-- origin_longitude: double (nullable = true)
 |-- dest_airport_iata_code: string (nullable = true)
 |-- dest_airport_name: string (nullable = true)
 |-- dest_city: string (nullable = true)
 |-- dest_state: string (nullable = true)
 |-- dest_latitude: double (nullable = true)
 |-- dest_longitude: double (nullable 



+-----------+-----------+------------+----------+------------------+-----------+-----------------+---------------------------+-------------+-----------+------------------------+---------------------------------------+-----------------+------------+---------------+----------------+----------------------+-----------------------------------------------------------------------------+----------+----------+-------------+--------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------+-------------+--------+-------+--------+------------+--------------+--------+----------------+--------------+-------------+-------------------+-------------+
|flight_id  |flight_year|flight_month|flight_day|flight_day_of_week|flight_date|airline_iata_code|airline_name               |flight_number|tail_number|origin_airport_iata_code|origin_airport_name                    |origin_city      |origin_state|origin_latitude|origin_lo

                                                                                

In [5]:
# Encerra a sessão Spark
spark.stop()
log.info("[Refinement][Load] Sessão Spark finalizada.")

2025-11-13 23:21:38 [INFO] refinement.flights_silver_load | [Refinement][Load] Sessão Spark finalizada.
