### Importação das bibliotecas necessárias

Esta seção consolida todas as importações de bibliotecas necessárias

In [0]:
from pyspark.sql import functions as F

In [0]:
spark.sql("USE CATALOG mvp")
spark.sql("USE SCHEMA gold")

DataFrame[]

In [0]:
df_silver_stations = spark.table(f"mvp.silver.stations")

### Criação de recursos
Inclusão de novas colunas representando características obtidas indiretamente, ou seja, por meio de cálculo. Dentro as colunas, destaca-se o period_size, que é o intervalo entre a data final e inicial de operação de uma estação. Essa feature permite uma série de conclusões posteriores a respeito da operacionalidade da rede.

In [0]:
df_silver_stations = (
    df_silver_stations
    .withColumn(
        "period_size",
        F.datediff(F.col("last_record"), F.col("first_record"))
    )
)

### Criação de indicadores de status das estações

Nesta célula, são adicionadas colunas derivadas à tabela Gold para classificar o status das estações. As estações são categorizadas como **New** ou **Old** com base no ano do primeiro registro, identificadas como **Active** ou **Inactive** a partir do último registro disponível, e consolidadas em um campo único (`combined_status`) que resume essas condições de forma descritiva. Assim é possível constatar se a estação é recente ou antiga na rede nas análises futuras.

In [0]:
df_silver_stations = (
    df_silver_stations
    .withColumn("status", F.when(F.year("first_record") > 2015, F.lit("New")).otherwise(F.lit("Old")))
    .withColumn("inactive", F.year("last_record") < 2025)
    .withColumn(
        "combined_status",
        F.concat(
            F.col("status"),
            F.lit(" & "),
            F.when(F.col("inactive"), F.lit("Inactive"))
             .otherwise(F.lit("Active"))
        )
    )
)

### Persistência da tabela Gold

Nesta célula, o DataFrame é salvo no schema gold, utilizando o modo **overwrite** para substituir qualquer versão existente. Isso garante que os dados tratados e padronizados estejam disponíveis de forma consistente para consultas e análises posteriores.


In [0]:
df_silver_stations.write.format("delta").mode("overwrite").saveAsTable("stations")

### Consulta de validação da tabela `stations`

Nesta célula, é executada uma consulta SQL para exibir uma amostra dos registros da tabela `stations`, verificando se os dados foram persistidos corretamente após as transformações aplicadas.


In [0]:
%sql
select * from stations limit 10

region,uf,city,code,first_record,last_record,period_size,status,inactive,combined_status
CO,DF,BRASILIA,A001,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z,9155,Old,False,Old & Active
NE,BA,SALVADOR,A401,2000-05-13T00:00:00.000Z,2025-05-31T00:00:00.000Z,9149,Old,False,Old & Active
N,AM,MANAUS,A101,2000-05-09T00:00:00.000Z,2025-05-31T00:00:00.000Z,9153,Old,False,Old & Active
SE,RJ,ECOLOGIA AGRICOLA,A601,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z,9155,Old,False,Old & Active
S,RS,PORTO ALEGRE,A801,2000-09-22T00:00:00.000Z,2025-05-31T00:00:00.000Z,9017,Old,False,Old & Active
CO,GO,GOIANIA,A002,2001-05-29T00:00:00.000Z,2025-05-19T00:00:00.000Z,8756,Old,False,Old & Active
CO,GO,MORRINHOS,A003,2001-05-25T00:00:00.000Z,2025-05-31T00:00:00.000Z,8772,Old,False,Old & Active
CO,MS,CAMPO GRANDE,A702,2001-09-10T00:00:00.000Z,2025-05-31T00:00:00.000Z,8664,Old,False,Old & Active
CO,MS,PONTA PORA,A703,2001-09-07T00:00:00.000Z,2025-05-31T00:00:00.000Z,8667,Old,False,Old & Active
CO,MS,TRES LAGOAS,A704,2001-09-03T00:00:00.000Z,2025-05-31T00:00:00.000Z,8671,Old,False,Old & Active


### Documentação da tabela Gold `mvp.gold.stations`

Nesta célula, é definido o comentário descritivo da tabela `mvp.gold.stations`, explicando seu conteúdo e finalidade analítica. Em seguida, são adicionados comentários às colunas derivadas, documentando indicadores calculados relacionados ao período de registros e ao status operacional das estações, garantindo melhor entendimento e uso dos dados na camada Gold.


In [0]:
spark.sql("""
    comment on table mvp.gold.stations is
    'The table contains information about various stations, including their geographical locations and operational details. It can be used for mapping station locations, analyzing regional coverage, and tracking the operational history of each station. Key data points include the region, state, city, and latitude/longitude coordinates.'
""")

COLUMN_COMMENTS = [
    ("period_size", "Number of records in the recorded period"),
    ("status", "Current operational condition represented as text"),
    ("inactive", "Indicates whether the station is currently active or not"),
    ("combined_status", "Aggregated operational and activity condition of the station"),
]

for column, comment in COLUMN_COMMENTS:
    spark.sql(f"comment on column mvp.gold.stations.`{column}` is '{comment}'")

### Verificação do schema e metadata da tabela Gold

Nesta célula, é utilizado o comando `DESCRIBE EXTENDED` para inspecionar os metadados completos da tabela `mvp.gold.stations`. Um identificador auxiliar é criado para permitir a extração de um trecho específico das informações, iniciando a partir da seção **Catalog**. Em seguida, o comando `DESCRIBE` padrão é executado para exibir apenas o schema da tabela, possibilitando a validação final das colunas e tipos de dados.


In [0]:
df_describe = spark.sql("describe extended mvp.gold.stations")
df_describe = df_describe.withColumn("_id", F.monotonically_increasing_id())
target_id = df_describe.filter("col_name = 'Catalog'").select("_id").first()._id

table_describe = df_describe.filter(f"_id >= {target_id}").limit(9)
display(table_describe.drop("_id"))

display(spark.sql("describe mvp.gold.stations"))

col_name,data_type,comment
Catalog,mvp,
Database,gold,
Table,stations,
Created Time,Sun Dec 21 20:28:53 UTC 2025,
Last Access,UNKNOWN,
Created By,Spark,
Statistics,"12921 bytes, 615 rows",
Type,MANAGED,
Comment,"The table contains information about various stations, including their geographical locations and operational details. It can be used for mapping station locations, analyzing regional coverage, and tracking the operational history of each station. Key data points include the region, state, city, and latitude/longitude coordinates.",


col_name,data_type,comment
region,string,The geographical area or zone where the station is located
uf,string,State where the station is located
city,string,The city where the station is located
code,string,Unique identifier assigned to each station
first_record,timestamp,Timestamp of the first recorded data entry from the station.
last_record,timestamp,Timestamp of the most recent data entry
period_size,int,Number of records in the recorded period
status,string,Current operational condition represented as text
inactive,boolean,Indicates whether the station is currently active or not
combined_status,string,Aggregated operational and activity condition of the station


### Criação da tabela integrada de dados de estações e clima

Nesta célula, é criada (ou recriada) a tabela `stations_data` a partir da junção entre as tabelas Silver de estações (`mvp.silver.stations`) e dados meteorológicos (`mvp.silver.weather_data`). A junção é realizada pela chave da estação, resultando em um conjunto de dados unificado que combina informações temporais e meteorológicas com os atributos cadastrais das estações, pronto para análises integradas.


In [0]:
%sql
create or replace table stations_data as
select
    wd.datetime,
    wd.temperature,
    wd.dew_point,
    wd.wind_speed,
    wd.wind_direction,
    wd.precipitation,
    wd.pressure,
    wd.relative_humidity,
    wd.wind_gust,
    wd.radiation,
    s.code,
    s.city,
    s.uf,
    s.region,
    s.first_record,
    s.last_record
from mvp.silver.stations as s
inner join mvp.silver.weather_data as wd
on s.code = wd.station_code

num_affected_rows,num_inserted_rows


### Verificação do schema e metadata da tabela Gold

In [0]:
spark.sql("""
    comment on table mvp.gold.stations_data is
    'The table contains meteorological data collected from various weather stations. It includes hourly observations such as temperature, dew point, wind speed and direction, precipitation, atmospheric pressure, relative humidity, wind gusts, and radiation levels. This data can be used for weather analysis, climate research, and understanding local weather patterns over time.'
""")
df_describe = spark.sql("describe extended mvp.gold.stations_data")
df_describe = df_describe.withColumn("_id", F.monotonically_increasing_id())
target_id = df_describe.filter("col_name = 'Catalog'").select("_id").first()._id

table_describe = df_describe.filter(f"_id >= {target_id}").limit(9)
display(table_describe.drop("_id"))

col_name,data_type,comment
Catalog,mvp,
Database,gold,
Table,stations_data,
Created Time,Sun Dec 21 20:29:05 UTC 2025,
Last Access,UNKNOWN,
Created By,Spark,
Type,MANAGED,
Comment,"The table contains meteorological data collected from various weather stations. It includes hourly observations such as temperature, dew point, wind speed and direction, precipitation, atmospheric pressure, relative humidity, wind gusts, and radiation levels. This data can be used for weather analysis, climate research, and understanding local weather patterns over time.",
Collation,UTF8_BINARY,


### Consulta de validação da tabela `stations_data`

Nesta célula, é executada uma consulta SQL para exibir uma amostra dos registros da tabela `stations_data`, verificando se os dados foram persistidos corretamente após as transformações aplicadas.


In [0]:
%sql
select * from stations_data limit 10

datetime,temperature,dew_point,wind_speed,wind_direction,precipitation,pressure,relative_humidity,wind_gust,radiation,code,city,uf,region,first_record,last_record
2000-05-07T12:00:00.000Z,22.6,14.7,1.8,126.0,0.0,888.2,61.0,3.8,1506.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T13:00:00.000Z,24.2,14.7,2.7,75.0,0.0,888.4,55.0,4.7,2230.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T14:00:00.000Z,25.0,14.1,2.0,117.0,0.0,888.1,51.0,4.9,2675.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T15:00:00.000Z,26.2,13.2,2.5,58.0,0.0,887.4,44.0,5.8,2915.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T16:00:00.000Z,26.7,14.0,2.4,167.0,0.0,886.5,46.0,5.8,2523.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T17:00:00.000Z,26.6,13.6,1.8,178.0,0.0,885.9,45.0,4.3,2435.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T18:00:00.000Z,28.0,12.4,1.8,125.0,0.0,885.5,38.0,6.3,2530.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T19:00:00.000Z,26.6,12.5,1.1,53.0,0.0,885.6,41.0,3.8,1412.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T20:00:00.000Z,25.8,12.7,1.5,109.0,0.0,885.9,44.0,3.0,540.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
2000-05-07T21:00:00.000Z,24.1,13.4,1.3,197.0,0.0,886.2,51.0,3.2,34.0,A001,BRASILIA,DF,CO,2000-05-07T00:00:00.000Z,2025-05-31T00:00:00.000Z
