### Real-time Data, Hourly Average

In [2]:
# Welcome to your new notebook
# Type here in the cell editor to add code!
from pyspark.sql.functions import col, to_timestamp, date_trunc, avg

# Read Bronze table
df_fingrid = spark.read.format("delta").load("Tables/bronze_fingrid_hydro")

# Convert timestamps and compute hourly averages
silver_fingrid = (
    df_fingrid
    .withColumn("startTime", to_timestamp(col("startTime")))
    .withColumn("hour", date_trunc("hour", col("startTime")))
    .groupBy("hour")
    .agg(avg("value").alias("generation_mw"))
    .orderBy("hour")
)

# Save as Silver table
silver_fingrid.write.mode("overwrite").format("delta").save("Tables/silver_fingrid_hourly")


StatementMeta(, 70b26712-60d7-4116-879c-a3991e0574b1, 4, Finished, Available, Finished)

### Clean and Filter Zenodo (Historical Modeled Data)

In [1]:
# Find the Finnish data from the European level data

from pyspark.sql import functions as F

df_zenodo = spark.read.format("delta").load("Tables/bronze_zenodo_hydro")

df_zenodo = df_zenodo.withColumnRenamed("T1h", "timestamp").withColumn(
    "timestamp", F.to_timestamp("timestamp")
)

# find all Finnish columns by type
ror_cols = [c for c in df_zenodo.columns if c.startswith("FI") and c.endswith("_ROR")]
sto_cols = [c for c in df_zenodo.columns if c.startswith("FI") and c.endswith("_STO")]

silver_zenodo = (
    df_zenodo
    .select("timestamp", *ror_cols, *sto_cols)
    .withColumn("avg_ror_cf", sum(F.col(c) for c in ror_cols) / len(ror_cols))
    .withColumn("avg_sto_cf", sum(F.col(c) for c in sto_cols) / len(sto_cols))
    .withColumn("avg_cf", (F.col("avg_ror_cf") + F.col("avg_sto_cf")) / 2)
    .orderBy("timestamp")
)

silver_zenodo.write.mode("overwrite").format("delta").save("Tables/silver_zenodo_fi_cf")


StatementMeta(, 5ce31a4b-cb0c-41fe-9192-458804047630, 3, Finished, Available, Finished)

### Summarize Hydropower Metadata

In [3]:
from pyspark.sql.functions import col, sum as _sum

df_meta = spark.read.format("delta").load("Tables/bronze_hydro_metadata")

silver_meta = (
    df_meta
    .filter(col("country_code")== "FI")
    .groupBy("type")
    .agg(_sum(col("installed_capacity_mw")).alias("total_capacity_mw"))
)

silver_meta.write.mode("overwrite").format("delta").save("Tables/silver_meta_capacity_fi")


StatementMeta(, 4e9a8451-5721-4c44-8651-c161c55ecfb0, 5, Finished, Available, Finished)

In [4]:
df = spark.sql("SELECT * FROM hydro_energy_lakehouse.silver_meta_capacity_fi LIMIT 1000")
display(df)

StatementMeta(, 4e9a8451-5721-4c44-8651-c161c55ecfb0, 6, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 86321f2e-875c-4771-b518-49651a02554a)

- HDAM → Hydro DAM = Storage / Reservoir plants

- HROR → Hydro Run-of-River = Continuous-flow plants


Finland has roughly 51% storage-type and 49% run-of-river hydropower capacity.

How all the datasets work together

1. Fingrid → “Actual”

- Real hourly production (after you aggregate 3-min → hourly).

2. JRC Metadata → “Capacity Context”

- Total Finnish installed hydropower capacity = 1345 MW (HDAM) + 1289.6 MW (HROR) = 2634.6 MW total.

3. Zenodo → “Historical Potential”

- Historical average inflow patterns (how much of that capacity could produce power historically).

Next step is to join them in the Gold layer:

- Compute Observed Capacity Factor = Fingrid / 2634.6 MW

- Compute Deviation = Observed CF – Zenodo CF