# Petrinex Volumetrics - Load and Display

Load Alberta volumetric data from Petrinex into Spark DataFrames.

**Features:**
- ✅ Unity Catalog compatible (no ANY FILE privilege needed)
- ✅ Direct repo import (no pip install needed)
- ✅ Auto ZIP extraction (handles nested ZIPs)
- ✅ Memory efficient (incremental union + checkpointing)
- ✅ Robust error handling (skips missing files)


## Setup - Import from Repo


In [None]:
import sys, os

# Add repo to Python path
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
sys.path.insert(0, os.path.dirname(notebook_path))

from petrinex import PetrinexVolumetricsClient
from pyspark.sql import functions as F
from datetime import datetime, timedelta

# Initialize client
client = PetrinexVolumetricsClient(spark=spark, jurisdiction="AB", file_format="CSV")
print("✓ Ready")


## List Available Files (Optional)


In [None]:
# Check files updated in last 30 days
cutoff = (datetime.now() - timedelta(days=30)).strftime("%Y-%m-%d")
files = client.list_updated_after(cutoff)

print(f"Found {len(files)} files updated after {cutoff}\n")
for f in files[:10]:
    print(f"{f.production_month} | {f.updated_ts}")


## Load Data

**Memory Efficient:** Unions incrementally as files load + checkpoints every 10 files

**Progress:** Shows real-time loading status for each file


In [None]:
# Load data - will show progress for each file
df = client.read_updated_after_as_spark_df_via_pandas(
    "2025-12-01",  # Change date as needed
    pandas_read_kwargs={"dtype": str, "encoding": "latin1"}
)

# Cache the final result
df.cache()
row_count = df.count()
print(f"\n✅ Final DataFrame: {row_count:,} rows × {len(df.columns)} columns")


## Display Data


In [None]:
# Show schema
df.printSchema()


In [None]:
# Show sample data
display(df.limit(100))


In [None]:
# Records by production month
display(
    df.groupBy("production_month")
    .agg(F.count("*").alias("records"))
    .orderBy("production_month")
)


## Optional: Save to Delta

Uncomment to persist data to a Delta table:


In [None]:
# Uncomment to save:
# table_name = "main.petrinex.volumetrics_raw"
# 
# df_with_parts = df.withColumn("year", F.substring("production_month", 1, 4)) \
#                   .withColumn("month", F.substring("production_month", 6, 2))
# 
# df_with_parts.write.format("delta") \
#     .mode("overwrite") \
#     .partitionBy("year", "month") \
#     .saveAsTable(table_name)
# 
# print(f"✓ Saved to {table_name}")
