# Petrinex Volumetrics Data - Databricks Example

This notebook demonstrates how to use the `petrinex` Python package to:
1. Fetch Alberta volumetric data from Petrinex
2. Load it into Spark DataFrames
3. Explore and display the data
4. Perform basic analysis

**âœ¨ Features:**
- âœ… **Unity Catalog Compatible** - No `ANY FILE` privilege required
- âœ… **Direct import from repo** - No pip install needed
- âœ… **Read-only workflow** - Just load and display (save to Delta is optional)
- âœ… **Production ready** - Handles schema drift, encoding, and provenance


## 1. Setup - Import from Repo Directory

This notebook imports directly from the Databricks Repo directory.

**Requirements:**
- This notebook must be in the same repo directory as the `petrinex/` package
- Works with Databricks Repos integration (Git sync)
- No pip installation needed


In [None]:
# Import directly from Databricks Repo directory
import sys
import os

# Get the current notebook's directory (should be in /Workspace/Repos/...)
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
repo_root = os.path.dirname(notebook_path)

# Add repo root to Python path so we can import petrinex package
if repo_root not in sys.path:
    sys.path.insert(0, repo_root)

print(f"âœ“ Added repo directory to Python path: {repo_root}")
print(f"âœ“ Ready to import petrinex package")


## 2. Initialize Client


In [None]:
from petrinex import PetrinexVolumetricsClient
from pyspark.sql import functions as F
from datetime import datetime, timedelta

# Initialize the Petrinex client
client = PetrinexVolumetricsClient(
    spark=spark,
    jurisdiction="AB",      # Alberta
    file_format="CSV"
)

print("âœ“ Petrinex client initialized successfully")


## 3. Explore Available Files

First, let's see what files have been updated recently.


In [None]:
# Check what files have been updated in the last 30 days
cutoff_date = (datetime.now() - timedelta(days=30)).strftime("%Y-%m-%d")

files = client.list_updated_after(cutoff_date)

print(f"Found {len(files)} file(s) updated after {cutoff_date}\n")
print("Production Month | Updated Date        | URL")
print("-" * 100)

for f in files[:10]:  # Show first 10
    print(f"{f.production_month:15} | {str(f.updated_ts):19} | {f.url}")

if len(files) > 10:
    print(f"\n... and {len(files) - 10} more files")


## 4. Load Data (Recommended for Unity Catalog)

This method downloads files on the driver and avoids Spark file permission issues.

**Best Practices:**
- Use `dtype=str` to avoid mixed-type column issues
- Use `encoding="latin1"` for special characters
- Data is automatically unioned across months with schema alignment


In [None]:
# Define the cutoff date (e.g., load data updated in 2026)
updated_after = "2026-01-01"

print(f"Loading data updated after {updated_after}...")
print("This may take a few minutes depending on the number of files...\n")

# Read data using the pandas-based method (UC-friendly)
df = client.read_updated_after_as_spark_df_via_pandas(
    updated_after,
    pandas_read_kwargs={
        "dtype": str,           # Force all columns to string (avoid mixed types)
        "encoding": "latin1"    # Handle special characters properly
    },
    add_provenance_columns=True,  # Add tracking columns
    union_by_name=True            # Handle schema drift across months
)

# Cache the DataFrame for better performance
df.cache()

row_count = df.count()
print(f"âœ“ Loaded {row_count:,} rows")
print(f"âœ“ Columns: {len(df.columns)}")


## 5. Explore the Data


In [None]:
# Display schema
print("DataFrame Schema:")
print("=" * 80)
df.printSchema()


In [None]:
# Show sample data
print("\nSample Data (first 10 rows):")
print("=" * 80)
display(df.limit(10))


In [None]:
# Check provenance columns
print("Data Provenance:")
print("=" * 80)

provenance_df = df.select(
    "production_month",
    "file_updated_ts"
).distinct().orderBy("production_month")

display(provenance_df)


## 6. Data Quality Checks


In [None]:
# Count records by production month
print("Records by Production Month:")
print("=" * 80)

monthly_counts = df.groupBy("production_month") \
    .agg(F.count("*").alias("record_count")) \
    .orderBy("production_month")

display(monthly_counts)


## 7. Additional Analysis (Optional)

You can perform additional analysis, transformations, or save to Delta tables as needed.


In [None]:
# Example: Show column statistics
print("Column Statistics:")
print("=" * 80)

# Show distinct value counts for key columns
key_columns = ["OperatorBAID", "ProductionMonth", "ReportingFacilityType"]

for col in key_columns:
    if col in df.columns:
        distinct_count = df.select(col).distinct().count()
        print(f"{col:30} : {distinct_count:,} distinct values")

# Example: Filter and analyze specific data
print("\n" + "=" * 80)
print("Sample Analysis: Records by Facility Type")
print("=" * 80)

if "ReportingFacilityType" in df.columns:
    facility_summary = df.groupBy("ReportingFacilityType") \
        .agg(F.count("*").alias("record_count")) \
        .orderBy(F.desc("record_count"))
    
    display(facility_summary.limit(10))


## 8. Summary


In [None]:
# Cleanup: Unpersist cached DataFrames
df.unpersist()

print("âœ… Notebook execution complete!")
print("=" * 80)
print(f"\nðŸ“Š Data loaded and displayed successfully")
print(f"   Total rows: {df.count():,}")
print(f"   Columns: {len(df.columns)}")
print(f"\nðŸ’¡ Next Steps:")
print("   - Save to Delta table if needed")
print("   - Create visualizations")
print("   - Export to other formats")
print("   - Perform additional analysis")


## 9. Optional: Save to Delta Table

If you need to persist the data, uncomment and run the code below:

```python
# Define your target table
catalog_name = "main"
schema_name = "petrinex"
table_name = "volumetrics_raw"
full_table_name = f"{catalog_name}.{schema_name}.{table_name}"

# Create schema if needed
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name}")

# Add partitioning columns
df_to_save = df.withColumn(
    "year", F.substring(F.col("production_month"), 1, 4)
).withColumn(
    "month", F.substring(F.col("production_month"), 6, 2)
)

# Write to Delta table
df_to_save.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .option("overwriteSchema", "true") \
    .saveAsTable(full_table_name)

print(f"âœ“ Data saved to {full_table_name}")
```


In [None]:
# This cell is intentionally empty - you can add your own code here
