# Utility Functions & Project Libraries

## 1. Overview
Before we begin ingesting data, we need to set up a framework to handle:
1.  **Job Control:** Tracking which data has been loaded (Full vs. Incremental load) using a MySQL/Delta table log.
2.  **Configuration:** Managing the "Run Date" (Batch Date).
3.  **Common Utilities:** Generating date data, casting columns, etc.
4.  **AWS S3 Management:** Moving files from "Landing" to "Archive" after processing.

In this notebook, we will create the necessary Python library files in a `lib` folder.

---

## 2. Setup Library Folder
First, let's create a directory to store our helper scripts.

In [None]:
import os

# Create a 'lib' directory if it doesn't exist
if not os.path.exists('lib'):
    os.makedirs('lib')
    print("Created 'lib' directory.")
else:
    print("'lib' directory already exists.")
    
# Create an empty __init__.py to make it a package
with open('lib/__init__.py', 'w') as f:
    pass

## 3. Configuration (Run Date)

We need a way to tell our pipeline what "Business Date" or "Batch Date" we are processing. We will use a simple text file `run_config.txt` (or JSON) to store this.

*   **Format:** `YYYYMMDD`
*   **Initial Full Load Date:** `20220101`

In [None]:
%%writefile lib/run_config.txt
{"rundate": "20220101"}

## 4. General Utilities (`utils.py`)

This script will contain general helper functions:
1.  **`get_rundate`**: Reads the config file created above.
2.  **`get_string_cols`**: A helper to cast all columns in a DataFrame to String (useful for raw ingestion).
3.  **`date_data`**: A specific function to generate Date Dimension data (since we don't have a source file for dates).

In [None]:
%%writefile lib/utils.py
from datetime import datetime, timedelta
from pyspark.sql.functions import col, lit, concat
import json

def get_rundate():
    """Reads the Run Date from the config file"""
    try:
        with open("lib/run_config.txt", "r") as f:
            config = json.load(f)
            return config.get("rundate")
    except Exception as e:
        print(f"Error reading config: {e}")
        return None

def get_string_cols(spark, df):
    """Casts all columns to String type"""
    return [col(c).cast("string").alias(c) for c in df.columns]

def date_data(spark, start_run_dt, num_years=1):
    """
    Generates a sequence of dates for the Date Dimension.
    """
    data = []
    start_date = datetime.strptime(str(start_run_dt), "%Y%m%d")
    
    # Generate data for X years
    for i in range(0, 365 * num_years):
        next_date = start_date + timedelta(days=i)
        data.append((
            next_date.strftime("%Y-%m-%d"), # date
            next_date.strftime("%Y%m%d"),   # date_key
            next_date.strftime("%Y"),       # year
            next_date.strftime("%m"),       # month
            next_date.strftime("%d")        # day
        ))

    schema = ["full_date", "date_key", "year", "month", "day"]
    return spark.createDataFrame(data, schema=schema)

## 5. Job Control (`job_control.py`)

This is the most critical logic for Data Warehousing. This script manages the **Job Control Table**.

**Table Schema:**
*   `schema_name`: Database name.
*   `table_name`: Table name.
*   `max_timestamp`: The maximum timestamp of data loaded.
*   `rundate`: The batch date.

**Key Logic:**
*   **`get_max_timestamp`**: Checks the log table. 
    *   If a record exists, return the max timestamp (Incremental Load).
    *   If **NO** record exists, return `1900-01-01 00:00:00` (Default High Watermark for Full Load).
*   **`insert_log`**: Updates the table after a successful load.

In [None]:
%%writefile lib/job_control.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import max, lit, current_timestamp

def get_max_timestamp(spark, schema_name, table_name):
    """
    Gets the max timestamp for a specific table to determine incremental logic.
    Returns '1900-01-01' if no history exists.
    """
    try:
        # Read the Job Control Table (We will create this table in later videos, assuming it exists or handling error)
        # Note: In a real scenario, handle the "Table not found" exception for the very first run
        df = spark.read.format("delta").load(f"spark-warehouse/job_control")
        
        # Filter for specific table
        df_filtered = df.filter((df.schema_name == schema_name) & (df.table_name == table_name))
        
        if df_filtered.count() > 0:
            max_ts = df_filtered.agg(max("max_timestamp")).collect()[0][0]
            return str(max_ts)
        else:
            return "1900-01-01 00:00:00"
            
    except Exception as e:
        # If table doesn't exist yet (First run ever), return default
        return "1900-01-01 00:00:00"

def insert_log(spark, schema_name, table_name, max_timestamp, rundate):
    """Logs the execution status"""
    try:
        data = [{
            "schema_name": schema_name,
            "table_name": table_name,
            "max_timestamp": str(max_timestamp),
            "rundate": str(rundate),
            "insert_dt": current_timestamp()
        }]
        
        # Determine schema implicitly or define explicit schema
        df = spark.createDataFrame(data)
        
        # Write to Job Control table
        df.write.format("delta").mode("append").save(f"spark-warehouse/job_control")
        return True
    except Exception as e:
        print(f"Error logging job: {e}")
        return False

## 6. AWS S3 Utilities (`aws_s3.py`)

This script handles file movements. After we process a CSV file from the "Landing" zone, we don't want to process it again. We move it to an "Archive" folder.

*   **Prerequisite:** Requires `boto3` library.

In [None]:
%%writefile lib/aws_s3.py
import boto3

def archive_landing_object(bucket_name, file_key):
    """
    Moves a file from Landing to Archive within S3.
    Structure: s3://bucket/landing/file.csv -> s3://bucket/archive/file.csv
    """
    s3 = boto3.client('s3')
    
    try:
        # Define Copy Source
        copy_source = {'Bucket': bucket_name, 'Key': file_key}
        
        # Define New Key (Replace 'landing' with 'archive')
        new_key = file_key.replace("landing", "archive")
        
        # Copy Object
        s3.copy_object(Bucket=bucket_name, CopySource=copy_source, Key=new_key)
        
        # Delete Original Object
        s3.delete_object(Bucket=bucket_name, Key=file_key)
        
        print(f"Archived: {file_key}")
        return True
        
    except Exception as e:
        print(f"Error archiving file {file_key}: {e}")
        return False

## 7. Spark Session Wrapper (`spark_session.py`)

A simple utility to ensure we always create the Spark Session with the necessary configurations (like Delta Lake support and S3 connectivity) consistent across all notebooks.

In [None]:
%%writefile lib/spark_session.py
from pyspark.sql import SparkSession

def get_spark_session(app_name="PetFood_DW"):
    spark = SparkSession.builder \
        .appName(app_name) \
        .config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0,org.apache.hadoop:hadoop-aws:3.3.2") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
    return spark

## 8. Verification

Let's verify that our files were created successfully in the `lib` folder.

In [None]:
import os
print("Files in 'lib' directory:")
print(os.listdir('lib'))