#Silver To Gold

The Silver to Gold code refines the data into a star schema model, enhancing usability for analytics and reporting by creating fact and dimension tables. Starting from cleansed data in the Silver layer, it proceeds as follows:

- **Data Loading:** Reads the Silver data for a specified partition year and month into a Spark DataFrame.

- **Fact Table Creation:** Adds a unique TripID to each record and extracts date-related fields (e.g., Date, PartitionYear, PartitionMonth) to enable efficient querying and partitioning. It also captures the source file name for audit purposes, allowing data lineage tracking. This fact table, `FactTrips`, contains detailed trip information for analysis.

- **Dimension Tables:**

  -   **Date Dimension:** Extracts date-related attributes like day of the week, day of the month, and year to support time-based aggregations and trend analysis.

  - **Location Dimension:** Incorporates external reference data to associate location IDs with descriptive zones, enhancing the contextual value of location data.

  - **Rate Code and Payment Type Dimensions:** Maps categorical codes to descriptive labels for rate types and payment methods, ensuring clarity and consistency in the trip data.

- **Data Loading:** Writes `FactTrips`, `DimDate`, `DimLocation`, `DimRateCode`, and `DimPaymentType` tables to the Gold layer in Delta format, partitioned where applicable, optimizing the data for fast access in analytics and reporting.

This structured transformation enables more efficient and insightful analysis by simplifying joins and making dimensions reusable across queries.

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from calendar import monthrange

##1. Data Reading

Retrieves year and month values for partitioning, allowing for efficient data filtering.

In [0]:
dbutils.widgets.text("Year", "")
dbutils.widgets.text("Month", "")

partition_year = dbutils.widgets.get("Year")
partition_month = dbutils.widgets.get("Month")

Reads the Silver layer data for the specified partition month and year.

In [0]:
gold_df = spark.read.format("delta").load(f"/mnt/silver/PartitionYear={partition_year}/PartitionMonth={partition_month}")

##2. Fact Table

Adds a unique `TripID` to ensure each trip record is identifiable for analysis. Creates a Date column for joining with the `Date` dimension table in analytics.

In [0]:
gold_df = (gold_df
    .withColumn("Date", to_date(col("tpep_pickup_datetime")))
    .withColumn("PartitionYear", year(col("tpep_pickup_datetime")).cast(IntegerType()))
    .withColumn("PartitionMonth", month(col("tpep_pickup_datetime")).cast(IntegerType()))
    .withColumn("SourceFile", input_file_name())
)

fact_trips = gold_df.select(
    col("Date"),
    col("PULocationID").alias("PickupLocationID"),
    col("DOLocationID").alias("DropoffLocationID"),
    col("payment_type").alias("PaymentTypeID"),
    col("RatecodeId").alias("RateCodeID"),
    col("store_and_fwd_flag").alias("StoreAndFwdFlag"),
    col("passenger_count").alias("PassengerCount"),
    col("trip_distance").alias("TripDistance"),
    col("fare_amount").alias("FareAmount"),
    col("extra").alias("Extra"),
    col("mta_tax").alias("MtaTax"),
    col("tip_amount").alias("TipAmount"),
    col("tolls_amount").alias("TollsAmount"),
    col("improvement_surcharge").alias("ImprovementSurcharge"),
    col("congestion_surcharge").alias("CongestionSurcharge"),
    col("Airport_fee").alias("AirportFee"),
    col("total_amount").alias("TotalAmount"),
    col("PartitionYear"),
    col("PartitionMonth"),
    col("PipelineRunID"),
    col("PipelineRunDate"),
    col("SourceFile")
) # Columns are selected depending on fact table design requirements.

##3. Dimensions

####Datetime

Creates Date dimension to provide additional temporal insights (e.g., weekday vs. weekend trends).

In [0]:
dim_date = fact_trips.select(col("Date"))

dim_date = (dim_date
    .withColumn("DayOfWeek", dayofweek("Date")) # Enables analysis by day of week (e.g., busy days).
    .withColumn("Day", dayofmonth(col("Date"))) # Allows aggregation by specific day.
    .withColumn("Month", month(col("Date"))) # Facilitates monthly trend analysis.
    .withColumn("Year", year(col("Date")))  # Supports year-over-year comparisons.
)
dim_date = dim_date.dropDuplicates()

####Location

Loads reference data to map location IDs to readable names and zones, enriching the trips data with location context.

In [0]:
dim_pickup_location = spark.read.option("header", "true").csv("/mnt/lookup-data/location.csv")
dim_pickup_location = dim_pickup_location.withColumnRenamed("service_zone", "ServiceZone")

dim_dropoff_location = spark.read.option("header", "true").csv("/mnt/lookup-data/location.csv")
dim_dropoff_location = dim_dropoff_location.withColumnRenamed("service_zone", "ServiceZone")

####Rate Code

Rate Code dimension maps rate code IDs to descriptive names.

In [0]:
dim_rate_code = spark.read.json("/mnt/lookup-data/rate_code.json")

dim_rate_code = (dim_rate_code
  .withColumn("RateCodeID", col("id").cast(IntegerType()))
  .withColumn("RateCode", col("type").cast(StringType()))
  .select(
    "RateCodeID",
    "RateCode"
  )
)

####Payment Type

Payment Type dimension maps payment type IDs to descriptive names.

In [0]:
dim_payment_type = spark.read.json("/mnt/lookup-data/payment_type.json")

dim_payment_type = (dim_payment_type
  .withColumn("PaymentTypeID", col("id").cast(IntegerType()))
  .withColumn("PaymentType", col("type").cast(StringType()))
  .select(
    "PaymentTypeID",
    "PaymentType"
  )
)

##4. Data Loading

Saves `FactTrips`, `DimDate`, `DimPickupLocation`, `DimDropoffLocation`, `DimRateCode`, and `DimPaymentType` tables in Gold for analytics, using Delta format for efficient querying.

In [0]:
fact_trips.write.mode("append").format("delta").partitionBy("PartitionYear", "PartitionMonth").save("/mnt/gold/fact_tables/FactTrips")
dim_date.write.mode("append").format("delta").partitionBy("Year", "Month").save("/mnt/gold/dimensions/DimDate")
dim_pickup_location.write.mode("overwrite").format("delta").save("/mnt/gold/dimensions/DimPickupLocation")
dim_dropoff_location.write.mode("overwrite").format("delta").save("/mnt/gold/dimensions/DimDropoffLocation")
dim_rate_code.write.mode("overwrite").format("delta").save("/mnt/gold/dimensions/DimRateCode")
dim_payment_type.write.mode("overwrite").format("delta").save("/mnt/gold/dimensions/DimPaymentType")