Skip to content

jpatel207/spark-transform-sql-emulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Athena SQL: Mimicking Spark Denesting Logic for Reservation Events

This SQL script recreates the transformation logic used in a Spark job to denest and reshape deeply nested hotel reservation event data from S3. It is designed to be run in Amazon Athena to enable inspection and validation of raw event-layer data prior to its ingestion into Redshift.

Purpose

This SQL serves as a utility for:

  • Reproducing Spark transformations in SQL for direct querying
  • Validating data in S3 against Redshift by showing what the output should look like after Spark
  • Enabling quick investigation of mismatches, missing records, or transformation logic issues
  • Providing a hands-on reference for how complex Spark data pipelines can be expressed in SQL

Features

  • Complex flattening of arrays of structs (such as sleeprooms, roomrates, occupancyprices)
  • Conditional currency handling and transformation logic (including fallback behavior for custom codes like 'PNTS')
  • Derivation of business-critical fields such as cost metrics, customer data, and booking keys
  • Deduplication of event data using ROW_NUMBER() window functions
  • Approximate reconstruction of logic from Spark source code for use in Athena

Technical Context

  • Execution Environment: Amazon Athena (Presto/Trino SQL dialect)
  • Source Data: Raw reservation event data stored in S3
  • Transformation Logic Reference: Recreated from Spark ETL behavior using Athena SQL functions

Spark Logic Mirrors

This SQL was designed to replicate specific transformation logic originally implemented in a Spark-based ETL pipeline. The goal was to provide a SQL-accessible view of raw event-layer data that closely matches the structure and derived values produced by the Spark job.

While the original Spark logic used custom Java UDFs, this SQL version approximates their behavior using standard Athena (PrestoSQL) functions such as FILTER, TRANSFORM, FLATTEN, and ELEMENT_AT.

Key translated logic includes:

  • deriveOccupancyCurrencyFromNestedRates
    Reconstructs the pricing currency from nested arrays of sleeprooms → roomrates → occupancyprices, used when reservations are made using non-currency methods (e.g., points).

  • calculatePointRedemptionEstimate
    Estimates the monetary value of loyalty point-based reservations (coded as 'PNTS') using a flat nightly rate and ISO 8601 duration parsing. This logic is simplified in SQL but aligns with the core behavior of the Spark implementation.

  • Event Deduplication
    Simulates getOrderedDataset logic by applying a ROW_NUMBER() partitioned by event ID and ordered by timestamp fields in joined enrichment tables, retaining only the most recent record.

These approximations support use cases such as:

  • Validating Spark-transformed Redshift records against raw S3-layer data
  • Debugging record-level issues in event ingestion or transformation
  • Exploring nested data structures without running Spark jobs

Refer to the inline SQL comments for detailed mappings of each logic block.

Example Use Case

This script can be used by developers, analysts, or engineers to:

  • Simulate the output of Spark transformation logic directly within Athena
  • Compare raw S3-layer data against records found in Redshift
  • Investigate whether specific records were transformed or ingested correctly
  • Avoid rerunning full Spark pipelines when debugging record-level issues

Disclaimer

  • This script is not intended for production workloads. It is designed for ad hoc debugging, learning, and data validation.
  • Proprietary and identifying information has been removed or anonymized to ensure compliance with internal and external sharing policies.
  • The logic is based on real transformation patterns but is not guaranteed to replicate all Spark edge cases.

Author: jpatel207 License: MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published