### Notes on Compression Techniques

- **Serialization**:
    - Turning data into binary
    - Converting structured data, like Java objects or a Python dictionary into a compact binary format

- **Deserialization**:
    - Turning binary back into data
    - Converting binary back into a structured data object

- **Data Formats**:
    - `CSV`:
        - Raw text file with Comma-separated values
        - Data type not stored
        - Not storage efficient
        - Good for small data sets, quick data exports
    - `JSON`:
        - JavaScript Object Notation, flexible format designed for modern web APIs, NoSQL and semi-structured data
        - Data type not stored
        - File size trend to be larger
        - Querying in Spark requires extra parsing, which can slow performance
    - `Avro`:
        - Row-oriented data framework designed for efficiency, flexibility and cross-language compatibility
        - Schema driven
        - Compact and efficient
        - Language neutral
        - Splitable and compressible
        - Schema evolution, handle data serialized with different versions of the schema without breaking compatibility (i.e. Kafka)
    - `ORC`:
        - Optimized Row Columnar
        - built for speed up analytics workflows by storing data column by column
        - Skips irrelevant data without reading
        - Commonly used in Hadoop systems
        - File Structured:
            - Header
            - Body: data stored in stripes, divided into index data, row data and stripe footer
            - Footer: metadata about the file
        - Considerations on, write performance and schema evolution
    - `Parquet`:
        - Used in big data storage formats like Apache Spark cloud data lakes, ML pipelines, columnar format like `ORC`
        - Broad compatibility
        - Used with Spark, Snowflake, AWS and Google BigQuery
        - Optimized for reading, not writing
        - Run-Length Encoding (RLE)
        - Flexible schema evolution

- **Compression Types**:
    - `Zstandard` (ztsd), developed by Facebook, high compression ratios and fast speed, balance of efficiency and performance
    - `Snappy`, developed by Google, prioritizes speed over compression ratio, good for real-time processing
    - `LZ4` Developed by Yann Collet, its all about speed, ultra fast compression/decompression


In [None]:
import time
import os
from dataclasses import dataclass
from functools import wraps
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame as PysparkDataFrame

In [9]:
# Initialize Spark session
spark = SparkSession.builder.appName("CompressionBenchmark").getOrCreate()

# Set log level to ERROR to reduce verbosity
spark.sparkContext.setLogLevel("ERROR")

In [12]:
@dataclass
class TestCompression:
    """
    Class to test compression and decompression benchmarks in Apache Spark.
    """
    file_name: str
    df: PysparkDataFrame


    @staticmethod
    def measure_execution_time(func):
        """
        Decorator to measure function execution time
        """
        @wraps(func)
        def wrapper(*args, **kwargs):
            start_time = time.time()
            result = func(*args, **kwargs)
            end_time = time.time()
            execution_time = end_time - start_time

            return result, execution_time

        return wrapper


    @measure_execution_time
    def test_compression(self, format: str, compression: str) -> tuple:
        """
        Test compression by writing a DataFrame to disk
        """
        output_path = f"output/{format}_{compression}"

        # Ensure output directory is empty before writing
        if os.path.exists(output_path):
            for file in os.listdir(output_path):
                os.remove(os.path.join(output_path, file))

        # Write DataFrame
        self.df.write.mode("overwrite").format(format).option("compression", compression).save(output_path)

        # Ensure directory exists before calculating size
        file_size_mb = 0.0
        if os.path.exists(output_path):
            file_size = sum(
                os.path.getsize(os.path.join(output_path, f))
                for f in os.listdir(output_path) if os.path.isfile(os.path.join(output_path, f))
            )
            # Convert to MB
            file_size_mb = round(file_size / (1024 * 1024), 2)

        return file_size_mb


    @measure_execution_time
    def test_decompression(self, format: str, compression: str) -> None:
        """
        Test decompression by reading a DataFrame from disk.
        """
        input_path = f"output/{format}_{compression}"
        _ = spark.read.format(format).load(input_path).count()


    def test_compression_benchmarks(self) -> dict:
        """
        Run compression and decompression benchmarks for Parquet using Zstd, Snappy, and LZ4.
        """
        results = {}
        codecs = ["zstd", "snappy", "lz4"]

        for codec in codecs:
            # Test compression
            size, compression_time = self.test_compression("parquet", codec)
            
            # Test decompression
            _, decompression_time = self.test_decompression("parquet", codec)
            
            results[f"parquet_{codec}"] = {
                "size_mb": size,
                "compression_time_seconds": compression_time,
                "decompression_time_seconds": decompression_time
            }

        return results


    def get_original_file_size(self) -> float:
        """
        Get the size of the original CSV file.
        """
        if os.path.exists(self.file_name):
            file_size = os.path.getsize(self.file_name)
            return round(file_size / (1024 * 1024), 2)  # Convert to MB
        return 0.0

In [13]:
# Path to CSV file
FILE_PATH = "../data/test_data.csv"

# Read CSV into DataFrame with correct data types
df = spark.read.option("header", "true").csv(FILE_PATH)

# Ensure DataFrame is not empty
if df.count() == 0:
    print("❌ ERROR: DataFrame is empty. Please check the input CSV file.")
    spark.stop()
    exit(1)

# Get original file size before compression
test_compression = TestCompression(file_name=FILE_PATH, df=df)
original_size_mb = test_compression.get_original_file_size()

# Run compression and decompression benchmarks
results = test_compression.test_compression_benchmarks()

# Print original file size and benchmark results
print(f"\nOriginal CSV file size: {original_size_mb} MB\n")
print("🔥 Compression and Decompression Benchmark Results 🔥 \n")
for algorithm, metrics in results.items():
    print(f"{algorithm}: \n ")
    print(f"  Compression Time: {metrics['compression_time_seconds']:.2f}s")
    print(f"  Decompression Time: {metrics['decompression_time_seconds']:.2f}s")
    print(f"  Size: {metrics['size_mb']} MB\n")



Original CSV file size: 70.91 MB

🔥 Compression and Decompression Benchmark Results 🔥 

parquet_zstd: 
 
  Compression Time: 0.52s
  Decompression Time: 0.09s
  Size: 10.94 MB

parquet_snappy: 
 
  Compression Time: 0.51s
  Decompression Time: 0.07s
  Size: 22.77 MB

parquet_lz4: 
 
  Compression Time: 0.49s
  Decompression Time: 0.08s
  Size: 23.14 MB



In [14]:
# Stop Spark session
spark.stop()