# Spark Handbook
## Apache Spark: A Comprehensive Guide for Data Engineers
This handbook provides a comprehensive overview of [Apache Spark](glossary.md#apache-spark), a powerful [distributed data processing framework](glossary.md#distributed-data-processing-framework) designed for handling [big data](glossary.md#big-data) workloads with speed, ease of use, and flexibility.

## Table of Contents
- [How Apache Spark Works](#how-apache-spark-works)
- [Apache Spark Architecture](#apache-spark-architecture)
- [Spark Core Components and Libraries](#spark-core-components-and-libraries)
- [Why Do Data Engineers Need Spark?](#why-do-data-engineers-need-spark)
- [Typical Use Cases](#typical-use-cases)
- [RDDs and DataFrames in Apache Spark](#rdds-and-dataframes-in-apache-spark)
  - [Introduction](#1-introduction)
  - [RDD: Resilient Distributed Dataset](#2-rdd-resilient-distributed-dataset)
  - [DataFrames](#3-dataframes)
  - [Conversion Between RDD and DataFrame](#4-conversion-between-rdd-and-dataframe)
  - [RDD vs. DataFrame - Comparison](#5-rdd-vs-dataframe---comparison)
  - [Use Case Summary](#6-use-case-summary)
  - [Conclusion](#7-conclusion)
- [Apache Spark Local Setup](#apache-spark-local-setup)
  - [Installing Spark Locally (Native Installation)](#1-installing-spark-locally-native-installation)
  - [Using Docker to Set Up Spark](#2-using-docker-to-set-up-spark)

## What is Apache Spark?

**Apache Spark** is an open-source, distributed analytics engine designed for large-scale data processing and machine learning. It is renowned for its speed, versatility, and ability to scale from a single machine to large clusters of computers. Spark offers APIs in several popular languages, including Python (using PySpark), Scala, Java, and R, making it accessible to a wide audience of data professionals.

## Main Abstractions of Apache Spark

### Resilient Distributed Datasets (RDDs)

RDDs are the fundamental data structure in Spark. They represent immutable, distributed collections of objects partitioned across the cluster. RDDs support two types of operations:

- **Transformations:** Lazy operations that define a new RDD from an existing one (e.g., map, filter).
- **Actions:** Operations that trigger computation and return results (e.g., collect, count).

RDDs enable fault tolerance by tracking lineage, allowing the system to recompute lost data partitions in case of node failure.

### Directed Acyclic Graph (DAG)

Spark uses a DAG to represent the sequence of transformations applied to RDDs. When a job is submitted, Spark’s DAG scheduler breaks the computation into stages of tasks that can be executed in parallel. This DAG-based execution plan enables optimization and efficient job scheduling.

## Spark Architecture

Apache Spark follows a **master-worker** architecture composed of several key components:

### 1. Spark Driver

The driver is the central coordinator of a Spark application. It:
- Maintains the lifecycle of the application.
- Converts user code into a logical execution plan (DAG).
- Schedules tasks and monitors their execution.
- Communicates with the cluster manager to request resources.
- Collects and aggregates results from executors.

The driver contains components such as SparkContext, DAG Scheduler, Task Scheduler, and Block Manager.

### 2. Cluster Manager

The cluster manager oversees resource allocation in the cluster. It manages CPUs, memory, and executors across worker nodes. Spark can run on various cluster managers such as:
- Apache YARN (Hadoop ecosystem)
- Apache Mesos
- Kubernetes
- Spark's standalone cluster manager

### 3. Executors

Executors are worker processes launched on cluster nodes. Each executor:
- Executes tasks assigned by the driver.
- Performs computations on partitions of data.
- Caches data in memory or on disk as needed.
- Reports task status and results back to the driver.

Executors live for the duration of a Spark application and enable parallel task execution.

### 4. Worker Nodes

Worker nodes are the physical or virtual machines in the cluster where executors run. They host one or more executors executing tasks in parallel.

### 5. SparkContext

SparkContext is the entry point through which the driver interacts with the cluster. It:
- Connects to the cluster manager.
- Creates RDDs and manages their lifecycle.
- Coordinates job execution.

---

## Spark Core Components and Libraries

### Spark SQL

Spark SQL is Spark’s module for working with structured data. It allows querying data using:
- Standard SQL.
- Hive Query Language (HQL).
- Support for numerous data sources including Hive tables, Parquet, and JSON.

Spark SQL integrates SQL queries with Spark’s programmatic APIs (RDDs, DataFrames) in Python, Scala, and Java. This tight integration supports complex analytics and interactive querying within a unified application framework.

### MLlib

MLlib is Spark’s scalable machine learning library. It provides:
- Algorithms for classification, regression, clustering, and collaborative filtering.
- Utilities for model evaluation and data import.
- Low-level primitives such as a generic gradient descent optimization algorithm.

### GraphX

GraphX is Spark’s graph processing library, enabling:
- Creation and manipulation of graphs with properties on vertices and edges.
- Graph-parallel computations like PageRank and triangle counting.
- Operators such as subgraph extraction and vertex mapping.

## Why Do Data Engineers Need Spark?

### 1. Speed and Performance
- Spark performs in-memory computing, reducing costly disk read/write operations.
- It can be up to 100× faster than Hadoop MapReduce for iterative and interactive workloads.

### 2. Scalability
- Spark scales from a single machine to thousands of cluster nodes.
- Handles petabyte-scale data through distributed processing.

### 3. Unified Processing Engine
- Supports batch processing, real-time streaming, SQL querying, machine learning, and graph analytics all within one platform.

### 4. Language Flexibility and Ease of Use
- Provides APIs in Python, Scala, Java, and R.
- High-level abstractions (RDDs, DataFrames, Datasets) simplify complex data transformations.

### 5. Ecosystem and Integration
- Integrates with Hadoop HDFS, Amazon S3, Apache Kafka, and other platforms.
- Supports multiple cluster managers for flexible deployment.

### 6. Essential for Modern Workloads
- Enables ETL pipelines, real-time dashboards, machine learning workflows, and large-scale interactive queries.

---

## Typical Use Cases
- ETL pipelines for big data ingestion and transformation
- Scalable machine learning model training and deployment
- Real-time data stream processing (e.g., fraud detection, log analysis)
- Graph analytics for social network analysis and recommendations
---

# RDDs and DataFrames in Apache Spark

## 1. Introduction
Apache Spark has two core abstractions for working with distributed data:
- **RDD (Resilient Distributed Dataset):** The original low-level distributed data structure
- **DataFrame:** A high-level abstraction built on top of RDDs, offering a tabular data structure similar to a database table or Pandas DataFrame.

## 2. RDD: Resilient Distributed Dataset

### 2.1 What is an RDD?
An RDD is an immutable distributed collection of objects that can be processed in parallel.

### 2.2 Key Features
- Fault-tolerant
- Lazy evaluation
- Supports transformations (`map`, `filter`, etc.) and actions (`collect`, `count`, etc.)
- Type-safe (in Scala/Java)
- No built-in schema

### 2.3 Creating or Loading Data into an RDD

#### Creating an RDD (PySpark):

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDDExample").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4, 5])

#### Loading Data into an RDD

In [None]:
# Load file (skip header)
rdd = sc.textFile("./data/customers.csv")
header = rdd.first()
rdd_data = rdd.filter(lambda line: line != header)

### 2.4 RDD Transformation and Actions

In [None]:
# Split CSV into fields
customers_rdd = rdd_data.map(lambda line: line.split(","))

In [None]:
# View Sample
customers_rdd.take(3)

In [None]:
# Count missing join dates
missing_dates = customers_rdd.filter(lambda x: x[4] == "").count()
print(f"Missing join dates: {missing_dates}")

In [None]:
# Extract customer names
names = customers_rdd.map(lambda x: f"{x[1]} {x[2]}").collect()
print(names)

## 3. DataFrames

### 3.1 What is a DataFrame?
A DataFrame is a distributed collection of data organized into named columns, like a SQL table.

### 3.2 Key Features
- Schema-aware (columns and types)
- Optimized by Catalyst optimizer
- Supports SQL queries via `spark.sql()`
- Interoperable with RDDs and Pandas
- Better performance than RDD for most use cases

### 3.3 Creating or Loading Data into a DataFrame

#### Reading CSV into DataFrame

In [None]:
df = spark.read.option("header", True).csv("./data/customers.csv")
df.show()

### 3.4 Common DataFrame Operations

In [None]:
# Print schema
df.printSchema()

In [None]:
# Select specific columns
df.select("first_name", "email").show()

In [None]:
# Filter customers with missing join dates
df.filter(df.join_date.isNull()).show()

In [None]:
# Count customers who joined
df.filter(df.join_date.isNotNull()).count()

In [None]:
# Extract customer names (from RDD)
names = customers_rdd.map(lambda x: f"{x[1]} {x[2]}").collect()
print(names)

## 4. Conversion Between RDD and DataFrame

### From RDD to DataFrame

In [None]:
from pyspark.sql import Row

# Convert RDD to Row RDD
row_rdd = customers_rdd.map(lambda x: Row(
    customer_id=int(x[0]),
    first_name=x[1],
    last_name=x[2],
    email=x[3],
    join_date=x[4] if x[4] != "" else None
))

df_from_rdd = spark.createDataFrame(row_rdd)
df_from_rdd.show()

### From DataFrame to RDD

In [None]:
rdd_from_df = df.rdd
rdd_from_df.take(3)

## 5. RDD vs. DataFrame - Comparison

| Feature           | RDD                        | DataFrame               |
| ----------------- | -------------------------- | ----------------------- |
| Abstraction Level | Low                        | High                    |
| API Style         | Functional                 | SQL-like                |
| Schema            | Not enforced               | Schema-aware            |
| Performance       | Lower                      | Optimized with Catalyst |
| Best for          | Custom, fine-grained logic | Queries, aggregations   |

## 6. Use Case Summary

| Task                                     | Recommended |
| ---------------------------------------- | ----------- |
| Load structured CSV data                 | DataFrame   |
| Filter or select fields efficiently      | DataFrame   |
| Custom parsing, transformation, or logic | RDD         |
| SQL-like querying and grouping           | DataFrame   |

## 7. Conclusion

- Use DataFrames when working with structured data like CSV, JSON, or Parquet.
- Use RDDs when you need custom logic, performance tuning, or low-level transformations.

This practical section using your `customers.csv` helps you clearly see how both abstractions work and when to use them.

# Apache Spark Local Setup

In this section, we'll cover two common ways to set up Apache Spark on a local development machine:

1. **Installing Spark Locally (Native Installation)**
2. **Using Docker to Set Up Spark**

## 1. Installing Spark Locally (Native Installation)

This method involves manually installing Spark and its dependencies on your machine.

### 1.1 Prerequisites
- **Java (JDK 8 or 11):** Spark runs on the JVM.
- **Python 3.x:** Required for PySpark.

### 1.2 Download and Install Spark
- Download Spark from the [Official Apache Spark website](https://spark.apache.org/downloads.html).
    - Choose a version (e.g., Spark 3.4.1) and a pre-built package for Hadoop (e.g., "Pre-built for Apache Hadoop 3.3 and later").

#### Extract the archive to a directory of your choice

**On Linux:**

In [None]:
tar -xzf spark-3.4.1-bin-hadoop3.tgz -C /path/to/your/directory

**On Windows:**
- Use a tool like 7-Zip or WinRAR.
    - Right-click the downloaded `.tgz` file
    - Select "Extract Here" or "Extract to spark-3.4.1-bin-hadoop3"
    - Move the extracted folder to your desired location

### 1.3 Set Environment Variables
Set the following environment variables so your system can find Spark and Java.

In [None]:
# Linux (add to ~/.bashrc or ~/.zshrc)
export SPARK_HOME=/path/to/your/directory/spark-3.4.1-bin-hadoop3
export PATH=$PATH:$SPARK_HOME/bin
export JAVA_HOME=/path/to/your/java

On Windows, set environment variables via System Properties > Environment Variables.

### 1.4 Install Required Python Libraries

In [None]:
!pip install pyspark findspark

### 1.5 Test Your Installation
Start the PySpark shell:

In [None]:
!pyspark

Or test with a small script:

In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
print(spark.range(5).collect())

---
## 2. Using Docker to Set Up Spark

An alternative way is to run Spark inside Docker containers. This avoids manual setup and ensures a clean environment.

### 2.1 Prerequisites
- Docker installed on your system ([Install Docker](https://docs.docker.com/get-docker/))

### 2.2 Standalone Setup
#### 2.2.1 Pull a Spark Docker image
You can use an existing image from Docker Hub or customize it using a Dockerfile.

In [None]:
docker pull bitnami/spark

#### 2.2.2 Run a Spark Container
Start a Spark standalone container:

In [None]:
docker run -it bitnami/spark pyspark

### 2.3 Set Up Spark Cluster
You can create a local Spark cluster with [`docker-compose.yaml`](./docker-compose.yaml).

#### 2.3.1 Start the Cluster
Run the following command to start the cluster:

In [None]:
docker compose up -d

#### 2.3.2 Access the Spark Web UI
- Master: [http://localhost:8080](http://localhost:8080)
- Worker: [http://localhost:8081](http://localhost:8081)

#### 2.3.3 Submit Jobs
You can submit jobs using the spark-submit tool or run a PySpark shell inside the container:

In [None]:
docker exec -it spark-master pyspark --master spark://spark-master:7077

#### 2.3.4 Setting Up Jupyter Notebook Container for Spark (Optional)

Running a Jupyter Notebook container alongside your Spark services is a great way to interactively test Spark code using PySpark.

- Uncomment the `jupyter` service block in the [`docker-compose.yaml`](./docker-compose.yaml) file.
- Ensure the `notebooks` directory exists in the same location as your `docker-compose.yaml`:
    ```bash
    mkdir notebooks
    ```
  This directory will be mounted into the Jupyter container so that your notebooks are saved persistently.

- To start the whole cluster (including Jupyter):
    ```bash
    docker-compose up -d
    ```
- To start only the Jupyter container (after cluster is running):
    ```bash
    docker-compose up -d jupyter
    ```
- You can now access the notebook UI at: [http://localhost:8888](http://localhost:8888)

  Use the token shown in the terminal (when the Jupyter container starts) to log in.

#### 2.3.5 Test Notebook Code

In a new notebook, run:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("NotebookSpark") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

spark.range(5).show()

# Extracting and Transforming Data with Apache Spark

This section provides a comprehensive guide to extracting and transforming data using Apache Spark, focusing on Spark SQL and DataFrame APIs. It includes detailed explanations, practical scenarios, code examples, and best practices to help you master these critical aspects of Spark.

### Extracting Data with Spark

#### Overview
Extracting data in Apache Spark involves loading data from various sources into DataFrames for further processing. Spark’s DataFrame API provides a unified interface to read data from file formats like JSON, CSV, and Parquet, as well as databases and cloud storage.

#### Supported Data Sources
Spark supports a wide range of data sources, including:
- `File Formats`: CSV, JSON, Parquet, ORC, Avro, Text
- `Databases`: JDBC/ODBC (MySQL, PostgreSQL, SQL Server, etc.)
- `Big Data Systems`: Hadoop HDFS, Apache Hive, Apache HBase
- `Cloud Storage`: AWS S3, Google Cloud Storage, Azure Blob Storage
- `Other`: Kafka, NoSQL databases like Cassandra

#### Reading Data
Spark provides the `spark.read` API to load data into DataFrames. Common methods include:
- `spark.read.csv(path)`: Reads CSV files
- `spark.read.json(path)`: Reads JSON files
- `spark.read.parquet(path)`: Reads Parquet files.
- ``spark.read.jdbc(url, table, properties)``: Reads from JDBC databases.

#### Key Options:
- `header=True`: Treats the first row as column names (CSV).
- `inferSchema=True`: Automatically infers column data types.
- `schema=StructType`: Specifies a custom schema to avoid inference overhead.
- `mode`: Controls error handling (permissive, dropmalformed, failfast).

#### Scenario: Extracting Data from JSON, CSV, and Parquet
**Problem**: Load the `products.json`, `customers.csv`, and `orders.parquet` datasets into Spark DataFrames, ensuring proper schema handling and error management for missing or inconsistent data.

**Solution**: <br>
- Define explicit schemas to ensure correct data types.
- Handle missing or malformed data during extraction.
- Cache DataFrames for repeated use and save them in Parquet for unified storage.

**Code Example**:

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType

# Initialize Spark session
spark = SparkSession.builder \
    .appName("PySparkHandbook") \
    .getOrCreate()

# Define schema for products.json
products_schema = StructType([
    StructField("product_id", IntegerType(), False),
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("price", DoubleType(), True)
])

# Define schema for customers.csv
customers_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("join_date", DateType(), True)
])

# Read products.json
products_df = spark.read \
    .schema(products_schema) \
    .json("./data/products.json")
 

# Read customers.csv
customers_df = spark.read \
    .schema(customers_schema) \
    .option("header", "true") \
    .option("mode", "dropmalformed") \
    .csv("./data/customers.csv")

# Read orders.parquet
orders_df = spark.read.parquet("./data/orders_parquet/orders.parquet")

# Cache DataFrames for performance
products_df.cache()
customers_df.cache()
orders_df.cache()

# Show sample data
print("Products:")
products_df.show(5, truncate=False)
print("Customers:")
customers_df.show(5, truncate=False)
print("Orders:")
orders_df.show(5, truncate=False)

# Save to Parquet for unified storage
products_df.write.mode("overwrite").parquet("./data/cleaned/products_clean.parquet")
customers_df.write.mode("overwrite").parquet("./data/cleaned/customers_clean.parquet")
orders_df.write.mode("overwrite").parquet("./data/cleaned/orders_clean.parquet")

# Stop Spark session
spark.stop()

Products:
+----------+------------+-----------+------+
|product_id|product_name|category   |price |
+----------+------------+-----------+------+
|101       |Laptop      |Electronics|1200.5|
|102       |Mouse       |Accessories|25.0  |
|103       |Keyboard    |Accessories|75.75 |
|104       |Monitor     |Electronics|300.0 |
|105       |USB Cable   |accessories|10.0  |
+----------+------------+-----------+------+
only showing top 5 rows
Customers:
+-----------+----------+---------+-----------------------+----------+
|customer_id|first_name|last_name|email                  |join_date |
+-----------+----------+---------+-----------------------+----------+
|1          |John      |Doe      |john.doe@example.com   |2023-01-15|
|2          |Jane      |Smith    |jane.smith@example.com |2023-02-20|
|3          |Peter     |Jones    |peter.jones@example.com|2023-03-10|
|4          |Sarah     |Lee      |sarah.lee@example.com  |NULL      |
|5          |Mike      |NULL     |mike.brown@example.com |20

**Explanation**:
- Defines explicit schemas to ensure correct data types and avoid inference overhead.
- Uses `dropmalformed` mode for the CSV to skip any malformed rows.
- Reads `orders.parquet` directly, as Parquet files include schema metadata.
- Caches DataFrames to improve performance for subsequent transformations.
- Saves cleaned DataFrames to Parquet for efficient storage and querying.

### Transforming Data with Spark
**Overview** <br>
Transforming data in Spark involves manipulating DataFrames to clean, enrich, or aggregate data for analysis. Spark’s DataFrame API and SQL queries support operations like joins, aggregations, filtering, and window functions, optimized for distributed execution. The provided datasets will be used to demonstrate these transformations in practical scenarios.

### Common Transformation Operations
**Joins**: <br>
Joins combine DataFrames based on a key. Common types include:
- `Inner Join`: Returns only matching rows.
- `Left Outer Join`: Includes all rows from the left DataFrame, with nulls for non-matching rows.
- `Right Outer Join`: Includes all rows from the right DataFrame.
- `Full Outer Join`: Includes all rows from both DataFrames.

**Syntax**:
```python 
result_df = df1.join(df2, df1.key == df2.key, "inner")
```

**Aggegations**: <br>
Aggregations summarize data using functions like `sum`, `avg`, `count`, `min`, `max`, typically after `groupBy`

**Syntax**: <br>
```python 
agg_df = df.groupBy("column").agg({"other_column": "sum"})
```

**Filtering and Selecting**: <br>
- **Filtering**: Select rows with filter or where.
- **Selecting**: Choose columns with select.

**Syntax**:
```python 
filtered_df = df.filter(col("column") > value)
selected_df = df.select("column1", "column2")
```

**Handling Missing Data**
- `Drop nulls`: df.dropna(subset=["column"])
- `Fill nulls`: df.fillna(value, subset=["column"])
- `Replace values`: df.replace(old_value, new_value, subset=["column"])


### Scenario: Joining and Aggregating Sales Data
**Problem**:
Combine the `products`, `customers`, and `orders` datasets to calculate total sales and average order amount per product category. Standardize the category column in products and handle missing data in customers.

**Solution**: <br>
- Join the three DataFrames on appropriate keys.
- Standardize `category` in `products`.
- Handle missing `last_name` and `join_date` in customers.
- Aggregate by `category` to compute sales metrics.

#### Code Example:

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as sum_, avg, count, coalesce, lit, upper

# Initialize Spark session
spark = SparkSession.builder.appName("SparkTransformation").getOrCreate()

# Load data
products_df = spark.read.parquet("./data/cleaned/products_clean.parquet")
customers_df = spark.read.parquet("./data/cleaned/customers_clean.parquet")
orders_df = spark.read.parquet("./data/cleaned/orders_clean.parquet")

# Standardize category in products (convert to uppercase)
products_df = products_df.withColumn("category", upper(col("category")))

# Handle missing data in customers
customers_df = customers_df \
    .withColumn("last_name", coalesce(col("last_name"), lit("Unknown"))) \
    .withColumn("join_date", coalesce(col("join_date"), lit("2023-01-01").cast(DateType())))

# Join DataFrames
joined_df = orders_df \
    .join(customers_df, "customer_id", "left_outer") \
    .join(products_df, "product_id", "inner")

# Print joined dataframe
print("Joined DataFrame:")
joined_df.show(truncate=False)

# Aggregate by category
summary_df = joined_df.groupBy("category") \
    .agg(
        sum_("total_price").alias("total_sales"),
        avg("total_price").alias("avg_order_amount"),
        count("order_id").alias("order_count")
    )

# Format numerical columns
summary_df = summary_df.select(
    col("category"),
    col("total_sales").cast("decimal(10,2)"),
    col("avg_order_amount").cast("decimal(10,2)"),
    col("order_count")
)

# Save and show results
summary_df.write.mode("overwrite").parquet("./data/cleaned/sales_summary.parquet")
print("Sales Summary:")
summary_df.show(truncate=False)

# Stop Spark session
spark.stop()


Joined DataFrame:
+----------+-----------+--------+--------+-----------+----------+---------+-----------------------+----------+------------+-----------+------+
|product_id|customer_id|order_id|quantity|total_price|first_name|last_name|email                  |join_date |product_name|category   |price |
+----------+-----------+--------+--------+-----------+----------+---------+-----------------------+----------+------------+-----------+------+
|101       |1          |1001    |1       |1200.5     |John      |Doe      |john.doe@example.com   |2023-01-15|Laptop      |ELECTRONICS|1200.5|
|102       |2          |1002    |2       |50.0       |Jane      |Smith    |jane.smith@example.com |2023-02-20|Mouse       |ACCESSORIES|25.0  |
|103       |3          |1003    |1       |75.75      |Peter     |Jones    |peter.jones@example.com|2023-03-10|Keyboard    |ACCESSORIES|75.75 |
|104       |1          |1004    |1       |300.0      |John      |Doe      |john.doe@example.com   |2023-01-15|Monitor     |E

**Explanation**: <br>
- Standardizes `category` in `products_df` to uppercase to fix any inconsistencies (e.g., “accessories” vs. “Accessories”).
- Fills missing `last_name` with “Unknown” and `join_date` with a default date in `customers_df`.
- Performs a left outer join for `orders` and `customers` to include all orders, and an inner join with `products` to ensure valid products.
- Aggregates by `category`, computing total sales, average order amount, and order count.
- Formats numerical columns to `decimal(10,2)` for readability.
- Saves results to Parquet.


## Best Practices
- Explicit Schemas: Always define schemas for JSON and CSV to ensure correct data types and avoid inference costs.
- Handle Inconsistencies: Standardize case-sensitive fields (e.g., `category`) early in the pipeline.
- Null Handling: Address missing data before joins or aggregations to prevent unexpected results.
- Join Optimization: Use inner joins when possible; use left outer joins to preserve data when needed.
- Columnar Storage: Use Parquet for intermediate and output data to leverage compression and columnar access.
- SQL for Readability: Use Spark SQL for complex transformations when it improves clarity.