# Spark Optimization

### Partitioning

- Partitioning refers to dividing data into logical chunks (partitions) across nodes.
- Effective partitioning improves parallelism, reduces shuffle, and enhances query performance.

#### Partitioning in memory

- Repartition
    - allows to specify the desired number of partitions and the columns to partition by
    - shuffles the data to create the specified number of partitions

- Coalesce
    - reduces the number of partitions by merging them
    - useful when you want to decrease the number of partitions for efficiency

#### Partitioning on disk


- `partitionBy()` method is used to partition the data into a file system, resulting in multiple sub-directories.
- this enhances the read performance for downstream systems.
- This function can be applied to one or multiple column values while writing a DataFrame to the disk.


### Bucketing

- Bucketing organizes data into fixed number of buckets using the hash of a column.

**Benefits:**
- Reduces shuffle during joins and aggregations.
- Supports efficient bucketed joins and sort-merge joins.

[spark performace tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)


In [22]:
# setup

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CustomersOrdersExample").getOrCreate()


In [24]:
# load the data

from pyspark.sql.functions import *


orders_df = spark.read.csv("file:///workspace/TRNG-2224-data-engineering/week2/datasets/orders.csv", inferSchema=True, header=True)

orders_df.show(4)





+--------------------+--------------------+----------+----------+------+
|            order_id|         customer_id|order_date|product_id|amount|
+--------------------+--------------------+----------+----------+------+
|02a777e0-5571-42c...|0e99a07c-c7a5-43d...|2023-04-21|     P1031|375.94|
|1c5a3e4d-f8de-47b...|3a69ac3e-6726-431...|2021-09-25|     P1086|373.51|
|a5b65d4d-3ac0-45d...|3a69ac3e-6726-431...|2024-01-04|     P1054| 61.73|
|b752df2c-aa68-41e...|c63cab5f-dc06-484...|2024-01-16|     P1029| 64.97|
+--------------------+--------------------+----------+----------+------+
only showing top 4 rows


In [25]:
customers_df = spark.read.parquet("file:///workspace/TRNG-2224-data-engineering/week2/cutomer_oputput.parquet")

customers_df.show(4)

+--------------------+---------------+--------------------+---+------+--------------------+-----------+-------------------+---------+-----------+--------------+
|         customer_id|           name|               email|age|gender|             country|signup_date|         last_login|is_active|total_spent|spend_category|
+--------------------+---------------+--------------------+---+------+--------------------+-----------+-------------------+---------+-----------+--------------+
|20780d38-901f-450...| Michael Malone|    dhart@haynes.com| 58|  Male|    Saint Barthelemy| 2021-04-29|2024-10-20 15:56:26|     true|     3733.6|          High|
|a2c56b05-acdc-4a7...|     Edwin Wall| bradley08@yahoo.com| 33|  Male|United Arab Emirates| 2025-01-02|2025-06-19 22:44:59|     true|    3708.71|          High|
|2fe8ff2e-19ea-493...|  Rachel Strong|heather15@schmidt...| 61| Other|              Israel| 2023-02-13|2025-04-12 21:14:26|     true|    2993.41|        Medium|
|5fd9f4a6-2134-41b...|Eddie Rodrig

In [None]:
# partitioning  in memory - repartition


customers_df.repartition(4, "country").write.mode("overwrite").parquet("customers_partitioned")


In [29]:
partitioned_customer_df = customers_df.repartitionByRange(4, "country").sortWithinPartitions("total_spent")

In [32]:

partitioned_customer_df.withColumn("partition_id", spark_partition_id())\
    .select("partition_id", "country", "total_spent") \
        .orderBy("partition_id", "total_spent") \
        .show(50, truncate=False)


+------------+--------------------------------------------+-----------+
|partition_id|country                                     |total_spent|
+------------+--------------------------------------------+-----------+
|0           |Cape Verde                                  |115.9      |
|0           |Brazil                                      |378.71     |
|0           |Brunei Darussalam                           |383.83     |
|0           |Cameroon                                    |435.75     |
|0           |Chile                                       |562.2      |
|0           |Bhutan                                      |900.67     |
|0           |Australia                                   |1097.63    |
|0           |Burundi                                     |1129.38    |
|0           |Cayman Islands                              |1269.56    |
|0           |Congo                                       |1411.93    |
|0           |Armenia                                     |1496.

In [33]:
# partitioning  in memory - coalese

customers_df.coalesce(1).write.mode("overwrite").parquet("customers_coalesce")

In [34]:
# partitioning on Disk

customers_df.write.mode("overwrite").partitionBy("country").parquet("customers_by_country")

                                                                                

In [35]:
df_partitioned = spark.read.parquet("file:///workspace/TRNG-2224-data-engineering/week2/customers_by_country")

df_partitioned.filter(col("country") == "Germany").explain(True)

== Parsed Logical Plan ==
'Filter '`=`('country, Germany)
+- Relation [customer_id#226,name#227,email#228,age#229,gender#230,signup_date#231,last_login#232,is_active#233,total_spent#234,spend_category#235,country#236] parquet

== Analyzed Logical Plan ==
customer_id: string, name: string, email: string, age: int, gender: string, signup_date: date, last_login: timestamp, is_active: boolean, total_spent: double, spend_category: string, country: string
Filter (country#236 = Germany)
+- Relation [customer_id#226,name#227,email#228,age#229,gender#230,signup_date#231,last_login#232,is_active#233,total_spent#234,spend_category#235,country#236] parquet

== Optimized Logical Plan ==
Filter (isnotnull(country#236) AND (country#236 = Germany))
+- Relation [customer_id#226,name#227,email#228,age#229,gender#230,signup_date#231,last_login#232,is_active#233,total_spent#234,spend_category#235,country#236] parquet

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [customer_id#226,name#227,ema

In [38]:
# optimizing join with buketing


spark.sql("DROP TABLE IF EXISTS bucketed_customers")

customers_df.write.bucketBy(8, "customer_id") \
    .sortBy("age") \
        .mode("overwrite") \
            .saveAsTable("bucketed_customers")


spark.sql("DROP TABLE IF EXISTS bucketed_orders")

orders_df.write.bucketBy(8, "customer_id") \
    .sortBy("order_date") \
        .mode("overwrite") \
            .saveAsTable("bucketed_orders")





In [40]:
bucketed_customers = spark.table("bucketed_customers")
bucketed_orders = spark.table("bucketed_orders")


bucketed_orders.join(bucketed_customers, "customer_id").explain(True)

== Parsed Logical Plan ==
'Join UsingJoin(Inner, [customer_id])
:- SubqueryAlias spark_catalog.default.bucketed_orders
:  +- Relation spark_catalog.default.bucketed_orders[order_id#254,customer_id#255,order_date#256,product_id#257,amount#258] parquet
+- SubqueryAlias spark_catalog.default.bucketed_customers
   +- Relation spark_catalog.default.bucketed_customers[customer_id#243,name#244,email#245,age#246,gender#247,country#248,signup_date#249,last_login#250,is_active#251,total_spent#252,spend_category#253] parquet

== Analyzed Logical Plan ==
customer_id: string, order_id: string, order_date: date, product_id: string, amount: double, name: string, email: string, age: int, gender: string, country: string, signup_date: date, last_login: timestamp, is_active: boolean, total_spent: double, spend_category: string
Project [customer_id#255, order_id#254, order_date#256, product_id#257, amount#258, name#244, email#245, age#246, gender#247, country#248, signup_date#249, last_login#250, is_activ