Turning off AQE (Adaptive query execution)

In [0]:
spark.conf.set("spark.sql.adaptive.enabled",False)

In [0]:
spark.conf.get("spark.sql.adaptive.enabled")

Reading Data

In [0]:
df=spark.read.csv("/FileStore/tables/BigMart_Sales.csv",header=True,inferSchema=True)

In [0]:
from pyspark.sql.functions import *

Spark divides the file into partitons. By default partition size is 128 MB

In [0]:
df.rdd.getNumPartitions()
# in our case file size is just ~800 KB, thats why number of partition is 1

In [0]:
spark.conf.set("spark.sql.files.maxPartitionBytes",131072)
#setting partition size to 128kb explicitly

In [0]:
spark.conf.set("spark.sql.files.maxPartitionBytes",134217728)
#changing to default 128 MB

Repartitioning - Partitioning w/o changing the default partition size

In [0]:
df=df.repartition(10)

In [0]:
df.rdd.getNumPartitions()

Getting partition info

In [0]:
df.withColumn("partition_id",spark_partition_id()).display()

Saving into parquet files

In [0]:
df.write.format("parquet").mode("append").save("/FileStore/parquet/")
#Write 10 parquet files

In [0]:
df_new=spark.read.parquet("/FileStore/parquet/")
#Reading 10 files
df_new.display()

In [0]:
df_new.filter(col("Outlet_Location_Type")=="Tier 2").display()
#here we are getting 1/3 of data, but for that we have to scan all 10 files

#### Scanning Optimization

In [0]:
df.write.format("parquet").mode("append").partitionBy("Outlet_Location_Type").save("/FileStore/parquet/opt/")
#store parquet files partitioned by Outlet_Location_Type

In [0]:
df_new_2=spark.read.parquet("/FileStore/parquet/opt")
#Reading 10 files
df_new_2.display()

In [0]:
df_new_2.filter(col("Outlet_Location_Type")=="Tier 2").display()
#Now the executor have power to choose which partition to take and which to not take
#Its always better to partition files on date, i.e year, month 

Partitioning refers to the division of your data into smaller, manageable chunks that can be processed in parallel across your cluster.

Number of Partitions: Directly impacts parallelism. Too few, and you underutilize your cluster. Too many, and overhead from task scheduling and managing small files outweighs benefits.

repartition() vs. coalesce():

repartition() - performs a full shuffle (Data is redistributed across the cluster). Can increase/ decrease partitions

coalesce() - It tries to merge existing partitions. only decrease the partitions.

Use when:
You have too many partitions (e.g., after filtering) and want to reduce the number of output files (the "small file problem").

Shuffle - It is an expensive operation in spark where data is redistributed across partitions. It occurs when a transformation requires data from different partitions to be grouped or aggregated together.

ex - groupBy, join, orderBy, repartition

Why are Shuffles Expensive?

Disk I/O: Data is written to disk by mappers and read from disk by reducers.
Network I/O: Data is transferred across the network between worker nodes.
Serialization/Deserialization: Data needs to be serialized before sending and deserialized upon receipt.

### Joining Optimization

While joining everytime the records shuffles in some ~200 partition and then executor executes the transformation. Very expansive opeartion.

SOL: Broadcast Join

only works when one df is small around ~5-10 MB, such that it can easily broadcast to executors by driver program.

So, in this, distribute the partition of large df among all executors.

Then the driver broadcast the small df to all executors.

In this way we prevent from shuffling.

In [0]:
#small df
data_customers = [
    (1, "Alice", "USA"),
    (2, "Bob", "UK"),
    (3, "Charlie", "Canada")
]
columns_customers = ["customer_id", "name", "country"]

df_customers = spark.createDataFrame(data_customers, columns_customers)

#Big df
data_orders = [
    (101, 1, 250),
    (102, 2, 450),
    (103, 1, 300)
]
columns_orders = ["order_id", "customer_id", "amount"]

df_orders = spark.createDataFrame(data_orders, columns_orders)

In [0]:
df_customers.display()
df_orders.display()

In [0]:
#normal join
df_orders.join(df_customers,df_customers["customer_id"]==df_orders["customer_id"],"left").display()

In [0]:
#optimised join
df_orders.join(broadcast(df_customers),df_customers["customer_id"]==df_orders["customer_id"],"left").display()

In [0]:
df_customers.createOrReplaceTempView("customers")
df_orders.createOrReplaceTempView("orders")

In [0]:
#SQL hints - provides hints to sql to perform such thing
spark.sql(
    """
    SELECT /*+ BROADCAST(c) */ 
    * FROM
    customers c
    JOIN orders o
    ON c.customer_id = o.customer_id
    """
).display()

### Caching and persistence

cache(): A shorthand for persist(StorageLevel.MEMORY_AND_DISK). It tries to store the DataFrame/RDD in memory, and if memory is insufficient, it spills to disk.

StorageLevel : MEMORY_AND_DISK, MEMORY_ONLY, DISK_ONLY

In [0]:
df2=spark.read.csv("/FileStore/tables/BigMart_Sales.csv",header=True,inferSchema=True).cache()

### Dynamic resource allocation

Instead of capturing all resource, we can dynamically scale down or up according to need

spark.dynamicAllocation.enabled = true

spark.dynamicAllocation.minExecutors = 1

spark.dynamicAllocation.maxExecutors = 10

spark.dynamicAllocation.initialExecutors = 2

spark.shuffle.service.enabled = true


### Adaptive Query Execution

It is a performance optimization feature that dynamically optimizes query plans at runtime based on actual data 

Characteristics: 

1. Dynamic Join Strategy Selection :  Automatically switches join strategies (e.g., broadcast vs. shuffle join) based on actual data sizes.

2. Skew Join Optimization : Detects skewed partitions and breaks them into smaller ones to balance work across executors.

3. Coalescing Shuffle Partitions : Dynamically reduces the number of shuffle partitions at runtime to avoid small or empty partitions.

### Dynamic Partition Pruning

Dynamic Partition Pruning (DPP) is a powerful optimization feature in Apache Spark that automatically prunes unnecessary partitions at runtime, reducing I/O and improving query performance

When you join a fact table (large, partitioned) with a dimension table (filtering values), Spark dynamically determines which partitions to read based on the actual join keys during execution, instead of reading all partitions.

SELECT *
FROM sales s
JOIN regions r
  ON s.region = r.region
WHERE r.country = 'USA'

**IMPORTANT**
The data should be partitioned on the same column, the joining condition is given

In [0]:
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", True)

In [0]:
df_not_partitioned=spark.read.csv("/FileStore/tables/BigMart_Sales.csv",header=True,inferSchema=True)

In [0]:
df_new_2.join(df_not_partitioned.filter(col("Outlet_Location_Type")=="Tier 3"),df_not_partitioned["Outlet_Location_Type"]==df_new_2["Outlet_Location_Type"],"inner").display()

In [0]:
df.groupBy("Outlet_Location_Type").count().display()

### Broadcast Variable

Broadcast variables allow you to efficiently distribute read-only variable (like a lookup table or a small DataFrame) to all worker nodes.

In [0]:
# look-up dictionary
product_lookup_data = {
"prod_0":"Laptop",
"prod_1":"Mouse",
"prod_2":"Shirt"
}

df_products=spark.createDataFrame([
    ("prod_0",),
    ("prod_1",),
    ("prod_2",),
],["product_id"])

df_products.display()


In [0]:
broad_products=spark.sparkContext.broadcast(product_lookup_data)

In [0]:
broad_products.value.get("prod_0")

In [0]:
def my_func(x):
    return broad_products.value.get(x)

my_udf=udf(my_func)

In [0]:
df_products.withColumn("product_name",my_udf(col("product_id"))).display()

### Salting OOM

Salting is a data engineering technique used to prevent data skew—where a small number of keys dominate the dataset and cause certain tasks to consume too much memory (OOM = Out Of Memory).

Appling a random or deterministic "salt" value to the skewed key so that the load is spread more evenly across partitions.

In [0]:
df_skew=spark.createDataFrame([
  (1,700),
  (2,200),
  (3,300),
  (2,600),
  (1,150),
  (1,250),
  (1,300),
],["cust_id","amount"])

df_skew.display()

In [0]:
df_skew=df_skew.withColumn("salt",floor(rand()*3))
df_skew=df_skew.withColumn("cust_id_salt",concat("cust_id",lit("-"),"salt"))

In [0]:
df_skew.display()

In [0]:
df_skew.groupBy("cust_id_salt").sum("amount").display()