# Deeper Dive Into Optimization Techniques

## Option 1 : Avoid Shuffling and Minimize Data Movement
 - 1.1 : Using Broadcast Joins Instead of Shuffling for Joins
 - 1.2 : Using Bucketing Instead of Shuffling for Joins

## Option 2 : Optimize Data Partitioning
 - 2.1 : Repartition for Parallelism (Be careful too many partitions can increase task scheduling overhead. Also, dataset with 10000 rows is not a large dataset. 10M - 1B is a large dataset
 - 2.2 : Reduce Partitions for Writing - if you write data as a single file, reduce the number of partitions

## Option 3 : Use Efficient Data Formats
 - 3.1 : Use Parquet Over CSV
## Option 4 : Optimize Spark Execution Plan

## Option 5 : Use Caching and Persistence Wisely

## Option 6 : Skew Handling for Large Datasets

## Option 7 : Optimize Joins for Large Datasets

## Option 8 : Parallel Processing and UDF Optimization

## Option 9 : Optimize Read Performance With Indexing

## Option 1 : Avoid Shuffling and Minimize Data Movement
So, what's the alternatives? Here it is:

- 1.1 Use `Broadcast Joins` for small tables - when joining a small DataFrame with a large DataFrame, always use broadcast joins to avoid expensive shuffling
- 1.2 Use `Bucketing` instead of Shuffling for Joins

### 1.1 Using Broadcast Joins Instead of Shuffling for Joins


In [12]:
# Using Broadcast joins for Small Tables
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
import time

spark = SparkSession.builder.appName('BroadcastJoinExample').getOrCreate()

df_large = spark.read.csv('./resources/8_large_file.csv', header=True, inferSchema=True)
df_small = spark.read.csv('./resources/9_small_file.csv', header=True, inferSchema=True)

start_time = time.time()
df_joined = df_large.join(broadcast(df_small), 'id')
df_joined.count()
end_time = time.time()
print(f"Execution Time (With Broadcast): {end_time - start_time} seconds")



# without broadcasting
df_large1 = spark.read.csv('./resources/8_large_file.csv', header=True, inferSchema=True)
df_small1 = spark.read.csv('./resources/9_small_file.csv', header=True, inferSchema=True)

start_time = time.time()
df_joined1 = df_large1.join(df_small1, 'id')
df_joined1.count()
end_time = time.time()
print(f"Execution Time (Without Broadcast): {end_time - start_time} seconds")


# use .explain() to view query plans
print('----------------NOW WITH NO BROADCAST (Look for (strategy=broadcast))--------------')
df_joined.explain(True)
print('=================================================================================')
print('----------------NOW WITH NO BROADCAST (Look for Shuffle Operations)--------------')
df_joined1.explain(True)

spark.stop()

25/03/29 21:45:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/03/29 21:45:54 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/03/29 21:45:54 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/03/29 21:45:54 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/03/29 21:45:54 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


Execution Time (With Broadcast): 0.1299881935119629 seconds
Execution Time (Without Broadcast): 0.13207006454467773 seconds
----------------NOW WITH NO BROADCAST (Look for (strategy=broadcast))--------------
== Parsed Logical Plan ==
'Join UsingJoin(Inner, [id])
:- Relation [id#957,name#958,salary#959,department#960] csv
+- ResolvedHint (strategy=broadcast)
   +- Relation [id#982,bonus_percentage#983] csv

== Analyzed Logical Plan ==
id: int, name: string, salary: int, department: string, bonus_percentage: double
Project [id#957, name#958, salary#959, department#960, bonus_percentage#983]
+- Join Inner, (id#957 = id#982)
   :- Relation [id#957,name#958,salary#959,department#960] csv
   +- ResolvedHint (strategy=broadcast)
      +- Relation [id#982,bonus_percentage#983] csv

== Optimized Logical Plan ==
Project [id#957, name#958, salary#959, department#960, bonus_percentage#983]
+- Join Inner, (id#957 = id#982), rightHint=(strategy=broadcast)
   :- Filter isnotnull(id#957)
   :  +- Rela

### 1.2 Using Bucketing Instead of Shuffling for Joins

- `If you're frequently using same column for join`, use bucketing to pre-sort the data


In [18]:
from pyspark.sql import SparkSession
import time

spark = SparkSession.builder.appName('BucketOptimization').getOrCreate()

df_large = spark.read.csv('./resources/8_large_file.csv', header=True, inferSchema=True)
df_small = spark.read.csv('./resources/9_small_file.csv', header=True, inferSchema=True)

# apply bucketing and sorting (avoids shuffling in joins)
df_large.write.mode('overwrite').bucketBy(8, 'id').sortBy('id').saveAsTable('large_bucketed_table')
df_small.write.mode('overwrite').bucketBy(8, 'id').sortBy('id').saveAsTable('small_bucketed_table')

# Now perform joins (Shuffle Heavy)
start_time = time.time()
df_joined = df_large.join(df_small, 'id')
df_joined.count()
end_time = time.time()
print(f'Execution Time {end_time - start_time:.2f} seconds')
df_joined.explain(True)

# Now perform joins (No Shuffle) - Optimized join using Bucketed tables
df_large_bucketed = spark.table('large_bucketed_table')
df_small_bucketed = spark.table('small_bucketed_table')
start_time = time.time()
df_joined = df_large_bucketed.join(df_small_bucketed, 'id')
df_joined.count()
end_time = time.time()
print(f'Execution Time {end_time - start_time:.2f} seconds')
df_joined.explain(True)

Execution Time 0.11 seconds
== Parsed Logical Plan ==
'Join UsingJoin(Inner, [id])
:- Relation [id#1435,name#1436,salary#1437,department#1438] csv
+- Relation [id#1460,bonus_percentage#1461] csv

== Analyzed Logical Plan ==
id: int, name: string, salary: int, department: string, bonus_percentage: double
Project [id#1435, name#1436, salary#1437, department#1438, bonus_percentage#1461]
+- Join Inner, (id#1435 = id#1460)
   :- Relation [id#1435,name#1436,salary#1437,department#1438] csv
   +- Relation [id#1460,bonus_percentage#1461] csv

== Optimized Logical Plan ==
Project [id#1435, name#1436, salary#1437, department#1438, bonus_percentage#1461]
+- Join Inner, (id#1435 = id#1460)
   :- Filter isnotnull(id#1435)
   :  +- Relation [id#1435,name#1436,salary#1437,department#1438] csv
   +- Filter isnotnull(id#1460)
      +- Relation [id#1460,bonus_percentage#1461] csv

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [id#1435, name#1436, salary#1437, department#1438, bonus_

# Option 2 : Optimize Data Partioning

### 2.1 : Repartition for Parallelism

In [31]:
from pyspark.sql import SparkSession
import time

spark = SparkSession.builder.appName('RepartitionOptimization').getOrCreate()

df_large = spark.read.csv('./resources/8_large_file.csv', header=True, inferSchema=True)
# Spark automatically assigns a default number of partitions


start_time = time.time()
df_large.count()
end_time = time.time()
print(f'Execution Time (for Default Partitions) : {end_time - start_time:.2f} seconds')

# Now, increase partitions for paralled processing
df_repartitioned = df_large.repartition(10) # Increase to 10 partitions


start_time = time.time()
df_repartitioned.count()
end_time = time.time()
print(f'Execution Time (for Repartitioned) : {end_time - start_time:.2f} seconds')

Execution Time (for Default Partitions) : 0.07 seconds
Execution Time (for Repartitioned) : 0.09 seconds


### 2.2 : Reduce Partitions for Writing Using coalesce()
If you write data as a single file, reduce the number of partitions

In [39]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ReducePartitionsForWriting').getOrCreate()

df_large = spark.read.csv('./resources/8_large_file.csv', header=True, inferSchema = True)

start_time = time.time()
df_repartitioned = df_large.repartition(10) # Increase to 10 partitions
df_repartitioned.write.mode('overwrite').parquet('./output/writeWith10Partitions')
end_time = time.time()
print(f'Initial partitions: {df_large.rdd.getNumPartitions()}')
print(f'Execution Time (for Repartitioned) : {end_time - start_time:.2f} seconds')


# Reduce partitions using coalesce()
start_time = time.time()
df_coalesced = df_large.coalesce(1) # reducing to 1 partition
df_coalesced.write.mode('overwrite').parquet('./output/writeWith1Partition')
end_time = time.time()
print(f'Partitions after coalesce: {df_large.rdd.getNumPartitions()}')
print(f'Execution Time (for coalesced - reducing to 1 partition) : {end_time - start_time:.2f} seconds')


Initial partitions: 1
Execution Time (for Repartitioned) : 0.25 seconds
Partitions after coalesce: 1
Execution Time (for coalesced - reducing to 1 partition) : 0.16 seconds
