# Performance Optimization (Contd...)
# Broadcast Joins
- When joining a small DataFrame with a large DataFrame, use broadcast joins to improve performance.


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.appName('TestApp').getOrCreate()

df_large = spark.read.csv('./resources/5_person.csv', header=True, inferSchema=True)
df_small = spark.read.csv('./resources/6_country.csv', header=True, inferSchema=True)

df_joined = df_large.join(broadcast(df_small), "id")
df_joined.show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/29 18:01:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/29 18:01:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/03/29 18:01:06 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/03/29 18:01:06 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


+---+-------+---+---------+
| id|   name|age|  country|
+---+-------+---+---------+
|  1|  Alice| 25|      USA|
|  2|    Bob| 30|   Canada|
|  3|Charlie| 35|       UK|
|  4|  David| 40|Australia|
|  5|    Eva| 45|  Germany|
+---+-------+---+---------+



### Partitioning for Performance
- Repartitioning or coalescing DataFrames can significantly improve performance, especially for large datasets

In [10]:
# Repartition: Increases the number of partitions, useful when performing wide transformations
df_repartitioned = df_joined.repartition(10)
df_repartitioned.show()

+---+-------+---+---------+
| id|   name|age|  country|
+---+-------+---+---------+
|  5|    Eva| 45|  Germany|
|  1|  Alice| 25|      USA|
|  2|    Bob| 30|   Canada|
|  4|  David| 40|Australia|
|  3|Charlie| 35|       UK|
+---+-------+---+---------+



In [12]:
# Coalesce: Reduces the number of partitions, useful when writing out data
df_coalesced = df_joined.coalesce(1) # Merge into 1 partition before saving
df_coalesced.show()

+---+-------+---+---------+
| id|   name|age|  country|
+---+-------+---+---------+
|  1|  Alice| 25|      USA|
|  2|    Bob| 30|   Canada|
|  3|Charlie| 35|       UK|
|  4|  David| 40|Australia|
|  5|    Eva| 45|  Germany|
+---+-------+---+---------+



# Handling Large Datasets
- when working with big data, better strategies are : partitioning, caching, broadcasting
- Avoid Shuffling Data - Shuffling is costly i.e when Spark moves data between partitions. Minimize it by:
    - Using `broadcast joins` when one DataFrame is small
    - Using `partitionBy` when saving data to disk

- Broadcasting the smaller table sends a copy to all nodes, eliminating the need for shuffling.


Summary: The Top 10 PySpark Optimizations
- Optimization	--------------------------> Why?
- 1️⃣ Broadcast small DataFrames	--------------------------> 	Avoids expensive shuffles
- 2️⃣ Use Bucketing for joins	--------------------------> 	Reduces data movement
- 3️⃣ Repartition wisely	--------------------------> 	Balances parallelism & efficiency
- 4️⃣ Use Parquet instead of CSV	--------------------------> 	Faster reads & compression
- 5️⃣ Check .explain() before running queries	--------------------------> 	Prevents performance bottlenecks
- 6️⃣ Cache DataFrames carefully	--------------------------> 	Avoids recomputation
- 7️⃣ Handle Data Skew using Salting	--------------------------> 	Distributes data evenly
- 8️⃣ Use Sort-Merge Joins for large tables	--------------------------> 	Reduces shuffle overhead
- 9️⃣ Use pandas_udf for custom functions	--------------------------> 	Boosts UDF performance
- 🔟 Z-Order & Indexing for faster reads		--------------------------> Speeds up queries
