 
 	
## **1. Cache a DataFrame and measure performance before and after**

**What is Caching?**

Caching keeps the DataFrame in memory after the first action (like count()), so next actions run faster.
It helps improve performance when the same DataFrame is used multiple times.

In [0]:
# Read CSV with header and infer schema
df = spark.read.csv("dbfs:/FileStore/tables/employees-3.csv", header=True, inferSchema=True)

# Show sample data
df.show(3)

# Optional: Check schema
df.printSchema()

+-----------+----------+---------+--------+------------+---------+--------+------+--------------+----------+-------------+
|EMPLOYEE_ID|FIRST_NAME|LAST_NAME|   EMAIL|PHONE_NUMBER|HIRE_DATE|  JOB_ID|SALARY|COMMISSION_PCT|MANAGER_ID|DEPARTMENT_ID|
+-----------+----------+---------+--------+------------+---------+--------+------+--------------+----------+-------------+
|        198|    Donald| OConnell|DOCONNEL|650.507.9833|21-JUN-07|SH_CLERK|  2600|            - |       124|           50|
|        199|   Douglas|    Grant|  DGRANT|650.507.9844|13-JAN-08|SH_CLERK|  2600|            - |       124|           50|
|        200|  Jennifer|   Whalen| JWHALEN|515.123.4444|17-SEP-03| AD_ASST|  4400|            - |       101|           10|
+-----------+----------+---------+--------+------------+---------+--------+------+--------------+----------+-------------+
only showing top 3 rows

root
 |-- EMPLOYEE_ID: integer (nullable = true)
 |-- FIRST_NAME: string (nullable = true)
 |-- LAST_NAME: string 

In [0]:
#Measure performance before caching
import time

start = time.time()
df.count()  # Action triggers reading
end = time.time()

print("⏱ Time before caching:", end - start)

⏱ Time before caching: 2.3866641521453857


In [0]:
# Cache the DataFrame
df.cache()

Out[4]: DataFrame[EMPLOYEE_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int]

In [0]:
# Measure performance after caching
start = time.time()
df.count()  # Same action, faster due to caching
end = time.time()

print("⚡ Time after caching:", end - start)


⚡ Time after caching: 0.8273754119873047


## **Repartition a DataFrame to improve processing speed.**

**What is repartition() in PySpark?**

repartition() increases or decreases the number of partitions (splits of your data).

More partitions = more parallelism (if cluster has multiple cores).

Helps when data is skewed or not well-distributed.

**When to Use**

Use repartition() if:

You read a small number of large files (few partitions).

You're preparing data for parallel processing (joins, aggregations).

Some stages are taking too long because of data skew.





In [0]:
df = spark.read.csv("dbfs:/FileStore/tables/employees.csv", header=True, inferSchema=True)


In [0]:
df

Out[4]: DataFrame[EMPLOYEE_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int]

In [0]:
print("👉 Initial partitions:", df.rdd.getNumPartitions())

👉 Initial partitions: 1


In [0]:
# Repartition into 4 partitions
df_repart = df.repartition(4)

# Check again
print("✅ After repartitioning:", df_repart.rdd.getNumPartitions())

✅ After repartitioning: 4


In [0]:
df_by_dept = df.repartition("DEPARTMENT_ID")
print("📊 Repartitioned by DEPARTMENT_ID:", df_by_dept.rdd.getNumPartitions())

📊 Repartitioned by DEPARTMENT_ID: 1


## **Prepare a list of Common optimization techniques**

**Common Spark Optimization Techniques**
1. Use DataFrame API over RDDs

- DataFrames are optimized via Catalyst Optimizer.

- Avoid using low-level RDDs unless needed.

2. Enable Caching & Persistence

- Use .cache() or .persist() to reuse DataFrames.

- Reduces recomputation in iterative workloads.

In [0]:
df.cache()

Out[3]: DataFrame[EMPLOYEE_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int]

3. Use select() Instead of *
Only fetch needed columns to reduce data shuffling and memory usage.



In [0]:
df.select("employee_id", "salary")

Out[5]: DataFrame[employee_id: int, salary: int]

4. Filter Early (Predicate Pushdown)
Use filter() as early as possible to reduce data size.

In [0]:
df.filter(df.SALARY > 5000)


Out[7]: DataFrame[EMPLOYEE_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int]

5. Repartition and Coalesce
Use .repartition(n) for parallelism.

Use .coalesce(n) to reduce number of partitions.

In [0]:
df.repartition("department_id")

Out[8]: DataFrame[EMPLOYEE_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int]

6. Broadcast Join (for Small Tables)
Avoid shuffling large data during joins.

Use broadcast() for small lookup tables.

In [0]:
from pyspark.sql.functions import broadcast
df.join(broadcast(df), "EMPLOYEE_ID")


Out[15]: DataFrame[EMPLOYEE_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int, FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: string, JOB_ID: string, SALARY: int, COMMISSION_PCT: string, MANAGER_ID: string, DEPARTMENT_ID: int]