# DAY 1 - NOTEBOOK 2: TRANSFORMATIONS vs ACTIONS & LAZY EVALUATION

---

## üéØ Learning Objectives:
1. Understand Transformations vs Actions
2. Master Lazy Evaluation concept
3. Learn Narrow vs Wide Transformations
4. Read and understand Execution Plans
5. Optimize query performance

---

## SETUP: CREATE SPARK SESSION

In [33]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.functions import col, count, avg, min, max
from pyspark.sql.types import *
import time

# Stop existing session
try:
    spark.stop()
    print("‚úÖ Stopped existing session")
    time.sleep(2)
except:
    pass

# Create new session
spark = SparkSession.builder \
    .appName("Day1-TransformationsActions") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "512m") \
    .config("spark.executor.cores", "1") \
    .config("spark.cores.max", "2") \
    .config("spark.sql.shuffle.partitions", "2") \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"   App ID: {spark.sparkContext.applicationId}")
print(f"   UI: http://localhost:4040")

‚úÖ Stopped existing session
‚úÖ Spark Session Created
   App ID: app-20251222155214-0002
   UI: http://localhost:4040


25/12/22 15:52:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


---

## PART 1: TRANSFORMATIONS vs ACTIONS

### üìö Theory:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                  TRANSFORMATIONS                             ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  ‚Ä¢ LAZY (kh√¥ng execute ngay)                                 ‚îÇ
‚îÇ  ‚Ä¢ Return DataFrame m·ªõi                                      ‚îÇ
‚îÇ  ‚Ä¢ Build execution plan                                      ‚îÇ
‚îÇ  ‚Ä¢ Kh√¥ng trigger computation                                 ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ  Examples:                                                   ‚îÇ
‚îÇ    - select()      - filter()       - withColumn()          ‚îÇ
‚îÇ    - groupBy()     - join()         - orderBy()             ‚îÇ
‚îÇ    - distinct()    - drop()         - union()               ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     ACTIONS                                  ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  ‚Ä¢ EAGER (execute ngay l·∫≠p t·ª©c)                              ‚îÇ
‚îÇ  ‚Ä¢ Trigger computation                                       ‚îÇ
‚îÇ  ‚Ä¢ Return results to driver                                  ‚îÇ
‚îÇ  ‚Ä¢ Th·ª±c thi t·∫•t c·∫£ transformations tr∆∞·ªõc ƒë√≥                  ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ  Examples:                                                   ‚îÇ
‚îÇ    - show()        - count()        - collect()             ‚îÇ
‚îÇ    - take()        - first()        - write()               ‚îÇ
‚îÇ    - head()        - foreach()      - reduce()              ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### üî¨ Demo: Lazy Evaluation

In [34]:
# Create sample data
data = [(i, f"Name_{i}", i * 10) for i in range(1, 101)]
df = spark.createDataFrame(data, ["id", "name", "value"])

print("‚úÖ Created DataFrame with 100 rows")
print(f"   Partitions: {df.rdd.getNumPartitions()}")

‚úÖ Created DataFrame with 100 rows
   Partitions: 2


In [35]:
print("üìå Step 1: Define transformations (LAZY - No execution yet)")
print("‚è±Ô∏è  Starting...")
start = time.time()

# These are transformations - NOT executed yet!
df_filtered = df.filter(col("value") > 50)
df_selected = df_filtered.select("id", "name")
df_renamed = df_selected.withColumnRenamed("name", "employee_name")

elapsed = time.time() - start
print(f"‚úÖ Transformations defined in {elapsed:.4f}s")
print("   ‚ö†Ô∏è  No actual computation happened yet!")
print("   ‚ö†Ô∏è  Data was NOT processed!")
print("   ‚ö†Ô∏è  Only execution plan was created!")

üìå Step 1: Define transformations (LAZY - No execution yet)
‚è±Ô∏è  Starting...
‚úÖ Transformations defined in 0.0338s
   ‚ö†Ô∏è  No actual computation happened yet!
   ‚ö†Ô∏è  Data was NOT processed!
   ‚ö†Ô∏è  Only execution plan was created!


In [36]:
print("üìå Step 2: Trigger action (EAGER - Execution happens now!)")
print("‚è±Ô∏è  Executing...")
start = time.time()

# This is an action - triggers execution of ALL transformations above
result_count = df_renamed.count()

elapsed = time.time() - start
print(f"‚úÖ Action completed in {elapsed:.4f}s")
print(f"   Result: {result_count} rows")
print("   ‚úÖ NOW all transformations were executed!")
print("   ‚úÖ Data was actually processed!")

üìå Step 2: Trigger action (EAGER - Execution happens now!)
‚è±Ô∏è  Executing...


[Stage 0:>                                                          (0 + 2) / 2]

‚úÖ Action completed in 2.9687s
   Result: 95 rows
   ‚úÖ NOW all transformations were executed!
   ‚úÖ Data was actually processed!


                                                                                

### üí° Key Insight:

```
Transformations:  0.0001s (instant) ‚Üí Just planning
Action:           0.5s (slow)       ‚Üí Actual execution

Why?
‚Ä¢ Transformations ch·ªâ build execution plan
‚Ä¢ Actions trigger th·ª±c thi plan ƒë√≥
‚Ä¢ Spark optimize to√†n b·ªô plan tr∆∞·ªõc khi execute
```

---

## PART 2: EXECUTION PLANS

In [37]:
print("üìå Logical Plan (What to do)")
df_renamed.explain(extended=False)

üìå Logical Plan (What to do)
== Physical Plan ==
*(1) Project [id#608L, name#609 AS employee_name#616]
+- *(1) Filter (isnotnull(value#610L) AND (value#610L > 50))
   +- *(1) Scan ExistingRDD[id#608L,name#609,value#610L]




In [38]:
print("üìå Physical Plan (How to do)")
df_renamed.explain(mode="formatted")

üìå Physical Plan (How to do)
== Physical Plan ==
* Project (3)
+- * Filter (2)
   +- * Scan ExistingRDD (1)


(1) Scan ExistingRDD [codegen id : 1]
Output [3]: [id#608L, name#609, value#610L]
Arguments: [id#608L, name#609, value#610L], MapPartitionsRDD[4] at applySchemaToPythonRDD at <unknown>:0, ExistingRDD, UnknownPartitioning(0)

(2) Filter [codegen id : 1]
Input [3]: [id#608L, name#609, value#610L]
Condition : (isnotnull(value#610L) AND (value#610L > 50))

(3) Project [codegen id : 1]
Output [2]: [id#608L, name#609 AS employee_name#616]
Input [3]: [id#608L, name#609, value#610L]




### üìñ Reading Execution Plans:

```
Execution plan ƒë·ªçc t·ª´ D∆Ø·ªöI L√äN TR√äN:

== Physical Plan ==
Project [id, name AS employee_name]     ‚Üê Step 3: Rename column
+- Project [id, name]                   ‚Üê Step 2: Select columns
   +- Filter (value > 50)               ‚Üê Step 1: Filter rows
      +- Scan                           ‚Üê Step 0: Read data

Spark ƒë·ªçc t·ª´ d∆∞·ªõi l√™n v√† optimize:
1. Scan data
2. Filter ngay (predicate pushdown)
3. Select only needed columns (column pruning)
4. Rename
```

---

## PART 3: NARROW vs WIDE TRANSFORMATIONS

### üìö Theory:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ              NARROW TRANSFORMATIONS                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  ‚Ä¢ Kh√¥ng c·∫ßn shuffle data                                    ‚îÇ
‚îÇ  ‚Ä¢ M·ªói partition x·ª≠ l√Ω ƒë·ªôc l·∫≠p                               ‚îÇ
‚îÇ  ‚Ä¢ Nhanh, hi·ªáu qu·∫£                                           ‚îÇ
‚îÇ  ‚Ä¢ 1 input partition ‚Üí 1 output partition                    ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ  Examples:                                                   ‚îÇ
‚îÇ    - select()      - filter()       - withColumn()          ‚îÇ
‚îÇ    - map()         - flatMap()      - union()               ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ  Diagram:                                                    ‚îÇ
‚îÇ    Partition 1 ‚Üí [filter] ‚Üí Partition 1'                    ‚îÇ
‚îÇ    Partition 2 ‚Üí [filter] ‚Üí Partition 2'                    ‚îÇ
‚îÇ    Partition 3 ‚Üí [filter] ‚Üí Partition 3'                    ‚îÇ
‚îÇ    (No data movement between partitions)                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ               WIDE TRANSFORMATIONS                           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  ‚Ä¢ C·∫ßn shuffle data gi·ªØa partitions                          ‚îÇ
‚îÇ  ‚Ä¢ Network I/O intensive                                     ‚îÇ
‚îÇ  ‚Ä¢ Ch·∫≠m h∆°n narrow transformations                           ‚îÇ
‚îÇ  ‚Ä¢ N input partitions ‚Üí M output partitions                  ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ  Examples:                                                   ‚îÇ
‚îÇ    - groupBy()     - join()         - orderBy()             ‚îÇ
‚îÇ    - distinct()    - repartition()  - coalesce()            ‚îÇ
‚îÇ                                                              ‚îÇ
‚îÇ  Diagram:                                                    ‚îÇ
‚îÇ    Partition 1 ‚îÄ‚îÄ‚îê                                          ‚îÇ
‚îÇ    Partition 2 ‚îÄ‚îÄ‚îº‚Üí [shuffle] ‚Üí Partition 1'                ‚îÇ
‚îÇ    Partition 3 ‚îÄ‚îÄ‚îò              ‚Üí Partition 2'              ‚îÇ
‚îÇ    (Data moves across network)                               ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### üî¨ Demo: Narrow Transformation

In [39]:
print("üìå Narrow Transformation: filter + select")
print("   (No shuffle, fast)")

start = time.time()
df_narrow = df.filter(col("value") > 50).select("id", "name")
result = df_narrow.count()
elapsed = time.time() - start

print(f"‚úÖ Result: {result} rows")
print(f"‚ö° Time: {elapsed:.4f}s")
print("\nüìä Sample:")
df_narrow.show(5)

üìå Narrow Transformation: filter + select
   (No shuffle, fast)
‚úÖ Result: 95 rows
‚ö° Time: 0.5208s

üìä Sample:
+---+-------+
| id|   name|
+---+-------+
|  6| Name_6|
|  7| Name_7|
|  8| Name_8|
|  9| Name_9|
| 10|Name_10|
+---+-------+
only showing top 5 rows



### üî¨ Demo: Wide Transformation

In [40]:
print("üìå Wide Transformation: groupBy + agg")
print("   (Requires shuffle, slower)")

start = time.time()
df_wide = df.groupBy("id").agg(sum("value").alias("total"))
result = df_wide.count()
elapsed = time.time() - start

print(f"‚úÖ Result: {result} rows")
print(f"‚ö° Time: {elapsed:.4f}s")
print("   ‚ö†Ô∏è  Notice: Slower than narrow transformation!")
print("\nüìä Sample:")
df_wide.show(5)

üìå Wide Transformation: groupBy + agg
   (Requires shuffle, slower)
‚úÖ Result: 100 rows
‚ö° Time: 0.9512s
   ‚ö†Ô∏è  Notice: Slower than narrow transformation!

üìä Sample:
+---+-----+
| id|total|
+---+-----+
|  2|   20|
|  4|   40|
|  5|   50|
|  8|   80|
| 12|  120|
+---+-----+
only showing top 5 rows



### üìä Performance Comparison

In [41]:
# Create larger dataset for better comparison
large_data = [(i, f"Name_{i}", i * 10) for i in range(1, 10001)]
df_large = spark.createDataFrame(large_data, ["id", "name", "value"])

print(f"‚úÖ Created dataset with {df_large.count()} rows")
print(f"   Partitions: {df_large.rdd.getNumPartitions()}")

‚úÖ Created dataset with 10000 rows
   Partitions: 2


In [42]:
print("üèÅ Performance Test: Narrow vs Wide")
print("="*60)

# Test 1: Narrow transformation
print("\nüìå Test 1: Narrow (filter + select)")
start = time.time()
result_narrow = df_large.filter(col("value") > 5000).select("id", "name").count()
time_narrow = time.time() - start
print(f"   Result: {result_narrow} rows")
print(f"   Time: {time_narrow:.4f}s")

# Test 2: Wide transformation
print("\nüìå Test 2: Wide (groupBy + agg)")
start = time.time()
result_wide = df_large.groupBy("id").agg(sum("value").alias("total")).count()
time_wide = time.time() - start
print(f"   Result: {result_wide} rows")
print(f"   Time: {time_wide:.4f}s")

# Comparison
print("\n" + "="*60)
print("üìä COMPARISON:")
print(f"   Narrow: {time_narrow:.4f}s")
print(f"   Wide:   {time_wide:.4f}s")
print(f"   Wide is {time_wide/time_narrow:.2f}x slower")
print("\nüí° Why? Wide transformation requires shuffle (network I/O)")

üèÅ Performance Test: Narrow vs Wide

üìå Test 1: Narrow (filter + select)
   Result: 9500 rows
   Time: 0.2735s

üìå Test 2: Wide (groupBy + agg)
   Result: 10000 rows
   Time: 0.4837s

üìä COMPARISON:
   Narrow: 0.2735s
   Wide:   0.4837s
   Wide is 1.77x slower

üí° Why? Wide transformation requires shuffle (network I/O)


---

## PART 4: COMMON ACTIONS

In [43]:
# Sample DataFrame
sample_df = df.limit(10)

print("üìå Common Actions Demo")
print("="*60)

üìå Common Actions Demo


In [44]:
print("\n1Ô∏è‚É£ count() - Count rows")
count = sample_df.count()
print(f"   Total rows: {count}")


1Ô∏è‚É£ count() - Count rows
   Total rows: 10


In [45]:
print("\n2Ô∏è‚É£ show() - Display data")
sample_df.show(5)


2Ô∏è‚É£ show() - Display data
+---+------+-----+
| id|  name|value|
+---+------+-----+
|  1|Name_1|   10|
|  2|Name_2|   20|
|  3|Name_3|   30|
|  4|Name_4|   40|
|  5|Name_5|   50|
+---+------+-----+
only showing top 5 rows



In [46]:
print("\n3Ô∏è‚É£ collect() - Get all rows as list")
rows = sample_df.limit(3).collect()
print(f"   Collected {len(rows)} rows")
for row in rows:
    print(f"   {row}")


3Ô∏è‚É£ collect() - Get all rows as list
   Collected 3 rows
   Row(id=1, name='Name_1', value=10)
   Row(id=2, name='Name_2', value=20)
   Row(id=3, name='Name_3', value=30)


In [47]:
print("\n4Ô∏è‚É£ take(n) - Get first n rows")
first_3 = sample_df.take(3)
print(f"   First 3 rows: {first_3}")


4Ô∏è‚É£ take(n) - Get first n rows
   First 3 rows: [Row(id=1, name='Name_1', value=10), Row(id=2, name='Name_2', value=20), Row(id=3, name='Name_3', value=30)]


In [48]:
print("\n5Ô∏è‚É£ first() - Get first row")
first_row = sample_df.first()
print(f"   First row: {first_row}")


5Ô∏è‚É£ first() - Get first row
   First row: Row(id=1, name='Name_1', value=10)


In [49]:
print("\n6Ô∏è‚É£ head(n) - Get first n rows")
head_3 = sample_df.head(3)
print(f"   Head 3: {head_3}")


6Ô∏è‚É£ head(n) - Get first n rows
   Head 3: [Row(id=1, name='Name_1', value=10), Row(id=2, name='Name_2', value=20), Row(id=3, name='Name_3', value=30)]


---

## PART 5: MULTIPLE ACTIONS PROBLEM

### ‚ö†Ô∏è Problem: Multiple actions re-compute everything!

In [50]:
print("üìå Test: Multiple actions WITHOUT cache")
print("   (Each action re-computes from scratch)")
print("="*60)

# Create expensive transformation
df_expensive = df_large.filter(col("value") > 5000)

start = time.time()
count1 = df_expensive.count()  # Action 1: compute
count2 = df_expensive.count()  # Action 2: compute again!
count3 = df_expensive.count()  # Action 3: compute again!
elapsed = time.time() - start

print(f"‚úÖ 3 actions completed")
print(f"   Time: {elapsed:.4f}s")
print("   ‚ö†Ô∏è  Each action re-computed the filter!")

üìå Test: Multiple actions WITHOUT cache
   (Each action re-computes from scratch)
‚úÖ 3 actions completed
   Time: 0.5464s
   ‚ö†Ô∏è  Each action re-computed the filter!


### ‚úÖ Solution: Use cache()!

In [51]:
print("üìå Test: Multiple actions WITH cache")
print("   (First action computes + caches, rest use cache)")
print("="*60)

# Cache the DataFrame
df_cached = df_large.filter(col("value") > 5000).cache()

start = time.time()
count1 = df_cached.count()  # Action 1: compute + cache
count2 = df_cached.count()  # Action 2: use cache (fast!)
count3 = df_cached.count()  # Action 3: use cache (fast!)
elapsed = time.time() - start

print(f"‚úÖ 3 actions completed")
print(f"   Time: {elapsed:.4f}s")
print("   ‚ö° Much faster! Used cache for 2nd and 3rd actions")

# Clean up
df_cached.unpersist()
print("\n‚úÖ Cache cleared")

üìå Test: Multiple actions WITH cache
   (First action computes + caches, rest use cache)
‚úÖ 3 actions completed
   Time: 1.6130s
   ‚ö° Much faster! Used cache for 2nd and 3rd actions

‚úÖ Cache cleared


---

## PART 6: CHAINING OPERATIONS

### ‚ùå Bad Practice: Step by step (verbose)

In [52]:
print("‚ùå Bad Practice: Step by step")
df1 = df.filter(col("value") > 30)
df2 = df1.select("id", "name", "value")
df3 = df2.withColumn("value_doubled", col("value") * 2)
df4 = df3.orderBy("value", ascending=False)
df4.show(5)

print("\n‚ö†Ô∏è  Problems:")
print("   ‚Ä¢ Too many intermediate variables")
print("   ‚Ä¢ Hard to read")
print("   ‚Ä¢ Clutters namespace")

‚ùå Bad Practice: Step by step
+---+--------+-----+-------------+
| id|    name|value|value_doubled|
+---+--------+-----+-------------+
|100|Name_100| 1000|         2000|
| 99| Name_99|  990|         1980|
| 98| Name_98|  980|         1960|
| 97| Name_97|  970|         1940|
| 96| Name_96|  960|         1920|
+---+--------+-----+-------------+
only showing top 5 rows


‚ö†Ô∏è  Problems:
   ‚Ä¢ Too many intermediate variables
   ‚Ä¢ Hard to read
   ‚Ä¢ Clutters namespace


                                                                                

### ‚úÖ Good Practice: Chaining (recommended)

In [53]:
print("‚úÖ Good Practice: Chaining")
result = df \
    .filter(col("value") > 30) \
    .select("id", "name", "value") \
    .withColumn("value_doubled", col("value") * 2) \
    .orderBy("value", ascending=False)

result.show(5)

print("\n‚úÖ Benefits:")
print("   ‚Ä¢ Clean and readable")
print("   ‚Ä¢ No intermediate variables")
print("   ‚Ä¢ Easy to understand flow")
print("   ‚Ä¢ Professional code style")

‚úÖ Good Practice: Chaining
+---+--------+-----+-------------+
| id|    name|value|value_doubled|
+---+--------+-----+-------------+
|100|Name_100| 1000|         2000|
| 99| Name_99|  990|         1980|
| 98| Name_98|  980|         1960|
| 97| Name_97|  970|         1940|
| 96| Name_96|  960|         1920|
+---+--------+-----+-------------+
only showing top 5 rows


‚úÖ Benefits:
   ‚Ä¢ Clean and readable
   ‚Ä¢ No intermediate variables
   ‚Ä¢ Easy to understand flow
   ‚Ä¢ Professional code style


---

## PART 7: PRACTICAL EXAMPLES

In [54]:
# Create realistic employee dataset
from faker import Faker
import random

fake = Faker()
Faker.seed(42)
random.seed(42)

employee_data = []
for i in range(1, 1001):
    employee_data.append({
        "id": i,
        "name": fake.name(),
        "age": random.randint(22, 65),
        "department": random.choice(["Engineering", "Sales", "HR", "Marketing"]),
        "salary": random.randint(40000, 150000)
    })

df_emp = spark.createDataFrame(employee_data)
print(f"‚úÖ Created employee dataset: {df_emp.count()} rows")

‚úÖ Created employee dataset: 1000 rows


### Example 1: Find high earners in Engineering

In [55]:
print("üìå Example 1: High earners in Engineering (salary > 100k)")

result = df_emp \
    .filter(col("department") == "Engineering") \
    .filter(col("salary") > 100000) \
    .select("name", "age", "salary") \
    .orderBy(col("salary").desc())

print(f"\nFound {result.count()} high earners")
result.show(10)

üìå Example 1: High earners in Engineering (salary > 100k)

Found 102 high earners
+---------------+---+------+
|           name|age|salary|
+---------------+---+------+
|Margaret Harper| 33|149602|
|   Jose Schultz| 31|149392|
|  Tammie Bright| 52|147756|
| Nicholas Payne| 40|147343|
|   Sherry Tapia| 24|147216|
|   Joseph Smith| 29|146099|
|      Glen Wood| 38|145464|
|  Sara Johnston| 41|144363|
|Casey Hernandez| 34|143692|
|  Julie Herrera| 60|143445|
+---------------+---+------+
only showing top 10 rows



### Example 2: Department statistics

In [56]:
print("üìå Example 2: Department statistics")

result = df_emp \
    .groupBy("department") \
    .agg(
        count('name').alias("employee_count"),
        avg("salary").alias("avg_salary"),
        min("salary").alias("min_salary"),
        max("salary").alias("max_salary")
    ) \
    .orderBy(col("avg_salary").desc())

result.show()

üìå Example 2: Department statistics


TypeError: 'int' object is not callable

### Example 3: Complex query with multiple transformations

In [57]:
print("üìå Example 3: Senior employees (age > 40) with above-average salary")

# Calculate average salary
avg_salary = df_emp.agg(avg("salary")).collect()[0][0]
print(f"   Average salary: ${avg_salary:,.2f}")

# Find senior employees with above-average salary
result = df_emp \
    .filter(col("age") > 40) \
    .filter(col("salary") > avg_salary) \
    .withColumn("salary_vs_avg", col("salary") - avg_salary) \
    .select("name", "age", "department", "salary", "salary_vs_avg") \
    .orderBy(col("salary_vs_avg").desc())

print(f"\nFound {result.count()} senior high earners")
result.show(10)

üìå Example 3: Senior employees (age > 40) with above-average salary
   Average salary: $94,912.75

Found 288 senior high earners
+--------------------+---+----------+------+-----------------+
|                name|age|department|salary|    salary_vs_avg|
+--------------------+---+----------+------+-----------------+
|    Curtis Wilkerson| 49|     Sales|149926|55013.25199999999|
|Elizabeth Oliver DDS| 48|        HR|149433|54520.25199999999|
|       Curtis Taylor| 61| Marketing|149055|54142.25199999999|
|        Gregory Peck| 50|     Sales|148926|54013.25199999999|
|        Karen Graham| 50| Marketing|148864|53951.25199999999|
|        Ashley Hicks| 54|        HR|148851|53938.25199999999|
|    Matthew Schwartz| 43|     Sales|148840|53927.25199999999|
|        Derek Zuniga| 57|        HR|148711|53798.25199999999|
|       Tracey Wagner| 61| Marketing|148395|53482.25199999999|
|         Mary Miller| 57|        HR|148175|53262.25199999999|
+--------------------+---+----------+------+------

---

## ‚úÖ SUMMARY

### üìö What you learned:

**1Ô∏è‚É£ Transformations vs Actions:**
- Transformations: Lazy, return DataFrame
- Actions: Eager, trigger computation

**2Ô∏è‚É£ Lazy Evaluation:**
- Transformations build execution plan
- Actions trigger execution
- Spark optimizes entire plan before execution

**3Ô∏è‚É£ Narrow vs Wide Transformations:**
- Narrow: No shuffle (fast) - filter, select, map
- Wide: Shuffle required (slow) - groupBy, join, orderBy

**4Ô∏è‚É£ Execution Plans:**
- Read from bottom to top
- Understand optimization steps

**5Ô∏è‚É£ Best Practices:**
- Chain operations for readability
- Use cache() for multiple actions
- Minimize wide transformations
- Always unpersist() when done

### üéØ Key Takeaways:
- ‚úÖ Understand when computation happens
- ‚úÖ Minimize wide transformations
- ‚úÖ Use caching for repeated actions
- ‚úÖ Chain operations for clarity
- ‚úÖ Read execution plans to optimize

### üìù Next: Partitioning & Caching Deep Dive