**Step-1: Load sales.csv as a DataFrame**

In [0]:
# Load the CSV as a DataFrame
sales_df= spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/parveen.r@live.com/sales.csv")

In [0]:
# Peek at the data
sales_df.show(5)

+----------------+--------------------+----------+----------+---------------+--------+---------+---------+
|SalesOrderNumber|SalesOrderLineNumber| OrderDate|CustomerID|           Item|Quantity|UnitPrice|TaxAmount|
+----------------+--------------------+----------+----------+---------------+--------+---------+---------+
|         SO20000|                   2|2024-08-24|  CUST2373| Running Shorts|       4|   278.63|    55.73|
|         SO20001|                   2|2024-10-06|  CUST2779| Vacuum Cleaner|       4|   111.58|    22.32|
|         SO20002|                   2|2024-08-02|  CUST2732|  Tennis Racket|       6|   152.03|    45.61|
|         SO20003|                   3|2024-09-21|  CUST1815|Children's Book|       3|     6.62|     0.99|
|         SO20004|                   2|2025-03-18|  CUST2913|           Doll|       7|    31.13|     10.9|
+----------------+--------------------+----------+----------+---------------+--------+---------+---------+
only showing top 5 rows


In [0]:
display(sales_df)

SalesOrderNumber,SalesOrderLineNumber,OrderDate,CustomerID,Item,Quantity,UnitPrice,TaxAmount
SO20000,2,2024-08-24,CUST2373,Running Shorts,4,278.63,55.73
SO20001,2,2024-10-06,CUST2779,Vacuum Cleaner,4,111.58,22.32
SO20002,2,2024-08-02,CUST2732,Tennis Racket,6,152.03,45.61
SO20003,3,2024-09-21,CUST1815,Children's Book,3,6.62,0.99
SO20004,2,2025-03-18,CUST2913,Doll,7,31.13,10.9
SO20005,3,2024-12-30,CUST1241,Vitamin Supplements,2,247.52,24.75
SO20006,2,2024-09-17,CUST2392,Pruning Shears,9,307.9,138.56
SO20007,1,2024-08-05,CUST2277,Remote Car,7,6.8,2.38
SO20008,3,2024-07-04,CUST1977,Coffee Maker,2,493.13,49.31
SO20009,4,2025-03-30,CUST1598,Car Vacuum,4,98.41,19.68


Databricks visualization. Run in Databricks to view.

_**What happens?**_
- Driver Program sends the command to Spark.
- Spark splits the file into partitions (by default: based on file size, usually 128MB per partition).

**Step 2: Check the Number of Partitions**

In [0]:
print("Number of partitions:", sales_df.rdd.getNumPartitions())


Number of partitions: 1


_**What does this show?**_
- Spark split your file into this many chunks.
- Each partition will become a task sent to executors.

**Step 3: Transformation – Filter for Large Sales**

In [0]:
# Transformation: prepare a filter for big orders (e.g., > 1000 units)
large_sales = sales_df.filter(sales_df.Quantity > 4)

# No work happens yet—Spark just builds a plan!
print(large_sales)


DataFrame[SalesOrderNumber: string, SalesOrderLineNumber: string, OrderDate: string, CustomerID: string, Item: string, Quantity: string, UnitPrice: string, TaxAmount: string]


_**Spark is planning what to do, but hasn’t done it yet.**_

_**Step 4: Action – Count the Results**_

In [0]:
# Action: actually count the large sales rows
count_large_sales = large_sales.count()
print("Number of large sales:", count_large_sales)


Number of large sales: 1146


_Now Spark executes:_

- Driver asks for results.
- Spark’s Cluster Manager creates tasks for each partition.
- Each Executor gets a task (processes a chunk).
- Executors filter their chunk, return counts to Driver.
- Driver adds up results and shows you the final count.

**Step 5: See Tasks in the Spark UI**
- In Databricks, after you run the above cell, click on the “View Spark UI” link at the top of the cell.
- Go to the Stages tab:
  -   You’ll see as many tasks as there are partitions!
  -   Each task corresponds to a partition, run by an executor.
- You can see:
  - DAG visualization (the plan Spark builds)
  - Task distribution (how work is divided)

**Step 6: Repartition and See the Effect**

In [0]:
# Repartition to 8 partitions (forces more parallelism)
sales_df2 = sales_df.repartition(8)
print("Partitions after repartition:", sales_df2.rdd.getNumPartitions())

# Trigger an action
sales_df2.filter(sales_df2.Quantity > 4).count()
# Check Spark UI: you’ll see 8 tasks for the stage!


Partitions after repartition: 8


1146

_You control parallelism and task distribution using repartitioning._

**Step 7: Aggregation Example (Group By Customer)**

In [0]:
# Calculate total sales amount per customer
from pyspark.sql.functions import col, sum as _sum

sales_df = sales_df.withColumn("TotalAmount", col("Quantity") * col("UnitPrice"))
customer_sales = sales_df.groupBy("CustomerID").agg(_sum("TotalAmount").alias("TotalSpent"))

customer_sales.orderBy("TotalSpent", ascending=False).show(5)


+----------+------------------+
|CustomerID|        TotalSpent|
+----------+------------------+
|  CUST1323|           14577.5|
|  CUST1351|          14420.77|
|  CUST1985|14294.189999999999|
|  CUST1417|          14037.34|
|  CUST2357|13178.220000000001|
+----------+------------------+
only showing top 5 rows


- Each partition processes a chunk.
- Spark’s Shuffle: after local aggregations, data is redistributed for the group by (“shuffle” stage).
- Tasks are assigned, results combined, and final result returned to Driver.