# RDD

an RDD is the fundamental data structure of Apache Spark. It's a fault-tolerant, distributed collection of elements that can be operated on in parallel.

**Key Characteristics:**

- Immutable
- Lazy evaluation
- Fault tolerant (via lineage info)
- Partitioned across cluster nodes
- Can be cached in memory

### SparkContext and SparkConf


SparkContext is the entry point for Spark functionality.

#### `SparkConf`

- Configuration for Spark application

**Common settings:**

- setMaster("local[*]") – Use local mode with all cores
- setAppName("RDDExample") – Application name

### transformations

Transformations create a new RDD from an existing one. They are lazy – not executed until an action is triggered.

| Transformation  | Description                                          |
| --------------- | ---------------------------------------------------- |
| `map(func)`     | Returns a new RDD by applying `func` to each element |
| `filter(func)`  | Filters elements for which `func` returns true       |
| `flatMap(func)` | Like map but flattens the result                     |
| `distinct()`    | Removes duplicates                                   |
| `union(rdd)`    | Combines two RDDs                                    |
| `groupByKey()`  | Groups values with same key                          |
| `reduceByKey()` | Aggregates values with same key using a function     |
| `sortBy(func)`  | Sorts RDD by computed key                            |


### actions

Actions trigger computation and return results or write data.

| Action             | Description                            |
| ------------------ | -------------------------------------- |
| `collect()`        | Returns all elements to driver         |
| `count()`          | Returns number of elements             |
| `first()`          | Returns first element                  |
| `take(n)`          | Returns first `n` elements             |
| `reduce(func)`     | Reduces elements using binary operator |
| `saveAsTextFile()` | Writes RDD to text files               |



reference - [spark rdd docs](https://spark.apache.org/docs/latest/rdd-programming-guide.html)


In [1]:
! pip install pyspark

Defaulting to user installation because normal site-packages is not writeable


In [1]:
# SparkContext and SparkConf

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("salesDemo").setMaster("local[*]")

sc = SparkContext(conf=conf)




Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Picked up JAVA_TOOL_OPTIONS: -XX:+UseContainerSupport -XX:ActiveProcessorCount=1
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/06/24 18:51:42 WARN Utils: Your hostname, krishnagopi-trng2224dat-g3q9nc1wf47, resolves to a loopback address: 127.0.0.1; using 10.0.5.2 instead (on interface eth0)
25/06/24 18:51:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/24 18:51:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
sc.defaultParallelism

1

In [6]:
# Step 1: Load the file into RDD

sales_raw = sc.textFile("file:////workspace/TRNG-2224-data-engineering/week1/datasets/sales.txt")



In [9]:
# Step 2: Convert each line into a tuple (ProductID, Category, Amount)

records = sales_raw.map(lambda x: x.split(",")).map(lambda x: (int(x[0]), x[1], int(x[2])))




In [10]:
# Step 3: Create a (Category, Amount) RDD

category_sales = records.map(lambda x: (x[1], x[2]) )


In [12]:
# Step 4: Total sales by category

total_sales_by_category = category_sales.reduceByKey(lambda x,y: x+y)



In [17]:
# Step 5: Average sale per category

average_sales_by_category = category_sales.mapValues(lambda x: (x, 1)).reduceByKey(lambda x,y: (x[0] + y[0], x[1]+ y[1])).mapValues(lambda x: x[0]/x[1])

average_sales_by_category.take(4)


[('Furniture', 538.4090909090909),
 ('Clothing', 443.0),
 ('Electronics', 547.4166666666666),
 ('Books', 407.3636363636364)]

In [18]:
# Step 6: Highest transaction

max_tran = records.max(key=lambda x : x[2])

max_tran


(1014, 'Electronics', 987)

In [19]:
# highest selling category

highest_selling_category = total_sales_by_category.max(key=lambda x: x[1] )

highest_selling_category

('Electronics', 13138)

In [None]:
# Step 7: Categories with sales above 5000

high_selling_cat_5k = total_sales_by_category.filter(lambda x: x[1] > 5000)

high_selling_cat_5k.collect()




[('Furniture', 11845), ('Electronics', 13138), ('Toys', 11794)]

In [23]:
# print final results

print("Total sales by cat")
print(total_sales_by_category.collect())

print("Avg sales per cat")

print(average_sales_by_category.collect())



Total sales by cat
[('Furniture', 11845), ('Clothing', 9303), ('Electronics', 13138), ('Books', 4481), ('Toys', 11794)]
Avg sales per cat
[('Furniture', 538.4090909090909), ('Clothing', 443.0), ('Electronics', 547.4166666666666), ('Books', 407.3636363636364), ('Toys', 536.0909090909091)]


**Assignment:**

1. find all product IDs where the amount is greater than 900.
2. Find all transactions that belong to the “Furniture” category.
3. Count how many transactions belong to the “Electronics” category.
4. Find average amount for each category.
5. Find the highest amount and the corresponding product ID.
6. Find the total number of unique categories.
7. For each category, find the product ID with the highest sale.
8. Count how many products were sold for less than 300.
9. Sort the transactions by amount in descending order.
