# 04_PySpark_RDD_Pair_RDDs

## Objective
In this notebook, we will:
- Load **orders.csv** and **customers.csv** into RDDs
- Transform each dataset into **key-value pairs**
- Perform a key-based operation to find customers who **haven’t placed any orders**

This example demonstrates how to use **map()**, **split()**, and **subtractByKey()** transformations in PySpark.


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
ordersDataPath = 'orders.csv'
ordersRdd = spark.sparkContext.textFile(ordersDataPath)

for i in ordersRdd.take(10): 
    print(i)

### Step 1: Inspect the first record
Retrieve the first record from the RDD for inspection.

In [None]:
x = ordersRdd.first()
x

### Step 2: Split the record
Split the record into columns using the comma as a delimiter.

In [None]:
x.split(',')

### Step 3: Extract customer ID
Extract the third field (index 2) which represents the customer ID.

In [None]:
x.split(',')[2]

### Step 4: Convert to integer
Convert the extracted customer ID into an integer.

In [None]:
int(x.split(',')[2])

### Step 5: Create a key-value pair
Represent the data as a tuple `(custId, 1)`.

In [None]:
(int(x.split(',')[2]), 1)

### Step 6: Define a lambda transformation
Create a lambda function that performs this transformation for all records.

In [None]:
lambda x: (int(x.split(',')[2]), 1)

### Step 7: Apply transformation
Apply the lambda function across the RDD to generate key-value pairs.

In [None]:
ordersMapRdd = ordersRdd.map(lambda x: (int(x.split(',')[2]), 1))
for i in ordersMapRdd.take(10): 
    print(i)

### Step 8: Load and transform customers.csv
Generate key-value pairs where the key is the customer ID.

In [None]:
customersDataPath = 'customers.csv'
customersRdd = spark.sparkContext.textFile(customersDataPath)

customersMapRdd = customersRdd.map(lambda x: (int(x.split(',')[0]), 1))
for i in customersMapRdd.take(10): 
    print(i)

### Step 9: Find customers without orders
Use `subtractByKey()` to find customers who are present in the customer list but missing from orders.

In [None]:
# Find customers who haven’t placed any orders
customersMapRdd.subtractByKey(ordersMapRdd).collect()

### Step 10: Final Compact Version
Here we combine all transformations into a single, clean version for clarity.

In [None]:
# Compact version combining all steps
ordersRdd = spark.sparkContext.textFile('orders.csv')
customersRdd = spark.sparkContext.textFile('customers.csv')

ordersMapRdd = ordersRdd.map(lambda x: (int(x.split(',')[2]), 1))
customersMapRdd = customersRdd.map(lambda x: (int(x.split(',')[0]), 1))

result = customersMapRdd.subtractByKey(ordersMapRdd).collect()
result