# RDDs and DataFrames in Apache Spark

Dataset: `./data/customers.csv`

## Table of Contents
- [1. Introduction](#1-introduction)
- [2. RDD: Resilient Distributed Dataset](#2-rdd-resilient-distributed-dataset)
  - [2.1 What is an RDD?](#21-what-is-an-rdd)
  - [2.2 Key Features](#22-key-features)
  - [2.3 Creating or Loading Data into an RDD](#23-creating-or-loading-data-into-an-rdd)
  - [2.4 RDD Transformation and Actions](#24-rdd-transformation-and-actions)
- [3. DataFrames](#3-dataframes)
  - [3.1 What is a DataFrame?](#31-what-is-a-dataframe)
  - [3.2 Key Features](#32-key-features)
  - [3.3 Creating or Loading Data into a DataFrame](#33-creating-or-loading-data-into-a-dataframe)
  - [3.4 Common DataFrame Operations](#34-common-dataframe-operations)
- [4. Conversion Between RDD and DataFrame](#4-conversion-between-rdd-and-dataframe)
- [5. RDD vs. DataFrame - Comparison](#5-rdd-vs-dataframe---comparison)
- [6. Use Case Summary](#6-use-case-summary)
- [7. Conclusion](#7-conclusion)

## 1. Introduction
Apache Spark has two core abstractions for working with distributed data:
- **RDD (Resilient Distributed Dataset):** The original low-level distributed data structure
- **DataFrame:** A high-level abstraction built on top of RDDs, offering a tabular data structure similar to a database table or Pandas DataFrame.

## 2. RDD: Resilient Distributed Dataset

### 2.1 What is an RDD?
An RDD is an immutable distributed collection of objects that can be processed in parallel.

### 2.2 Key Features
- Fault-tolerant
- Lazy evaluation
- Supports transformations (`map`, `filter`, etc.) and actions (`collect`, `count`, etc.)
- Type-safe (in Scala/Java)
- No built-in schema

### 2.3 Creating or Loading Data into an RDD

#### Creating an RDD (PySpark):

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RDDExample").getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4, 5])

#### Loading Data into an RDD

In [None]:
# Load file (skip header)
rdd = sc.textFile("./data/customers.csv")
header = rdd.first()
rdd_data = rdd.filter(lambda line: line != header)

### 2.4 RDD Transformation and Actions

In [None]:
# Split CSV into fields
customers_rdd = rdd_data.map(lambda line: line.split(","))

In [None]:
# View Sample
customers_rdd.take(3)

In [None]:
# Count missing join dates
missing_dates = customers_rdd.filter(lambda x: x[4] == "").count()
print(f"Missing join dates: {missing_dates}")

In [None]:
# Extract customer names
names = customers_rdd.map(lambda x: f"{x[1]} {x[2]}").collect()
print(names)

## 3. DataFrames

### 3.1 What is a DataFrame?
A DataFrame is a distributed collection of data organized into named columns, like a SQL table.

### 3.2 Key Features
- Schema-aware (columns and types)
- Optimized by Catalyst optimizer
- Supports SQL queries via `spark.sql()`
- Interoperable with RDDs and Pandas
- Better performance than RDD for most use cases

### 3.3 Creating or Loading Data into a DataFrame

#### Reading CSV into DataFrame

In [None]:
df = spark.read.option("header", True).csv("./data/customers.csv")
df.show()

### 3.4 Common DataFrame Operations

In [None]:
# Print schema
df.printSchema()

In [None]:
# Select specific columns
df.select("first_name", "email").show()

In [None]:
# Filter customers with missing join dates
df.filter(df.join_date.isNull()).show()

In [None]:
# Count customers who joined
df.filter(df.join_date.isNotNull()).count()

In [None]:
# Extract customer names (from RDD)
names = customers_rdd.map(lambda x: f"{x[1]} {x[2]}").collect()
print(names)

## 4. Conversion Between RDD and DataFrame

### From RDD to DataFrame

In [None]:
from pyspark.sql import Row

# Convert RDD to Row RDD
row_rdd = customers_rdd.map(lambda x: Row(
    customer_id=int(x[0]),
    first_name=x[1],
    last_name=x[2],
    email=x[3],
    join_date=x[4] if x[4] != "" else None
))

df_from_rdd = spark.createDataFrame(row_rdd)
df_from_rdd.show()

### From DataFrame to RDD

In [None]:
rdd_from_df = df.rdd
rdd_from_df.take(3)

## 5. RDD vs. DataFrame - Comparison

| Feature           | RDD                        | DataFrame               |
| ----------------- | -------------------------- | ----------------------- |
| Abstraction Level | Low                        | High                    |
| API Style         | Functional                 | SQL-like                |
| Schema            | Not enforced               | Schema-aware            |
| Performance       | Lower                      | Optimized with Catalyst |
| Best for          | Custom, fine-grained logic | Queries, aggregations   |

## 6. Use Case Summary

| Task                                     | Recommended |
| ---------------------------------------- | ----------- |
| Load structured CSV data                 | DataFrame   |
| Filter or select fields efficiently      | DataFrame   |
| Custom parsing, transformation, or logic | RDD         |
| SQL-like querying and grouping           | DataFrame   |

## 7. Conclusion

- Use DataFrames when working with structured data like CSV, JSON, or Parquet.
- Use RDDs when you need custom logic, performance tuning, or low-level transformations.

This practical section using your `customers.csv` helps you clearly see how both abstractions work and when to use them.