# Data-Intensive Applications & Apache Spark Core

## Run in GitHub Codespaces

**Course**: DSCC 202-402 - Data Science at Scale  
**Topics**: Big Data Fundamentals, Team Roles, Spark Architecture, DataFrames, Operations, & Advanced Topics

## Setup: Sample Data for Demonstrations

**Note**: Databricks notebooks have `spark` pre-initialized, so we don't need to create a SparkSession.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, when, max, min, expr, lit, datediff, current_date
from pyspark.sql.types import *
from pyspark.sql.window import Window
import time

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("Spark Core Lecture - Review") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Get Spark Context
sc = spark.sparkContext

print(f"Spark version: {spark.version}")
ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI available at: {ui_url}")
print("üí° In GitHub Codespaces: Check the 'PORTS' tab below for forwarded port 4040 to access Spark UI")

# In Databricks, spark is already available
# For local testing: spark = SparkSession.builder.appName("SparkCoreLecture").getOrCreate()

# Sample customer data
customers_data = [
    ("C001", "Alice", 25, "CA"), ("C002", "Bob", 30, "NY"),
    ("C003", "Carol", 28, "CA"), ("C004", "Dave", 35, "TX"),
    ("C005", "Eve", 22, "NY"), ("C006", "Frank", 45, "CA")
]
customers_df = spark.createDataFrame(customers_data, ["customer_id", "name", "age", "state"])

# Sample transaction data
transactions_data = [
    ("T001", "C001", 100.0, "Electronics"), ("T002", "C001", 50.0, "Books"),
    ("T003", "C002", 200.0, "Electronics"), ("T004", "C003", 150.0, "Books"),
    ("T005", "C002", 75.0, "Home"), ("T006", "C004", 300.0, "Electronics"),
    ("T007", "C005", 25.0, "Books"), ("T008", "C001", 120.0, "Home")
]
transactions_df = spark.createDataFrame(transactions_data, ["txn_id", "customer_id", "amount", "category"])

print("‚úÖ Sample data created for demonstrations")
print(f"   Customers: {customers_df.count()} rows")
print(f"   Transactions: {transactions_df.count()} rows")

---

# Section 1: Course Introduction

## What are Data-Intensive Applications?

### Key Characteristics:
- Processes data that doesn't fit on a single machine
- Uses distributed computing frameworks
- Automates or augments decision-making
- Requires specialized infrastructure and tools

### Examples:
- Recommendation systems (Netflix, Amazon)
- Fraud detection (banks, credit cards)
- Search engines (Google, Bing)
- Real-time analytics dashboards

## Goals: Augment vs Replace Human Decision-Making

### The Reality:
- Data-intensive applications typically **augment** human capabilities
- Humans provide context, judgment, and domain expertise
- Systems provide scale, consistency, and data-driven insights
- Best outcomes come from human-machine collaboration

### Data Scale Considerations:
- Data-intensive applications handle data that exceeds single-machine capacity
- Require distributed processing across multiple machines

## Lecture Roadmap

### Part 1: Foundations (Sections 1-4)
1. Big Data fundamentals (5 V's, CRISP-DM)
2. Team roles and collaboration
3. Development best practices

### Part 2: Spark Architecture (Sections 5-6)
4. Cluster architecture (Driver, Executors)
5. Core concepts (RDDs, lazy evaluation, immutability)

### Part 3: Practical Skills (Sections 7-8)
6. DataFrame operations
7. Advanced topics (SQL, UDFs, caching)

### Part 4: Integration (Section 9)
8. Real-world examples and best practices

---

# Section 2: Big Data Fundamentals

## The 4 V's of Big Data

![4 V's of Big Data](../illustrations/4vs_of_big_data.png)

### 1. **Volume** - The size of the data being processed
- Terabytes, petabytes, exabytes of data
- Example: Walmart processes 2.5 petabytes of data hourly

### 2. **Velocity** - The speed at which data is generated and processed
- Real-time or near real-time processing requirements
- Example: Twitter handles 500 million tweets per day

### 3. **Variety** - The different forms and sources of data
- Structured (databases), semi-structured (JSON), unstructured (text, images)
- Example: Social media data includes text, images, videos, metadata

### 4. **Veracity** - The uncertainty or quality of the data
- Data accuracy, trustworthiness, reliability
- Example: Sensor data may have noise or missing values

## CRISP-DM: Cross-Industry Standard Process for Data Mining

### The 6 Phases:
![CRISP-DM](../illustrations/crispdm.png)

1. **Business Understanding**: Define objectives and requirements
2. **Data Understanding**: Collect and explore initial data
3. **Data Preparation**: Clean, transform, and format data
4. **Modeling**: Build and train models
5. **Evaluation**: Assess model performance against business goals
6. **Deployment**: Put model into production

**Key Insight**: This is an iterative cycle, not a linear process!

## Scale Considerations

### When Do You Need Distributed Computing?

### Indicators You Need Distributed Processing:
- Data doesn't fit in memory on a single machine (> 16-32GB typical laptop RAM)
- Processing time exceeds acceptable limits on single machine
- Need for fault tolerance and high availability
- Multiple data sources requiring parallel ingestion

### Enter Apache Spark:
- Processes terabytes to petabytes of data
- Distributes computation across hundreds/thousands of machines
- Provides fault tolerance through lineage tracking
- Offers unified APIs for batch and streaming data

## Decision Frequency/Impact Matrix

### The Four Quadrants:

| Impact \ Frequency | **High Frequency** | **Low Frequency** |
|-------------------|-------------------|------------------|
| **High Impact** | üåü **Highest Value** <br> Many high-stakes decisions <br> Example: Automated trading | ‚ö†Ô∏è **Strategic Decisions** <br> Infrequent but critical <br> Example: Merger analysis |
| **Low Impact** | ‚úÖ **Valuable Target** <br> Small individual impact, <br> large cumulative value <br> Example: Product recommendations | ‚ùå **Not Worth Building** <br> Low frequency, low impact <br> Example: One-off reports |

### Key Insights:
- **High Frequency/High Impact**: Maximum ROI for data applications
- **High Frequency/Low Impact**: Still valuable due to cumulative effect
- **Low Frequency/High Impact**: Less common, may need human oversight
- **Low Frequency/Low Impact**: Generally not worth automation cost

---

# Section 3: Data Team Roles

## The Five Key Roles in Data-Intensive Projects

![Roles](../illustrations/roles.png)


## Role 1: Data Analyst

### Key Activities:
- Explore datasets to understand patterns and trends
- Create visualizations and dashboards
- Generate insights from data
- Answer business questions with data
- Communicate findings to stakeholders

### Tools:
- SQL, Python/R
- Tableau, Power BI
- Jupyter notebooks

## Role 2: Data Engineer

### Key Activities:
- Design and implement data pipelines (ETL/ELT)
- Manage data infrastructure
- Ensure data quality and reliability
- Optimize data storage and access
- Handle data at scale

### Tools:
- Apache Spark, Airflow
- Delta Lake, Databricks
- Cloud platforms (AWS, Azure, GCP)

## Role 3: Machine Learning Engineer

### Key Activities:
- Design and train ML models
- Feature engineering
- Model optimization and tuning
- Deploy models to production
- Monitor model performance

### Tools:
- Scikit-learn, TensorFlow, PyTorch
- MLflow, Kubeflow
- Model serving platforms

## Role 4: Evaluation Engineer

### Key Activities:
- Design evaluation metrics
- Test model accuracy and reliability
- Monitor models in production
- Detect model drift and degradation
- Ensure models meet business requirements

### Tools:
- A/B testing frameworks
- Monitoring systems
- Statistical analysis tools

## Role 5: Domain Expert

### Why Domain Experts Are Crucial:
- Provide business context for data
- Validate insights and models
- Identify relevant features and patterns
- Understand limitations and biases in data
- Translate between technical and business stakeholders

### Without Domain Experts:
- ‚ùå Technical solutions may not address real business needs
- ‚ùå Important patterns may be missed
- ‚ùå Models may make unrealistic assumptions
- ‚ùå Results may be misinterpreted

## Team Collaboration Patterns

### Effective Data Teams:
- **Cross-functional**: All roles work together
- **Iterative**: Follow CRISP-DM cycle
- **Communicative**: Regular sync between roles
- **Adaptable**: Respond to new insights and challenges

### Typical Workflow:
1. **Domain Expert** + **Data Analyst**: Define problem and explore data
2. **Data Engineer**: Build pipelines to prepare data
3. **ML Engineer**: Build and train models
4. **Evaluation Engineer**: Test model performance
5. **Data Engineer**: Deploy to production
6. **All roles**: Monitor, iterate, and improve

---

# Section 4: Development Best Practices

## Common Pitfalls in Data-Intensive Applications

![10 Ways Data Projects Fail](../illustrations/10_ways_data_projects_fail.jpeg)

### Top Pitfalls to Avoid:
1. **Lack of clear objectives** - Not defining success criteria
2. **Insufficient data quality** - Garbage in, garbage out
3. **Ignoring ethical considerations** - Bias, privacy, fairness
4. **Poor team collaboration** - Silos between roles
5. **Inadequate infrastructure** - Can't handle scale
6. **No deployment plan** - Models never reach production
7. **Lack of monitoring** - Models degrade over time
8. **Overfitting** - Model doesn't generalize
9. **Underestimating complexity** - Technical debt accumulates
10. **Ignoring business context** - Solutions don't address real needs

## Development Best Practices

### What Makes a Good Development Process:
- ‚úÖ **Well-defined**: Clear stages and deliverables
- ‚úÖ **Flexible**: Adapts to new insights and changes
- ‚úÖ **Iterative**: Follows CRISP-DM cycle
- ‚úÖ **Collaborative**: Involves all stakeholders

### Other Best Practices:
- Clear objectives and success metrics
- Data quality checks and validation
- Ethical considerations (bias, privacy, transparency)
- Version control for code and data
- Automated testing and monitoring
- Documentation and knowledge sharing

## When Models Fail: Next Steps

### Why More Data Is Often the Answer:
- More training examples improve model generalization
- Additional features may capture important patterns
- More diverse data helps handle edge cases
- Larger datasets reduce overfitting

### The CRISP-DM Response to Failure:
1. **Evaluate**: Understand why the model failed
2. **Data Understanding**: Identify data gaps or quality issues
3. **Data Preparation**: Acquire more data or improve existing data
4. **Modeling**: Retrain with enhanced dataset
5. **Evaluation**: Test again

### Not Good First Steps:
- ‚ùå Immediately change business goals (addresses symptoms, not cause)
- ‚ùå Accept poor performance (defeats the purpose)
- ‚ùå Abandon the project (too drastic without investigation)

## Ethical Considerations and Data Quality

### Ethical Principles:
- **Fairness**: Avoid bias and discrimination
- **Transparency**: Explain how decisions are made
- **Privacy**: Protect sensitive information
- **Accountability**: Take responsibility for outcomes

### Data Quality Dimensions:
- **Accuracy**: Data correctly represents reality
- **Completeness**: No critical missing values
- **Consistency**: Data is consistent across sources
- **Timeliness**: Data is up-to-date
- **Validity**: Data conforms to expected formats

## Key Success Factors Summary

### To Build Successful Data-Intensive Applications:

1. **Clear Vision**: Well-defined objectives aligned with business goals
2. **Strong Team**: All 5 roles working collaboratively
3. **Quality Data**: Accurate, complete, and relevant data
4. **Robust Process**: Iterative development following CRISP-DM
5. **Appropriate Technology**: Tools that scale (like Apache Spark!)
6. **Ethical Framework**: Consider fairness, privacy, and accountability
7. **Continuous Monitoring**: Track performance and iterate
8. **Production Focus**: Plan for deployment from day one

---

# Section 5: Apache Spark Architecture

Now we transition from general principles to Apache Spark specifics!

## What is Apache Spark?

### Definition:
Apache Spark is a **unified analytics engine for large-scale data processing**.

### Key Features:
- **Speed**: Up to 100x faster than Hadoop MapReduce
- **Ease of Use**: High-level APIs in Python, Scala, Java, R
- **Unified**: Supports batch processing, streaming, ML, graph processing
- **Fault Tolerant**: Automatically recovers from failures
- **Scalable**: Runs on single machine to thousands of nodes

### When to Use Spark:
- Processing terabytes to petabytes of data
- Need for distributed computing
- Complex data transformations and analytics
- Machine learning at scale
- Real-time streaming applications

## Spark API Hierarchy

![Spark API](../illustrations/spark_api.png)

### The API Stack (Bottom to Top):

1. **RDD** (Resilient Distributed Dataset) - Foundational abstraction
   - Low-level API
   - Full control over data and operations
   - All higher APIs build on RDDs

2. **DataFrame** - Structured data with named columns
   - High-level API
   - Automatic optimizations (Catalyst)
   - Similar to database tables or pandas DataFrames

3. **Dataset** - Type-safe DataFrames (Scala/Java only)
   - Compile-time type checking
   - Object-oriented programming interface

4. **Specialized APIs**:
   - Spark SQL: SQL queries on structured data
   - MLlib: Machine learning library
   - GraphX: Graph processing
   - Structured Streaming: Real-time stream processing

## Spark Cluster Architecture

![Spark Cluster](../illustrations/spark_cluster.png)

### Components:

1. **Driver Program**: Your application code
2. **Cluster Manager**: Allocates resources (YARN, Mesos, Kubernetes, Standalone)
3. **Executors**: Processes that run computations and store data

### How They Work Together:
1. Driver sends application to cluster manager
2. Cluster manager allocates executors on worker nodes
3. Driver sends tasks to executors
4. Executors run tasks and return results to driver
5. Driver coordinates the entire application

## The Driver Node

### Driver Responsibilities:
- Runs the `main()` function of your application
- Creates the SparkContext/SparkSession
- Converts user program into tasks
- Schedules tasks on executors
- Tracks executor status
- Returns results to the user

### In Databricks:
- The notebook runs on the driver
- SparkSession (`spark`) is pre-initialized
- You interact with the driver when running cells

In [None]:
# Example: Driver operations
# In Databricks, 'spark' is already created (this is the driver's SparkSession)

print(f"Spark Version: {spark.version}")
print(f"Application Name: {spark.sparkContext.appName}")
print(f"Master: {spark.sparkContext.master}")

# The driver coordinates all operations
print("\nDriver is ready to execute your Spark code!")

## Executor Nodes

### Executor Responsibilities:
- Execute tasks sent by the driver
- Store data for caching/persistence
- Report computation results back to driver
- Report metrics (memory usage, task duration)

### Key Characteristics:
- Multiple executors per application
- Each executor runs in its own JVM process
- Executors are long-lived (for the application duration)
- Failed executors are automatically restarted

### Executor vs Driver:
- **Driver**: ONE per application, orchestrates everything
- **Executors**: MANY per application, do the actual work

## Execution Hierarchy: Job, Stage, Task, Partition

![Spark Execution](../illustrations/spark_execution.png)

### The Four Levels:

1. **Job** - A sequence of stages triggered by an action
   - One job per action (count, collect, save)
   - Example: `df.count()` triggers one job

2. **Stage** - A set of tasks in a DAG
   - Stages are divided at shuffle boundaries
   - All tasks in a stage can run in parallel
   - Example: Stage 1 (read + filter), Stage 2 (groupBy + aggregate)

3. **Task** - A unit of work sent to an executor
   - One task per partition
   - Tasks are the smallest unit of execution
   - Example: Process partition 1 of 200

4. **Partition** - An atomic chunk of data (logical division) stored on a node
   - Data is split into partitions for parallel processing
   - Each partition is processed independently
   - More partitions = more parallelism

### Relationship:
```
Action ‚Üí Job ‚Üí Stages ‚Üí Tasks ‚Üí Partitions
  1        1       N        M       M
```

In [None]:
# Example: Understanding partitions

print("Default partitions:")
print(f"  Customers: {customers_df.rdd.getNumPartitions()} partitions")
print(f"  Transactions: {transactions_df.rdd.getNumPartitions()} partitions")

# Create a larger dataset to demonstrate partitioning
large_df = spark.range(0, 1000000)
print(f"\nLarge dataset: {large_df.rdd.getNumPartitions()} partitions")
print(f"Total rows: {large_df.count():,}")
print(f"Rows per partition: ~{large_df.count() // large_df.rdd.getNumPartitions():,}")

## RDD Deep Dive

### RDD = Resilient Distributed Dataset

**Resilient**: Fault-tolerant, automatically recovers from failures
- Uses lineage information to recompute lost partitions

**Distributed**: Data is split across multiple nodes
- Enables parallel processing

**Dataset**: Collection of objects
- Can contain any type of data (rows, tuples, custom objects)

### RDD Properties:
- **Immutable**: Cannot be changed after creation
- **Lazy**: Transformations are not computed until an action is called
- **Partitioned**: Divided for parallel processing
- **Typed**: Contains elements of a specific type

In [None]:
# Example: Creating and using RDDs

# Create RDD from a Python list
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print("RDD created from list:")
print(f"  Type: {type(numbers_rdd)}")
print(f"  Partitions: {numbers_rdd.getNumPartitions()}")
print(f"  Elements: {numbers_rdd.collect()}")

# RDD transformations (lazy)
squared_rdd = numbers_rdd.map(lambda x: x ** 2)
even_rdd = squared_rdd.filter(lambda x: x % 2 == 0)

print("\nAfter transformations (map and filter):")
print(f"  Even squares: {even_rdd.collect()}")

# Convert DataFrame to RDD
customers_rdd = customers_df.rdd
print(f"\nDataFrame to RDD:")
print(f"  Type: {type(customers_rdd)}")
print(f"  First row: {customers_rdd.first()}")

---

# Section 6: Core Spark Concepts

## Immutability in Spark

### What Does Immutability Mean?
- Once created, a DataFrame **cannot be changed**
- Transformations create **new** DataFrames
- Original DataFrame remains unchanged

### Why Immutability?
1. **Fault Tolerance**: Can recreate data from lineage if node fails
2. **Consistency**: Multiple operations see the same data
3. **Optimization**: Engine can safely reorder operations
4. **Parallelism**: Safe concurrent access without locks

In [None]:
# Example: Demonstrating immutability

print("Original customers DataFrame:")
customers_df.show()

# Apply transformation - creates NEW DataFrame
customers_with_category = customers_df.withColumn(
    "age_category",
    when(col("age") < 25, "Young")
    .when(col("age") < 35, "Adult")
    .otherwise("Senior")
)

print("\nNew DataFrame with age_category:")
customers_with_category.show()

print("\nOriginal DataFrame is UNCHANGED (immutability):")
customers_df.show()

print(f"\nOriginal DataFrame columns: {customers_df.columns}")
print(f"New DataFrame columns: {customers_with_category.columns}")

## RDD Lineage and Fault Tolerance

![RDD Lineage](../illustrations/rdd_lineage.png)

### How Lineage Works:
1. Spark tracks the **sequence of transformations** applied to data
2. Creates a **Directed Acyclic Graph (DAG)** of dependencies
3. If a partition is lost (node failure), Spark:
   - Traces back through the lineage
   - Recomputes only the lost partition
   - Continues execution

### Example Lineage:
```
Original Data ‚Üí filter() ‚Üí map() ‚Üí groupBy() ‚Üí Action
```

If partition fails during `groupBy()`, Spark recomputes:  
`filter()` ‚Üí `map()` ‚Üí `groupBy()` for that partition only

### Benefits:
- No need for expensive data replication
- Automatic recovery without user intervention
- Efficient - only recomputes what's needed

## Transformations vs Actions Overview

### Placeholder Diagram: Transformations vs Actions Comparison Table

| **Transformations (Lazy)** | **Actions (Eager)** |
|---------------------------|---------------------|
| Return new RDD/DataFrame | Return value to driver or write to storage |
| Not computed immediately | Trigger immediate execution |
| Build execution plan (DAG) | Execute the DAG |
| Examples: | Examples: |
| - `map()` | - `count()` |
| - `filter()` | - `collect()` |
| - `groupBy()` | - `show()` |
| - `join()` | - `take()` |
| - `select()` | - `first()` |
| - `where()` | - `save()` |

## Transformations: Lazy Operations

### Common DataFrame Transformations:

1. **select()**: Choose specific columns
2. **filter() / where()**: Filter rows by condition
3. **withColumn()**: Add or modify columns
4. **groupBy()**: Group data for aggregation
5. **join()**: Combine DataFrames
6. **union()**: Concatenate DataFrames
7. **orderBy() / sort()**: Sort data
8. **distinct()**: Remove duplicates

### Key Characteristic:
- **Nothing is computed** when you call a transformation
- Spark just records the operation in the execution plan
- Actual computation happens when an action is called

In [None]:
# Example: Chaining transformations (all lazy)

print("Defining transformation chain (no computation yet)...\n")

# Chain of transformations
result_df = transactions_df \
    .filter(col("amount") > 50) \
    .groupBy("category") \
    .agg(count("*").alias("transaction_count"),
         sum("amount").alias("total_amount")) \
    .orderBy(col("total_amount").desc())

print("‚úÖ Transformations defined (not executed yet)")
print(f"Result type: {type(result_df)}")
print("\nNo actual data processing has occurred until we call an action!")

## Actions: Eager Operations

### Common DataFrame Actions:

1. **show()**: Display rows in console
2. **count()**: Count number of rows
3. **collect()**: Return all rows to driver
4. **take(n)**: Return first n rows
5. **first()**: Return first row
6. **write()**: Save to storage
7. **foreach()**: Apply function to each row

### Key Characteristic:
- **Triggers immediate execution** of all pending transformations
- Returns results to the driver program
- Can see output or write to storage

In [None]:
# Example: Actions trigger execution

print("Now calling an ACTION - this triggers execution...\n")

# This action triggers the entire transformation chain
result_df.show()

print("\n‚úÖ Action completed - all transformations were executed!")

# Other actions
print(f"\nNumber of categories: {result_df.count()}")
print(f"First row: {result_df.first()}")

## Lazy Evaluation Explained

### How Lazy Evaluation Works:

1. **User writes code** with transformations
2. **Spark builds DAG** (Directed Acyclic Graph) of operations
3. **No computation** happens yet
4. **User calls action**
5. **Spark optimizes** the DAG
6. **Execution begins** and data is processed
7. **Results returned** to user

### Benefits:
- **Optimization**: Spark can rearrange operations for efficiency
- **Performance**: Avoids unnecessary intermediate results
- **Resource Efficiency**: Only loads data that's needed

In [None]:
# Example: Demonstrating lazy evaluation with timing

import time

print("=" * 60)
print("LAZY EVALUATION DEMONSTRATION")
print("=" * 60)

# Create a larger dataset for timing
large_data = spark.range(0, 1000000)

print("\n1. Defining transformations (should be instant)...")
start = time.time()

# Chain of transformations - all lazy
transformed = large_data \
    .filter(col("id") % 2 == 0) \
    .withColumn("squared", col("id") * col("id")) \
    .filter(col("squared") < 1000000)

transform_time = time.time() - start
print(f"   Time to define transformations: {transform_time:.4f} seconds")
print("   ‚úì No computation occurred (lazy!)")

print("\n2. Calling action (triggers execution)...")
start = time.time()

# This action triggers all transformations
result_count = transformed.count()

action_time = time.time() - start
print(f"   Time to execute count(): {action_time:.4f} seconds")
print(f"   Result: {result_count:,} rows")
print("   ‚úì All transformations executed!")

print(f"\nüìä Comparison:")
print(f"   Transformation definition: {transform_time:.4f}s (instant - lazy)")
print(f"   Action execution: {action_time:.4f}s (actual work - eager)")
print(f"   Speed ratio: {action_time/transform_time:.0f}x slower for action")

## Benefits of Lazy Evaluation

### 1. Query Optimization
Spark can analyze the entire chain and optimize:
- **Predicate pushdown**: Move filters earlier
- **Column pruning**: Only read needed columns
- **Join reordering**: Optimize join sequence

### 2. Avoiding Unnecessary Work
```python
df.filter(...).filter(...).take(10)  # Only processes enough data for 10 rows
```

### 3. Pipeline Optimization
Multiple operations can be combined into single stage:
```python
df.select(...).filter(...).map(...)  # May execute in single pass
```

### 4. Fault Tolerance
Lineage graph enables recomputation without storing intermediate data

## DataFrames vs RDDs

### Why DataFrames Are Optimized:

1. **Catalyst Optimizer**
   - Analyzes and optimizes query plans
   - Pushes filters and projections down
   - Reorders operations for efficiency

2. **Tungsten Execution Engine**
   - Off-heap memory management
   - Cache-aware computation
   - Code generation at runtime

3. **Schema Information**
   - DataFrames have schema (column names and types)
   - Enables better optimization decisions
   - RDDs are opaque (Spark doesn't know structure)

### When to Use What:
- **DataFrames**: 95% of use cases (recommended)
- **RDDs**: Complex custom logic, low-level control needed

In [None]:
# Example: DataFrame vs RDD performance

print("Comparing DataFrame vs RDD operations:\n")

# Create test data
test_data = spark.range(0, 100000)

# DataFrame approach (optimized)
print("1. DataFrame approach (with Catalyst optimization):")
start = time.time()
df_result = test_data \
    .filter(col("id") % 2 == 0) \
    .filter(col("id") < 50000) \
    .count()
df_time = time.time() - start
print(f"   Result: {df_result:,} rows")
print(f"   Time: {df_time:.4f} seconds")

# RDD approach (no optimization)
print("\n2. RDD approach (no Catalyst optimization):")
start = time.time()
rdd_result = test_data.rdd \
    .filter(lambda row: row[0] % 2 == 0) \
    .filter(lambda row: row[0] < 50000) \
    .count()
rdd_time = time.time() - start
print(f"   Result: {rdd_result:,} rows")
print(f"   Time: {rdd_time:.4f} seconds")

print(f"\nüìä Performance:")
print(f"   DataFrame: {df_time:.4f}s (optimized)")
print(f"   RDD: {rdd_time:.4f}s (no optimization)")
if rdd_time > df_time:
    print(f"   DataFrame is {rdd_time/df_time:.1f}x faster! ‚úì")

print("\nüí° DataFrames provide automatic optimizations!")

## Understanding Partitions

### Why Partitions Matter:
- Each partition is processed by a **single task**
- More partitions = more parallelism
- But too many partitions = overhead

### Rules of Thumb:
- Aim for 2-4 partitions per CPU core
- Partition size: 100MB - 1GB
- For 8-core cluster: 16-32 partitions

### Repartitioning:
- **repartition(n)**: Increase or decrease partitions (full shuffle)
- **coalesce(n)**: Decrease partitions (minimize shuffle)

In [None]:
# Example: Working with partitions

# Check current partitions
print("Current partitioning:")
print(f"  Customers: {customers_df.rdd.getNumPartitions()} partitions")
print(f"  Transactions: {transactions_df.rdd.getNumPartitions()} partitions")

# Create data with specific number of partitions
large_df = spark.range(0, 1000000, numPartitions=8)
print(f"\nLarge dataset: {large_df.rdd.getNumPartitions()} partitions")
print(f"Total rows: {large_df.count():,}")

# Repartition to increase parallelism
repartitioned = large_df.repartition(16)
print(f"\nAfter repartition(16): {repartitioned.rdd.getNumPartitions()} partitions")

# Coalesce to reduce partitions
coalesced = repartitioned.coalesce(4)
print(f"After coalesce(4): {coalesced.rdd.getNumPartitions()} partitions")

print("\nüí° More partitions = more parallel tasks!")

---

# Section 7: DataFrame Operations

## Section 7: DataFrame Operations - Common Transformations

This section covers the most frequently used DataFrame operations. All the operations below have been demonstrated throughout this notebook with the sample customer and transaction data.

### Key DataFrame Transformations:
- **select()**: Choosing columns from DataFrames
- **filter()/where()**: Filtering rows based on conditions
- **withColumn()**: Adding/modifying columns
- **withColumnRenamed()**: Renaming columns
- **orderBy()**: Sorting DataFrames
- **groupBy()**: Aggregating data
- **union()**: Combining DataFrames with same schema
- **join()**: Joining DataFrames on keys

### Key DataFrame Actions:
- **show()**: Displaying DataFrame contents (triggers execution)
- **collect()**: Retrieving all rows to driver (triggers execution)
- **count()**: Counting rows (triggers execution)

---

**Next**: We'll explore Advanced Topics including Spark SQL, UDFs, and Caching strategies.

## Section 8 Summary: Advanced Topics

### What We Covered:

1. **Spark SQL**
   - Register DataFrames as temporary views with `createOrReplaceTempView()`
   - Write SQL queries with `spark.sql(query)`
   - Mix SQL and DataFrame API operations
   - Same performance optimizations as DataFrame API

2. **User-Defined Functions (UDFs)**
   - **SQL UDFs**: `spark.udf.register('name', function, returnType)` for SQL queries
   - **DataFrame API UDFs**: `udf()` decorator or function for DataFrame operations
   - Enable custom business logic not available in built-in functions
   - Note: Slightly less optimized than built-in functions

3. **Caching**
   - Use `.cache()` to store DataFrames in memory
   - Cache **after expensive operations**, **before multiple uses**
   - Dramatically improves performance for iterative workflows
   - Remember to `.unpersist()` when done to free memory
   - Best for: ML pipelines, interactive analysis, repeated queries

### Key Takeaways:
- Spark SQL provides familiar syntax for analysts
- UDFs extend Spark with custom logic
- Caching is essential for performance when reusing DataFrames
- All three techniques are commonly used in production applications

---

**Next**: Final summary and resources

In [None]:
# Example: Caching Performance Demonstration

import time

print("=" * 60)
print("CACHING PERFORMANCE DEMONSTRATION")
print("=" * 60)

# Create a larger dataset with more expensive transformations
print("\nCreating a large dataset with expensive transformations...")

# Generate larger dataset - 10 million rows
large_data = spark.range(0, 10000000) \
    .withColumn("value", col("id") * 2) \
    .withColumn("squared", col("value") * col("value")) \
    .withColumn("category", (col("id") % 100).cast(StringType())) \
    .withColumn("subcategory", (col("id") % 1000).cast(StringType()))

# Add expensive aggregation that will benefit from caching
expensive_df = large_data \
    .groupBy("category") \
    .agg(
        count("*").alias("count"),
        sum("squared").alias("sum_squared"),
        avg("value").alias("avg_value"),
        max("squared").alias("max_squared"),
        min("squared").alias("min_squared")
    )

print("‚úÖ Complex transformation chain defined (lazy - not executed yet)\n")

# Scenario 1: WITHOUT caching - Multiple actions on aggregated data
print("=" * 60)
print("SCENARIO 1: WITHOUT CACHING")
print("=" * 60)

print("\nPerforming 4 operations on the SAME aggregated dataset WITHOUT caching:")

start = time.time()
count1 = expensive_df.count()
time1 = time.time() - start
print(f"  Operation 1 (count categories): {count1} rows - {time1:.3f} seconds")

start = time.time()
max_sum1 = expensive_df.agg(max("sum_squared")).collect()[0][0]
time2 = time.time() - start
print(f"  Operation 2 (max sum_squared): {max_sum1:,} - {time2:.3f} seconds")

start = time.time()
avg_count1 = expensive_df.agg(avg("count")).collect()[0][0]
time3 = time.time() - start
print(f"  Operation 3 (avg count): {avg_count1:.1f} - {time3:.3f} seconds")

start = time.time()
top_categories1 = expensive_df.orderBy(col("sum_squared").desc()).take(5)
time4 = time.time() - start
print(f"  Operation 4 (top 5 categories): retrieved - {time4:.3f} seconds")

total_time_no_cache = time1 + time2 + time3 + time4
print(f"\n  TOTAL TIME WITHOUT CACHE: {total_time_no_cache:.3f} seconds")
print("  ‚ö†Ô∏è  Each operation recomputes the entire expensive aggregation!")

# Scenario 2: WITH caching - Multiple actions
print("\n" + "=" * 60)
print("SCENARIO 2: WITH CACHING")
print("=" * 60)

print("\nCaching the aggregated dataset before multiple operations...")
cached_data = expensive_df.cache()

# First action: triggers computation AND caching
print("\n  Performing first operation (triggers caching)...")
start = time.time()
count2 = cached_data.count()
time_cache_load = time.time() - start
print(f"  Operation 1 (count categories): {count2} rows - {time_cache_load:.3f} seconds")
print("  ‚úì Data is now cached in memory")

# Subsequent actions: use cached data (should be much faster)
print("\n  Performing subsequent operations (using cached data)...")

start = time.time()
max_sum2 = cached_data.agg(max("sum_squared")).collect()[0][0]
time_cached_2 = time.time() - start
print(f"  Operation 2 (max sum_squared): {max_sum2:,} - {time_cached_2:.3f} seconds")

start = time.time()
avg_count2 = cached_data.agg(avg("count")).collect()[0][0]
time_cached_3 = time.time() - start
print(f"  Operation 3 (avg count): {avg_count2:.1f} - {time_cached_3:.3f} seconds")

start = time.time()
top_categories2 = cached_data.orderBy(col("sum_squared").desc()).take(5)
time_cached_4 = time.time() - start
print(f"  Operation 4 (top 5 categories): retrieved - {time_cached_4:.3f} seconds")

total_time_with_cache = time_cache_load + time_cached_2 + time_cached_3 + time_cached_4
print(f"\n  TOTAL TIME WITH CACHE: {total_time_with_cache:.3f} seconds")
print("  ‚úì Operations 2-4 used cached data (much faster!)")

# Performance comparison
print("\n" + "=" * 60)
print("PERFORMANCE COMPARISON")
print("=" * 60)

speedup = total_time_no_cache / total_time_with_cache
print(f"\nWithout caching: {total_time_no_cache:.3f} seconds")
print(f"With caching:    {total_time_with_cache:.3f} seconds")

if speedup > 1:
    print(f"Speedup:         {speedup:.2f}x faster with caching! ‚úì")
    print(f"\nBreakdown of cached operations:")
    print(f"  Operation 2 speedup: {time2/time_cached_2:.2f}x faster")
    print(f"  Operation 3 speedup: {time3/time_cached_3:.2f}x faster")
    print(f"  Operation 4 speedup: {time4/time_cached_4:.2f}x faster")
else:
    print(f"Note: Dataset may be small enough that Spark's optimization")
    print(f"      makes recomputation competitive with caching overhead.")
    print(f"      Caching shows more benefit with larger datasets and")
    print(f"      more expensive transformations (joins, complex aggregations).")

# Best practice: Unpersist when done
cached_data.unpersist()
print("\n‚úÖ Cache cleared with unpersist() to free memory")

# Real-world example: Machine learning scenario
print("\n" + "=" * 60)
print("REAL-WORLD EXAMPLE: ML Feature Preparation")
print("=" * 60)

print("\nScenario: Preparing features for multiple ML models")

# Expensive feature preparation - join is expensive!
features_df = transactions_df.alias("t") \
    .join(customers_df.alias("c"), "customer_id") \
    .groupBy("c.customer_id", "c.age", "c.state") \
    .agg(
        count("t.txn_id").alias("txn_count"),
        sum("t.amount").alias("total_spent"),
        avg("t.amount").alias("avg_transaction"),
        max("t.amount").alias("max_transaction")
    ) \
    .cache()  # Cache after expensive join and aggregation

print("‚úÖ Features prepared and cached")
print("\nNow you can use these features for multiple models:")
print("  - Model 1: Predict customer churn")
print("  - Model 2: Customer lifetime value prediction")
print("  - Model 3: Segmentation analysis")
print("\nEach model uses the SAME cached features (no recomputation)!")

# Trigger caching
features_df.show(5)

# Clean up
features_df.unpersist()

print("\nüí° Key Takeaways:")
print("   1. Cache DataFrames that are used MULTIPLE times")
print("   2. Cache AFTER expensive operations (joins, aggregations)")
print("   3. Caching benefits are most visible with:")
print("      - Large datasets")
print("      - Expensive transformations (joins, complex aggregations)")
print("      - Iterative algorithms (ML training)")
print("      - Interactive analysis with repeated queries")

## Caching: Performance Optimization

### What Is Caching?
Caching stores a DataFrame in memory (or disk) so it doesn't need to be recomputed when used multiple times.

### When to Use Cache:
‚úÖ **GOOD Use Cases:**
- DataFrame used multiple times in your workflow
- After expensive transformations (joins, aggregations)
- Iterative algorithms (machine learning)
- Interactive analysis where you query the same data repeatedly

‚ùå **BAD Use Cases:**
- DataFrame only used once
- Very large DataFrames that don't fit in memory
- Simple transformations that are cheap to recompute

### How to Cache:
```python
df.cache()  # or df.persist()
```

### Best Practice:
Cache **after** expensive operations, **before** multiple uses.

### Memory Considerations:
- Cached data uses cluster memory
- Can evict old cached data if memory is full (LRU)
- Use `unpersist()` to free memory when done

In [None]:
# Example: User-Defined Functions (UDFs)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, DoubleType, IntegerType

print("=" * 60)
print("USER-DEFINED FUNCTIONS (UDF) DEMONSTRATION")
print("=" * 60)

# Register DataFrames as temporary views (needed for SQL UDFs)
customers_df.createOrReplaceTempView("customers")
transactions_df.createOrReplaceTempView("transactions")

# Method 1: SQL UDF Registration
print("\n1. SQL UDF - Register for use in SQL queries")
print("-" * 60)

# Define and register a UDF for age category
def categorize_age(age):
    """Categorize customers by age group"""
    if age < 25:
        return "Young Adult"
    elif age < 35:
        return "Adult"
    elif age < 50:
        return "Middle Age"
    else:
        return "Senior"

# Register the UDF for SQL
spark.udf.register("categorize_age", categorize_age, StringType())
print("‚úÖ Registered 'categorize_age' UDF for SQL")

# Define and register a UDF for discount calculation
def calculate_discount(amount):
    """Calculate discount based on purchase amount"""
    if amount >= 200:
        return amount * 0.15  # 15% discount
    elif amount >= 100:
        return amount * 0.10  # 10% discount
    else:
        return amount * 0.05  # 5% discount

spark.udf.register("calculate_discount", calculate_discount, DoubleType())
print("‚úÖ Registered 'calculate_discount' UDF for SQL\n")

# Use UDFs in SQL query
query = """
    SELECT 
        name,
        age,
        categorize_age(age) as age_category,
        SUM(amount) as total_spent,
        calculate_discount(SUM(amount)) as total_discount
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, name, age
    ORDER BY total_spent DESC
"""

print("SQL Query with UDFs:")
result = spark.sql(query)
result.show()

# Method 2: DataFrame API UDF
print("\n2. DataFrame API UDF - Use with withColumn()")
print("-" * 60)

# Define UDF for DataFrame API
@udf(returnType=StringType())
def format_customer_label(name, state):
    """Create a formatted customer label"""
    return f"{name} ({state})"

# Alternative syntax without decorator
spending_tier_udf = udf(
    lambda amount: "High Spender" if amount > 150 else "Regular" if amount > 75 else "Low Spender",
    StringType()
)

print("‚úÖ Created UDFs for DataFrame API\n")

# Apply UDFs to DataFrame
customer_summary = customers_df.alias("c") \
    .join(transactions_df.alias("t"), "customer_id") \
    .groupBy("c.customer_id", "c.name", "c.state") \
    .agg(sum("t.amount").alias("total_spent")) \
    .withColumn("customer_label", format_customer_label(col("name"), col("state"))) \
    .withColumn("spending_tier", spending_tier_udf(col("total_spent")))

print("DataFrame with UDF results:")
customer_summary.select("customer_label", "total_spent", "spending_tier").show(truncate=False)

# Example 3: More complex UDF
print("\n3. Complex UDF Example - Multiple inputs")
print("-" * 60)

def calculate_loyalty_score(transaction_count, total_spent, avg_amount):
    """
    Calculate customer loyalty score based on multiple factors
    Score ranges from 0-100
    """
    # Base score from transaction frequency (max 40 points)
    frequency_score = transaction_count * 10
    if frequency_score > 40:
        frequency_score = 40
    
    # Volume score from total spending (max 40 points)
    volume_score = total_spent / 10
    if volume_score > 40:
        volume_score = 40
    
    # Consistency score from average transaction (max 20 points)
    consistency_score = avg_amount / 10
    if consistency_score > 20:
        consistency_score = 20
    
    return int(frequency_score + volume_score + consistency_score)

# Register for SQL
spark.udf.register("loyalty_score", calculate_loyalty_score, IntegerType())

# Use in SQL query
loyalty_query = """
    SELECT 
        c.name,
        COUNT(t.txn_id) as txn_count,
        SUM(t.amount) as total_spent,
        AVG(t.amount) as avg_amount,
        loyalty_score(COUNT(t.txn_id), SUM(t.amount), AVG(t.amount)) as loyalty_score
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, c.name
    ORDER BY loyalty_score DESC
"""

print("Customer Loyalty Scores:")
spark.sql(loyalty_query).show()

print("\nüí° UDFs enable custom business logic in Spark transformations!")

## User-Defined Functions (UDFs)

### What Are UDFs?
User-Defined Functions allow you to define custom logic that can be applied to DataFrame columns.

### Two Ways to Use UDFs:

1. **SQL UDFs**: Register with `spark.udf.register()` for use in SQL queries
2. **DataFrame API UDFs**: Use `udf()` decorator/function for DataFrame operations

### When to Use UDFs:
- Custom business logic not available in built-in functions
- Complex string parsing or transformations
- Domain-specific calculations
- Integration with external libraries

### Important Notes:
- UDFs are less optimized than built-in functions
- Use built-in functions when possible
- UDFs serialize/deserialize data (performance overhead)
- Good for clarity even with slight performance cost

In [None]:
# Example: Using Spark SQL

print("=" * 60)
print("SPARK SQL DEMONSTRATION")
print("=" * 60)

# Step 1: Register DataFrames as temporary views
customers_df.createOrReplaceTempView("customers")
transactions_df.createOrReplaceTempView("transactions")

print("\n‚úÖ Registered DataFrames as SQL views: 'customers' and 'transactions'\n")

# Step 2: Run SQL queries
print("Query 1: Customer transaction summary")
query1 = """
    SELECT 
        c.customer_id,
        c.name,
        c.state,
        COUNT(t.txn_id) as transaction_count,
        SUM(t.amount) as total_spent,
        AVG(t.amount) as avg_transaction
    FROM customers c
    LEFT JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, c.name, c.state
    ORDER BY total_spent DESC
"""

result1 = spark.sql(query1)
result1.show()

# Query 2: Category performance by state
print("\nQuery 2: Category performance by state")
query2 = """
    SELECT 
        c.state,
        t.category,
        COUNT(*) as num_transactions,
        SUM(t.amount) as revenue
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.state, t.category
    ORDER BY c.state, revenue DESC
"""

result2 = spark.sql(query2)
result2.show()

# Query 3: High-value customers (spending > $200)
print("\nQuery 3: High-value customers")
query3 = """
    SELECT 
        c.name,
        c.age,
        SUM(t.amount) as total_spent
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, c.name, c.age
    HAVING SUM(t.amount) > 200
    ORDER BY total_spent DESC
"""

result3 = spark.sql(query3)
result3.show()

print("\nüí° Spark SQL provides familiar syntax with DataFrame performance!")

## Spark SQL: Querying DataFrames with SQL

### Why Use Spark SQL?
- Familiar SQL syntax for data analysts
- Can mix SQL and DataFrame API operations
- Same optimizations as DataFrame API (Catalyst)
- Great for ad-hoc analysis and exploration

### How It Works:
1. Register DataFrame as a temporary view
2. Run SQL queries against the view
3. Results returned as DataFrames

### Benefits:
- Leverage existing SQL knowledge
- Easy integration with BI tools
- Readable queries for complex logic

---

# Section 8: Advanced Topics

Now let's explore advanced Spark features that are essential for real-world applications!

## Final Summary

This comprehensive lecture deck has covered all core Spark concepts:

### üéØ Core Learning Objectives Achieved:

1. **Data-Intensive Applications Foundations**
   - 5 V's of Big Data and CRISP-DM methodology
   - Team roles and collaboration patterns
   - Development best practices and pitfalls
   - Decision-making frameworks

2. **Apache Spark Architecture**
   - Driver and Executor roles
   - Job, Stage, Task, and Partition hierarchy
   - RDD as the fundamental data structure
   - Fault tolerance through lineage

3. **Core Spark Concepts**
   - Immutability principles
   - Lazy evaluation and optimization
   - Transformations vs Actions
   - DataFrames vs RDDs

4. **Practical DataFrame Operations**
   - Selection, filtering, and projection
   - Column operations and schema modification
   - Aggregations and grouping
   - Joins and unions
   - Actions that trigger execution

5. **Advanced Techniques**
   - Spark SQL integration
   - User-Defined Functions (UDFs)
   - Strategic caching for performance
   - End-to-end pipeline design

### üìö Study Recommendations:

- **Review this notebook** section by section
- **Run all code examples** in Databricks to see outputs
- **Practice with the labs** to reinforce concepts
- **Understand the 'why'** behind each concept, not just memorize syntax

### üöÄ Ready for Success!

You now have a complete reference with executable examples,  
conceptual explanations, and best practices for Apache Spark.

---

**Format**: üìì Jupyter Notebook with markdown and code cells  
