# Databricks Notebooks: PySpark Data Processing

**Databricks Notebooks** are interactive, collaborative documents for data engineering, data science, and machine learning. They support Python, SQL, Scala, and R.

## What You'll Learn

âœ… Read data from Unity Catalog tables  
âœ… Perform transformations with PySpark  
âœ… Use DataFrame API for data manipulation  
âœ… Display and visualize results  
âœ… Write processed data back to tables  

---

## Use Case: IoT Data Processing

We'll use PySpark to:
- Read sensor data from Unity Catalog
- Clean and transform the data
- Calculate aggregations and metrics
- Create visualizations
- Save results for downstream use

---

## Table of Contents

1. Notebook Basics
2. Reading Data
3. DataFrame Operations
4. Aggregations and Window Functions
5. Joining Tables
6. Visualizations
7. Writing Data

---

**References:**
- [Notebooks Documentation](https://docs.databricks.com/aws/en/notebooks/)
- [Notebooks Code](https://docs.databricks.com/aws/en/notebooks/notebooks-code)


In [0]:
# Configuration
CATALOG = 'default'
SCHEMA = 'db_crash_course'

print(f"Using: {CATALOG}.{SCHEMA}")


## 1. Notebook Basics <a id="basics"></a>

### What are Databricks Notebooks?

Notebooks are interactive documents containing:
- **Code cells** - Execute Python, SQL, Scala, or R code
- **Markdown cells** - Documentation and explanations
- **Visualizations** - Built-in plotting capabilities
- **Results** - Output from code execution

### Key Features:

âœ… **Multi-language support** - Switch between languages in the same notebook  
âœ… **Collaboration** - Real-time co-editing with teammates  
âœ… **Version control** - Git integration for tracking changes  
âœ… **Scheduling** - Run notebooks as automated jobs  
âœ… **Interactive visualizations** - Built-in charting  

### Magic Commands:

- `%python` - Python code (default)
- `%sql` - SQL queries
- `%scala` - Scala code
- `%r` - R code
- `%md` - Markdown for documentation
- `%sh` - Shell commands
- `%pip` - Install Python packages

### Keyboard Shortcut:

- `Shift + Enter` - Run cell and move to next

## 2. Reading Data <a id="reading"></a>

### Reading from Unity Catalog Tables

The simplest way to read data is using `spark.table()`:


In [0]:
# Read a table from Unity Catalog
sensors_df = spark.table(f"{CATALOG}.{SCHEMA}.sensor_bronze")

# Show schema
print("Schema:")
sensors_df.printSchema()

# Display first few rows
sensors_df.limit(5).display()


### Reading from Files in Volumes


In [0]:
# Read CSV files from a volume
csv_path = f"/Volumes/{CATALOG}/{SCHEMA}/sensor_data/"

df_from_volume = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(csv_path)
)

print(f"Records read from volume: {df_from_volume.count():,}")
df_from_volume.limit(3).display()


### Basic DataFrame Info


In [0]:
# Get DataFrame information
print(f"Total rows: {sensors_df.count():,}")
print(f"Total columns: {len(sensors_df.columns)}")
print(f"\nColumns: {sensors_df.columns}")

# Show summary statistics
sensors_df.select("temperature", "rotation_speed", "air_pressure").summary().display()


## 3. DataFrame Operations <a id="dataframe"></a>

### Selecting Columns


In [0]:
from pyspark.sql.functions import col

# Select specific columns
selected_df = sensors_df.select(
    "device_id",
    "timestamp",
    "temperature",
    "rotation_speed",
    "factory_id"
)

selected_df.limit(5).display()


### Filtering Data


In [0]:
# Filter for high temperatures
high_temp_df = sensors_df.filter(col("temperature") > 80)

print(f"High temperature readings: {high_temp_df.count():,}")
high_temp_df.limit(5).display()

# Multiple conditions
critical_df = sensors_df.filter(
    (col("temperature") > 80) & 
    (col("rotation_speed") > 600)
)

print(f"\nCritical readings (high temp AND high speed): {critical_df.count():,}")


### Adding and Transforming Columns


In [0]:
from pyspark.sql.functions import col, when, round as spark_round

# Add temperature in Celsius
transformed_df = sensors_df.withColumn(
    "temperature_celsius",
    spark_round((col("temperature") - 32) * 5/9, 2)
)

# Add a status flag
transformed_df = transformed_df.withColumn(
    "temperature_status",
    when(col("temperature") > 85, "Critical")
    .when(col("temperature") > 75, "Warning")
    .otherwise("Normal")
)

# Calculate derived metric
transformed_df = transformed_df.withColumn(
    "performance_index",
    spark_round(col("rotation_speed") / col("air_pressure") * 100, 2)
)

transformed_df.select(
    "device_id",
    "temperature",
    "temperature_celsius",
    "temperature_status",
    "performance_index"
).limit(10).display()


### Handling Null Values


In [0]:
# Check for nulls
from pyspark.sql.functions import count, when, col, isnan

null_counts = sensors_df.select([
    count(when(col(c).isNull(), c)).alias(c) 
    for c in sensors_df.columns
])

print("Null counts per column:")
null_counts.display()

# Drop rows with any nulls
clean_df = sensors_df.na.drop()

# Fill nulls with specific values
filled_df = sensors_df.fillna({
    "temperature": 0,
    "air_pressure": sensors_df.agg({"air_pressure": "mean"}).first()[0]
})

print(f"\nOriginal: {sensors_df.count():,} rows")
print(f"After dropping nulls: {clean_df.count():,} rows")


## 4. Aggregations and Window Functions <a id="aggregations"></a>

### Basic Aggregations


In [0]:
from pyspark.sql.functions import avg, max, min, count, stddev, percentile_approx

# Simple aggregations
overall_stats = sensors_df.agg(
    count("*").alias("total_readings"),
    avg("temperature").alias("avg_temperature"),
    max("temperature").alias("max_temperature"),
    min("temperature").alias("min_temperature"),
    stddev("temperature").alias("stddev_temperature")
)

print("Overall Statistics:")
overall_stats.display()


### Group By Aggregations


In [0]:
# Group by factory
factory_stats = (
    sensors_df
    .groupBy("factory_id")
    .agg(
        count("*").alias("reading_count"),
        avg("temperature").alias("avg_temperature"),
        avg("rotation_speed").alias("avg_rotation_speed"),
        max("air_pressure").alias("max_air_pressure")
    )
    .orderBy(col("avg_temperature").desc())
)

print("Statistics by Factory:")
factory_stats.display()

# Group by device
device_stats = (
    sensors_df
    .groupBy("device_id")
    .agg(
        count("*").alias("reading_count"),
        spark_round(avg("temperature"), 2).alias("avg_temp"),
        spark_round(max("temperature"), 2).alias("max_temp")
    )
    .orderBy(col("max_temp").desc())
    .limit(10)
)

print("\nTop 10 Devices by Max Temperature:")
device_stats.display()


### Window Functions


In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, lag, lead, avg as avg_func

# Define window specifications
device_window = Window.partitionBy("device_id").orderBy("timestamp")

# Add row number within each device
windowed_df = sensors_df.withColumn(
    "reading_number",
    row_number().over(device_window)
)

# Add previous and next temperatures
windowed_df = windowed_df.withColumn(
    "prev_temperature",
    lag("temperature", 1).over(device_window)
).withColumn(
    "next_temperature",
    lead("temperature", 1).over(device_window)
)

# Calculate moving average
windowed_df = windowed_df.withColumn(
    "temp_moving_avg_3",
    spark_round(
        avg_func("temperature").over(
            device_window.rowsBetween(-1, 1)  # Window of 3 rows
        ),
        2
    )
)

# Show temperature changes
windowed_df.select(
    "device_id",
    "timestamp",
    "temperature",
    "prev_temperature",
    "temp_moving_avg_3",
    "reading_number"
).filter(col("device_id") == 1).limit(10).display()


## 5. Joining Tables <a id="joins"></a>

### Inner Join with Dimension Tables


In [0]:
# Load dimension tables
dim_factories = spark.table(f"{CATALOG}.{SCHEMA}.dim_factories")
dim_models = spark.table(f"{CATALOG}.{SCHEMA}.dim_models")

# Join sensor data with factories
enriched_df = (
    sensors_df
    .join(dim_factories, "factory_id", "inner")
    .join(dim_models, "model_id", "inner")
    .select(
        "device_id",
        "timestamp",
        "temperature",
        "rotation_speed",
        "factory_name",
        "region",
        "city",
        "model_name",
        "model_category"
    )
)

print("Enriched sensor data:")
enriched_df.limit(10).display()


### Aggregate Enriched Data


In [0]:
# Performance by region and model category
region_model_stats = (
    enriched_df
    .groupBy("region", "model_category")
    .agg(
        count("*").alias("reading_count"),
        spark_round(avg("temperature"), 2).alias("avg_temperature"),
        spark_round(avg("rotation_speed"), 2).alias("avg_rotation_speed")
    )
    .orderBy("region", col("avg_temperature").desc())
)

print("Performance by Region and Model Category:")
region_model_stats.display()


## 6. Visualizations <a id="visualizations"></a>

### Built-in Display Visualizations

Databricks notebooks have built-in visualization capabilities:


In [0]:
# Prepare data for visualization
factory_temp_viz = (
    enriched_df
    .groupBy("factory_name")
    .agg(
        spark_round(avg("temperature"), 2).alias("avg_temperature"),
        spark_round(avg("rotation_speed"), 2).alias("avg_rotation_speed")
    )
    .orderBy(col("avg_temperature").desc())
)

# Display - click the chart icon to create visualizations
factory_temp_viz.display()

print("""
ðŸ“Š Try These Visualizations:
1. Click the bar chart icon below the table
2. Drag 'factory_name' to Keys
3. Drag 'avg_temperature' to Values
4. Try different chart types: bar, line, pie, scatter
""")


## 7. Writing Data <a id="writing"></a>

### Save Transformed Data as a New Table


In [0]:
# Create a processed dataset
processed_df = (
    enriched_df
    .withColumn("temperature_celsius", spark_round((col("temperature") - 32) * 5/9, 2))
    .withColumn(
        "temperature_category",
        when(col("temperature") > 85, "Critical")
        .when(col("temperature") > 75, "High")
        .when(col("temperature") > 65, "Normal")
        .otherwise("Low")
    )
    .withColumn("processing_timestamp", current_timestamp())
)

# Save as Delta table
output_table = f"{CATALOG}.{SCHEMA}.sensor_processed"

processed_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(output_table)

print(f"âœ… Data saved to: {output_table}")
print(f"   Total records: {processed_df.count():,}")


### Append vs Overwrite Modes


In [0]:
from pyspark.sql.functions import current_timestamp

# Example: Append mode (adds new records)
# processed_df.write.format("delta").mode("append").saveAsTable(output_table)

# Example: Overwrite mode (replaces all data)
# processed_df.write.format("delta").mode("overwrite").saveAsTable(output_table)

# Example: Write to a specific location
# processed_df.write.format("delta").mode("overwrite").save(f"/Volumes/{CATALOG}/{SCHEMA}/processed_data/")

print("""
Write Modes:
- overwrite: Replace existing data
- append: Add new records
- ignore: Skip if table exists
- error: Fail if table exists (default)
""")


## Summary

In this notebook, you learned:

âœ… **Notebook basics** - Magic commands, keyboard shortcuts, collaboration  
âœ… **Reading data** - From Unity Catalog tables and volumes  
âœ… **DataFrame operations** - Select, filter, transform columns  
âœ… **Aggregations** - GroupBy, window functions, statistics  
âœ… **Joins** - Enrich data with dimension tables  
âœ… **Visualizations** - Built-in charts and custom HTML  
âœ… **Writing data** - Save transformed data to Delta tables  

### Key Takeaways:

1. **DataFrames are immutable** - Each transformation returns a new DataFrame
2. **Lazy evaluation** - Transformations are planned, not executed until an action
3. **display()** - Best way to view results with automatic visualizations
4. **Chaining operations** - Use parentheses for readable multi-line transformations
5. **Unity Catalog** - Simplifies data access with three-level namespace

### PySpark Best Practices:

**Performance:**
- Use `filter()` early to reduce data volume
- Avoid `collect()` on large datasets (brings all data to driver)
- Use `cache()` for DataFrames accessed multiple times
- Partition output data appropriately

**Code Quality:**
- Use explicit column references with `col()`
- Chain transformations for readability
- Add comments for complex logic
- Use meaningful variable names

**Data Quality:**
- Check for nulls and handle them explicitly
- Validate data types match expectations
- Use `na.drop()` or `fillna()` strategically
- Add data quality checks before writing

### Common DataFrame Actions:

**Transformations (lazy):**
- `select()`, `filter()`, `withColumn()`, `groupBy()`, `join()`

**Actions (trigger execution):**
- `display()`, `show()`, `count()`, `collect()`, `write()`

### Try These:

- Explore **SQL magic** (`%sql`) for SQL queries in notebooks
- Learn about **Delta Lake** features (time travel, MERGE, OPTIMIZE)
- Try **Structured Streaming** for real-time data processing

---

**Additional Resources:**
- [Notebooks Documentation](https://docs.databricks.com/aws/en/notebooks/)
- [PySpark API Reference](https://spark.apache.org/docs/latest/api/python/)
- [Delta Lake Guide](https://docs.databricks.com/aws/en/delta/)
