# Getting Started with PySpark

This tutorial walks through the essentials of PySpark—the Python API for Apache Spark—from creating a session to running transformations and SQL queries.

## Prerequisites

- A Python environment with the `pyspark` package available.
- Access to a Spark cluster or a local installation (the default session builder will create a local Spark instance).
- Familiarity with basic Python data structures.

In [None]:
# Start or reuse a SparkSession
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName('PySparkTutorial')
    .getOrCreate()
)

spark

## Creating a DataFrame

PySpark DataFrames are distributed tables with named columns. You can create them from Python objects, files, or external systems. Here we build one from an in-memory list of tuples.

In [None]:
# Build a shared demo dataset
from pathlib import Path
from pyspark.sql import functions as F

repo_root = Path.cwd()
if (repo_root / 'notebooks').exists():
    data_path = repo_root / 'notebooks' / 'data' / 'orders_demo.csv'
else:
    data_path = Path('..') / 'data' / 'orders_demo.csv'

df = (
    spark.read
    .option('header', True)
    .option('inferSchema', True)
    .csv(str(data_path))
)

df.show()


## Transformations

Transformations build a logical plan—Spark executes them lazily when an action (like `show`) runs.

In [None]:
# Filter and enrich the DataFrame
north_orders = df.filter(F.col('region') == 'north')
with_levels = (
    north_orders
      .withColumn(
          'demand_level',
          F.when(F.col('orders') >= 14, 'high').otherwise('steady'),
      )
      .orderBy('order_date')
)
with_levels.show()


## Aggregations

Grouping and aggregation reveal high-level trends across large datasets.

In [None]:
# Summaries by region
summary = (
    df.groupBy('region')
      .agg(
          F.sum('orders').alias('total_orders'),
          F.avg('orders').alias('avg_orders'),
      )
      .orderBy('region')
)
summary.show()


## Using Spark SQL

Spark lets you mix SQL queries with the DataFrame API. Register a temporary view and run SQL statements directly from Python.

In [None]:
df.createOrReplaceTempView('orders')
spark.sql(
    '''
    SELECT order_date,
           region,
           orders,
           CASE WHEN orders >= 14 THEN 'high' ELSE 'steady' END AS demand_level
    FROM orders
    WHERE order_date = '2024-01-02'
    ORDER BY region
    '''
).show()


## Clean Up

Stop the SparkSession when you are finished with the notebook to release resources.

In [None]:
spark.stop()


## Exercises

- Load the shared orders dataset, then add a new column that marks weekends versus weekdays.
- Use `groupBy` to compute the maximum and minimum daily orders per region.
- Write a short Markdown summary explaining how you would adapt the dataset loader to read from a data lake path.
