# Python Essentials for PySpark Users

PySpark applications are written in Python, so fluency with core language patterns makes your Spark code easier to read and maintain. This tutorial highlights Python concepts you will lean on while building PySpark jobs.


## Prerequisites

- Basic familiarity with Python syntax (`if`, `for`, functions).
- Spark environment capable of running PySpark notebooks.
- Access to the shared `orders_demo.csv` dataset under `notebooks/data/`.


## Using Python Collections to Stage Data

Lists and dictionaries are common when assembling small lookup tables or configuration that feeds into Spark jobs.


In [None]:
# Create a list of dictionaries describing regions
regions = [
    {"region": "north", "timezone": "CST"},
    {"region": "south", "timezone": "CST"},
    {"region": "east", "timezone": "EST"},
    {"region": "west", "timezone": "PST"},
]
regions


### List Comprehensions

You can reshape Python collections concisely with comprehensions before handing them to Spark.


In [None]:
# Add a label to each region using a comprehension
labeled_regions = [
    {**entry, "label": f"{entry['region'].title()} Region"} for entry in regions
]
labeled_regions


## Loading Data with Path Helpers

Python's `pathlib` is handy for pointing Spark to datasets without hard-coding absolute paths.


In [None]:
from pathlib import Path
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName('PythonForPySpark').getOrCreate()

repo_root = Path.cwd()
if (repo_root / 'notebooks').exists():
    data_path = repo_root / 'notebooks' / 'data' / 'orders_demo.csv'
else:
    data_path = Path('..') / 'data' / 'orders_demo.csv'

orders_df = (
    spark.read
    .option('header', True)
    .option('inferSchema', True)
    .csv(str(data_path))
)
orders_df.show()


## Mapping Python Functions to Transform Data

Before creating Spark DataFrames, you may need to normalize raw Python lists. Functions keep that logic reusable.


In [None]:
def annotate_region(entry):
    'Return a tuple that Spark can ingest, tagging coastal regions.'
    coastal = entry['region'] in {'east', 'west'}
    return entry['region'], entry['timezone'], coastal

region_tuples = [annotate_region(item) for item in labeled_regions]
region_tuples


## Creating DataFrames from Python Objects

Once your Python data is structured, `createDataFrame` turns it into a Spark DataFrame.


In [None]:
schema = ['region', 'timezone', 'is_coastal']
regions_df = spark.createDataFrame(region_tuples, schema)
regions_df.show()


## Combining Python Logic with Spark SQL Functions

Use Python conditionals to choose Spark expressions dynamically, keeping complex business rules readable.


In [None]:
def demand_category_expression(threshold):
    if threshold < 12:
        # For lower thresholds, treat anything above as elevated demand
        return F.when(F.col('orders') > threshold, 'high').otherwise('steady')
    return F.when(F.col('orders') >= threshold, 'high').otherwise('steady')

demand_expr = demand_category_expression(threshold=13)
orders_with_category = orders_df.withColumn('demand_level', demand_expr)
orders_with_category.orderBy('order_date', 'region').show()


## Clean Up

Always stop the SparkSession when your notebook run is complete.


In [None]:
spark.stop()


## Exercises

- Write a comprehension that filters the `regions` list to only coastal entries and prints the result.
- Implement a helper function that normalizes region names (uppercase and trimmed) before building the Spark DataFrame.
- Save your Python helpers to a separate `.py` module and import them back into the notebook to reinforce reuse patterns.
