# Joining DataFrames with Spark

Learn how to combine multiple DataFrames in PySpark using different join strategies while keeping data lineage clear.

## Setup

Start or reuse a SparkSession. We'll continue using the shared orders dataset stored under `notebooks/data/orders_demo.csv`.

In [None]:
from pathlib import Path
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName('SparkJoinsTutorial').getOrCreate()

repo_root = Path.cwd()
if (repo_root / 'notebooks').exists():
    orders_path = repo_root / 'notebooks' / 'data' / 'orders_demo.csv'
else:
    orders_path = Path('..') / 'data' / 'orders_demo.csv'

orders_df = (
    spark.read
    .option('header', True)
    .option('inferSchema', True)
    .csv(str(orders_path))
)
orders_df.show()


## Prepare Dimension Data

Joins require related keys. Here we create a simple region dimension with additional attributes to enrich the orders dataset.

In [None]:
region_info = [
    ('east', 'East Coast', 'EST'),
    ('north', 'Northern Region', 'CST'),
    ('south', 'Southern Region', 'CST'),
    ('west', 'Western Region', 'PST'),
]
columns = ['region', 'region_name', 'timezone']
regions_df = spark.createDataFrame(region_info, columns)
regions_df.show()


## Inner Join

An inner join returns rows with matching keys in both DataFrames.

In [None]:
inner_joined = orders_df.join(regions_df, on='region', how='inner')
inner_joined.orderBy('order_date', 'region').show()


## Left Join

A left join keeps all rows from the left DataFrame and fills unmatched lookups with nulls.

In [None]:
left_joined = orders_df.join(regions_df, on='region', how='left')
left_joined.orderBy('order_date', 'region').show()


## Handling Missing Matches

If a region appears in the orders data but not in the dimension table, joins expose nulls. Let's simulate this by adding a new region to the orders data.

In [None]:
augmented_orders_df = orders_df.unionByName(
    spark.createDataFrame([('2024-01-04', 'central', 8)], orders_df.columns)
)
left_with_missing = augmented_orders_df.join(regions_df, on='region', how='left')
left_with_missing.orderBy('order_date', 'region').show()


## Using Broadcast Joins

When the dimension table is small, broadcasting it avoids shuffle and speeds up joins.

In [None]:
broadcast_joined = orders_df.join(F.broadcast(regions_df), on='region', how='inner')
broadcast_joined.explain(mode='formatted')


## Clean Up

Stop the SparkSession to release resources when you finish experimenting.

In [None]:
spark.stop()


## Exercises

- Create a small product dimension DataFrame and join it to the orders dataset using an inner join.
- Demonstrate a left anti join to identify regions that appear in the dimension table but not in the orders data.
- Convert one of the joins to use an explicit join condition involving multiple columns (e.g., region plus day part).
