
# Chapter 3: Joining DataFrames & Data Integration

Welcome to Chapter 3! In this notebook, you'll master all types of joins and data integration techniques in PySpark using realistic datasets. This chapter is standalone: all data is freshly loaded and cleaned before you begin practicing.

## What You'll Do
- Practice all major join types and integration scenarios on real-world data
- Learn with generic sample syntax, then apply concepts to your actual DataFrames
- Tackle interview-style and real-world analytics questions



## Important Instructions
- Sample syntax is for illustration only and uses generic DataFrame names (e.g., `df1`, `df2`, `input_df`).
- Always use the actual DataFrame names provided in the practice questions (e.g., `customers_df`, `orders_df`).
- Do not copy-paste the sample code for the practice question. Try to solve it yourself using the actual DataFrame.
- This is for your own practice, so type the commands even if the question is similar to the example.
- Don't execute the code mentioned in syntax as it may modify the data.
- Avoid using AI for code completion.
- Play around and try out a few more for your understanding.



## Data Preparation (Run This First)
This section downloads, loads, and cleans all datasets so you can start joining and integrating data without running previous chapters.


In [None]:

# Download the data
!wget -O customers.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/customers.csv
!wget -O products.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/products.csv
!wget -O orders.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/orders.csv
!wget -O order_items.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/order_items.csv
!wget -O employees.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/employees.csv
!wget -O transactions.csv https://raw.githubusercontent.com/icyanide9/de-practice/refs/heads/main/transactions.csv

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("PySpark Join Practice").getOrCreate()

# Load DataFrames
df_map = {}
for name in ["customers", "products", "orders", "order_items", "employees", "transactions"]:
    df_map[name] = spark.read.csv(f"{name}.csv", header=True, inferSchema=True)

customers_df = df_map["customers"]
products_df = df_map["products"]
orders_df = df_map["orders"]
order_items_df = df_map["order_items"]
employees_df = df_map["employees"]
transactions_df = df_map["transactions"]

# Data cleaning (minimal, for joins):
customers_df = customers_df.dropDuplicates().dropna(subset=["customer_id", "name", "email"])
products_df = products_df.dropDuplicates().dropna(subset=["product_id", "product_name", "price"])
orders_df = orders_df.dropDuplicates().dropna(subset=["order_id", "customer_id", "order_amount"])
order_items_df = order_items_df.dropDuplicates().dropna(subset=["order_item_id", "order_id", "product_id"])
employees_df = employees_df.dropDuplicates().dropna(subset=["employee_id", "name"])
transactions_df = transactions_df.dropDuplicates().dropna(subset=["transaction_id", "customer_id", "amount"])



## Table Overview
- **customers_df**: customer_id, name, email, phone, address, registration_date, status
- **products_df**: product_id, product_name, category, price, stock_quantity
- **orders_df**: order_id, customer_id, order_date, order_amount, order_status, payment_method
- **order_items_df**: order_item_id, order_id, product_id, quantity, item_total
- **employees_df**: employee_id, name, department, hire_date, salary, manager_id
- **transactions_df**: transaction_id, customer_id, transaction_date, amount, transaction_type, location, created_at



### 1. Introduction to Joins

**Concept:** Joins combine rows from two or more DataFrames based on a related column. Types include inner, left, right, full, semi, and anti joins.

**Sample Syntax (Generic):**
```python
df1.join(df2, df1.col1 == df2.col2, "inner")
```

**Practice:**
- List all join types available in PySpark and briefly describe each.

**Expected Output:**
- A markdown/text cell with join type descriptions.

**Additional Challenge:**
- Give a real-world example for each join type using your data tables.


In [None]:
#practice here


### 2. Basic Join Syntax

**Concept:** Use the `join()` method to combine DataFrames. Specify join columns and type.

**Sample Syntax (Generic):**
```python
joined_df = df1.join(df2, df1.col1 == df2.col2, "inner")
```

**Practice:**
- Join `orders_df` with `customers_df` to get customer details for each order.

**Expected Output:**
- A DataFrame with order and customer columns.

**Additional Challenge:**
- Join `order_items_df` with `products_df` to get product details for each order item.


In [None]:
#practice here


### 3. Inner Join

**Concept:** Returns rows with matching keys in both DataFrames.

**Sample Syntax (Generic):**
```python
result_df = df1.join(df2, df1.key == df2.key, "inner")
```

**Practice:**
- Get all orders with valid customers (inner join `orders_df` and `customers_df`).

**Expected Output:**
- A DataFrame with only orders that have a matching customer.

**Additional Challenge:**
- Get all order items with valid products (inner join `order_items_df` and `products_df`).


In [None]:
#practice here


### 4. Left, Right, and Full Outer Joins

**Concept:**
- Left: All rows from left, matched rows from right
- Right: All rows from right, matched rows from left
- Full: All rows from both, matched where possible

**Sample Syntax (Generic):**
```python
left_df = df1.join(df2, df1.key == df2.key, "left")
right_df = df1.join(df2, df1.key == df2.key, "right")
full_df = df1.join(df2, df1.key == df2.key, "outer")
```

**Practice:**
- Find all customers and their orders (left join `customers_df` and `orders_df`).
- Find all products and their order items (right join `order_items_df` and `products_df`).

**Expected Output:**
- DataFrames showing all left/right/full join results.

**Additional Challenge:**
- Find all orders and their order items (full outer join `orders_df` and `order_items_df`).


In [None]:
#practice here


### 5. Semi and Anti Joins

**Concept:**
- Semi: Returns rows from left where a match exists in right (no columns from right)
- Anti: Returns rows from left where no match exists in right

**Sample Syntax (Generic):**
```python
semi_df = df1.join(df2, df1.key == df2.key, "left_semi")
anti_df = df1.join(df2, df1.key == df2.key, "left_anti")
```

**Practice:**
- Find customers who have placed at least one order (semi join `customers_df` and `orders_df`).
- Find customers who have never placed an order (anti join `customers_df` and `orders_df`).

**Expected Output:**
- DataFrames with only the relevant customers.

**Additional Challenge:**
- Find products that have never been ordered (anti join `products_df` and `order_items_df`).


In [None]:
#practice here


### 6. Joining Multiple DataFrames

**Concept:** Chain multiple joins to combine more than two DataFrames.

**Sample Syntax (Generic):**
```python
joined_df = df1.join(df2, ...).join(df3, ...)
```

**Practice:**
- Get order details with product names and customer info (join `orders_df`, `order_items_df`, `products_df`, and `customers_df`).

**Expected Output:**
- A DataFrame with order, product, and customer columns.

**Additional Challenge:**
- For each transaction, get the customer name and all related orders.


In [None]:
#practice here


### 7. Handling Duplicate Columns and Name Conflicts

**Concept:** Use `alias`, `select`, and `drop` to resolve duplicate column names after joins.

**Sample Syntax (Generic):**
```python
df1 = df1.alias("a")
df2 = df2.alias("b")
joined_df = df1.join(df2, df1.key == df2.key)
# Use select to pick/rename columns
```

**Practice:**
- Join `orders_df` and `customers_df`, then select only one `customer_id` column and rename as needed.

**Expected Output:**
- A DataFrame with no duplicate columns.

**Additional Challenge:**
- Join `order_items_df` and `products_df`, then select and rename columns for reporting.


In [None]:
#practice here


### 8. Join Conditions Beyond Equality

**Concept:** Joins can use expressions, not just equality (e.g., date ranges).

**Sample Syntax (Generic):**
```python
from pyspark.sql.functions import expr
joined_df = df1.join(df2, expr("df1.col1 >= df2.col2"), "inner")
```

**Practice:**
- Join `transactions_df` to `orders_df` by customer_id and where transaction_date is within 7 days of order_date.

**Expected Output:**
- A DataFrame with joined transactions and orders.

**Additional Challenge:**
- Join `employees_df` to itself to get each employee and their manager's name.


In [None]:
#practice here


### 9. Broadcast Joins

**Concept:** Use `broadcast()` to optimize joins with small DataFrames.

**Sample Syntax (Generic):**
```python
from pyspark.sql.functions import broadcast
joined_df = df1.join(broadcast(df2), df1.key == df2.key)
```

**Practice:**
- Broadcast join `products_df` (if small) to `order_items_df` for performance.

**Expected Output:**
- A DataFrame joined efficiently.

**Additional Challenge:**
- Broadcast join a small lookup table (e.g., department info) to `employees_df`.


In [None]:
#practice here


### 10. Dealing with Nulls and Missing Keys

**Concept:** Nulls and missing keys can affect join results. Handle them with care.

**Sample Syntax (Generic):**
```python
# Filter out null keys before join
df1 = df1.filter(df1.key.isNotNull())
```

**Practice:**
- Find orders with missing customer references (left join and filter where customer columns are null).

**Expected Output:**
- A DataFrame with orders that have no matching customer.

**Additional Challenge:**
- Find order items with missing product references.


In [None]:
#practice here


### 11. Self Joins

**Concept:** Join a DataFrame to itself to relate rows (e.g., employees and managers).

**Sample Syntax (Generic):**
```python
df1 = df.alias("a")
df2 = df.alias("b")
joined_df = df1.join(df2, df1.manager_id == df2.employee_id)
```

**Practice:**
- Join `employees_df` to itself to get each employee and their manager's name.

**Expected Output:**
- A DataFrame with employee and manager columns.

**Additional Challenge:**
- Find employees who are not managers of anyone.


In [None]:
#practice here


### 12. Real-World Scenarios & Challenges

- Many-to-many joins: Customers and products via orders and order_items
- Joining with aggregated data: Customer with their total spend
- Data enrichment: Adding external reference data (e.g., product categories)

**Try to solve these using the concepts above!**


In [None]:
#practice here