
# PySpark Data Engineering Practice: End-to-End Hands-On

Welcome to this hands-on PySpark data engineering practice series! This notebook (and the series) is designed to help you master real-world data engineering skills using PySpark, with a focus on practical, interview-style scenarios.

## What You'll Do

- Work with realistic, joinable datasets (Customers, Orders, Products, Order Items, Employees, Transactions) containing intentional data quality issues.
- Practice a wide range of data engineering concepts, from data cleaning and transformation to advanced aggregations, joins, UDFs, Spark SQL, and performance optimization.
- Learn by doing: Each topic includes explanations, sample code, practice questions, expected outputs, and additional challenges.


In [None]:

# Download the data (replace URLs with your actual GitHub repo links)
!wget https://raw.githubusercontent.com/yourusername/yourrepo/main/customers.csv
!wget https://raw.githubusercontent.com/yourusername/yourrepo/main/products.csv
!wget https://raw.githubusercontent.com/yourusername/yourrepo/main/orders.csv
!wget https://raw.githubusercontent.com/yourusername/yourrepo/main/order_items.csv
!wget https://raw.githubusercontent.com/yourusername/yourrepo/main/employees.csv
!wget https://raw.githubusercontent.com/yourusername/yourrepo/main/transactions.csv


In [None]:

# Install PySpark if not already installed
!pip install pyspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder     .appName("PySpark Data Engineering Practice")     .getOrCreate()



## Table Overview

**customers**
Stores customer details: customer_id, name, email, phone, address, registration_date, status.
*Issues included: missing values, duplicates, invalid formats, inconsistent phone numbers.*

**products**
Product catalog: product_id, product_name, category, price, stock_quantity.
*Issues included: duplicates, missing values, zero/invalid values, inconsistent capitalization.*

**orders**
Customer orders: order_id, customer_id, order_date, order_amount, order_status, payment_method.
*Issues included: duplicates, unknown/missing values, zero/invalid values, foreign key issues.*

**order_items**
Items in each order: order_item_id, order_id, product_id, quantity, item_total.
*Issues included: duplicates, zero/invalid values, foreign key issues.*

**employees**
Employee details: employee_id, name, department, hire_date, salary, manager_id.
*Issues included: duplicates, missing values, zero/invalid values, inconsistent department names.*

**transactions**
Customer transactions: transaction_id, customer_id, transaction_date, amount, transaction_type, location, created_at (timestamp).
*Issues included: duplicates, missing/unknown/invalid values, foreign key issues, timestamp for time-based operations.*



## Topics Covered in This Series

This series is organized into five main chapters, each focusing on a core area of PySpark data engineering:

1. **Data Transformation**
   Use PySpark functions for cleaning, transforming, and enriching data.
   *Topics: data cleaning, type conversion, string/date manipulation, filtering, new columns, renaming, sorting, sampling, etc.*

2. **Data Aggregation**
   Practice using `groupBy`, `agg`, `pivot`, and window functions to perform aggregations and derive insights.

3. **Joining Data**
   Master different types of joins (inner, outer, left, right) to combine data from multiple DataFrames.

4. **User-Defined Functions (UDFs)**
   Learn how to create and apply UDFs for custom data transformations.

5. **Spark SQL**
   Practice writing SQL queries against Spark DataFrames using `spark.sql`.

6. **Performance Optimization**
   Explore techniques like partitioning, caching, and broadcasting to improve query performance.

---

Each chapter will follow this structure for every concept:
- **Concept Explanation**
- **Sample Syntax**
- **Practice Question**
- **Expected Output**
- **Additional Challenge**

---

**Let’s get started!**


# Chapter 1: Data Transformation


## 1. Data Cleaning

**Concept Explanation:**  
Data cleaning is the process of identifying and correcting (or removing) errors and inconsistencies in data to improve its quality. Common tasks include handling missing values, removing duplicates, standardizing formats, and correcting typos.


In [None]:

# Sample Syntax
# Remove duplicate customers
customers_df = customers_df.dropDuplicates()

# Drop rows with missing email
customers_df = customers_df.dropna(subset=["email"])

# Fill missing phone numbers with 'Unknown'
customers_df = customers_df.fillna({"phone": "Unknown"})



**Practice Question:**  
Clean the `customers` DataFrame by:
- Removing duplicate records (based on all columns)
- Dropping rows where the `name` or `email` is missing
- Filling missing `phone` values with "Unknown"



**Expected Output:**  
A cleaned DataFrame with no duplicate customers, no missing names or emails, and all phone numbers filled (no nulls).



**Additional Challenge:**  
Standardize all email addresses to lowercase and remove any leading/trailing spaces in the `name` column.



## 2. Data Type Conversion

**Concept Explanation:**  
Data type conversion ensures that each column in your DataFrame has the correct data type for analysis and processing. For example, converting a string date to a DateType, or a string number to IntegerType.


In [None]:

from pyspark.sql.functions import col, to_date

# Convert registration_date from string to date
customers_df = customers_df.withColumn("registration_date", to_date(col("registration_date"), "yyyy-MM-dd"))

# Convert price from string to float
products_df = products_df.withColumn("price", col("price").cast("float"))



**Practice Question:**  
Convert the following columns to appropriate types:
- `registration_date` in `customers` to DateType
- `price` in `products` to FloatType
- `order_amount` in `orders` to FloatType



**Expected Output:**  
The specified columns should have the correct data types (date or float) in their respective DataFrames.



**Additional Challenge:**  
Convert the `created_at` column in `transactions` to a TimestampType.



## 3. String Manipulation

**Concept Explanation:**  
String manipulation involves extracting, replacing, or transforming text data. PySpark provides functions for trimming, changing case, extracting substrings, and more.


In [None]:

from pyspark.sql.functions import lower, trim, substring, split

# Extract email domain
customers_df = customers_df.withColumn("email_domain", split(col("email"), "@")[1])

# Trim spaces from name
customers_df = customers_df.withColumn("name", trim(col("name")))

# Convert product_name to uppercase
products_df = products_df.withColumn("product_name", upper(col("product_name")))



**Practice Question:**  
For the `customers` DataFrame:
- Create a new column `email_domain` that contains only the domain part of the email.
- Trim any leading/trailing spaces from the `name` column.



**Expected Output:**  
A DataFrame with a new `email_domain` column and all names properly trimmed.



**Additional Challenge:**  
For the `products` DataFrame, create a new column `short_name` that contains the first 5 characters of the product name in uppercase.



## 4. Filtering and Selection

**Concept Explanation:**  
Filtering and selection allow you to retrieve specific rows or columns based on conditions, such as selecting only active customers or orders above a certain amount.


In [None]:

# Select only active customers
active_customers = customers_df.filter(col("status") == "Active")

# Select orders with amount greater than 100
large_orders = orders_df.filter(col("order_amount") > 100)

# Select specific columns
selected_customers = customers_df.select("customer_id", "name", "email")



**Practice Question:**  
From the `orders` DataFrame, select all orders with `order_amount` greater than 100 and `order_status` as "Completed".



**Expected Output:**  
A DataFrame containing only completed orders with an amount greater than 100.



**Additional Challenge:**  
From the `customers` DataFrame, select all customers who registered after "2022-01-01" and have status "Active".



## 5. Creating New Columns

**Concept Explanation:**  
Creating new columns allows you to derive additional information from existing data, such as calculating tenure, categorizing values, or combining fields.


In [None]:

from pyspark.sql.functions import year, when

# Calculate customer tenure in years
customers_df = customers_df.withColumn("tenure_years", 2025 - year(col("registration_date")))

# Categorize order amount
orders_df = orders_df.withColumn(
    "order_size",
    when(col("order_amount") > 200, "Large")
    .when(col("order_amount") > 50, "Medium")
    .otherwise("Small")
)



**Practice Question:**  
Add a new column `tenure_years` to the `customers` DataFrame, representing the number of years since registration (assume current year is 2025).



**Expected Output:**  
A DataFrame with a new `tenure_years` column showing the correct tenure for each customer.



**Additional Challenge:**  
In the `orders` DataFrame, add a column `is_high_value` that is `True` if `order_amount` is greater than 250, otherwise `False`.



## 6. Renaming Columns

**Concept Explanation:**  
Renaming columns improves clarity and consistency in your DataFrames, especially when preparing data for downstream processes.


In [None]:

# Rename 'name' to 'customer_name'
customers_df = customers_df.withColumnRenamed("name", "customer_name")

# Rename multiple columns
products_df = products_df.withColumnRenamed("product_name", "name").withColumnRenamed("stock_quantity", "stock")



**Practice Question:**  
Rename the `name` column in the `customers` DataFrame to `customer_name`.



**Expected Output:**  
A DataFrame where the column is now called `customer_name`.



**Additional Challenge:**  
Rename the `order_amount` column in the `orders` DataFrame to `total_amount`.



## 7. Sorting Data

**Concept Explanation:**  
Sorting arranges your data in a specific order based on one or more columns, which is useful for reporting, ranking, or preparing data for further analysis.


In [None]:

# Sort customers by registration_date descending
customers_df = customers_df.orderBy(col("registration_date").desc())

# Sort products by price ascending
products_df = products_df.orderBy("price")



**Practice Question:**  
Sort the `products` DataFrame by `price` in descending order.



**Expected Output:**  
A DataFrame of products sorted from highest to lowest price.



**Additional Challenge:**  
Sort the `customers` DataFrame by `tenure_years` (from the previous task) in ascending order.



## 8. Sampling Data

**Concept Explanation:**  
Sampling allows you to work with a subset of your data, which is useful for testing, prototyping, or when working with very large datasets.


In [None]:

# Take a random 10% sample of customers
sample_customers = customers_df.sample(fraction=0.1, seed=42)

# Take a fixed number of rows
sample_products = products_df.limit(5)



**Practice Question:**  
Take a random sample of 5 customers from the `customers` DataFrame.



**Expected Output:**  
A DataFrame containing 5 randomly selected customers.



**Additional Challenge:**  
Take a 20% random sample of the `orders` DataFrame.



## 9. Date Handling and Transformations

**Concept Explanation:**  
Date handling includes parsing strings to dates, extracting date parts, performing date arithmetic, formatting, and handling time zones or nulls.


In [None]:

from pyspark.sql.functions import to_date, year, month, datediff, current_date

# Parse registration_date to date
customers_df = customers_df.withColumn("registration_date", to_date(col("registration_date"), "yyyy-MM-dd"))

# Extract year and month
customers_df = customers_df.withColumn("reg_year", year(col("registration_date")))
customers_df = customers_df.withColumn("reg_month", month(col("registration_date")))

# Calculate days since registration
customers_df = customers_df.withColumn("days_since_registration", datediff(current_date(), col("registration_date")))



**Practice Question:**  
For the `transactions` DataFrame, extract the year and month from the `created_at` timestamp into new columns `year` and `month`.



**Expected Output:**  
A DataFrame with two new columns: `year` and `month`, showing the year and month of each transaction.



**Additional Challenge:**  
Calculate the number of days between `transaction_date` and `created_at` for each transaction.



## 10. Date Range Filtering

**Concept Explanation:**  
Date range filtering allows you to select records that fall within a specific date or timestamp range, which is common in time-based analyses.


In [None]:

from pyspark.sql.functions import to_date

# Filter orders between two dates
start_date = '2023-01-15'
end_date = '2023-02-10'
filtered_orders = orders_df.filter((col('order_date') >= start_date) & (col('order_date') <= end_date))



**Practice Question:**  
Filter the `orders` DataFrame to include only orders placed between '2023-01-20' and '2023-02-05' (inclusive).



**Expected Output:**  
A DataFrame containing only orders within the specified date range.



**Additional Challenge:**  
Filter the `transactions` DataFrame to include only transactions that occurred in February 2023.
