# PySpark Basics - Practice Notebook

This notebook covers fundamental PySpark concepts for ETL operations.

## Topics Covered:
1. Creating SparkSession
2. Reading Data
3. Basic DataFrame Operations
4. Filtering and Selecting
5. Aggregations

In [None]:
# Import required libraries
import sys
from pathlib import Path

# Add src to path
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root / 'src'))

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

## 1. Create Spark Session

In [None]:
# Create Spark Session
spark = SparkSession.builder \
    .appName("PySpark Basics Practice") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "2") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

## 2. Read Sample Data

In [None]:
# Read customers data
customers_df = spark.read.csv(
    str(project_root / 'data/sample/customers.csv'),
    header=True,
    inferSchema=True
)

# Display schema
customers_df.printSchema()

# Show first few rows
customers_df.show(5)

## 3. Basic DataFrame Operations

In [None]:
# Count rows
print(f"Total rows: {customers_df.count()}")

# Get column names
print(f"Columns: {customers_df.columns}")

# Describe numerical columns
customers_df.describe().show()

## 4. Filtering and Selecting

In [None]:
# Select specific columns
customers_df.select("customer_id", "first_name", "last_name", "city").show(5)

# Filter rows
print("\nCustomers age > 30:")
customers_df.filter(F.col("age") > 30).show()

# Multiple conditions
print("\nCustomers age > 30 AND from USA:")
customers_df.filter(
    (F.col("age") > 30) & (F.col("country") == "USA")
).show()

## 5. Aggregations

In [None]:
# Group by and aggregate
print("Customers by city:")
customers_df.groupBy("city").count().orderBy("count", ascending=False).show()

# Multiple aggregations
print("\nAge statistics by city:")
customers_df.groupBy("city").agg(
    F.avg("age").alias("avg_age"),
    F.min("age").alias("min_age"),
    F.max("age").alias("max_age"),
    F.count("*").alias("customer_count")
).show()

## Practice Exercises

Try these exercises on your own:

1. Find all customers with email addresses containing 'gmail'
2. Calculate the average age of all customers
3. Find customers who signed up in 2023
4. Create a new column 'full_name' by concatenating first_name and last_name
5. Find the top 3 cities with most customers

In [None]:
# Your code here


In [None]:
# Stop Spark session
spark.stop()