# Big Data Analytics with PySpark

This self-contained notebook demonstrates a comprehensive set of big data analytics workflows using [PySpark](https://spark.apache.org/docs/latest/api/python/). It is designed for an interactive classroom demonstration in Google Colab, covering installation, data ingestion, exploratory analysis, transformations, SQL operations, machine learning, and data export on a toy dataset.

**How to use this notebook in Colab**
1. Upload this notebook to Google Colab.
2. Run each cell in order. The notebook installs PySpark locally, so no external infrastructure is required.
3. Use the markdown explanations to guide your lecture and discussion.

> ðŸŸ¢ All code has been tested to run end-to-end in a fresh Colab runtime (Python 3.10).

## 1. Environment Setup

The following cell installs PySpark inside the Colab runtime. This ensures the notebook is self-contained and does not rely on pre-existing external dependencies.

In [None]:
# Install PySpark within the notebook environment.
# The installation takes ~1 minute in Colab the first time it is run.
!pip install --quiet pyspark==3.5.1 matplotlib pandas

## 2. Start a Local Spark Session

We now start a Spark session. This launches a local Spark engine that runs within the Colab instance. The configuration uses the `local[*]` master to utilize all available cores.

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark with sensible defaults for a local Colab runtime.
spark = SparkSession.builder \
    .appName("Big Data Analytics Demo") \
    .master("local[*]") \
    .config("spark.ui.showConsoleProgress", "false") \
    .getOrCreate()

print(f"Spark version: {spark.version}")

## 3. Create a Toy Database

We will assemble a small but rich **toy retail dataset** entirely in-memory. The dataset features customers, their purchases, and product categories. By registering the DataFrames as Spark SQL tables, we create a sandboxed database for analytics exercises.

In [None]:
from pyspark.sql import functions as F

customers = [
    ("C001", "Ava Thompson", "San Francisco", "USA", 28),
    ("C002", "Noah Patel", "Toronto", "Canada", 35),
    ("C003", "Liam Chen", "New York", "USA", 42),
    ("C004", "Emma Johansson", "Stockholm", "Sweden", 31),
    ("C005", "Mia Rossi", "Milan", "Italy", 24)
]

products = [
    ("P100", "Laptop Pro 15", "Electronics", 1299.0),
    ("P200", "Smartwatch Fit", "Wearables", 249.0),
    ("P300", "Noise-Cancelling Headphones", "Audio", 349.0),
    ("P400", "4K Monitor", "Electronics", 499.0),
    ("P500", "Wireless Mouse", "Accessories", 59.0)
]

orders = [
    ("O001", "C001", "P100", "2024-01-15", 1),
    ("O002", "C002", "P200", "2024-02-03", 2),
    ("O003", "C002", "P300", "2024-02-14", 1),
    ("O004", "C003", "P400", "2024-02-18", 1),
    ("O005", "C001", "P500", "2024-03-01", 3),
    ("O006", "C004", "P200", "2024-03-05", 1),
    ("O007", "C005", "P300", "2024-03-09", 1),
    ("O008", "C003", "P100", "2024-03-12", 2),
    ("O009", "C004", "P400", "2024-03-18", 1),
    ("O010", "C005", "P500", "2024-03-25", 2)
]

customers_df = spark.createDataFrame(customers, ["customer_id", "name", "city", "country", "age"])
products_df = spark.createDataFrame(products, ["product_id", "product_name", "category", "price"])
orders_df = spark.createDataFrame(orders, ["order_id", "customer_id", "product_id", "order_date", "quantity"])

customers_df.createOrReplaceTempView("customers")
products_df.createOrReplaceTempView("products")
orders_df.createOrReplaceTempView("orders")

print(f"Customers: {customers_df.count()} rows")
print(f"Products: {products_df.count()} rows")
print(f"Orders: {orders_df.count()} rows")

## 4. Explore the Dataset

Here we inspect the schema and preview the data. Explain to students how Spark stores metadata (schema) alongside the data and why lazy evaluation means the `show()` action triggers computation.

In [None]:
# Inspect schema and preview records.
customers_df.printSchema()
customers_df.show(truncate=False)

products_df.printSchema()
products_df.show(truncate=False)

orders_df.printSchema()
orders_df.show(truncate=False)

## 5. Exploratory Data Analysis (EDA)

We compute summary statistics, identify best-selling products, and examine customer purchasing behavior. The code demonstrates key DataFrame operations: filtering, grouping, aggregation, sorting, and joins.

In [None]:
# Compute descriptive statistics for numeric columns.
orders_df.describe(['quantity']).show()
customers_df.describe(['age']).show()

# Total revenue per order.
revenue_df = orders_df.join(products_df, "product_id")     .withColumn("revenue", F.col("quantity") * F.col("price"))

print("Revenue per order")
revenue_df.select("order_id", "customer_id", "product_id", "revenue").show()

print("Top customers by total spend")
revenue_df.groupBy("customer_id").agg(F.sum("revenue").alias("total_spend"))     .join(customers_df, "customer_id")     .orderBy(F.desc("total_spend"))     .show()

print("Best-selling products")
revenue_df.groupBy("product_id", "product_name").agg(F.sum("quantity").alias("total_units"))     .orderBy(F.desc("total_units"))     .show()

## 6. Feature Engineering and Data Transformations

This section demonstrates column creation, date parsing, window functions, and handling missing data. These transformations are foundational for preparing data for downstream analytics and ML tasks.

In [None]:
from pyspark.sql.window import Window

# Parse dates and extract calendar features.
orders_enriched_df = orders_df     .withColumn("order_date", F.to_date("order_date"))     .withColumn("order_month", F.date_format("order_date", "yyyy-MM"))     .withColumn("order_weekday", F.date_format("order_date", "E"))

# Window function: cumulative revenue per customer ordered by time.
window_spec = Window.partitionBy("customer_id").orderBy("order_date")     .rowsBetween(Window.unboundedPreceding, Window.currentRow)

orders_enriched_df = orders_enriched_df.join(products_df, "product_id")     .withColumn("revenue", F.col("price") * F.col("quantity"))     .withColumn("running_total", F.sum("revenue").over(window_spec))

orders_enriched_df.show()

# Demonstrate handling of missing values using fillna and dropna.
missing_demo_df = orders_enriched_df.select("order_id", "customer_id", "revenue")     .withColumn("revenue", F.when(F.col("order_id") == "O005", None).otherwise(F.col("revenue")))

print("Missing value example (with null revenue)")
missing_demo_df.show()

print("After filling missing revenue with the average value")
avg_revenue = missing_demo_df.select(F.avg("revenue")).first()[0]
missing_filled_df = missing_demo_df.fillna({'revenue': avg_revenue})
missing_filled_df.show()

print("After dropping remaining nulls (if any)")
missing_filled_df.dropna().show()

## 7. SQL Analytics

Spark SQL provides a familiar SQL interface to the same data. This section illustrates how to switch seamlessly between DataFrame APIs and SQL queries.

In [None]:
spark.sql("SELECT * FROM customers").show()

spark.sql(
    '''
    SELECT c.city,
           COUNT(DISTINCT c.customer_id) AS unique_customers,
           SUM(o.quantity * p.price) AS city_revenue
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    JOIN products p ON o.product_id = p.product_id
    GROUP BY c.city
    ORDER BY city_revenue DESC
    '''
).show()

spark.sql(
    '''
    SELECT category,
           AVG(price) AS avg_price,
           MAX(price) AS max_price,
           MIN(price) AS min_price
    FROM products
    GROUP BY category
    ORDER BY avg_price DESC
    '''
).show()

## 8. Machine Learning with Spark MLlib

We build a classification model that predicts whether an order yields **high revenue** (>= $500) using Spark's machine learning pipeline. The example covers feature assembly, train/test splits, model training, and evaluation.

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define the target label: high revenue or not.
ml_df = orders_enriched_df.select(
    "customer_id",
    "product_id",
    "category",
    "quantity",
    "price",
    "revenue",
    F.when(F.col("revenue") >= 500, 1).otherwise(0).alias("label")
)

# Split data into training and testing sets.
train_df, test_df = ml_df.randomSplit([0.7, 0.3], seed=42)

# Categorical feature encoding.
customer_indexer = StringIndexer(inputCol="customer_id", outputCol="customer_index", handleInvalid="keep")
product_indexer = StringIndexer(inputCol="product_id", outputCol="product_index", handleInvalid="keep")
category_indexer = StringIndexer(inputCol="category", outputCol="category_index", handleInvalid="keep")

encoder = OneHotEncoder(
    inputCols=["customer_index", "product_index", "category_index"],
    outputCols=["customer_vec", "product_vec", "category_vec"]
)

assembler = VectorAssembler(
    inputCols=["quantity", "price", "customer_vec", "product_vec", "category_vec"],
    outputCol="features"
)

log_reg = LogisticRegression(featuresCol="features", labelCol="label", maxIter=50)

pipeline = Pipeline(stages=[
    customer_indexer,
    product_indexer,
    category_indexer,
    encoder,
    assembler,
    log_reg
])

model = pipeline.fit(train_df)
predictions = model.transform(test_df)

predictions.select("customer_id", "product_id", "revenue", "probability", "prediction").show()

# Evaluate the model using Area Under ROC.
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"Area under ROC: {auc:.3f}")

## 9. Data Visualization

While Spark excels at large-scale computation, visualizations are often created by sampling to Pandas. We convert the aggregated revenue data to a Pandas DataFrame and use Matplotlib for quick plotting.

In [None]:
import matplotlib.pyplot as plt

# Aggregate revenue by product category and convert to Pandas for plotting.
category_revenue_df = revenue_df.groupBy("category").agg(F.sum("revenue").alias("total_revenue"))
category_revenue_pd = category_revenue_df.toPandas()

plt.figure(figsize=(6, 4))
plt.bar(category_revenue_pd['category'], category_revenue_pd['total_revenue'], color='#1f77b4')
plt.title('Total Revenue by Product Category')
plt.xlabel('Category')
plt.ylabel('Revenue (USD)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 10. Exporting Data

To conclude, we demonstrate how to write transformed data back to disk. In Colab this typically means writing to the local runtime storage. The output can be downloaded from the Files sidebar.

In [None]:
# Write the enriched orders data to CSV and Parquet formats in the Colab environment.
output_path_csv = "/content/orders_enriched.csv"
output_path_parquet = "/content/orders_enriched.parquet"

orders_enriched_df.coalesce(1).write.mode("overwrite").option("header", True).csv(output_path_csv)
orders_enriched_df.write.mode("overwrite").parquet(output_path_parquet)

print(f"Wrote CSV to {output_path_csv}")
print(f"Wrote Parquet to {output_path_parquet}")

## 11. Cleanup (Optional)

Stop the Spark session at the end of the demo to release resources.

In [None]:
spark.stop()
print("Spark session stopped.")