# Assignment Lesson 6. Pagila Database Analysis

## Requirements.

1. Using Spark, compute monthly revenue by film category.
2. Define customer lifetime value (CLV) using Spark.
3. Identify the top 1% of customers generating 80% of revenue.
4. Propose a partitioning strategy for the payment table:
    - by date?
    - by store?
    - by customer?
    
    Explain trade-offs.

5. The following join is very slow at scale:

    `payment -> rental -> inventory -> film -> film_category -> category`

    Propose:

    - join order optimization
    - indexing strategies
    - caching or materialized views

## My Solution

### Set Up Spark Session and JDBC Connection

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, count, avg, round as _round, max as _max, min as _min, \
    stddev, percentile_approx, \
    date_format, datediff,  \
    countDistinct, concat_ws, \
    row_number

from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("PagilaAnalysis") \
    .config("spark.jars", "../jars/postgresql-42.7.8.jar") \
    .config("spark.driver.extraClassPath", "../jars/postgresql-42.7.8.jar") \
    .master("local[*]") \
    .getOrCreate()
    
jdbc_url = "jdbc:postgresql://localhost:5432/analytics"
db_properties = {
    "user": "spark_user",
    "password": "spark_password",
    "driver": "org.postgresql.Driver"
}

25/12/28 23:22:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


### Load Data

In [3]:
print("\\n=== READING PAGILA DATA FROM POSTGRESQL ===")

# Read all tables needed
df_film = spark.read.jdbc(url=jdbc_url, table="film", properties=db_properties)
df_film_category = spark.read.jdbc(url=jdbc_url, table="film_category", properties=db_properties)
df_category = spark.read.jdbc(url=jdbc_url, table="category", properties=db_properties)
df_inventory = spark.read.jdbc(url=jdbc_url, table="inventory", properties=db_properties)
df_rental = spark.read.jdbc(url=jdbc_url, table="rental", properties=db_properties)
df_payment = spark.read.jdbc(url=jdbc_url, table="payment", properties=db_properties)
df_store = spark.read.jdbc(url=jdbc_url, table="store", properties=db_properties)
df_customer = spark.read.jdbc(url=jdbc_url, table="customer", properties=db_properties)

print("\\n=== All tables loaded! ===")

\n=== READING PAGILA DATA FROM POSTGRESQL ===
\n=== All tables loaded! ===


### Exercise 1. Monthly Revenue by Film Category

In [4]:
print("\\n=== EXERCISE 1. Monthly Revenue by Film Category")

monthly_revenue_by_category = df_payment \
    .join(df_rental, "rental_id") \
    .join(df_inventory, "inventory_id") \
    .join(df_film, "film_id") \
    .join(df_film_category, "film_id") \
    .join(df_category, "category_id") \
    .withColumn("year_month", date_format(col("payment_date"), "yyyy-MM")) \
    .groupBy("year_month", col("name").alias("category_name")) \
    .agg(
        _sum("amount").alias("total_revenue")
    ) \
    .orderBy("year_month", col("total_revenue").desc())

# Show result
print("\\nMonthly Revenue by Category:")
monthly_revenue_by_category.show(truncate=False)

pivot_revenue = (
    monthly_revenue_by_category
    .groupBy("category_name")      
    .pivot("year_month")           
    .agg(_round(_sum("total_revenue"), 2))
    .orderBy("category_name")
)
    
print("\\n Pivot View (Month x Category):")
pivot_revenue.show(truncate=False)

\n=== EXERCISE 1. Monthly Revenue by Film Category


\nMonthly Revenue by Category:


                                                                                

+----------+-------------+-------------+
|year_month|category_name|total_revenue|
+----------+-------------+-------------+
|2022-01   |Sci-Fi       |244.43       |
|2022-01   |Action       |237.44       |
|2022-01   |New          |236.47       |
|2022-01   |Sports       |235.48       |
|2022-01   |Comedy       |226.57       |
|2022-01   |Drama        |218.55       |
|2022-01   |Foreign      |210.48       |
|2022-01   |Documentary  |195.51       |
|2022-01   |Family       |189.53       |
|2022-01   |Classics     |168.58       |
|2022-01   |Animation    |164.54       |
|2022-01   |Travel       |150.68       |
|2022-01   |Horror       |147.68       |
|2022-01   |Music        |136.63       |
|2022-01   |Games        |129.71       |
|2022-01   |Children     |123.68       |
|2022-02   |Sports       |821.20       |
|2022-02   |Games        |728.34       |
|2022-02   |Family       |709.12       |
|2022-02   |Action       |684.34       |
+----------+-------------+-------------+
only showing top

                                                                                

\n Pivot View (Month x Category):


                                                                                

+-------------+-------+-------+-------+-------+-------+-------+-------+
|category_name|2022-01|2022-02|2022-03|2022-04|2022-05|2022-06|2022-07|
+-------------+-------+-------+-------+-------+-------+-------+-------+
|Action       |237.44 |684.34 |839.90 |652.34 |628.36 |741.06 |592.41 |
|Animation    |164.54 |623.48 |802.02 |779.04 |882.86 |757.02 |647.34 |
|Children     |123.68 |520.76 |620.36 |698.27 |611.35 |636.28 |444.85 |
|Classics     |168.58 |541.59 |604.43 |579.46 |643.37 |621.44 |480.72 |
|Comedy       |226.57 |640.66 |736.31 |633.57 |749.44 |730.42 |666.61 |
|Documentary  |195.51 |585.46 |756.24 |646.48 |697.38 |752.02 |584.43 |
|Drama        |218.55 |678.36 |768.24 |748.39 |809.11 |720.35 |644.39 |
|Family       |189.53 |709.12 |684.28 |744.12 |657.30 |688.20 |553.52 |
|Foreign      |210.48 |663.48 |736.24 |623.40 |756.20 |609.50 |671.37 |
|Games        |129.71 |728.34 |740.31 |589.64 |753.42 |678.40 |661.51 |
|Horror       |147.68 |589.76 |730.41 |561.65 |587.67 |553.71 |5

### Exercise 2. Define Customer Lifetime Value (CLV)

In the basic background: `CLV = Total Revenue from Customer`.

In [7]:
print("\\n=== EXERCISE 2. Customer Lifetime Value")

clv_summary = df_payment \
    .join(df_customer, "customer_id") \
    .groupBy(
        "customer_id",
        concat_ws(" ", col("first_name"), col("last_name")).alias("customer_name")
    ) \
    .agg(
        # Total Revenue (CLV)
        _sum("amount").alias("clv"),
        # Total Transactions
        count("payment_id").alias("total_transactions"),
        # Number of active months
        countDistinct(date_format(col("payment_date"), "yyyy-MM")).alias("total_active_months"),
        # Average revenue of each transaction value
        _round(avg("amount"), 2).alias("avg_transaction_value"),
        # Tenure days
        datediff(_max("payment_date"), _min("payment_date")).alias("tenure_days"),
        # _max("payment_date").alias("last_purchase_date"),
        # _min("payment_date").alias("first_purchase_date")
    ) \
    .withColumn(
        "avg_monthly_revenue",
        _round(col("clv")/col("total_active_months"), 2)
    ) \
    .orderBy(col("clv").desc())

# Show result
print("\\n Top 20 Customers by CLV:")
clv_summary.show(20, truncate=False)

# Statistic Summary
clv_summary.select(
    _round(avg("clv"), 2).alias("avg_clv"),
    _round(stddev("clv"), 2).alias("stddev_clv"),
    _round(_min("clv"), 2).alias("min_clv"),
    _round(_max("clv"), 2).alias("max_clv"),
    _round(percentile_approx("clv", 0.25), 2).alias("25th_percentile_clv"),
    _round(percentile_approx("clv", 0.5), 2).alias("median_clv"),
    _round(percentile_approx("clv", 0.75), 2).alias("75th_percentile_clv")
).show()

\n=== EXERCISE 2. Customer Lifetime Value


\n Top 20 Customers by CLV:
+-----------+--------------+------+------------------+-------------------+---------------------+-----------+-------------------+
|customer_id|customer_name |clv   |total_transactions|total_active_months|avg_transaction_value|tenure_days|avg_monthly_revenue|
+-----------+--------------+------+------------------+-------------------+---------------------+-----------+-------------------+
|526        |KARL SEAL     |221.55|45                |6                  |4.92                 |171        |36.93              |
|148        |ELEANOR HUNT  |216.54|46                |7                  |4.71                 |180        |30.93              |
|144        |CLARA SHAW    |195.58|42                |7                  |4.66                 |179        |27.94              |
|137        |RHONDA KENNEDY|194.61|39                |7                  |4.99                 |180        |27.80              |
|178        |MARION SNYDER |194.61|39                |7              

### Exercise 3. Identify The Top 1% of Customers Generating 80% of Revenue.

In [9]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

In [10]:
print("\\n=== EXERCISE 3. Top 1% of Customers Generating 80% of Revenue")

# Compute total revenue
total_revenue = df_payment.agg(_sum("amount")).collect()[0][0]
print(f"Total Revenue: ${total_revenue}")

# Compute cumulative revenue
windowSpec = Window.orderBy(col("clv").desc())

# Pareto 80/20
## Ranking customer by clv
## Cumulative revenue from customer #1 to present
## % revenue at this present row
## % number of customers at this present row
customer_pareto = clv_summary \
    .withColumn("row_num", row_number().over(windowSpec)) \
    .withColumn("cumulative_revenue", _sum("clv").over(windowSpec.rowsBetween(Window.unboundedPreceding, 0))) \
    .withColumn("revenue_percentage", _round(col("cumulative_revenue") / total_revenue * 100, 2)) \
    .withColumn("customer_percentage", _round(col("row_num") / clv_summary.count() * 100, 2))
    
top_customers_80 = customer_pareto.filter(col("revenue_percentage") <= 0.80)
top_customers_80.show(truncate=False)

\n=== EXERCISE 3. Top 1% of Customers Generating 80% of Revenue
Total Revenue: $67416.51


25/12/28 23:47:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/28 23:47:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/28 23:47:10 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/28 23:47:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/28 23:47:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/28 23:47:11 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/12/28 2

+-----------+-------------+------+------------------+-------------------+---------------------+-----------+-------------------+-------+------------------+------------------+-------------------+
|customer_id|customer_name|clv   |total_transactions|total_active_months|avg_transaction_value|tenure_days|avg_monthly_revenue|row_num|cumulative_revenue|revenue_percentage|customer_percentage|
+-----------+-------------+------+------------------+-------------------+---------------------+-----------+-------------------+-------+------------------+------------------+-------------------+
|526        |KARL SEAL    |221.55|45                |6                  |4.92                 |171        |36.93              |1      |221.55            |0.33              |0.17               |
|148        |ELEANOR HUNT |216.54|46                |7                  |4.71                 |180        |30.93              |2      |438.09            |0.65              |0.33               |
+-----------+-------------+---