## 1596 - The Most Frequently Ordered Products for Each Customer

### Table: Customers

| Column Name | Type    |
|-------------|---------|
| customer_id | int     |
| name        | varchar |

customer_id is the column with unique values for this table.  
This table contains information about the customers.

---

### Table: Orders

| Column Name | Type |
|-------------|------|
| order_id    | int  |
| order_date  | date |
| customer_id | int  |
| product_id  | int  |

order_id is the column with unique values for this table.  
This table contains information about the orders made by customer_id.  
No customer will order the same product more than once in a single day.

---

### Table: Products

| Column Name   | Type    |
|---------------|---------|
| product_id    | int     |
| product_name  | varchar |
| price         | int     |

product_id is the column with unique values for this table.  
This table contains information about the products.

---

Write a solution to find the most frequently ordered product(s) for each customer.

The result table should have the product_id and product_name for each customer_id who ordered at least one order.

Return the result table in any order.

---

### Example 1:

#### Input:

**Customers table:**

| customer_id | name  |
|-------------|-------|
| 1           | Alice |
| 2           | Bob   |
| 3           | Tom   |
| 4           | Jerry |
| 5           | John  |

**Orders table:**

| order_id | order_date | customer_id | product_id |
|----------|------------|-------------|------------|
| 1        | 2020-07-31 | 1           | 1          |
| 2        | 2020-07-30 | 2           | 2          |
| 3        | 2020-08-29 | 3           | 3          |
| 4        | 2020-07-29 | 4           | 1          |
| 5        | 2020-06-10 | 1           | 2          |
| 6        | 2020-08-01 | 2           | 1          |
| 7        | 2020-08-01 | 3           | 3          |
| 8        | 2020-08-03 | 1           | 2          |
| 9        | 2020-08-07 | 2           | 3          |
| 10       | 2020-07-15 | 1           | 2          |

**Products table:**

| product_id | product_name | price |
|------------|--------------|-------|
| 1          | keyboard     | 120   |
| 2          | mouse        | 80    |
| 3          | screen       | 600   |
| 4          | hard disk    | 450   |

---

### Output:

| customer_id | product_id | product_name |
|-------------|------------|--------------|
| 1           | 2          | mouse        |
| 2           | 1          | keyboard     |
| 2           | 2          | mouse        |
| 2           | 3          | screen       |
| 3           | 3          | screen       |
| 4           | 1          | keyboard     |

---

**Explanation:**  
Alice (customer 1) ordered the mouse three times and the keyboard one time, so the mouse is the most frequently ordered product for them.  
Bob (customer 2) ordered the keyboard, the mouse, and the screen one time, so those are the most frequently ordered products for them.  
Tom (customer 3) only ordered the screen (two times), so that is the most frequently ordered product for them.  
Jerry (customer 4) only ordered the keyboard (one time), so that is the most frequently ordered product for them.  
John (customer 5) did not order anything, so we do not include them in the result table.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType
from pyspark.sql.functions import col, count, max as max_, row_number
from pyspark.sql.window import Window

# Start Spark session
spark = SparkSession.builder.appName("MostFrequentProduct").getOrCreate()

# Define schemas
customers_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("name", StringType(), True)
])

orders_schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("order_date", DateType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True)
])

products_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("price", IntegerType(), True)
])

# Sample data
from datetime import datetime

customers_data = [
    (1, "Alice"), (2, "Bob"), (3, "Tom"), (4, "Jerry"), (5, "John")
]

orders_data = [
    (1, datetime.strptime("2020-07-31", "%Y-%m-%d"), 1, 1),
    (2, datetime.strptime("2020-07-30", "%Y-%m-%d"), 2, 2),
    (3, datetime.strptime("2020-08-29", "%Y-%m-%d"), 3, 3),
    (4, datetime.strptime("2020-07-29", "%Y-%m-%d"), 4, 1),
    (5, datetime.strptime("2020-06-10", "%Y-%m-%d"), 1, 2),
    (6, datetime.strptime("2020-08-01", "%Y-%m-%d"), 2, 1),
    (7, datetime.strptime("2020-08-01", "%Y-%m-%d"), 3, 3),
    (8, datetime.strptime("2020-08-03", "%Y-%m-%d"), 1, 2),
    (9, datetime.strptime("2020-08-07", "%Y-%m-%d"), 2, 3),
    (10, datetime.strptime("2020-07-15", "%Y-%m-%d"), 1, 2)
]

products_data = [
    (1, "keyboard", 120),
    (2, "mouse", 80),
    (3, "screen", 600),
    (4, "hard disk", 450)
]

# Create DataFrames
customers_df = spark.createDataFrame(customers_data, customers_schema)
orders_df = spark.createDataFrame(orders_data, orders_schema)
products_df = spark.createDataFrame(products_data, products_schema)

customers_df.createOrReplaceTempView("customers")
orders_df.createOrReplaceTempView("orders")
products_df.createOrReplaceTempView("products")

In [0]:
%sql
WITH ProductFrequency AS (
    SELECT 
        customer_id,
        product_id,
        COUNT(*) AS freq
    FROM Orders
    GROUP BY customer_id, product_id
),
MaxFrequency AS (
    SELECT 
        customer_id,
        MAX(freq) AS max_freq
    FROM ProductFrequency
    GROUP BY customer_id
)
SELECT 
    pf.customer_id,
    pf.product_id,
    p.product_name
FROM ProductFrequency pf
JOIN MaxFrequency mf
    ON pf.customer_id = mf.customer_id AND pf.freq = mf.max_freq
JOIN Products p
    ON pf.product_id = p.product_id order by pf.customer_id,
    pf.product_id ;

In [0]:
from pyspark.sql.functions import *

cust_prd_df = (
    orders_df.groupBy(col("customer_id"), col("product_id"))
    .agg(count("*").alias("freq"))
    .selectExpr("customer_id as c_id", "product_id", "freq")
)
cust_df = (
    cust_prd_df.groupBy(col("c_id"))
    .agg(max(col("freq")).alias("max_freq"))
    .selectExpr("c_id as cust_id", "max_freq")
)
cust_prd_mx_freq_df = cust_prd_df.join(
    cust_df, (col("c_id") == col("cust_id")) & (col("freq") == col("max_freq")), "inner"
).selectExpr("c_id as customer_id", "product_id as p_id", "max_freq as frequency")
cust_prd_mx_freq_df.join(
    products_df, col("p_id") == col("product_id"), "inner"
).selectExpr("customer_id", "product_id", "product_name").orderBy(
    ["customer_id", "product_id"], ascending=[1, 1]
).display()