# 🧠 Leetcode 607 — Customer With Most Orders (Databricks Edition)

---

## 📘 Problem Statement

### Table: Orders

| Column Name     | Type |
|-----------------|------|
| order_number    | int  |
| customer_number | int  |

- `order_number` is the primary key.
- This table contains information about the order ID and the customer ID.

---

## 🎯 Objective

Write a query to find the `customer_number` for the customer who has placed the **largest number of orders**.

- The test cases are generated so that **exactly one customer** will have placed more orders than any other customer.

---

## 🧾 Example

### Input

**Orders Table**

| order_number | customer_number |
|--------------|-----------------|
| 1            | 1               |
| 2            | 2               |
| 3            | 3               |
| 4            | 3               |

### Output

| customer_number |
|-----------------|
| 3               |

### Explanation

- Customer 3 has placed 2 orders.
- Customers 1 and 2 have placed only 1 order each.
- So the result is `customer_number = 3`.

---

## 🧩 Follow-Up

If **more than one customer** has the largest number of orders, return **all such customer_number(s)**.

---

## 🧱 PySpark DataFrame Creation

```python
from pyspark.sql import Row

# Sample data
orders_data = [
    Row(order_number=1, customer_number=1),
    Row(order_number=2, customer_number=2),
    Row(order_number=3, customer_number=3),
    Row(order_number=4, customer_number=3)
]

# Create DataFrame
orders_df = spark.createDataFrame(orders_data)

# Register temp view
orders_df.createOrReplaceTempView("Orders")
```

---

## ✅ SQL Solution

```sql
SELECT customer_number
FROM Orders
GROUP BY customer_number
ORDER BY COUNT(*) DESC
LIMIT 1;
```

---

## 🧪 PySpark Solution

```python
from pyspark.sql.functions import count

# Group by customer_number and count orders
order_counts = orders_df.groupBy("customer_number") \
                        .agg(count("*").alias("order_count"))

# Find max order count
max_count = order_counts.agg({"order_count": "max"}).collect()[0][0]

# Filter customers with max order count
result_df = order_counts.filter(order_counts["order_count"] == max_count) \
                        .select("customer_number")

result_df.show()
```

---

📘 *This notebook is part of DataGym’s SQL-to-PySpark transition series. Want to build a reusable template for ranking, aggregation, or customer analytics? Let’s co-create it!*


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, count, row_number, rank
from pyspark.sql.window import Window

# 1️⃣ Sample Data
orders_data = [
    (1, 1),
    (2, 2),
    (3, 3),
    (4, 3)
]

# 2️⃣ Schema Definition
orders_schema = StructType([
    StructField("order_number", IntegerType(), True),
    StructField("customer_number", IntegerType(), True)
])

# 3️⃣ Create DataFrame
orders_df = spark.createDataFrame(orders_data, schema=orders_schema)

# 4️⃣ Register Temp View
orders_df.createOrReplaceTempView("Orders")

In [0]:
orders_df.groupBy()

In [0]:
result = spark.sql("""
    SELECT customer_number
    FROM Orders
    GROUP BY customer_number
    ORDER BY COUNT(*) DESC
    LIMIT 1
""")

result.show()