# Compressed Mean  
**Alibaba SQL Interview Question**

---

### Question  
You're trying to find the mean number of items per order on Alibaba, rounded to 1 decimal place using tables which includes information on the count of items in each order (`item_count` table) and the corresponding number of orders for each item count (`order_occurrences` table).

---

### items_per_order Table:

| Column Name        | Type     |
|--------------------|----------|
| item_count         | integer  |
| order_occurrences  | integer  |

---

### Example Input:

| item_count | order_occurrences |
|------------|-------------------|
| 1          | 500               |
| 2          | 1000              |
| 3          | 800               |
| 4          | 1000              |

> There are a total of 500 orders with one item per order, 1000 orders with two items per order, and 800 orders with three items per order.

---

### Example Output:

| mean |
|------|
| 2.7  |

---

### Explanation  
Let's calculate the arithmetic average:

- **Total items** = (1×500) + (2×1000) + (3×800) + (4×1000) = **8900**
- **Total orders** = 500 + 1000 + 800 + 1000 = **3300**

So,  
**Mean = 8900 / 3300 = 2.7**


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import *

# Initialize Spark session
spark = SparkSession.builder.master('local[1]').appName("CompressedMean").getOrCreate()

# Define schema
schema = StructType([
    StructField("item_count", IntegerType(), True),
    StructField("order_occurrences", IntegerType(), True)
])

# Create data
data = [
    (1, 500),
    (2, 1000),
    (3, 800),
    (4, 1000)
]

# Create DataFrame
items_per_order_df = spark.createDataFrame(data, schema)

# Show DataFrame
items_per_order_df.show()


+----------+-----------------+
|item_count|order_occurrences|
+----------+-----------------+
|         1|              500|
|         2|             1000|
|         3|              800|
|         4|             1000|
+----------+-----------------+



In [2]:
items_per_order_df\
    .agg(round(sum(col('item_count')*col('order_occurrences'))/sum('order_occurrences'),1).alias('mean'))\
    .show()

+----+
|mean|
+----+
| 2.7|
+----+



In [3]:
items_per_order_df.createOrReplaceTempView('items_per_order')
spark.sql(
"""
SELECT round(sum(item_count * order_occurrences) / sum(order_occurrences) , 1) as mean
FROM items_per_order
"""
).show()

+----+
|mean|
+----+
| 2.7|
+----+

