# Average Review Ratings  
**Amazon SQL Interview Question**

---

## Question

Given the `reviews` table, write a query to retrieve the **average star rating** for each product, **grouped by month**.  

The output should display:
- The **month** as a numerical value (`mth`)
- The **product ID**
- The **average star rating** rounded to two decimal places  

Sort the output first by **month**, then by **product ID**.

---

## Schema

### `reviews` Table:
| Column Name  | Type      |
|--------------|-----------|
| review_id    | integer   |
| user_id      | integer   |
| submit_date  | datetime  |
| product_id   | integer   |
| stars        | integer (1-5) |

---

### Example Input:
| review_id | user_id | submit_date         | product_id | stars |
|-----------|---------|---------------------|------------|--------|
| 6171      | 123     | 06/08/2022 00:00:00 | 50001      | 4      |
| 7802      | 265     | 06/10/2022 00:00:00 | 69852      | 4      |
| 5293      | 362     | 06/18/2022 00:00:00 | 50001      | 3      |
| 6352      | 192     | 07/26/2022 00:00:00 | 69852      | 3      |
| 4517      | 981     | 07/05/2022 00:00:00 | 69852      | 2      |

---

## Example Output:
| mth | product | avg_stars |
|-----|---------|-----------|
| 6   | 50001   | 3.50      |
| 6   | 69852   | 4.00      |
| 7   | 69852   | 2.50      |

---

## Explanation

- Product **50001** received ratings of **4** and **3** in **June**, resulting in an average of **(4 + 3)/2 = 3.5**.
- Product **69852** received:
  - A rating of **4** in **June**, average is **4.0**
  - Ratings of **3** and **2** in **July**, average is **(3 + 2)/2 = 2.5**

---  


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType
from pyspark.sql.functions import *
from datetime import datetime

# Initialize Spark session
spark = SparkSession.builder.master('local[1]').appName("AmazonReviewRatings").getOrCreate()

# Define schema for reviews table
reviews_schema = StructType([
    StructField("review_id", IntegerType(), True),
    StructField("user_id", IntegerType(), True),
    StructField("submit_date", TimestampType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("stars", IntegerType(), True)
])

# Sample data based on the question
reviews_data = [
    (6171, 123, datetime(2022, 6, 8, 0, 0, 0), 50001, 4),
    (7802, 265, datetime(2022, 6, 10, 0, 0, 0), 69852, 4),
    (5293, 362, datetime(2022, 6, 18, 0, 0, 0), 50001, 3),
    (6352, 192, datetime(2022, 7, 26, 0, 0, 0), 69852, 3),
    (4517, 981, datetime(2022, 7, 5, 0, 0, 0), 69852, 2)
]

# Create the DataFrame
reviews_df = spark.createDataFrame(reviews_data, schema=reviews_schema)

# Show the DataFrame
reviews_df.show()


+---------+-------+-------------------+----------+-----+
|review_id|user_id|        submit_date|product_id|stars|
+---------+-------+-------------------+----------+-----+
|     6171|    123|2022-06-08 00:00:00|     50001|    4|
|     7802|    265|2022-06-10 00:00:00|     69852|    4|
|     5293|    362|2022-06-18 00:00:00|     50001|    3|
|     6352|    192|2022-07-26 00:00:00|     69852|    3|
|     4517|    981|2022-07-05 00:00:00|     69852|    2|
+---------+-------+-------------------+----------+-----+



In [5]:
reviews_df.withColumn('mth',month('submit_date'))\
    .groupBy('mth','product_id').agg(avg('stars'))\
    .orderBy('mth','product_id').show()

+---+----------+----------+
|mth|product_id|avg(stars)|
+---+----------+----------+
|  6|     50001|       3.5|
|  6|     69852|       4.0|
|  7|     69852|       2.5|
+---+----------+----------+



In [8]:
reviews_df.createOrReplaceTempView('reviews')
spark.sql(
    """
select 
    month(submit_date) as mth, product_id, avg(stars) as avg_stars
from reviews
group by mth, product_id
order by 1,2
    """
).show()

+---+----------+---------+
|mth|product_id|avg_stars|
+---+----------+---------+
|  6|     50001|      3.5|
|  6|     69852|      4.0|
|  7|     69852|      2.5|
+---+----------+---------+

