## Question 1: Calculate the average rating for each movie

We use the `rating_df` DataFrame and group it by `movie_id` to compute the average rating.
Then, we join the result with `item_df` to retrieve the movie titles.
Finally, we display the first 10 movies with their average ratings.

### Spark Code:

```python
avg_rating_df = rating_df.groupBy("movie_id").avg("rating") \
    .withColumnRenamed("avg(rating)", "avg_rating") \
    .orderBy("movie_id")

joined_df = avg_rating_df.join(item_df, on="movie_id")
joined_df.select("movie_id", "title", "avg_rating").show(10)



![Q1 Result](https://raw.githubusercontent.com/loerish/assignment3/main/images_assignment3/q1_result.png)






### Question 1 Result Interpretation
The result displays the first 10 movies by movie_id with their average viewer ratings. The average scores vary between low (1.0) and moderate (around 3.2), suggesting that these movies received mixed reviews. This aggregation helps understand which movies are better received based on user feedback and sets a foundation for recommendation systems or further analysis.

## Question 2: Top 10 Movies with the Highest Average Ratings

We extract the top 10 movies with the highest average viewer ratings by sorting the `avg_rating` column in descending order. This helps us identify the most favored movies based on viewer feedback.

### Spark Code:
```python
top10_df = joined_df.orderBy("avg_rating", ascending=False)
top10_df.select("movie_id", "title", "avg_rating").show(10)


![Q2 Result](https://raw.githubusercontent.com/loerish/assignment3/main/images_assignment3/q2_result.png)




### Question 2 Result Interpretation
The result shows that all top 10 movies received a perfect average score of 5.0, indicating unanimous approval from the viewers who rated them. These movies likely had fewer but consistently high ratings, showing either niche appeal or high quality. This insight is valuable for identifying standout content within the dataset.

## Question 3: Join Rating and User Data

We perform a join operation between the `rating_df` and `user_df` on the `user_id` column. This allows us to enrich the rating records with demographic information of the users such as age, gender, occupation, and zip code.

This is essential for analyzing how user characteristics affect movie preferences and enables user-based recommendation analysis.

### Spark Code:

```python
rating_user_df = rating_df.join(user_df, on="user_id")
rating_user_df.show(10)


![Q3 Result](https://raw.githubusercontent.com/loerish/assignment3/main/images_assignment3/q3_result.png)




### Question 3 Result Interpretation
The result combines rating and user demographic data. For example, all records shown are from user 148, a 33-year-old male engineer from zip code 97006. This allows us to examine how specific user groups (e.g., by age or occupation) rate movies. It forms the basis for personalized or group-based movie recommendations.

## Question 4: Analyze Rating Trends by Occupation
We aim to explore whether users from different occupations exhibit varying rating patterns. This analysis can reveal how professional background may influence movie rating behavior.

### Spark Code:
occupation_rating_df = rating_user_df.groupBy("occupation").avg("rating")
occupation_rating_df = occupation_rating_df.withColumnRenamed("avg(rating)", "avg_rating")
occupation_rating_df = occupation_rating_df.orderBy("avg_rating", ascending=False)
occupation_rating_df.show(10)

![Q4 Result](https://raw.githubusercontent.com/loerish/assignment3/main/images_assignment3/q4_result.png)




### Question 4 Result Interpretation:
The result displays the average movie rating by occupation. Occupations such as none, lawyer, and doctor tend to give relatively higher ratings, possibly indicating a more lenient or enthusiastic viewer group. On the other hand, programmer and librarian users gave lower average ratings, suggesting they may be more critical or have specific preferences. These differences can inform targeted content recommendations in real-world applications.

## Question 5: Analyze rating differences by gender
We investigate whether there is a significant difference between male and female users in terms of their average movie ratings. We use the rating_user_df DataFrame (which combines rating and user information), and group the data by the gender column to compute the average rating for each gender.

### Spark Code:
gender_rating_df = rating_user_df.groupBy("gender").avg("rating")
gender_rating_df = gender_rating_df.withColumnRenamed("avg(rating)", "avg_rating")
gender_rating_df = gender_rating_df.orderBy("gender")
gender_rating_df.show()

![Q5 Result](https://raw.githubusercontent.com/loerish/assignment3/main/images_assignment3/q5_result.png)




### Question 5 Result Interpretation
The result shows that female users (F) have a slightly higher average rating (≈ 3.5315) compared to male users (M) with an average of ≈ 3.5293. Although the difference is minimal, it may suggest a marginally more generous rating tendency among female users. However, further statistical testing would be needed to confirm whether this difference is significant.