# Assignment 2: Data wrangling and data exploring with pandas

All questions are weighted the same in this assignment. You are encouraged to check out the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/). 

What to submit: HTML version of this notebook (use File -> Download as -> HTML from the menu) with solutions and answers to the questions. Please rename the file as follows: Assignment_2_*Name*_*Surname*.html.

In [3]:
import pandas as pd

In [4]:
pd.set_option('display.max_rows', 15)
pd.set_option('display.precision', 2)
pd.options.display.float_format = '{:,.2f}'.format

In this assignment, use our version of the [MovieLens](https://grouplens.org/datasets/movielens/) data set -- the same data set that we used in the lecture.

In [6]:
df = pd.read_csv('movie_lens_1M.csv')

![MovieLens](https://grouplens.org/site-content/uploads/visual_for_blog_post_cscw2018.png)

### Exercise 1

When did the first rating occur? Which rating occured at the latest date?

Solution: `['2000-04-25 23:05:32', '2003-02-28 17:49:50']`

In [8]:
#print(df.head())

# Ensure the 'timestamp' column is in datetime format
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Find the first rating (earliest timestamp)
first_rating = df['timestamp'].min()

# Find the latest rating (latest timestamp)
latest_rating = df['timestamp'].max()

print(f"The first rating occurred on: {first_rating}")
print(f"The latest rating occurred on: {latest_rating}")



The first rating occurred on: 2000-04-25 23:05:32
The latest rating occurred on: 2003-02-28 17:49:50


We are now going to save the titles of all movies that were rated at least 250 times as index called `active_titles`. In the exercises that follow, use this index to enable a quick access to movie titles.

In [9]:
ratings_by_title = df.groupby('title').size()
active_titles = df.groupby('title').size().index[ratings_by_title >= 250]

### Exercise 2

What are Top 10 (best rated) movies according to scientists?

What are Top 10 (best rated) movies according to artists?

Solution (first rows): `[Guess Who's Coming to Dinner (1967), Manchurian Candidate, The (1962)]`

In [10]:
# Filter data for active titles
filtered_df = df[df['title'].isin(active_titles)]

# Calculate average ratings by title
average_ratings = filtered_df.groupby('title')['rating'].mean()

# Filter users who are scientists and calculate their top 10 rated movies
scientist_df = filtered_df[filtered_df['occupation'] == 'scientist']
scientist_ratings = scientist_df.groupby('title')['rating'].mean()
top_10_scientist_movies = scientist_ratings.sort_values(ascending=False).head(10)

# Filter users who are artists and calculate their top 10 rated movies
artist_df = filtered_df[filtered_df['occupation'] == 'artist']
artist_ratings = artist_df.groupby('title')['rating'].mean()
top_10_artist_movies = artist_ratings.sort_values(ascending=False).head(10)

# Display results
print("Top 10 movies according to scientists:")
print(top_10_scientist_movies)

print("\nTop 10 movies according to artists:")
print(top_10_artist_movies)

## COMMENT

# my solution differs from the given one. This my be due to different approaches.
# I calculated the average ratings for movies in the active_titles index and then
# filter the data based on the occupation of users (scientists or artists). 
# Finally, I sorted the results to get the top 10 movies for each group.
# This seemed only logical to me.


Top 10 movies according to scientists:
title
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)   5.00
Guess Who's Coming to Dinner (1967)             5.00
Midnight Express (1978)                         4.83
Modern Times (1936)                             4.75
Roman Holiday (1953)                            4.75
M (1931)                                        4.75
Monty Python and the Holy Grail (1974)          4.68
American History X (1998)                       4.67
Charade (1963)                                  4.67
Mumford (1999)                                  4.67
Name: rating, dtype: float64

Top 10 movies according to artists:
title
Manchurian Candidate, The (1962)                                      4.77
Close Shave, A (1995)                                                 4.77
When We Were Kings (1996)                                             4.62
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)   4.61
Rear Window (1954)                                   

### Exercise 3

What are Top 10 (best rated) movies according to people in the state of California (CA)? 

What are Top 10 (best rated) movies according to people in the state of New York (NY)?

Solution (first rows): `[Sunset Blvd. (a.k.a. Sunset Boulevard) (1950), Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)]`

In [11]:
# Filter data for active titles
filtered_df = df[df['title'].isin(active_titles)]

# Filter for people in California and calculate top 10 rated movies
ca_df = filtered_df[filtered_df['state'] == 'CA']
ca_ratings = ca_df.groupby('title')['rating'].mean()
top_10_ca_movies = ca_ratings.sort_values(ascending=False).head(10)

# Filter for people in New York and calculate top 10 rated movies
ny_df = filtered_df[filtered_df['state'] == 'NY']
ny_ratings = ny_df.groupby('title')['rating'].mean()
top_10_ny_movies = ny_ratings.sort_values(ascending=False).head(10)

# Display results
print("Top 10 movies according to people in California:")
print(top_10_ca_movies)

print("\nTop 10 movies according to people in New York:")
print(top_10_ny_movies)


Top 10 movies according to people in California:
title
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)                                 4.62
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)           4.61
Wallace & Gromit: The Best of Aardman Animation (1996)                        4.53
Wrong Trousers, The (1993)                                                    4.53
Lawrence of Arabia (1962)                                                     4.52
Treasure of the Sierra Madre, The (1948)                                      4.52
Godfather, The (1972)                                                         4.51
City Lights (1931)                                                            4.51
When We Were Kings (1996)                                                     4.50
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)   4.50
Name: rating, dtype: float64

Top 10 movies according to people in New York:
title
Seven Samurai (The Magnificent S

### Exercise 4

Which occupations are most represented by male users? List top five.

Solution: `['college/grad student', 'other', 'executive/managerial', 'technician/engineer', 'academic/educator']`

In [12]:
# Filter the data for male users
male_users = df[df['gender'] == 'M']

# Count the number of occurrences of each occupation
occupation_counts = male_users['occupation'].value_counts()

# Get the top 5 most represented occupations
top_5_occupations = occupation_counts.head(5)

# Display results
print("Top 5 occupations most represented by male users:")
print(top_5_occupations)


Top 5 occupations most represented by male users:
occupation
college/grad student    97585
other                   94009
executive/managerial    84842
technician/engineer     64399
academic/educator       50955
Name: count, dtype: int64


We are now going to save the titles of all movies that were rated at least 250 times as index called `active_titles`. In the exercises that follow, use this index to enable a quick access to movie titles.

In [13]:
ratings_by_title = df.groupby('title').size()
active_titles = df.groupby('title').size().index[ratings_by_title >= 250]

### Exercise 5

Which are the top 10 movies with the least fluctuations in rating? (*hint: use standard deviation*)

Only take into account the `"active titles"` movies.

Solution (first rows): `[Close Shave, A (1995), Rear Window (1954), Great Escape, The (1963)]`

In [14]:
# Filter the DataFrame for active titles
filtered_df = df[df['title'].isin(active_titles)]

# Calculate the standard deviation of ratings for each title
rating_std = filtered_df.groupby('title')['rating'].std()

# Sort the movies by standard deviation in ascending order
least_fluctuations = rating_std.sort_values().head(10)

# Display results
print("Top 10 movies with the least fluctuations in rating:")
print(least_fluctuations)


Top 10 movies with the least fluctuations in rating:
title
Close Shave, A (1995)                           0.66
Rear Window (1954)                              0.69
Great Escape, The (1963)                        0.69
Shawshank Redemption, The (1994)                0.70
Wrong Trousers, The (1993)                      0.71
Raiders of the Lost Ark (1981)                  0.73
North by Northwest (1959)                       0.73
Hustler, The (1961)                             0.74
Double Indemnity (1944)                         0.74
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)   0.74
Name: rating, dtype: float64


### Exercise 6

List 20 movies where the ratings of female and male users differ the least on average. (*hint: consider changing display settings*)

Only take into account the `"active titles"` movies.

Solution (first rows): `[Bob Roberts (1992), American Movie (1999, Jerry Maguire (1996), Cape Fear (1991), Serpico (1973)]`

In [15]:
# Filter for active titles
filtered_df = df[df['title'].isin(active_titles)]

# Calculate mean ratings for male and female users by title
mean_ratings_by_gender = filtered_df.groupby(['title', 'gender'])['rating'].mean().unstack()

# Calculate the absolute difference between male and female ratings
mean_ratings_by_gender['difference'] = abs(mean_ratings_by_gender['M'] - mean_ratings_by_gender['F'])

# Sort by the smallest difference
least_diff_movies = mean_ratings_by_gender.sort_values(by='difference').head(20)

# Display the results
print("20 movies where the ratings of female and male users differ the least:")
print(least_diff_movies[['difference']])


20 movies where the ratings of female and male users differ the least:
gender                        difference
title                                   
Bob Roberts (1992)                  0.00
American Movie (1999)               0.00
Jerry Maguire (1996)                0.00
Cape Fear (1991)                    0.00
Serpico (1973)                      0.00
...                                  ...
Batman Returns (1992)               0.00
Cat on a Hot Tin Roof (1958)        0.00
Perfect Murder, A (1998)            0.00
Executive Decision (1996)           0.01
Hamlet (1996)                       0.01

[20 rows x 1 columns]


### Exercise 7

From the previously shown 20 movies where the ratings of female and male users differ the least, list the top 4 with the highest ratings.

Only take into account the `"active titles"` movies.

Solution (first rows): `['Roger & Me (1989)',  'Cat on a Hot Tin Roof (1958)', 'American Movie (1999)', 'Serpico (1973)']`

In [16]:
# Filter for active titles
filtered_df = df[df['title'].isin(active_titles)]

# Calculate mean ratings for male and female users by title
mean_ratings_by_gender = filtered_df.groupby(['title', 'gender'])['rating'].mean().unstack()

# Calculate the absolute difference between male and female ratings
mean_ratings_by_gender['difference'] = abs(mean_ratings_by_gender['M'] - mean_ratings_by_gender['F'])

# Sort by the smallest difference and take the top 20 movies
least_diff_movies = mean_ratings_by_gender.sort_values(by='difference').head(20)

# Calculate the overall mean rating for these 20 movies
least_diff_movies['overall_rating'] = filtered_df.groupby('title')['rating'].mean()

# Sort the 20 movies by overall rating in descending order
top_4_highest_rated = least_diff_movies.sort_values(by='overall_rating', ascending=False).head(4)

# Display the results
print("Top 4 movies with the highest ratings among the 20 with the least gender differences:")
print(top_4_highest_rated[['overall_rating']])


Top 4 movies with the highest ratings among the 20 with the least gender differences:
gender                        overall_rating
title                                       
Roger & Me (1989)                       4.07
Cat on a Hot Tin Roof (1958)            4.05
American Movie (1999)                   4.01
Serpico (1973)                          3.99
