**T1-2659 Data Curation for Business Analytics**

**October 19th, 2023**

**Final Exam: Part II Pandas**

“What movie should I watch this evening?” This perhaps is a question you would ask yourself very often. As for me—yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

We are going to examine a dataset which provides non-commercial, personalized movie recommendations. This dataset describes 5-star rating from MovieLens. It contains ratings across movies created by users.


The data are contained in the files *movies_dc.csv*, and *ratings_dc.csv*. More details about the contents and use of all these files follows.


> Ratings Data File Structure (*ratings_dc.csv*) All ratings are contained in the file *ratings_dc.csv*. Each line of this file after the header row represents one rating of one movie by one user, and has the following format: userId, movieId, rating. The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

>  
Movies Data File Structure (*movies_dc.csv*) Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format: movieId, title, genres, movie_description




**Questions 1 (10 points)**

Load the *movies_dc.csv* data as a pandas dataframe. Fix the following problems:



*   The “movieId” column is mistakenly encoded as "movieId_". Please revise the column name as movieId.
*   The “movie_description” column has irrelevant values. Please delete this column.

*   The “title” column contains each movie’s release year. Please extract the year information from the “title” column and use it to generate a new column “year”. (You can still keep the release year in the original "title" column.)


*   Note that the data type of the new column “year” should be converted to **int**.

**You should work on this updated dataframe for this exam.**


In [None]:
## Insert your Q1 solution
import pandas as pd
import numpy as np

movie_df = pd.read_csv('/content/movies_dc.csv')
movie_df.head(10)


Unnamed: 0,movieId,title,genres,movie_description
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Hello World!
1,2,Jumanji (1995),Adventure|Children|Fantasy,Hello World!
2,3,Grumpier Old Men (1995),Comedy|Romance,Hello World!
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Hello World!
4,5,Father of the Bride Part II (1995),Comedy,Hello World!
5,6,Heat (1995),Action|Crime|Thriller,Hello World!
6,7,Sabrina (1995),Comedy|Romance,Hello World!
7,8,Tom and Huck (1995),Adventure|Children,Hello World!
8,9,Sudden Death (1995),Action,Hello World!
9,10,GoldenEye (1995),Action|Adventure|Thriller,Hello World!


In [None]:

movie_df.rename(columns={'movieId_': 'movieId'}, inplace=True)
movie_df.drop(["movie_description"], axis=1, inplace=True)
movie_df['year'] = movie_df.title.astype(str).str[-5:-1].astype(int)
movie_df.dtypes


movieId     int64
title      object
genres     object
year        int64
dtype: object

**Questions 2 (5 points)**

Show the top 3 years with the highest number of movies.


In [None]:
## Insert your Q2 solution
movie_df.year.value_counts().sort_values(ascending=False)[:3]

2002    311
2006    295
2001    294
Name: year, dtype: int64

**Question 3 (5 points)**

Load the *ratings_dc.csv* as a pandas dataframe. Show the mean, max, and min values of the “rating” column.


In [None]:
## Insert your Q3 solution
rating_df = pd.read_csv('/content/ratings_dc.csv')
rating_df.describe()

Unnamed: 0,userId,movieId,rating
count,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557
std,182.618491,35530.987199,1.042529
min,1.0,1.0,0.5
25%,177.0,1199.0,3.0
50%,325.0,2991.0,3.5
75%,477.0,8122.0,4.0
max,610.0,193609.0,5.0


**Question 4 (10 points)**

Use the rating dataframe to generate a dataframe that contains each movie’s average rating and number of ratings.

Then, you need to inner join the movie table with table generated in the previous step to create a new dataframe, *movie_rating* , which has the following four column names:  movieId, title, genres, year, avg_rating, num_rating. Each row presents information of one movie. "avg_rating" is the mean value of ratings of a movie; "num_rating" measures how many time a movie has been rated.



In [None]:
## Insert your Q4 solution
mean_rating = rating_df.groupby('movieId')['rating'].agg(['mean','count']) ## intermediate table
movie_rating = pd.merge(movie_df, mean_rating, on="movieId")
movie_rating.rename(columns={'mean': 'avg_rating', 'count':'num_rating'}, inplace=True)

**Question 5 (5 points)**

Show the number of movies with the average ratings 4.0.


In [None]:
## Insert your Q5 solution
movie_rating[movie_rating['avg_rating'] == 4.0].shape

(766, 6)

**Question 6 (5 points)**

Show the titles of movies with Top 2 number of ratings (NOT average ratings).

In [None]:
## Insert your Q6 solution
movie_rating.sort_values(by='num_rating', ascending=False)[:2]

Unnamed: 0,movieId,title,genres,year,avg_rating,num_rating
310,356,Forrest Gump (1994),Comedy|Drama|Romance|War,1994,4.164134,329
273,318,"Shawshank Redemption, The (1994)",Crime|Drama,1994,4.429022,317


**Question 7 (10 points)**

Group release years of movies into the following categories: “before 2000”, “2000-2009”, “since 2010”. The bins should be [1979, 1999, 2009, 2020]. Add and display this generated category as a new column "time_interval" to the *movie_rating* dataframe.

Show the number of movies in each time_interval.


In [None]:
## Insert your Q7 solution
movie_rating['time_interval'] = pd.cut(movie_rating['year'], bins = [1979, 1999, 2009, 2020], labels=["before 2000", "2000-2009", "since 2010"])
movie_rating['time_interval'].value_counts()

before 2000    3381
2000-2009      2846
since 2010     1926
Name: time_interval, dtype: int64

**Question 8 (challenging and extra 5 bonus points)**

Now you need to implement a recommender system using collaborative filtering method. This works simply as to recommend movies that "people who like this movie also like these movies".

For example, people who like to watch StarWars are very likely to watch Star Treks.
In order to do so, you need to find out users who like one movie (i.e., post a rating of 5), and
count what are the movies these users also like, ranked by the number of likes.

**Task: Show the recommended movie list with top 10 movies that users who like the Forrest Gump (1994) may also like.**

Congratulations! You just build the first recommender system that worth 1 million dollars :D


In [None]:
## Insert your Q8 solution
## According to Q6, we know that the movieId Forrest Gump (1994) of 356.
## Step 1, find users who like Forrest Gump (1994)
gump_user = rating_df.loc[(rating_df['movieId'] == 356) & (rating_df['rating'] == 5), 'userId']
## Step 2, find what other movies these users also liked
liked_movies = rating_df.loc[(rating_df['userId'].isin(gump_user)) & (rating_df['rating'] == 5)
                                & (rating_df['movieId'] != 356)]
## Step 3, count the likes of each movie and sort by number of likes
movies_num_likes = liked_movies.groupby('movieId').agg(num_likes = ('userId', 'count')).sort_values(by='num_likes', ascending=False)
## Step 4, recommend top 10 number of likes movies
pd.merge(movies_num_likes, movie_rating, on='movieId').head(10).title


0                     Shawshank Redemption, The (1994)
1                                    Braveheart (1995)
2                     Silence of the Lambs, The (1991)
3                                  Pulp Fiction (1994)
4                                   Matrix, The (1999)
5                              Schindler's List (1993)
6                                    Fight Club (1999)
7                    Terminator 2: Judgment Day (1991)
8    Star Wars: Episode V - The Empire Strikes Back...
9                           Saving Private Ryan (1998)
Name: title, dtype: object