**T1-2659 Data Curation for Business Analytics**

**October 19th, 2023**

**Final Exam: Part II Pandas**

“What movie should I watch this evening?” This perhaps is a question you would ask yourself very often. As for me—yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

We are going to examine a dataset which provides non-commercial, personalized movie recommendations. This dataset describes 5-star rating from MovieLens. It contains ratings across movies created by users.


The data are contained in the files *movies_dc.csv*, and *ratings_dc.csv*. More details about the contents and use of all these files follows.


> Ratings Data File Structure (*ratings_dc.csv*) All ratings are contained in the file *ratings_dc.csv*. Each line of this file after the header row represents one rating of one movie by one user, and has the following format: userId, movieId, rating. The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

>  
Movies Data File Structure (*movies_dc.csv*) Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format: movieId, title, genres, movie_description




**Question 1 (10 points)**

Load the *movies_dc.csv* data as a pandas dataframe. Fix the following problems:



*   The “movieId” column is mistakenly encoded as "movieId_". Please revise the column name as movieId.
*   The “movie_description” column has irrelevant values. Please delete this column.

*   The “title” column contains each movie’s release year. Please extract the year information from the “title” column and use it to generate a new column “year”. (You can still keep the release year in the original "title" column.)


*   Note that the data type of the new column “year” should be converted to **int**.

**You should work on this updated dataframe for this exam.**


In [1]:
import pandas as pd

In [43]:
df_movies = pd.read_csv('movies_dc.csv')

df_movies.rename(columns={'movieId_':'movieId'}, inplace=True)
df_movies.drop(columns='movie_description', inplace=True)
df_movies['year'] = df_movies['title'].apply(lambda x: int(x[-5:-1]))

display(df_movies)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995
...,...,...,...,...
8155,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017
8156,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017
8157,193585,Flint (2017),Drama,2017
8158,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018


**Question 2 (5 points)**

Show the top 3 years with the highest number of movies.


In [44]:
movies_year = df_movies.groupby('year')['movieId'].size()
display(movies_year.nlargest(3))

year
2002    311
2006    295
2001    294
Name: movieId, dtype: int64

**Question 3 (5 points)**

Load the *ratings_dc.csv* as a pandas dataframe. Show the mean, max, and min values of the “rating” column.


In [45]:
df_ratings = pd.read_csv('ratings_dc.csv')

print('mean:', df_ratings['rating'].mean())
print('max:', df_ratings['rating'].max())
print('min:', df_ratings['rating'].min())

mean: 3.501556983616962
max: 5.0
min: 0.5


**Question 4 (10 points)**

Use the rating dataframe to generate a dataframe that contains each movie’s average rating and number of ratings.

Then, you need to inner join the movie table with table generated in the previous step to create a new dataframe, *movie_rating* , which has the following four column names:  MovieID, title, genres, year, avg_rating, num_rating. Each row presents information of one movie. "avg_rating" is the mean value of ratings of a movie; "num_rating" measures how many time a movie has been rated.



In [46]:
df_avg = df_ratings.groupby('movieId').size().rename('num_rating').to_frame()
df_avg['avg_rating'] = df_ratings.groupby('movieId')['rating'].mean()

movie_rating = df_avg.join(df_movies.set_index('movieId'), how='inner').reset_index()
movie_rating.rename(columns={'movieId':'MovieId'}, inplace=True)
movie_rating = movie_rating[['MovieId','title','genres','year','avg_rating','num_rating']]

display(movie_rating)

Unnamed: 0,MovieId,title,genres,year,avg_rating,num_rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.920930,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.431818,110
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.259615,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.357143,7
4,5,Father of the Bride Part II (1995),Comedy,1995,3.071429,49
...,...,...,...,...,...,...
8148,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,4.000000,1
8149,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,3.500000,1
8150,193585,Flint (2017),Drama,2017,3.500000,1
8151,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,3.500000,1


**Question 5 (5 points)**

Show the number of movies with the average ratings 4.0.


In [47]:
print('Number of Movies above 4.0:', movie_rating[movie_rating['avg_rating']>4]['MovieId'].count())

Number of Movies above 4.0: 935


**Question 6 (5 points)**

Show the titles of movies with Top 2 number of ratings (NOT average ratings).

In [48]:
display(movie_rating.nlargest(2, 'num_rating')[['title', 'num_rating']])

Unnamed: 0,title,num_rating
310,Forrest Gump (1994),329
273,"Shawshank Redemption, The (1994)",317


**Question 7 (10 points)**

In the *movie_rating* dataframe, with movie release years ranging from 1980 to 2018, please add a new column called "time_interval." Specifically, use year bins [1979, 1999, 2009, 2020] to categorize movies into three groups: "before 2000", "2000-2009", and "since 2010".

After creating this new column, display the count of movies in each time interval.


In [51]:
movie_rating['time_interval'] = pd.cut(movie_rating['year'], bins=[1979,1999,2009,2020], labels=["before 2000", "2000-2009","since 2010"])
display(movie_rating['time_interval'].value_counts())

time_interval
before 2000    3381
2000-2009      2846
since 2010     1926
Name: count, dtype: int64

**Question 8 (challenging and extra 5 bonus points)**

Now you need to implement a recommender system using collaborative filtering method. This works simply as to recommend movies that "people who like this movie also like these movies".

For example, people who like to watch StarWars are very likely to watch Star Treks.
In order to do so, you need to find out users who like one movie (i.e., post a rating of 5), and
count what are the movies these users also like, ranked by the number of likes.

**Task: Show the recommended movie list with top 10 movies that users who like the Forrest Gump (1994) may also like.**

Congratulations! You just build the first recommender system that worth 1 million dollars :D


In [84]:
forrest_id = movie_rating.loc[movie_rating['title']=='Forrest Gump (1994)', 'MovieId'].values[0]

forrest_userId = df_ratings.loc[(df_ratings['movieId']==forrest_id) & (df_ratings['rating']==5),'userId'].unique()
forrest_other_movies = df_ratings.loc[(df_ratings['userId'].isin(forrest_userId)) & (df_ratings['rating']==5) &(df_ratings['movieId']!=forrest_id)]

recommended_movies = forrest_other_movies.groupby('movieId').size().rename('Number of Likes').to_frame().join(movie_rating[['MovieId','title']].set_index('MovieId'), how='inner')
recommended_movies.sort_values('Number of Likes', ascending=False, inplace=True)

display(recommended_movies.head(10))

Unnamed: 0,Number of Likes,title
318,48,"Shawshank Redemption, The (1994)"
110,38,Braveheart (1995)
593,34,"Silence of the Lambs, The (1991)"
296,34,Pulp Fiction (1994)
2571,31,"Matrix, The (1999)"
527,30,Schindler's List (1993)
2959,25,Fight Club (1999)
589,24,Terminator 2: Judgment Day (1991)
1196,23,Star Wars: Episode V - The Empire Strikes Back...
480,22,Jurassic Park (1993)
