# Movie Recommendation

“What movie should I watch this evening?” This perhaps is a question you would ask yourself very often. As for me—yes, and more than once. As such, from Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

We are going to examine a dataset which provides non-commercial, personalized movie recommendations. This dataset describes 5-star rating from MovieLens. It contains ratings across movies created by users.


The data are contained in the files *movies_dc.csv*, and *ratings_dc.csv*. More details about the contents and use of all these files follows.


> Ratings Data File Structure (*ratings_dc.csv*) All ratings are contained in the file *ratings_dc.csv*. Each line of this file after the header row represents one rating of one movie by one user, and has the following format: userId, movieId, rating. The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

>  
Movies Data File Structure (*movies_dc.csv*) Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format: movieId, title, genres, movie_description




Load the *movies_dc.csv* data as a pandas dataframe. Fix the following problems:



*   The “movieId” column is mistakenly encoded as "movieId_". Please revise the column name as movieId.
*   The “movie_description” column has irrelevant values. Please delete this column.

*   The “title” column contains each movie’s release year. Please extract the year information from the “title” column and use it to generate a new column “year”. (You can still keep the release year in the original "title" column.)

In [1]:
# importing libraries
import pandas as pd

In [2]:
## Insert your Q1 solution

# load movies_dc
moviesDF = pd.read_csv("movies_dc.csv")
moviesDF

Unnamed: 0,movieId_,title,genres,movie_description
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Hello World!
1,2,Jumanji (1995),Adventure|Children|Fantasy,Hello World!
2,3,Grumpier Old Men (1995),Comedy|Romance,Hello World!
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Hello World!
4,5,Father of the Bride Part II (1995),Comedy,Hello World!
...,...,...,...,...
8155,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,Hello World!
8156,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,Hello World!
8157,193585,Flint (2017),Drama,Hello World!
8158,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,Hello World!


In [3]:
# 1.1
moviesDF.rename(columns = {"movieId_": "movieId"}, inplace = True)

In [4]:
# 1.2
moviesDF.drop("movie_description", axis = 1, inplace = True)

In [5]:
# 1.3
moviesDF["year"] = moviesDF.title.str[-5:-1]

In [6]:
# 1.4
moviesDF["year"] = moviesDF.year.astype(int)
moviesDF.year.dtype

dtype('int64')

In [7]:
# Looking at the transformed dataset
moviesDF

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995
...,...,...,...,...
8155,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017
8156,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017
8157,193585,Flint (2017),Drama,2017
8158,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018


Show the top 3 years with the highest number of movies.


In [8]:
moviesDF.groupby("year").size().nlargest(3)

year
2002    311
2006    295
2001    294
dtype: int64

Load the *ratings_dc.csv* as a pandas dataframe. Show the mean, max, and min values of the “rating” column.


In [9]:
ratingsDF = pd.read_csv("ratings_dc.csv")
ratingsDF

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0
...,...,...,...
100831,610,166534,4.0
100832,610,168248,5.0
100833,610,168250,5.0
100834,610,168252,5.0


In [10]:
ratingsDF.rating.agg(["mean", "max", "min"])

mean    3.501557
max     5.000000
min     0.500000
Name: rating, dtype: float64

Use the rating dataframe to generate a dataframe that contains each movie’s average rating and number of ratings.

Then, you need to inner join the movie table with table generated in the previous step to create a new dataframe, *movie_rating* , which has the following four column names:  MovieID, title, genres, year, avg_rating, num_rating. Each row presents information of one movie. "avg_rating" is the mean value of ratings of a movie; "num_rating" measures how many time a movie has been rated.



In [11]:
# transforming the ratingsDF
ratingDF2 = ratingsDF.groupby("movieId").agg({"rating":["mean", "count"]}).reset_index()
ratingDF2.columns = ratingDF2.columns.droplevel(0)
ratingDF2 = ratingDF2.set_axis(["movieId", "avg_rating", "num_rating"], axis="columns")

# checking the result
ratingDF2

Unnamed: 0,movieId,avg_rating,num_rating
0,1,3.920930,215
1,2,3.431818,110
2,3,3.259615,52
3,4,2.357143,7
4,5,3.071429,49
...,...,...,...
9719,193581,4.000000,1
9720,193583,3.500000,1
9721,193585,3.500000,1
9722,193587,3.500000,1


In [12]:
# joining dataframes
joinedDF = moviesDF[["movieId", "title", "genres", "year"]].set_index("movieId").join(ratingDF2.set_index("movieId"), how = "inner").reset_index()
joinedDF

Unnamed: 0,movieId,title,genres,year,avg_rating,num_rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,3.920930,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,3.431818,110
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,3.259615,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,2.357143,7
4,5,Father of the Bride Part II (1995),Comedy,1995,3.071429,49
...,...,...,...,...,...,...
8148,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,4.000000,1
8149,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,3.500000,1
8150,193585,Flint (2017),Drama,2017,3.500000,1
8151,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,3.500000,1


Show the number of movies with the average ratings 4.0.


In [13]:
# Having a look at the filtered dataset
joinedDF[joinedDF.avg_rating == 4.0]

Unnamed: 0,movieId,title,genres,year,avg_rating,num_rating
50,55,Georgia (1995),Drama,1995,4.0,1
66,74,Bed of Roses (1996),Drama|Romance,1996,4.0,8
69,77,Nico Icon (1995),Documentary,1995,4.0,1
72,80,"White Balloon, The (Badkonake sefid) (1995)",Children|Drama,1995,4.0,2
94,106,Nobody Loves Me (Keiner liebt mich) (1994),Comedy|Drama,1994,4.0,1
...,...,...,...,...,...,...
8137,190209,Jeff Ross Roasts the Border (2017),Comedy,2017,4.0,1
8145,193571,Silver Spoon (2014),Comedy|Drama,2014,4.0,1
8146,193573,Love Live! The School Idol Movie (2015),Animation,2015,4.0,1
8148,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,4.0,1


In [14]:
# printing out the number of movies fullfilling this condition
joinedDF[joinedDF.avg_rating == 4.0].shape[0]

766

Show the titles of movies with Top 2 number of ratings (NOT average ratings).

In [15]:
# join ratingsDF and moviesDF
joinedDF_total = moviesDF.set_index("movieId").join(ratingsDF.set_index("movieId"), how = "left")

joinedDF_total.groupby("title")["rating"].size().nlargest(2)

title
Forrest Gump (1994)                 329
Shawshank Redemption, The (1994)    317
Name: rating, dtype: int64

In the *movie_rating* dataframe, with movie release years ranging from 1980 to 2018, please add a new column called "time_interval." Specifically, use year bins [1979, 1999, 2009, 2020] to categorize movies into three groups: "before 2000", "2000-2009", and "since 2010".

After creating this new column, display the count of movies in each time interval.


In [16]:
filt_movieDF = moviesDF[(moviesDF.year >= 1980) & (moviesDF.year <= 2018)]

bins = [1979, 1999, 2009, 2020]
labels = ["before 2000", "2000-2009", "since 2010"]

filt_movieDF["time_interval"] = pd.cut(filt_movieDF.year, bins = bins, labels = labels)
filt_movieDF

Unnamed: 0,movieId,title,genres,year,time_interval
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,before 2000
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,before 2000
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,before 2000
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,before 2000
4,5,Father of the Bride Part II (1995),Comedy,1995,before 2000
...,...,...,...,...,...
8155,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,2017,since 2010
8156,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,2017,since 2010
8157,193585,Flint (2017),Drama,2017,since 2010
8158,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,2018,since 2010


In [17]:
filt_movieDF.groupby("time_interval").size()

time_interval
before 2000    3386
2000-2009      2847
since 2010     1927
dtype: int64

Now you need to implement a recommender system using collaborative filtering method. This works simply as to recommend movies that "people who like this movie also like these movies".

For example, people who like to watch StarWars are very likely to watch Star Treks.
In order to do so, you need to find out users who like one movie (i.e., post a rating of 5), and
count what are the movies these users also like, ranked by the number of likes.

**Task: Show the recommended movie list with top 10 movies that users who like the Forrest Gump (1994) may also like.**

Congratulations! You just build the first recommender system that worth 1 million dollars :D


In [18]:
# identifying people who like the movie Titanic
FG_likers = list(joinedDF_total[(joinedDF_total.title == "Forrest Gump (1994)") & (joinedDF_total.rating == 5)].userId)

# filtering the dataframe such that we only see movies rated by people who like the movie Titanic
recDF = joinedDF_total[(joinedDF_total.userId.isin(FG_likers)) & (joinedDF_total.title != "Forrest Gump (1994)")]

# identify the movies these users also like
recDF_likes = recDF[recDF.rating == 5]

# rank movies based on the number of likes
recDF_likes.groupby("title")["rating"].count().sort_values(ascending = False).head(10)

title
Shawshank Redemption, The (1994)                         48
Braveheart (1995)                                        38
Pulp Fiction (1994)                                      34
Silence of the Lambs, The (1991)                         34
Matrix, The (1999)                                       31
Schindler's List (1993)                                  30
Fight Club (1999)                                        25
Terminator 2: Judgment Day (1991)                        24
Star Wars: Episode V - The Empire Strikes Back (1980)    23
Apollo 13 (1995)                                         22
Name: rating, dtype: int64