<a href="https://www.kaggle.com/code/kursatdinc/hybrid-recommender-system?scriptVersionId=182548393" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Business Problem

Make 10 movie recommendations for the user whose ID is given, using the item-based and user-based recommender methods.

# Dataset Story

The dataset was provided by MovieLens, a movie recommendation service. It contains the movies as well as the rating points made for these movies. Contains 2,000,0263 ratings across 27,278 movies. This data set was created on October 17, 2016. It includes 138,493 users and data between January 09, 1995 and March 31, 2015. Users were randomly selected. It is known that all selected users voted for at least 20 movies.

**movie.csv**

3 Variables, 27278 Observation

* **movieId :** Unique movie ID. 
* **title :** Movie name.
* **genres :** Movie category.

**rating.csv**

4 Variables, 20000263 Observation

* **userid :** Unique user ID.
* **movieId :** Unique movie ID.
* **rating :** Rating given to the movie by the user.
* **timestamp :** Rating date.

# Importings & Load Dataset


In [1]:
import pandas as pd
import numpy as np

movie_df = pd.read_csv("/kaggle/input/d/kursatdinc/movie-lens-dataset/movie.csv")
rating_df = pd.read_csv("/kaggle/input/d/kursatdinc/movie-lens-dataset/rating.csv")

df = pd.merge(movie_df, rating_df, on="movieId")

df

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11,4.5,2009-01-02 01:13:41
...,...,...,...,...,...,...
20000258,131254,Kein Bund für's Leben (2007),Comedy,79570,4.0,2015-03-30 19:32:59
20000259,131256,"Feuer, Eis & Dosenbier (2002)",Comedy,79570,4.0,2015-03-30 19:48:08
20000260,131258,The Pirates (2014),Adventure,28906,2.5,2015-03-30 19:56:32
20000261,131260,Rentun Ruusu (2001),(no genres listed),65409,3.0,2015-03-30 19:57:46


# Overview & Preprocessing

In [2]:
def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())

check_df(df, 10)

##################### Shape #####################
(20000263, 6)
##################### Types #####################
movieId        int64
title         object
genres        object
userId         int64
rating       float64
timestamp     object
dtype: object
##################### Head #####################
   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
5        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
6        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
7        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
8   

In [3]:
df.describe()

Unnamed: 0,movieId,userId,rating
count,20000260.0,20000260.0,20000260.0
mean,9041.567,69045.87,3.525529
std,19789.48,40038.63,1.051989
min,1.0,1.0,0.5
25%,902.0,34395.0,3.0
50%,2167.0,69141.0,3.5
75%,4770.0,103637.0,4.0
max,131262.0,138493.0,5.0


We calculate how many people voted for each movie and remove movies with less than 1000 votes from the data set. Then we calculate how many people voted for each movie.

In [4]:
comment_count = df.groupby("title").size().sort_values(ascending=False)
popular_movies = comment_count[comment_count >= 1000].index

df = df[df["title"].isin(popular_movies)]

In [5]:
df.groupby("title").size().sort_values(ascending=False)

title
Pulp Fiction (1994)                  67310
Forrest Gump (1994)                  66172
Shawshank Redemption, The (1994)     63366
Silence of the Lambs, The (1991)     63299
Jurassic Park (1993)                 59715
                                     ...  
Return to Paradise (1998)             1003
Pet Sematary II (1992)                1003
Scanners (1981)                       1003
Wristcutters: A Love Story (2006)     1001
Lincoln Lawyer, The (2011)            1001
Length: 3159, dtype: int64

We create a pivot table for the dataframe, which contains userIDs in the index, movie names in the columns, and ratings as values.

In [6]:
user_movie_df = df.pivot_table(index="userId", columns="title", values="rating")

user_movie_df

title,"'burbs, The (1989)",(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),102 Dalmatians (2000),12 Angry Men (1957),...,Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zulu (1964),[REC] (2007),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489,,,,,,,,,,4.5,...,,,,,,,,,,
138490,,,,,,,,,,,...,,,,,,,,,,
138491,,,,,,,,2.5,,,...,,,,,,,,,,
138492,,,,,,,,,,,...,,,,,,,,,,


# USER BASED RECOMMENDATION

# Determining the Movies Watched by the User to Make a Recommendation

Create random user

In [7]:
np.random.seed(42)
random_user = np.random.choice(df['userId'])

print(f"random_user_id : {random_user}")

random_user_id : 75268


We create a new dataframe called random_user_df, which consists of observation units belonging to the selected user.

In [8]:
random_user_df = user_movie_df.iloc[random_user]
random_user_df.isnull().sum()
random_user_df.dropna(inplace=True)

random_user_df

title
10 Things I Hate About You (1999)                         3.0
Airplane! (1980)                                          5.0
American Beauty (1999)                                    3.0
American Pie (1999)                                       5.0
Austin Powers: The Spy Who Shagged Me (1999)              3.0
Bachelor, The (1999)                                      2.0
Blair Witch Project, The (1999)                           3.0
Bowfinger (1999)                                          2.0
Civil Action, A (1998)                                    4.0
Clerks (1994)                                             4.0
Clockers (1995)                                           3.0
Creepshow (1982)                                          3.0
Deep Blue Sea (1999)                                      4.0
Detroit Rock City (1999)                                  3.0
Dr. Dolittle (1998)                                       4.0
Fight Club (1999)                                         5.0
Fr

We add the movies voted by the selected user to a list called movies_watched.

In [9]:
movies_watched = random_user_df.index.tolist()

movies_watched

['10 Things I Hate About You (1999)',
 'Airplane! (1980)',
 'American Beauty (1999)',
 'American Pie (1999)',
 'Austin Powers: The Spy Who Shagged Me (1999)',
 'Bachelor, The (1999)',
 'Blair Witch Project, The (1999)',
 'Bowfinger (1999)',
 'Civil Action, A (1998)',
 'Clerks (1994)',
 'Clockers (1995)',
 'Creepshow (1982)',
 'Deep Blue Sea (1999)',
 'Detroit Rock City (1999)',
 'Dr. Dolittle (1998)',
 'Fight Club (1999)',
 'From Dusk Till Dawn (1996)',
 'Full Metal Jacket (1987)',
 "General's Daughter, The (1999)",
 'Grumpier Old Men (1995)',
 'Grumpy Old Men (1993)',
 'Inspector Gadget (1999)',
 'Jackie Brown (1997)',
 'Kingpin (1996)',
 'L.A. Confidential (1997)',
 'Man on the Moon (1999)',
 'Men in Black (a.k.a. MIB) (1997)',
 'Mod Squad, The (1999)',
 'Mystery Men (1999)',
 'Naked Gun 2 1/2: The Smell of Fear, The (1991)',
 'Naked Gun 33 1/3: The Final Insult (1994)',
 'Naked Gun: From the Files of Police Squad!, The (1988)',
 'Never Been Kissed (1999)',
 'Nightmare Before Christm

# Accessing the Data and IDs of Other Users Watching the Same Movies

Select the columns of the movies watched by the selected user from user_movie_df and create a new dataframe called movies_watched_df.

In [10]:
movies_watched_df = user_movie_df[movies_watched]

movies_watched_df

title,10 Things I Hate About You (1999),Airplane! (1980),American Beauty (1999),American Pie (1999),Austin Powers: The Spy Who Shagged Me (1999),"Bachelor, The (1999)","Blair Witch Project, The (1999)",Bowfinger (1999),"Civil Action, A (1998)",Clerks (1994),...,Payback (1999),Powder (1995),Pushing Tin (1999),Romy and Michele's High School Reunion (1997),Rushmore (1998),Schindler's List (1993),Stigmata (1999),Superstar (1999),"Thomas Crown Affair, The (1999)",Wild Wild West (1999)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,3.5,,,,,4.0,...,,,,,,,,,,
2,,2.0,3.0,,,,,,,,...,,,,,,,,,,
3,,5.0,,,,,5.0,,,5.0,...,,3.0,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489,,,4.0,,,,,,,,...,,,,,,,,,,
138490,,,3.0,,,,,,4.0,,...,,,,,,,,,,
138491,,,,,,,,,,,...,,,,,,,,,,
138492,,5.0,5.0,,,,,,,,...,,,,,,,,,,


We create a new dataframe called user_movie_count, which contains information about how many of the movies each user watched that the selected user watched.

In [11]:
user_movie_count = movies_watched_df.T.notna().sum().reset_index()
user_movie_count.columns = ["userId", "movie_count"]

user_movie_count

Unnamed: 0,userId,movie_count
0,1,4
1,2,6
2,3,6
3,4,1
4,5,1
...,...,...
138488,138489,2
138489,138490,2
138490,138491,1
138491,138492,2


We consider those who watched 60 percent or more of the movies voted by the selected user as similar users. We create a list called users_same_movies from the ids of these users.

In [12]:
th = len(movies_watched) * 0.6

user_same_movies = user_movie_count[user_movie_count["movie_count"] > th]["userId"].tolist()

def user_list_control(user, list):

    if user in list:
        print(f"{user} in the list.")
    else:
        print(f"{user} not in the list.")

user_list_control(random_user, user_same_movies)

75268 not in the list.


In [13]:
user_same_movies.append(random_user)

user_list_control(random_user, user_same_movies)

75268 in the list.


# Determining the Users Most Similar to the User to Make a Recommendation

We filter the movies_watched_df dataframe to find the IDs of users that are similar to the selected user in the user_same_movies list.

In [14]:
final_df = movies_watched_df[movies_watched_df.index.isin(user_same_movies)]

final_df

title,10 Things I Hate About You (1999),Airplane! (1980),American Beauty (1999),American Pie (1999),Austin Powers: The Spy Who Shagged Me (1999),"Bachelor, The (1999)","Blair Witch Project, The (1999)",Bowfinger (1999),"Civil Action, A (1998)",Clerks (1994),...,Payback (1999),Powder (1995),Pushing Tin (1999),Romy and Michele's High School Reunion (1997),Rushmore (1998),Schindler's List (1993),Stigmata (1999),Superstar (1999),"Thomas Crown Affair, The (1999)",Wild Wild West (1999)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
54,,5.0,5.0,2.0,4.0,,,3.0,,5.0,...,,,,3.0,,5.0,3.0,,3.0,2.0
116,2.0,2.0,4.5,2.0,3.5,,,1.0,,3.5,...,3.0,1.0,,,,4.0,2.5,,1.5,2.0
156,,5.0,5.0,3.0,4.0,3.0,1.0,4.0,4.0,,...,5.0,4.0,4.0,4.0,2.0,5.0,3.0,5.0,5.0,3.0
298,4.0,,5.0,5.0,5.0,,3.0,4.0,3.0,4.0,...,4.0,3.0,,3.0,1.0,3.0,4.0,2.0,4.0,3.0
586,1.5,3.5,3.5,3.5,3.0,,2.0,3.5,,4.0,...,3.5,,,2.0,2.5,5.0,3.5,,2.5,1.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137563,5.0,3.0,5.0,4.0,3.0,,4.0,4.0,3.0,5.0,...,5.0,3.0,4.0,4.0,5.0,5.0,3.0,3.0,5.0,
137677,3.0,,4.0,4.0,5.0,,3.0,2.0,3.0,4.0,...,4.0,3.0,3.0,2.0,3.0,,,2.0,4.0,2.0
137686,3.5,4.0,5.0,4.0,4.0,,3.5,4.0,,4.0,...,4.0,3.0,,3.0,4.0,5.0,2.5,,3.0,3.0
137885,3.0,4.0,2.0,4.0,4.0,1.0,4.0,4.0,,4.0,...,3.0,2.0,2.0,3.0,3.0,5.0,3.0,,2.0,1.0


We create a new corr_df dataframe in which the correlations of users with each other will be found.

In [15]:
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df = corr_df.reset_index()

corr_df

Unnamed: 0,user_id_1,user_id_2,corr
0,39579,75268,-0.816497
1,9145,119067,-0.773879
2,47235,55556,-0.764685
3,9545,134774,-0.761063
4,58213,8647,-0.757140
...,...,...,...
547873,130191,75268,0.984063
547874,68063,32344,0.987858
547875,75268,122995,0.988105
547876,54,54,1.000000


We create a new dataframe called top_users by filtering out users with a high correlation (over 0.65) with the selected user.

In [16]:
corr_df[corr_df["user_id_1"] == random_user].sort_values("corr", ascending=False)

top_users = corr_df[(corr_df["user_id_1"] == random_user) & (corr_df["corr"] >= 0.65)][
    ["user_id_2", "corr"]].sort_values(by='corr', ascending=False).reset_index(drop=True)

top_users.rename(columns={"user_id_2": "userId"}, inplace=True)

top_users

Unnamed: 0,userId,corr
0,122995,0.988105
1,134866,0.980469
2,112572,0.979332
3,56520,0.964710
4,982,0.963321
...,...,...
179,25411,0.661451
180,4529,0.659365
181,12200,0.655506
182,85640,0.650182


In [17]:
top_users_ratings = top_users.merge(rating_df[["userId", "movieId", "rating"]], how='inner')
top_users_ratings = top_users_ratings[top_users_ratings["userId"] != random_user]

top_users_ratings

Unnamed: 0,userId,corr,movieId,rating
0,122995,0.988105,1,5.0
1,122995,0.988105,2,3.0
2,122995,0.988105,3,3.0
3,122995,0.988105,4,3.0
4,122995,0.988105,7,4.0
...,...,...,...,...
252692,54745,0.650015,5016,3.0
252693,54745,0.650015,5066,2.0
252694,54745,0.650015,5102,3.0
252695,54745,0.650015,5103,4.0


# Calculation of Weighted Average Recommendation Score and Recommendations

We create a new variable called weighted_rating, which is the product of each user's corr and rating values.

In [18]:
top_users_ratings["weighted_rating"] = top_users_ratings["corr"] * top_users_ratings["rating"]

top_users_ratings

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
0,122995,0.988105,1,5.0,4.940525
1,122995,0.988105,2,3.0,2.964315
2,122995,0.988105,3,3.0,2.964315
3,122995,0.988105,4,3.0,2.964315
4,122995,0.988105,7,4.0,3.952420
...,...,...,...,...,...
252692,54745,0.650015,5016,3.0,1.950044
252693,54745,0.650015,5066,2.0,1.300029
252694,54745,0.650015,5102,3.0,1.950044
252695,54745,0.650015,5103,4.0,2.600059


We create a new dataframe called recommendation_df, which contains the movie id and the average value of the weighted ratings of all users for each movie.

In [19]:
recommendation_df = top_users_ratings.groupby("movieId").agg({"weighted_rating" : "mean"}).reset_index()

recommendation_df

Unnamed: 0,movieId,weighted_rating
0,1,3.210756
1,2,2.223708
2,3,2.287418
3,4,1.871852
4,5,1.928175
...,...,...
13071,129937,2.266331
13072,130071,2.056680
13073,130073,3.113383
13074,130578,2.266331


Movies to be recommend

In [20]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5].sort_values("weighted_rating", ascending=False).head(10)
movies_to_be_recommend.merge(movie_df[["movieId", "title"]])

Unnamed: 0,movieId,weighted_rating,title
0,4892,4.940525,Maze (2000)
1,5098,4.940525,Dimples (1936)
2,5097,4.940525,Bright Eyes (1934)
3,124,4.740949,"Star Maker, The (Uomo delle stelle, L') (1995)"
4,6463,4.698366,Divine Trash (1998)
5,4281,4.698366,Candy (1968)
6,7385,4.623335,Twentynine Palms (2003)
7,45412,4.622747,"Hidden Blade, The (Kakushi ken oni no tsume) (..."
8,52078,4.622747,Love and Honor (2006)
9,82447,4.547144,"Tomorrow, When the War Began (2010)"


# ITEM BASED RECOMMENDATION

We get the ID of the movie with the most current score from the movies that the user to be recommended (random_user) gave 5 points.

In [21]:
film_select = df[(df["userId"] == random_user) & (df["rating"] == 5)].sort_values("timestamp", ascending=False).head(1)["movieId"].values[0]

film_select

6502

We filter the user_movie_df dataframe created in the User based recommendation section according to the selected movie id.

In [22]:
film_select_title = movie_df[movie_df["movieId"] == film_select]["title"].values[0]

film_select_title

'28 Days Later (2002)'

In [23]:
film_select_title = user_movie_df[film_select_title]

film_select_title

userId
1         3.5
2         NaN
3         NaN
4         NaN
5         NaN
         ... 
138489    NaN
138490    NaN
138491    NaN
138492    NaN
138493    NaN
Name: 28 Days Later (2002), Length: 138493, dtype: float64

Using the filtered dataframe, we find the correlation between the selected movie and other movies and rank them and apart from the selected movie itself, we give the first 10 movies as suggestions.

In [24]:
recommend_film = pd.DataFrame(user_movie_df.corrwith(film_select_title).sort_values(ascending=False).head(10))
recommend_film.reset_index(inplace=True)
recommend_film.columns = ["title", "corr"]

recommend_film.iloc[1:11]

Unnamed: 0,title,corr
1,28 Weeks Later (2007),0.490047
2,Vanya on 42nd Street (1994),0.397725
3,Shaft (1971),0.396805
4,Dawn of the Dead (2004),0.380016
5,Shaun of the Dead (2004),0.370421
6,"7th Voyage of Sinbad, The (1958)",0.363955
7,Invasion of the Body Snatchers (1978),0.35639
8,[REC] (2007),0.346691
9,"Fog, The (1980)",0.340775
