In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from itertools import islice

<h1>Content-based movie recommender</h1>

In this notebook, I will be using the ratings and genres to attemp to help recommend movies to a user.
<br>
<h5>Method</h5>
The general idea is to predict the ratings a user will give based on the genre.
<br>
I will be using linear regression to train a user's preferences based on movies that he/she has rated, then use the learned preferences to predict the ratings of movies. Based on those ratings, I will then suggest the top few movies for him/her to watch

<h3>Getting and sorting the data</h3>

Read in files, and check it they are clean

In [2]:
df_ratings = pd.read_csv('ratings.csv')
df_movies = pd.read_csv('movies.csv')

In [3]:
print(df_ratings.isnull().sum())
df_ratings.head()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
print(df_movies.isnull().sum())
df_movies.head()

movieId    0
title      0
genres     0
dtype: int64


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
df_ratings.shape

(100836, 4)

<h5> Merge the data </h5>

In [6]:
df = pd.merge(df_movies, df_ratings, on='movieId')
df = df.sort_values(by=['userId'])
df.shape

(100836, 6)

In [7]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
35548,1777,"Wedding Singer, The (1998)",Comedy|Romance,1,4.0,964981230
35249,1732,"Big Lebowski, The (1998)",Comedy|Crime,1,5.0,964981125
34348,1676,Starship Troopers (1997),Action|Sci-Fi,1,3.0,964982620
2379,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,1,5.0,964982931


<h5> I won't be needing timestamp, so I'll delete that </h5>

In [8]:
df = df.drop(columns=['timestamp'])

<h5>Now I want to create a vector for each of genres. 1 if genre is true and 0 if not</h5>

In [9]:
# From the movielens information
genres = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy',
         'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
         'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

def to_genre_vec(genre, gen_df_row):
    gen_list = gen_df_row.split("|")
    if genre in gen_list:
        return 1
    else:
        return 0

In [10]:
for genre in genres:
    df[genre] = df['genres'].apply(lambda x: to_genre_vec(genre,x))
df.shape

(100836, 23)

In [11]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,0,1,1,1,1,...,1,0,0,0,0,0,0,0,0,0
35548,1777,"Wedding Singer, The (1998)",Comedy|Romance,1,4.0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
35249,1732,"Big Lebowski, The (1998)",Comedy|Crime,1,5.0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
34348,1676,Starship Troopers (1997),Action|Sci-Fi,1,3.0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2379,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,1,5.0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


<h3>Training a user's preferences</h3>

Now, I will train a user's preferences by using the genre_vec of movies the user has reviewed to learn to predict ratings. Essentially, the goal is to have my model (corresponding to a user) take in a movie's genres and predict the rating the user would give it.

In [12]:
def get_user_model(userId):
    # get x variables
    df_user_x = df.loc[df['userId'] == userId][genres]
    
    # get y variables
    df_user_y = df.loc[df['userId'] == userId]['rating']
    
    # Create and train the model
    rating_model = LinearRegression()
    rating_model.fit(df_user_x, df_user_y)
    
    # Get predictions for the following calculations
    predict = rating_model.predict(df_user_x)
    
    # Print the Mean Squared Error
    print('Mean Square Error: {}'.format(round(mean_squared_error(predict, df_user_y),3)))
    
    # Print the average error
    print('Average Rating Error: {}\n'.format(round(sum(abs(predict - df_user_y))/len(df_user_y),3)))
    
    
    return rating_model

<h3>Example on userId 1</h3>

In [13]:
user1_model = get_user_model(1)

Mean Square Error: 0.524
Average Rating Error: 0.585



For the above testing, we tested back on the user's ratings themselves, since we dont have any other information to test it on. <br><br>
The Mean square error looks pretty bad, but when we check the real effective error, it really isn't too bad. Just an average of 0.585 in predicted ratings based on the genres

<h3>Recommendations</h3>

Now, let's take the rest of the movies and sort them by predicted ratings. We will take the best few and recommend them.

In [14]:
# Get a list of movies and their respective genre variables
# Note: we use df_movies because we want to predict on each movie once, and df_movies
#       is a df containing each movie only once

df_predict = df_movies
for genre in genres:
    df_predict[genre] = df_predict['genres'].apply(lambda x: to_genre_vec(genre,x))
df_predict.head()

Unnamed: 0,movieId,title,genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,1,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# get a df of only the genres variables
# Note: I am using the genres variable previously defined, rahter than the 'genres' column in df_predict
df_predict_genres = df_predict[genres]

In [16]:
def get_user_recommendations(user_model, userId):
    
    # predict the ratings using the user model we have
    user_predictions = user_model.predict(df_predict_genres)
    
    # concatenate movie title and predictions
    user_dict = dict(zip(df_predict['title'], user_predictions))
    
    # We don't want to recommend movies already watched, so lets delete those
    user_watched = df.loc[df['userId']==userId]['title'].to_numpy()
    
    for i in user_watched:
        del user_dict[i]
     
    # sort the results based on reccomendations
    user_reccomendations = sorted(user_dict.items(), key=lambda item: item[1], reverse = True)
    
    
    # print recommendations
    print('Based on User{}\'s rated movies, below are more movies User{} should watch'.format(userId, userId))

    for i in user_reccomendations[:10]:
        print(('- ' + i[0]))

Testing the recommendations on User 1

In [17]:
get_user_recommendations(user1_model, 1)

Based on User1's rated movies, below are more movies User1 should watch
- Mildred Pierce (1945)
- Strange Love of Martha Ivers, The (1946)
- Sweet Smell of Success (1957)
- Harder They Fall, The (1956)
- While the City Sleeps (1956)
- Fury (1936)
- Letter, The (1940)
- Grifters, The (1990)
- Hoodlum (1997)
- This World, Then the Fireworks (1997)


<h3> Some more predictions</h3>
Recommending movies to users 2-7

In [18]:
for i in range (2,8):
    print('User' + str(i) + '\n')
    usermodel = get_user_model(i)
    get_user_recommendations(usermodel,i)
    print('\n')

User2

Mean Square Error: 0.498
Average Rating Error: 0.533

Based on User2's rated movies, below are more movies User2 should watch
- White Sun of the Desert, The (Beloe solntse pustyni) (1970)
- Operation Petticoat (1959)
- Hot Shots! (1991)
- El Cid (1961)
- Robin Hood (2010)
- Very Long Engagement, A (Un long dimanche de fiançailles) (2004)
- Joe Dirt (2001)
- Mortdecai (2015)
- Rob Roy (1995)
- Henry V (1989)


User3

Mean Square Error: 0.677
Average Rating Error: 0.681

Based on User3's rated movies, below are more movies User3 should watch
- American Psycho (2000)
- Book of Shadows: Blair Witch 2 (2000)
- From Hell (2001)
- Identity (2003)
- House of Wax (1953)
- Testament of Dr. Mabuse, The (Das Testament des Dr. Mabuse) (1933)
- Bird with the Crystal Plumage, The (Uccello dalle piume di cristallo, L') (1970)
- Saw VI (2009)
- Ninth Gate, The (1999)
- Heartless (2009)


User4

Mean Square Error: 1.611
Average Rating Error: 1.066

Based on User4's rated movies, below are more mo

<h3> Conclusion</h3>
The model was created based solely on genres and rating, so it could be recommending rather general things. Maybe next time, I could try creating a model that also predicts based on cast and director (using a similarity vector), the year it was produced and whether it was animated or not.