## Lauren Thompson
## Recommender System

Exercise: Using the small MovieLens data set, create a recommender system that allows users to input a movie they like (in the data set) and recommends ten other movies for them to watch. In your write-up, clearly explain the recommender system process and all steps performed. If you are using a method found online, be sure to reference the source.

In [1]:
# Imports and Read-ins
import pandas as pd
from sklearn.model_selection import train_test_split
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate


ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')


In [2]:
# Checking for nulls and duplicates
ratings.info()
movies.info()
ratings[ratings.duplicated()==True]
movies[movies.duplicated()==True]

# Data frame prep
movies['title without year'] = movies['title'].apply( lambda x: x.split(' (')[0])
same_name_titles = movies[movies['title without year'].duplicated() == True]

data = pd.merge(ratings, movies, on='movieId')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [3]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

In [4]:
# Create a Reader object to parse the ratings data, build SVD model, cross validate
reader = Reader(rating_scale=(0.5, 5))
train_dataset = Dataset.load_from_df(train_data[['userId', 'movieId', 'rating']], reader)
model = SVD()
cross_validate(model, train_dataset, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8725  0.8848  0.8827  0.8791  0.8878  0.8814  0.0053  
MAE (testset)     0.6707  0.6831  0.6827  0.6765  0.6812  0.6788  0.0047  
Fit time          0.60    0.64    0.56    0.64    0.65    0.62    0.03    
Test time         0.12    0.06    0.11    0.06    0.11    0.09    0.03    


{'test_rmse': array([0.87247018, 0.88483864, 0.88274021, 0.87912735, 0.88780115]),
 'test_mae': array([0.67073439, 0.683131  , 0.68269739, 0.67645237, 0.68120867]),
 'fit_time': (0.6043682098388672,
  0.6388342380523682,
  0.5597271919250488,
  0.6359059810638428,
  0.6534168720245361),
 'test_time': (0.12056422233581543,
  0.05529665946960449,
  0.11032509803771973,
  0.06118512153625488,
  0.11045193672180176)}

In [5]:
# parameters user movie title, SVD model, movie df, merged df, user ID, top 10
# locates movie id from id, gets ratings
# predicts rating for all movies in data based on user id ratings and sorts them to pull top 10 via lsit
# returns list
def movie_recommendations(movie_title, model, movies, data, user, n=10):
    movie_id = movies[(movies['title'] == movie_title) | (movies['title without year'] == movie_title)]['movieId'].iloc[0]
    movie_ratings = data[data['movieId'] == movie_id]
    
    predictions = []
    for movie_id in data['movieId'].unique():
        prediction = model.predict(user, movie_id)
        predictions.append((movie_id, prediction.est))
    predictions.sort(key=lambda x: x[1], reverse=True)
    
    recommended_movies = []
    for movie_id, _ in predictions[:10]:
        recommended_movie = movies[movies['movieId'] == movie_id]['title'].iloc[0]
        recommended_movies.append(recommended_movie)
    return recommended_movies

In [6]:
# User Input needed
user = int(input('Enter user ID: '))
liked_movie = input('Enter Title: ').title()

if same_name_titles[same_name_titles['title without year' ] == liked_movie]['title'].any():
    liked_movie = input('''Multiple titles with that name, year needed
    \nExample: Sabrina (1995)
    \nPlease enter title and year: ''').title()
    
try:
    recommended_movies = movie_recommendations(liked_movie, model, movies, data, user)
    print(f"Top 10 recommended movies based on '{liked_movie}':")
    for movie in recommended_movies:
        print(f"\n{movie}")
except:
    print(f"{liked_movie} not found")

Enter user ID: 1
Enter Title: Sabrina
Multiple titles with that name, year needed
    
Example: Sabrina (1995)
    
Please enter title and year: Sabrina (1995)
Top 10 recommended movies based on 'Sabrina (1995)':

Monty Python and the Holy Grail (1975)

Princess Bride, The (1987)

Goodfellas (1990)

Back to the Future (1985)

Shawshank Redemption, The (1994)

Departed, The (2006)

Seven Samurai (Shichinin no samurai) (1954)

Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)

Spirited Away (Sen to Chihiro no kamikakushi) (2001)

Graduate, The (1967)


SVD is a matrix factorization algorithm which attempts to predict missing interactions from the matrix via factorization producing user latent and item/movie latent factors. In this case it is predicting movies based on rating. In this case we use the 100k dataset from Movielens which includes a movie database and ratings database. A reader for the ratings is created followed by the SVD model and 5-fold cross validation. A function takes in the given movie title, model, movie data frame, merged movie and ratings data frame, given user ID, and the number of recommendations in this case 10. Some precautions were accounted for by creating a column in the movies data frame that is only the title. This allows the user to type in the movie title without the year. In the event there are multiple movies with the same title the year is then requested. For each movie in the merged data set the model is used to predict the rating based on the user and movie. The recommended movies are found by taking the first/top 10 sorted predictions by movie rating descending.