This project uses the small MovieLens data set, to create a movie recommendation system that allows users to input a movie they like (from within the data set) and recommends ten other movies for them to watch. This project uses Python. 

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re

#Import libraries for text frequency vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#Import libraries to create and display text box widget
import ipywidgets as widgets
from IPython.display import display

import warnings
warnings.filterwarnings('ignore')



In [2]:
# Import movie data frame from https://grouplens.org/datasets/movielens/
movies = pd.read_csv('/Users/mksis/Documents/Data Science/DS630 Predictive Analytics/Data Sets/movies/movies.csv')
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [3]:
#Viewing shape of the movies dataframe
movies.shape

(62423, 3)

In [4]:
#Import ratings data from https://grouplens.org/datasets/movielens/
ratings = pd.read_csv('/Users/mksis/Documents/Data Science/DS630 Predictive Analytics/Data Sets/movies/ratings.csv')
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828


In [5]:
#Viewing shape of the ratings dataframe
ratings.shape

(25000095, 4)

In [6]:
ratings.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [7]:
#Finding null values in movies df
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [8]:
#Finding null values in ratings df
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [9]:
#Finding statistics of the movie data set
movie_num = len(movies['movieId'].unique())

print("Number of movies in movies data set: ", movie_num)

Number of movies in movies data set:  62423


In [10]:
#Finding statistics of the ratings data set
rating_num = len(ratings['movieId'])
rating_movies = len(ratings['movieId'].unique())
rating_users = len(ratings['userId'].unique())
print('Number of movie ratings: ', rating_num)
print('Number of movies in rating data set: ', rating_movies)
print('Number of movie ratings: ', rating_users)

Number of movie ratings:  25000095
Number of movies in rating data set:  59047
Number of movie ratings:  162541


In [11]:
#Average movie rating
mean_rating = ratings.groupby('movieId')[['rating']].mean()
mean_rating

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,3.893708
2,3.251527
3,3.142028
4,2.853547
5,3.058434
...,...
209157,1.500000
209159,3.000000
209163,4.500000
209169,3.000000


In [12]:
#Highest rated movie
highest_rated = mean_rating['rating'].idxmax()
highest_rated = movies.loc[movies['movieId'] == highest_rated]
highest_rated

Unnamed: 0,movieId,title,genres
9416,27914,"Hijacking Catastrophe: 9/11, Fear & the Sellin...",Documentary


In [13]:
#Lowest rated movie
lowest_rated = mean_rating['rating'].idxmin()
lowest_rated = movies.loc[movies['movieId'] == lowest_rated]
lowest_rated

Unnamed: 0,movieId,title,genres
5693,5805,Besotted (2001),Drama


In [14]:
#Create a function to clean the titles
#This function removes any characters that are not letters, numbers, or spaces
def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title

In [15]:
#Run the clean_title function
#show the head of the movie dataframe
movies['clean_title'] = movies['title'].apply(clean_title)
movies.head(3)

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995


In [16]:
#Create a term frequency vector - this will help identify if there are similarities in the titles that are searched

vectorizer = TfidfVectorizer(ngram_range=(1,2))

tfidf = vectorizer.fit_transform(movies['clean_title'])

In [17]:
#Creating a search function

def search(title):
    #cleaning the title that was searched
    title = clean_title(title) 
    #creating a matrix
    query_vec = vectorizer.transform([title])
    #creating variable of similarity by using the cosine_similarity() function and by using the text frequency matrix
        #and the query vectorizer of the title
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    #creating a numpy array with titles with similar titles
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]
    
    return results

In [18]:
#Creating a function that finds similar movies based on the movieId searched
def find_similar_movies(movie_id):
    
    #creating similar users array
    #similar_users takes the movieId column and matches it with the movie_id and then finds ratings
    #from unique users that are above 4.
    similar_users = ratings[(ratings['movieId'] == movie_id) & (ratings['rating'] > 4)]['userId'].unique()
    #New similar_user_recs variable takes the value count of the similar_user_recs
    #and divides it by the length (aka number) of similar_users
    similar_user_recs = ratings[(ratings['userId'].isin(similar_users)) & (ratings['rating'] > 4)]['movieId']
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)
    similar_user_recs = similar_user_recs[similar_user_recs > .1]
    
    #Create all_users array
    #This finds out how much all users in a data set liked certain movies
    all_users = ratings[(ratings['movieId'].isin(similar_user_recs.index)) & (ratings['rating'] > 4)]
    all_user_recs = all_users['movieId'].value_counts() / len(all_users['userId'].unique())
    
    #Finding the difference of recommendations from the similar_user_recs and all_user_recs
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ['similar user recs', 'all user recs']
    rec_percentages['score'] = rec_percentages['similar user recs'] / rec_percentages['all user recs']
    rec_percentages = rec_percentages.sort_values('score', ascending=False)
    #returns the top 10 movies and only showing the score, title, and genre
    return rec_percentages.head(10).merge(movies, left_index=True, right_on = 'movieId')[['score', 'title', 'genres']]

In [19]:
#Creating input box to type movie title you want to view similar movies to.
movie_name_input = widgets.Text(
    value='Input Title You Want To Find Similar Titles To',
    description='Movie Title:',
    disabled=False)
recommendation_list = widgets.Output()

#This function will make a recommendation list based on the name of the movie title entered
def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data['new']
        if len(title) > 5:
            results = search(title)
            movie_id = results.iloc[0]['movieId']
            display(find_similar_movies(movie_id)) #Will display the results from the find_similar_movies function

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='Input Title You Want To Find Similar Titles To', description='Movie Title:')

Output()

### Write Up:
First step was to import all the data frames - each csv file had different information about the movies. The movies dataframe had information on the movieId, the title, and the genres of the movies. The ratings dataframe had information on the userId, movieId, the rating, and the timestamp the rating was documented.

After importing the two dataframes, I then did quick analysis to find the shape of the two dataframes, find the number of movies, the number of ratings, the number of unique users who rated movies, and the average movie rating. I then found the highest rated movie and the lowest rated movie. 

The next step was to clean the two dataframes. I looked to see if there was any null observations, neither dataframe had any null values. Next was to create a formula to clean the observations in the 'title' column in the df_movie file. This way I removed any characters that were not letters, numbers or spaces. 

Create Term Frequency with tfid vector based on the clean_title column of the movies dataframe. This term frequency will identify if there are similarities in the titles that are searched. The term frequency is used to rank the titles according to their relative significance.

Next, I created a search function. This module received the movie title as an input, and then ran the clean_title function to remove uneccessary characters. Then the module created a matrix based on similar titles and then returned the results of the 10 closest results. 

Then I created a function that finds similar movies based on the movieId searched. This was done by creating an array of similar users this was done by finding users that rated movies in a similar manner, so any users that rated individual movies higer than a 4 [4.5 or 5] were then grouped into this similar users group. Then I created a similar users rec variable that finds ratings that are within the similar users array.  This was then used to find the count of similar user recs divided by the number of similar users and I then took the top 10% of those recommendations. Next, I created an all_users variable which finds out what all users thought about each movie. This all users recommendation variable was then used to find the difference of what similar users rated movies versus what all users rated those movies. And that recommendation difference was used to create a score that helped sort movies and only return the top 10 movies with the highest score. 

Lastly, I created a widget, this would be used for a user to input a movie that they liked, or a movie they wanted to see similar movies to. This was done by creating a movie name input variable wich used the widget function and you input the movie title into it. The recommendation list was the output of this widget function. The recommendation list is based on the title that was input into the text widget, and then the title search function that returns titles that are similar in text frequency matrix and then finds movies that have the highest recommendation score. The output is then the display of the 10 movies most closely related to the search. 