Introduction


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import regex as re


In [2]:
movies = pd.read_csv('ml-25m/movies.csv')


In [3]:
movies.head()


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Clean the titles (remove everything thats not a latter,number och space)


In [4]:
def clean_titles(title):
    cleaned_title = re.sub('[^A-Za-z0-9 ]+', '', title)
    return cleaned_title


movies['clean_title'] = movies['title'].apply(lambda x: clean_titles(x))


In [5]:
movies.head()


Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


Now, lets create a TFIDF matrix (info on what it is)


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [7]:
vectorize = TfidfVectorizer(ngram_range=(1, 2))
titles_vector = vectorize.fit_transform(movies['clean_title'])


Next, we'll compute the similarity between our search term and all the titles in our data. To do this, we're going to use something called cosine_similarity, which is available in scikit-learn — we don't need to implement it ourselves.

We'll then write a function called search, which takes in a search term; in this case, the term is a title we want to search. The function will then do the following:

Clean the title

Convert the title into a set of numbers

Use cosine_similarity to find the similarity between our search term and all the titles in our data


In [8]:
from sklearn.metrics.pairwise import cosine_similarity


In [9]:
def search(search_term):
    cleaned_search_term = clean_titles(search_term)
    df = movies
    cleaned_titles = 'title'
    titles_vector = vectorize.transform(df[cleaned_titles])
    search_vector = vectorize.transform([cleaned_search_term])
    similarity = cosine_similarity(titles_vector, search_vector).flatten()
    ind = np.argpartition(similarity, -5)[-5:]
    return df.iloc[ind]


In [27]:
test = search('Toy')


Now, lets make it interactive with ipython and display!


In [11]:
import ipywidgets as widget
from IPython.display import display


In [12]:

movie_input = widget.Text(
    value='Toy Story',
    description='Movie titles: ',
    disabled=False
)
movie_list = widget.Output()


def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data['new']
        if len(title) > 5:
            display(search(title))


movie_input.observe(on_type, names='value')
display(movie_input, movie_list)


Text(value='Toy Story', description='Movie titles: ')

Output()

In the ratings.csv file, we have movie_id and rating. Each user has rated a movie, and we can see how they rated it. We'll create a function to find all the users who also liked the movie that we typed in. For example, if we type the hulk, we want to find all users who also liked the movie hulk. Then we want to see the other movies they liked because those will probably be good recommendations for us.


In [13]:
ratings = pd.read_csv('ml-25m/ratings.csv')
ratings.head()


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


So now, lets just create a new dataframe with all that liked the same movie as us (lets define 4 or above as "like"). Lets start with Star Wars Episode 1 (movie ID = 2628)


In [14]:
def find_similar_users(movie_id, ratings):
    movie_df = pd.DataFrame()
    movie_df = ratings[(ratings['movieId'] == movie_id)
                       & (ratings['rating'] > 4)]
    return movie_df


similar_ratings = find_similar_users(2628, ratings)


Find the movies that they liked


In [15]:
def find_similar_movies(user_df, ratings_df=ratings):
    user_id_list = list(user_df['userId'])
    similar_user_ratings = ratings[(ratings['userId'].isin(
        user_id_list)) & (ratings['rating'] > 4)]
    return similar_user_ratings


similar_user_ratings = find_similar_movies(similar_ratings, ratings)


if 10% or above of our similar users liked the movie, save it


In [16]:
similar_user_list = list(similar_user_ratings['userId'].unique())
ten_percent = len(similar_user_list)*0.1
test = similar_user_ratings['movieId'].value_counts() > ten_percent
movies_list = list(similar_user_ratings['movieId'].value_counts()[test].index)


Now, we're going to find how many of the users in our dataset like these movies. We need to find movies that are specific to our niche. For example, if someone likes the Avengers, you want to find other movies they like that are similar to the Avengers. You don't just want all of the movies they like because they probably like many movies that don't have anything to do with the Avengers.

First, lets find the percentage of all users like each of the movies, and add it as a column in the movies dataframe


all_movies_list = list(movies['movieId'].unique())
movies['all_users_rating'] = np.nan
number_of_users = ratings['userId'].nunique()
for i in range(0,len(movies)):
movie_id = movies['movieId'].iloc[i]
movies['all_users_rating'].iloc[i] = len(ratings[(ratings['movieId'] == movie_id) & (ratings['rating'] > 4)])/number_of_users


Second, lets see the percentage of people that liked star wars liked the different movies


In [32]:
new_movies = pd.read_csv('new_movies.csv')


Now, lets only recommend movies that are within the same genres. To do that, we need to first make the "genres" column to a list, where the \ acts a separator. Lets create a function that does the conversion to a list and then only keep the ones with the same genres

In [181]:
def keep_same_genres(movie_df, movie_id):
    # Split genres to a list
    movie_df['genres_list'] = movie_df['genres'].astype(str).str.split('|')
    # Get the genres of the picked movie
    movie_id_genres = movie_df[movie_df['movieId']
                               == movie_id]['genres_list'].values[0]
    # Convert the list of genres in movie_df to a new dataframe, with every genre as a value. The df will be expanded to 9 columns in this case.
    genres_df = movie_df['genres_list'].apply(pd.Series).astype(str)
    # Check if any value is in the movie_id list of genres, and then check if any value in the row is true.
    genres_bool_df = genres_df.isin(movie_id_genres).any(axis=1)

    # Finally, return the new dataframe which only contains movies that have the same genres as the picked movie
    return movie_df[genres_bool_df]


movies_same_genres = keep_same_genres(new_movies, 1)
movies_same_genres.head(5)


Unnamed: 0.1,Unnamed: 0,movieId,title,genres,clean_title,all_users_rating,genres_list
0,0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995,0.115878,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995,0.016408,"[Adventure, Children, Fantasy]"
2,2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995,0.007155,"[Comedy, Romance]"
3,3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995,0.00112,"[Comedy, Drama, Romance]"
4,4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995,0.005955,[Comedy]


In [18]:
def rate_similar_movies(num_similar_users, movies_list, similar_user_ratings):

    similar_movies_dict = {}
    for movie in movies_list:
        similar_movies_dict[movie] = len(similar_user_ratings[(
            similar_user_ratings['movieId'] == movie)])/num_similar_users
    new_movies_df = pd.DataFrame(data=similar_movies_dict.values(
    ), index=similar_movies_dict.keys(), columns=['similar_users_rating'])
    return new_movies_df


num_similar_users = len(similar_user_list)
similar_movies_perc_df = rate_similar_movies(
    num_similar_users, movies_list, similar_user_ratings)
similar_movies_perc_df


Unnamed: 0,similar_users_rating
2628,1.000000
260,0.723475
1196,0.680677
1210,0.679651
2571,0.582778
...,...
1127,0.101230
2617,0.100974
54001,0.100718
1923,0.100718


Now, lets merge the df with the scores from our similar users and all users to one. Then, lets divide our similar user score with all user score for each of the movies. We will then select the top 10 movies of these, and make a new df with only these ones


In [19]:
def find_top_10(movies_df, similar_movies_df):
    merged_df = pd.merge(left=movies_df, right=similar_movies_df,
                         how='left', left_on='movieId', right_on=similar_movies_df.index)
    merged_df = merged_df.dropna()
    merged_df['Score'] = merged_df['similar_users_rating'] / \
        merged_df['all_users_rating']
    top_10_movies = merged_df.sort_values(
        by='Score', ascending=False).iloc[:10, :]
    return top_10_movies[['title', 'genres', 'Score']]


In [20]:
top_10_movies = find_top_10(new_movies, similar_movies_perc_df)
top_10_movies.head(3)


Unnamed: 0,title,genres,Score
2537,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Sci-Fi,41.655818
5270,Star Wars: Episode II - Attack of the Clones (...,Action|Adventure|Sci-Fi|IMAX,20.663292
9952,Star Wars: Episode III - Revenge of the Sith (...,Action|Adventure|Sci-Fi,12.80997


Before we create our simple search engine, we need to slighty modify the "search" function. Currently, its returning 5 of the most similar titles, but we want it to only return one. Furthermore, we only want the "movie_id" of the searched movie. Lets modify it slightly before creating the search engine.


In [21]:
def search_one(search_term, df=new_movies):

    cleaned_search_term = clean_titles(search_term)
    titles_vector = vectorize.transform(df['title'])
    search_vector = vectorize.transform([cleaned_search_term])
    similarity = cosine_similarity(titles_vector, search_vector).flatten()
    ind = np.argpartition(similarity, -1)[-1:]
    return int(df['movieId'].iloc[ind])


Now, lets put it all together and make a search engine!


In [182]:
def movie_recommendation(search_term, movie_df=new_movies, rating_df=ratings):
    # Find movie_id of movie most similar to searched term
    movie_id = search_one(search_term)

    # Find similar users
    similar_user_df = find_similar_users(movie_id, rating_df)

    # Find what movies similar users like
    similar_movies_df = find_similar_movies(similar_user_df, rating_df)

    # Make a list of similar users
    similar_user_list = list(similar_movies_df['userId'].unique())

    # Calculate how many unique users there are (number of users)
    num_similar_users = len(similar_user_list)

    # Only include movies that 10% or more of similar users liked
    ten_percent = len(similar_user_list)*0.1
    test = similar_movies_df['movieId'].value_counts() > ten_percent
    movies_list = list(similar_movies_df['movieId'].value_counts()[test].index)
    
    # Calculate how many of the similar users liked the movies in the list
    similar_movies_rated = rate_similar_movies(
        num_similar_users, movies_list, similar_movies_df)
    # Remove all the movies that does not have the same genre
    movies_same_genre_df = keep_same_genres(movie_df,movie_id)

    # Retrieve the top 10% of the movies with the "score"
    recommended_df = find_top_10(movies_same_genre_df, similar_movies_rated)

    return recommended_df


In [194]:
recommended_movies = movie_recommendation('Star Wars')


In [195]:
recommended_movies

Unnamed: 0,title,genres,Score
3290,Star Wars: Dresca,Sci-Fi,162541.0
2976,Star Wars Downunder (2013),Sci-Fi,162541.0
1754,The Star Wars Holiday Special (1978),Adventure|Children|Comedy|Sci-Fi,27090.166667
1271,Star Trek: Of Gods and Men (2007),Action|Adventure|Sci-Fi,20317.625
1918,Star Trek: Renegades (2015),Action|Adventure|Sci-Fi,20317.625
1787,Robot Chicken: Star Wars (2007),Animation|Comedy|Sci-Fi,3317.163265
855,Star Wars: The Clone Wars (2008),Action|Adventure|Animation|Sci-Fi,820.914141
3042,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,738.822727
447,Star Trek: Nemesis (2002),Action|Drama|Sci-Fi|Thriller,359.603982
78,Star Trek V: The Final Frontier (1989),Action|Sci-Fi,297.694139


Now, lets make an interactive search box as before with our new search engine


In [185]:
movie_input = widget.Text(
    value='',
    description='Movie titles: ',
    disabled=False
)
movie_list = widget.Output()


def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data['new']
        if len(title) > 5:
            display(movie_recommendation(title))


movie_input.observe(on_type, names='value')
display(movie_input, movie_list)


Text(value='', description='Movie titles: ')

Output()