# Movie Recommendation System

![img](images/Movies.png)

1. Importing the necessary libraries:

In [1]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import ipywidgets as widgets
from IPython.display import display
import os


2. Reading movie data from a CSV file into a pandas DataFrame:

In [2]:
movies = pd.read_csv("data/movies.csv")


In [3]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


3. Extracting the year from the movie titles using regular expressions:

In [4]:
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)')


4. Converting the extracted year to numeric format:

In [5]:
movies['year'] = pd.to_numeric(movies['year'])


5. Finding the oldest and most recent years in the dataset:

In [6]:
oldest_year = movies['year'].min()
most_recent_year = movies['year'].max()


In [7]:
print("Oldest Year:", oldest_year)
print("Most Recent Year:", most_recent_year)


Oldest Year: 1874.0
Most Recent Year: 2019.0


6. Defining a function to clean movie titles by removing special characters:

In [8]:
def clean_title(title):
    title = re.sub("[^a-zA-Z0-9 ]", "", title)
    return title


7. Applying the clean_title function to the "title" column of the movies DataFrame and storing the cleaned titles in a new column called "clean_title":

In [9]:
movies["clean_title"] = movies["title"].apply(clean_title)


In [10]:
movies

Unnamed: 0,movieId,title,genres,year,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995.0,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995.0,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995.0,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,1995.0,Father of the Bride Part II 1995
...,...,...,...,...,...
62418,209157,We (2018),Drama,2018.0,We 2018
62419,209159,Window of the Soul (2001),Documentary,2001.0,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,2018.0,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),2001.0,A Girl Thing 2001


8. Creating an instance of the TfidfVectorizer class with an n-gram range of (1,2) to consider both single words and bigrams as features:

In [11]:
vectorizer = TfidfVectorizer(ngram_range=(1,2))


9. Transforming the "clean_title" column of the movies DataFrame into a TF-IDF matrix using the fitted TfidfVectorizer:

In [12]:
tfidf = vectorizer.fit_transform(movies["clean_title"])
tfidf

<62423x170073 sparse matrix of type '<class 'numpy.float64'>'
	with 446566 stored elements in Compressed Sparse Row format>

10. Defining a function to search for similar movies based on a given title:

In [13]:
def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices].iloc[::-1]
    return results


11. Creating an input widget for the user to enter a movie title:

In [14]:
movie_input = widgets.Text(
    placeholder='Input Movie',
    description='Movie Title:',
    disabled=False
)


12. Creating an output widget to display the search results:

In [15]:
movie_list = widgets.Output()


13. Defining a callback function that is triggered when the user types in the movie input widget:

In [16]:
def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data["new"]
        if len(title) > 0:
            display(search(title))
            

# Observing changes in the value of the movie input widget and calling the on_type function:
movie_input.observe(on_type, names='value') 

# Displaying the movie input widget and the movie list widget:
display(movie_input, movie_list)

Text(value='', description='Movie Title:', placeholder='Input Movie')

Output()

14. Accessing a specific movie by its movie ID:

In [17]:
movie_id = 89745
movie = movies[movies["movieId"] == movie_id]


15. Reading ratings data from a CSV file into a pandas DataFrame:

In [18]:
ratings = pd.read_csv("data/ratings.csv")


In [19]:
ratings = ratings.drop('timestamp', axis=1)

In [20]:
ratings

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5
...,...,...,...
25000090,162541,50872,4.5
25000091,162541,55768,2.5
25000092,162541,56176,2.0
25000093,162541,58559,4.0


16. Selecting the unique user IDs who rated the specific movie with a rating higher than 4:

In [21]:
similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
similar_users

array([    21,    187,    208, ..., 162469, 162485, 162532])

17. Finding movies that were highly rated by the similar users and calculating their frequency in perecentage:

In [22]:
similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
similar_user_recs = similar_user_recs.value_counts() / len(similar_users)
similar_user_recs

movieId
89745     1.000000
58559     0.573393
59315     0.530649
79132     0.519715
2571      0.496687
            ...   
160402    0.000166
161642    0.000166
158950    0.000166
199648    0.000166
198609    0.000166
Name: count, Length: 16553, dtype: float64

18. Filtering the recommended movies based on a threshold (0.10 in this case)This means that only movies that were rated highly by at least 10% of similar users will be considered:

In [23]:
similar_user_recs = similar_user_recs[similar_user_recs > .10]
similar_user_recs 

movieId
89745    1.000000
58559    0.573393
59315    0.530649
79132    0.519715
2571     0.496687
           ...   
47610    0.103545
780      0.103380
88744    0.103048
1258     0.101226
1193     0.100895
Name: count, Length: 193, dtype: float64

19. Selecting all users who rated the recommended movies:

In [24]:
all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
all_users

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
29,1,4973,4.5
48,1,7361,5.0
72,2,110,5.0
76,2,260,5.0
...,...,...,...
25000065,162541,5952,5.0
25000078,162541,7153,5.0
25000081,162541,7361,4.5
25000086,162541,31658,4.5


20. Calculating the frequency of the recommended movies among all users: will contain the IDs of the movies and their relative frequency among all users.

In [25]:
all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
all_user_recs

movieId
318       0.346395
296       0.288146
2571      0.247010
356       0.238136
593       0.228665
            ...   
86332     0.010142
91630     0.009324
122900    0.008573
122926    0.008070
106072    0.005289
Name: count, Length: 193, dtype: float64

21. Concatenating the similar user recommendations and all user recommendations into a DataFrame:

In [26]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
rec_percentages.columns = ["similar", "all"]


In [27]:
rec_percentages

Unnamed: 0_level_0,similar,all
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
89745,1.000000,0.040459
58559,0.573393,0.148256
59315,0.530649,0.054931
79132,0.519715,0.132987
2571,0.496687,0.247010
...,...,...
47610,0.103545,0.022770
780,0.103380,0.054723
88744,0.103048,0.010383
1258,0.101226,0.083887


22. Calculating a score for each recommended movie by dividing the similar user recommendations by the all user recommendations:

In [28]:
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
rec_percentages["score"]

movieId
89745    24.716368
58559     3.867590
59315     9.660345
79132     3.908027
2571      2.010791
           ...    
47610     4.547463
780       1.889149
88744     9.924843
1258      1.206688
1193      0.839081
Name: score, Length: 193, dtype: float64

23. Sorting the recommended movies by the score in descending order:

In [29]:
rec_percentages = rec_percentages.sort_values("score", ascending=False)
rec_percentages

Unnamed: 0_level_0,similar,all,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
89745,1.000000,0.040459,24.716368
106072,0.103711,0.005289,19.610199
122892,0.241054,0.012367,19.491770
102125,0.216534,0.012119,17.867419
88140,0.215043,0.012052,17.843074
...,...,...,...
296,0.288933,0.288146,1.002730
593,0.222830,0.228665,0.974483
527,0.199967,0.217833,0.917984
1193,0.100895,0.120244,0.839081


24. Displaying the top 10 recommended movies merged with the movie details:

In [30]:
rec_percentages.merge(movies, left_index=True, right_on="movieId")


Unnamed: 0,similar,all,score,movieId,title,genres,year,clean_title
17067,1.000000,0.040459,24.716368,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX,2012.0,Avengers The 2012
20513,0.103711,0.005289,19.610199,106072,Thor: The Dark World (2013),Action|Adventure|Fantasy|IMAX,2013.0,Thor The Dark World 2013
25058,0.241054,0.012367,19.491770,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi,2015.0,Avengers Age of Ultron 2015
19678,0.216534,0.012119,17.867419,102125,Iron Man 3 (2013),Action|Sci-Fi|Thriller|IMAX,2013.0,Iron Man 3 2013
16725,0.215043,0.012052,17.843074,88140,Captain America: The First Avenger (2011),Action|Adventure|Sci-Fi|Thriller|War,2011.0,Captain America The First Avenger 2011
...,...,...,...,...,...,...,...,...
292,0.288933,0.288146,1.002730,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,1994.0,Pulp Fiction 1994
585,0.222830,0.228665,0.974483,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,1991.0,Silence of the Lambs The 1991
522,0.199967,0.217833,0.917984,527,Schindler's List (1993),Drama|War,1993.0,Schindlers List 1993
1164,0.100895,0.120244,0.839081,1193,One Flew Over the Cuckoo's Nest (1975),Drama,1975.0,One Flew Over the Cuckoos Nest 1975


25. Defining a function to find similar movies based on a given movie ID:

This function identifies similar users who have rated a movie. It then calculates the recommendation percentages for each movie by similar users and all users, calculates a score based on the proportion of similar user recommendations to all user recommendations, and sorts the movies based on the score to provide the top 10 similar movies with their titles and genres.





In [31]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"]
    similar_user_recs = similar_user_recs.value_counts() / len(similar_users)

    similar_user_recs = similar_user_recs[similar_user_recs > .10]
    all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)]
    all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())
    rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1)
    rec_percentages.columns = ["similar", "all"]
    
    rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]
    rec_percentages = rec_percentages.sort_values("score", ascending=False)
    return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]

26. Creating an input widget for the user to enter a movie title:

In [32]:

movie_name_input = widgets.Text(
    placeholder='Input Movie',
    description='Movie Title:',
    disabled=False
)
recommendation_list = widgets.Output()

def on_type(data):
    with recommendation_list:
        recommendation_list.clear_output()
        title = data["new"]
        if len(title) > 0:
            results = search(title)
            movie_id = results.iloc[0]["movieId"]
            display(find_similar_movies(movie_id))

movie_name_input.observe(on_type, names='value')

display(movie_name_input, recommendation_list)

Text(value='', description='Movie Title:', placeholder='Input Movie')

Output()

27. Merging movies and ratings

In [33]:
df = pd.merge(movies, ratings, on='movieId')
df = df.drop(['title'], axis=1)
df = df.rename(columns={'clean_title': 'title'})



df

Unnamed: 0,movieId,genres,year,title,userId,rating
0,1,Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story 1995,2,3.5
1,1,Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story 1995,3,4.0
2,1,Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story 1995,4,3.0
3,1,Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story 1995,5,4.0
4,1,Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story 1995,8,4.0
...,...,...,...,...,...,...
25000090,209157,Drama,2018.0,We 2018,119571,1.5
25000091,209159,Documentary,2001.0,Window of the Soul 2001,115835,3.0
25000092,209163,Comedy|Drama,2018.0,Bad Poems 2018,6964,4.5
25000093,209169,(no genres listed),2001.0,A Girl Thing 2001,119571,3.0


In [34]:
df2 = df.groupby(['movieId', 'title', 'genres', 'year'])['rating'].mean().reset_index()



df2


Unnamed: 0,movieId,title,genres,year,rating
0,1,Toy Story 1995,Adventure|Animation|Children|Comedy|Fantasy,1995.0,3.893708
1,2,Jumanji 1995,Adventure|Children|Fantasy,1995.0,3.251527
2,3,Grumpier Old Men 1995,Comedy|Romance,1995.0,3.142028
3,4,Waiting to Exhale 1995,Comedy|Drama|Romance,1995.0,2.853547
4,5,Father of the Bride Part II 1995,Comedy,1995.0,3.058434
...,...,...,...,...,...
58670,209157,We 2018,Drama,2018.0,1.500000
58671,209159,Window of the Soul 2001,Documentary,2001.0,3.000000
58672,209163,Bad Poems 2018,Comedy|Drama,2018.0,4.500000
58673,209169,A Girl Thing 2001,(no genres listed),2001.0,3.000000


28. Splitting the "genres" column into three separate columns

In [35]:



df2['genres'] = df2['genres'].str.split('|')
df2['genres1'] = df2['genres'].str[0].str.strip()
df2['genres2'] = df2['genres'].str[1].str.strip()
df2['genres3'] = df2['genres'].str[2].str.strip()

df3 = df2[['movieId', 'title', 'year', 'rating', 'genres1', 'genres2', 'genres3']]

df3



Unnamed: 0,movieId,title,year,rating,genres1,genres2,genres3
0,1,Toy Story 1995,1995.0,3.893708,Adventure,Animation,Children
1,2,Jumanji 1995,1995.0,3.251527,Adventure,Children,Fantasy
2,3,Grumpier Old Men 1995,1995.0,3.142028,Comedy,Romance,
3,4,Waiting to Exhale 1995,1995.0,2.853547,Comedy,Drama,Romance
4,5,Father of the Bride Part II 1995,1995.0,3.058434,Comedy,,
...,...,...,...,...,...,...,...
58670,209157,We 2018,2018.0,1.500000,Drama,,
58671,209159,Window of the Soul 2001,2001.0,3.000000,Documentary,,
58672,209163,Bad Poems 2018,2018.0,4.500000,Comedy,Drama,
58673,209169,A Girl Thing 2001,2001.0,3.000000,(no genres listed),,


29. Extracting the data frame

In [36]:



path = os.path.expanduser('~/Desktop/IRONHACK/Projects/Final-Project/data')
os.makedirs(path, exist_ok=True)

df3.to_csv(os.path.join(path, 'df3.csv'), index=False)


