# Project title : Movie Recommendation System with Machine Learning

Recommendation systems are among the most popular applications of data science. They are used to predict the rating or preference that a user would give to an item. These systems are integral to modern applications, helping businesses and platforms tailor their offerings to individual user tastes.

Real-World Applications
Amazon: Suggests products to customers based on their browsing and purchasing history.
YouTube: Decides which video to play next on autoplay by analyzing viewing patterns.
Facebook: Recommends pages to like and people to follow based on user interactions and interests.
Objective
In this Data Science project, we will build a basic recommendation system to understand the fundamentals of how these systems work. We will explore both simple and content-based recommendation systems. While our models may not match industry standards in terms of complexity, quality, or accuracy, they will serve as a solid foundation for developing more sophisticated models in the future.

Project Outline
1. Data Preparation
Load and Clean Data: Import the dataset, handle missing values, and ensure consistency.
Feature Engineering: Extract and format relevant features, such as movie titles and descriptions.
2. Recommendation Approaches
Content-Based Filtering: Recommend items similar to those the user has liked, based on item features.
Collaborative Filtering: Suggest items based on user-item interactions and similarities.
3. Model Training and Evaluation
Similarity Calculation: Use techniques like TF-IDF and cosine similarity to measure item similarity.
Build and Evaluate: Create models and evaluate their performance using metrics like precision and recall.
4. Generate Recommendations
Create Recommendation Function: Implement functions to generate and display recommendations.

Test the Function: Verify the function with sample inputs to ensure accuracy.

In [1]:
import pandas as pd
import numpy as np

In [5]:
credits = pd.read_csv(r"C:\Users\User\Desktop\MRS\tmdb_5000_credits.csv")
movies = pd.read_csv(r"C:\Users\User\Desktop\MRS\tmdb_5000_movies.csv")

  credits = pd.read_csv(r"C:\Users\User\Desktop\MRS\tmdb_5000_credits.csv")


In [6]:
credits.head()

Unnamed: 0,movie_id,title,cast,crew,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 1255,Unnamed: 1256,Unnamed: 1257,Unnamed: 1258,Unnamed: 1259,Unnamed: 1260,Unnamed: 1261,Unnamed: 1262,Unnamed: 1263,Unnamed: 1264
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",,,,,,,...,,,,,,,,,,
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...",,,,,,,...,,,,,,,,,,
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...",,,,,,,...,,,,,,,,,,
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...",,,,,,,...,,,,,,,,,,
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de...",,,,,,,...,,,,,,,,,,


In [7]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12/10/2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",5/19/2007,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",10/26/2015,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",7/16/2012,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",3/7/2012,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [8]:
print("Credits:",credits.shape)
print("Movies Dataframe:",movies.shape)

Credits: (4813, 1265)
Movies Dataframe: (4803, 20)


In [9]:
credits_column_renamed = credits.rename(index=str, columns={"movie_id": "id"})

In [10]:
movies_merge = movies.merge(credits_column_renamed, on='id')

In [11]:
print(movies_merge.head())

     budget                                             genres  \
0         0  [{"id": 35, "name": "Comedy"}, {"id": 14, "nam...   
1  48000000  [{"id": 14, "name": "Fantasy"}, {"id": 35, "na...   
2  58000000  [{"id": 10402, "name": "Music"}, {"id": 18, "n...   
3  55000000                    [{"id": 37, "name": "Western"}]   
4  48000000  [{"id": 53, "name": "Thriller"}, {"id": 28, "n...   

                           homepage      id  \
0                               NaN   11232   
1                               NaN    1636   
2                               NaN    2148   
3  http://www.310toyumathefilm.com/    5176   
4       http://www.taken3movie.com/  260346   

                                            keywords original_language  \
0  [{"id": 1808, "name": "lover (female)"}, {"id"...                en   
1  [{"id": 2038, "name": "love of one's life"}, {...                en   
2  [{"id": 1416, "name": "jazz"}, {"id": 3017, "n...                en   
3  [{"id": 1582, "name":

In [None]:
movies_cleaned = movies_merge.drop(columns=['homepage', 'title_x', 'title_y', 'status','production_countries'])
print(movies_cleaned.head())
print(movies_cleaned.info())
print(movies_cleaned.head(1)['overview'])

In [19]:
movies_cleaned = movies_merge.drop(columns=['homepage', 'title_x', 'title_y', 'status','production_countries'])

In [20]:
print(movies_cleaned.head())


     budget                                             genres      id  \
0         0  [{"id": 35, "name": "Comedy"}, {"id": 14, "nam...   11232   
1  48000000  [{"id": 14, "name": "Fantasy"}, {"id": 35, "na...    1636   
2  58000000  [{"id": 10402, "name": "Music"}, {"id": 18, "n...    2148   
3  55000000                    [{"id": 37, "name": "Western"}]    5176   
4  48000000  [{"id": 53, "name": "Thriller"}, {"id": 28, "n...  260346   

                                            keywords original_language  \
0  [{"id": 1808, "name": "lover (female)"}, {"id"...                en   
1  [{"id": 2038, "name": "love of one's life"}, {...                en   
2  [{"id": 1416, "name": "jazz"}, {"id": 3017, "n...                en   
3  [{"id": 1582, "name": "saloon"}, {"id": 1701, ...                en   
4  [{"id": 9748, "name": "revenge"}, {"id": 9826,...                en   

    original_title                                           overview  \
0   Kate & Leopold  When her scientis

In [21]:
print(movies_cleaned.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3276 entries, 0 to 3275
Columns: 1279 entries, budget to Unnamed: 1264
dtypes: float64(3), int64(3), object(1273)
memory usage: 32.0+ MB
None


In [22]:
print(movies_cleaned.head(1)['overview'])

0    When her scientist ex-boyfriend discovers a po...
Name: overview, dtype: object


Content Based Recommendation System: - 

Now lets make a recommendations based on the movie’s plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
tfv = TfidfVectorizer(min_df=3,  max_features=None,
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3),
            stop_words = 'english')

In [28]:
# Fitting the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(movies_cleaned['overview'])
print(tfv_matrix)
print(tfv_matrix.shape)

ValueError: np.nan is an invalid document, expected byte or unicode string.

In [29]:
print(movies_cleaned['overview'].isna().sum())

1


In [30]:
# Fill NaN values with an empty string
movies_cleaned['overview'].fillna('', inplace=True)

In [31]:
# Drop rows with NaN values in the 'overview' column
movies_cleaned.dropna(subset=['overview'], inplace=True)

In [32]:
print(movies_cleaned['overview'].apply(type).value_counts())

<class 'str'>    3276
Name: overview, dtype: int64


In [33]:
movies_cleaned['overview'] = movies_cleaned['overview'].astype(str)

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [35]:
# Assuming you have already defined tfv as a TfidfVectorizer instance
tfv = TfidfVectorizer()

In [36]:
# Handle missing values
movies_cleaned['overview'].fillna('', inplace=True)


In [37]:
# Ensure all entries are strings
movies_cleaned['overview'] = movies_cleaned['overview'].astype(str)

In [38]:
# Fit and transform the data
tfv_matrix = tfv.fit_transform(movies_cleaned['overview'])

In [39]:
# Output results
print(tfv_matrix)
print(tfv_matrix.shape)

  (0, 7424)	0.11298420645874493
  (0, 710)	0.13951473057861022
  (0, 6207)	0.11604170845739324
  (0, 17130)	0.07907700711009275
  (0, 322)	0.15603475637625414
  (0, 11272)	0.0817642040770832
  (0, 7418)	0.07876978933225338
  (0, 13089)	0.1044798686374166
  (0, 4670)	0.10869666323512633
  (0, 7223)	0.04860069018620823
  (0, 7757)	0.0972452726461527
  (0, 2330)	0.0543773397848142
  (0, 7406)	0.06285791678734366
  (0, 5751)	0.09679544412618696
  (0, 14034)	0.06551701283116608
  (0, 7127)	0.1599302507011789
  (0, 17197)	0.041021976995716934
  (0, 14628)	0.13645722857996193
  (0, 10323)	0.08683694042671004
  (0, 189)	0.15274116285385456
  (0, 15600)	0.07411277678419216
  (0, 14137)	0.1451204226386621
  (0, 6185)	0.09198206331879683
  (0, 13042)	0.1599302507011789
  (0, 15371)	0.08813225006748676
  :	:
  (3275, 12157)	0.13331677664517405
  (3275, 5490)	0.10656630990980957
  (3275, 8429)	0.13819446416234352
  (3275, 6039)	0.08898355110258703
  (3275, 8978)	0.10656630990980957
  (3275, 7028)	0

In [40]:
from sklearn.metrics.pairwise import sigmoid_kernel

In [41]:
# Compute the sigmoid kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)
print(sig[0])

[0.7616182  0.76159524 0.76159469 ... 0.76159521 0.76159474 0.76159503]


## Reverse mapping of indices and movie titles

In [42]:
# Reverse mapping of indices and movie titles
indices = pd.Series(movies_cleaned.index, index=movies_cleaned['original_title']).drop_duplicates()
print(indices)

original_title
Kate & Leopold                  0
Bedazzled                       1
The Cotton Club                 2
3:10 to Yuma                    3
Taken 3                         4
                             ... 
El Mariachi                  3271
Newlyweds                    3272
Signed, Sealed, Delivered    3273
Shanghai Calling             3274
My Date with Drew            3275
Length: 3276, dtype: int64


In [43]:
print(indices['Newlyweds'])

3272


In [45]:
indices = pd.Series(movies_cleaned.index, index=movies_cleaned['original_title']).drop_duplicates()
print(indices)
print(indices['Newlyweds'])
print(sig[4799])
print(list(enumerate(sig[indices['Newlyweds']])))
print(sorted(list(enumerate(sig[indices['Newlyweds']])), key=lambda x: x[1], reverse=True))

original_title
Kate & Leopold                  0
Bedazzled                       1
The Cotton Club                 2
3:10 to Yuma                    3
Taken 3                         4
                             ... 
El Mariachi                  3271
Newlyweds                    3272
Signed, Sealed, Delivered    3273
Shanghai Calling             3274
My Date with Drew            3275
Length: 3276, dtype: int64
3272


IndexError: index 4799 is out of bounds for axis 0 with size 3276

In [46]:
indices = pd.Series(movies_cleaned.index, index=movies_cleaned['original_title']).drop_duplicates()
print(indices)

title = 'Newlyweds'
if title in indices:
    idx = indices[title]
    if idx < sig.shape[0]:
        print(sig[idx])
        sim_scores = list(enumerate(sig[idx]))
        print(sim_scores)
        sorted_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        print(sorted_scores)
    else:
        print(f"Index {idx} is out of bounds for sig matrix.")
else:
    print(f"Title '{title}' not found in indices.")

original_title
Kate & Leopold                  0
Bedazzled                       1
The Cotton Club                 2
3:10 to Yuma                    3
Taken 3                         4
                             ... 
El Mariachi                  3271
Newlyweds                    3272
Signed, Sealed, Delivered    3273
Shanghai Calling             3274
My Date with Drew            3275
Length: 3276, dtype: int64
[0.76159426 0.7615944  0.7615945  ... 0.76159486 0.76159458 0.76159439]
[(0, 0.7615942631812533), (1, 0.7615944012876541), (2, 0.7615944980079399), (3, 0.761594461091149), (4, 0.761594700548365), (5, 0.7615945657923853), (6, 0.7615943016298923), (7, 0.7615946000145005), (8, 0.7615943656596972), (9, 0.7615944710041125), (10, 0.7615945098571912), (11, 0.7615947829626022), (12, 0.7615947669319096), (13, 0.761594748746191), (14, 0.7615941559557649), (15, 0.7615943281059172), (16, 0.761595644529789), (17, 0.7615943939732779), (18, 0.7615945622793554), (19, 0.7615945861661553), (20, 

In [47]:
def give_recomendations(title, sig=sig):
    # Get the index corresponding to original_title
    idx = indices[title]

    # Get the pairwsie similarity scores
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:11]

    # Movie indices
    movie_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    return movies_cleaned['original_title'].iloc[movie_indices]

In [81]:
print(give_recomendations('Avatar'))

KeyError: 'Avatar'

In [84]:
from sklearn.metrics.pairwise import cosine_similarity

In [85]:
# Compute similarity matrix
cosine_sim = cosine_similarity(tfv_matrix, tfv_matrix)

In [86]:
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie
    idx = movies.index[movies['original_title'] == title].tolist()[0]
    
    # Get similarity scores for the movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top 10 most similar movies
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return movies['original_title'].iloc[movie_indices]

# Test the function
print(get_recommendations('Avatar'))

1598                                Drag Me to Hell
2038                                  Summer of Sam
2536                                The Deer Hunter
451                                    The Haunting
3115    An Alan Smithee Film: Burn, Hollywood, Burn
1635                        The Replacement Killers
630                                 What Women Want
703                                Two Weeks Notice
258                                    The Smurfs 2
1846          Anchorman: The Legend of Ron Burgundy
Name: original_title, dtype: object


Summary

> Data Preparation: Clean and format your data for analysis.

> Recommendation Approaches: Understand and implement different recommendation techniques.

> Model Training and Evaluation: Build, test, and refine your recommendation models.