3. Develop a movie recommendation model using the scikit-learn library in python. Refer dataset 


Data Exploration 

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns 

In [4]:
df = pd.read_csv("movie_dataset.csv")

In [5]:
df.head(5)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


Data preprocessing

In [6]:
# Fill missing values
df['overview'] = df['overview'].fillna('')
df['genres'] = df['genres'].fillna('')
df['title'] = df['title'].fillna('')

# Combine fields for TF-IDF
df['combined'] = df['title'] + ' ' + df['genres'] + ' ' + df['overview']


Data Splitting
This is unsupervised learning (no labels). Instead of train-test split, we'll evaluate similarity quality.

If you wanted a supervised model using ratings (predicting user ratings), we could split. For now, we'll simulate "query & response

4. Algorithm Selection (TF-IDF + Cosine Similarity)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# TF-IDF vectorization
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['combined'])

# Cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


 5. Model Training
No explicit training needed here. We're using unsupervised similarity.

6. Model Evaluation
You can simulate evaluation by checking whether the recommendations for a known movie make sense.

python
Copy code


In [8]:
# Create reverse index for titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

# Recommend function
def recommend_movies(title, num_recommendations=5):
    if title not in indices:
        return f"Movie '{title}' not found in dataset."
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:num_recommendations+1]
    movie_indices = [i[0] for i in sim_scores]
    return df['title'].iloc[movie_indices]


# Try 

In [9]:
print(recommend_movies("The Dark Knight"))


3                         The Dark Knight Rises
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
299                              Batman Forever
428                              Batman Returns
Name: title, dtype: object


 7. Comparison & Selection
To compare, you could:

Try CountVectorizer instead of TF-IDF.

Try using genre only vs full text.

Try including cast/directors if available.

We’ll assume TF-IDF + Cosine is our best model here.

8. Storing the Model

In [16]:
import pickle

# Save the model and data
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)

with open('cosine_sim_matrix.pkl', 'wb') as f:
    pickle.dump(cosine_sim, f)

df.to_csv('movie_dataset.csv', index=False)


9 . Loading the Model 

In [17]:
# Load saved model and data
with open('tfidf_vectorizer.pkl', 'rb') as f:
    tfidf = pickle.load(f)

with open('cosine_sim_matrix.pkl', 'rb') as f:
    cosine_sim = pickle.load(f)

df = pd.read_csv('movie_dataset.csv')
indices = pd.Series(df.index, index=df['title']).drop_duplicates()


10. Interacting with the Model 

In [18]:
movie_name = input("Enter a movie you like: ")
recommendations = recommend_movies(movie_name, 5)
print("\nTop 5 recommendations:")
print(recommendations)



Top 5 recommendations:
3                         The Dark Knight Rises
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
299                              Batman Forever
428                              Batman Returns
Name: title, dtype: object
