This jupyter notebook is for a movie recommendation system as per the movies watched between 2021 to 2024.
The recommendation is based on contents of the movie a peroson watches.

Lets beging by installing all the tools and libraries we are going to need.

In [None]:
!pip install pandas scikit-learn streamlit
!pip install scikit-surprise       # For collaborative filtering
!pip install matplotlib seaborn    # For any visualization
!pip install requests              # For fetching movie posters from TMDb API


In [3]:
# loading data.
import pandas as pd

movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
tags = pd.read_csv("tags.csv")


In [4]:
# lets load links data.
links=pd.read_csv("links.csv")

In [5]:
 #displaying first five rows in each dataset
print("🎬 Movies:")
print(movies.head())

print("\n⭐ Ratings:")
print(ratings.head())

print("\n🏷️ Tags:")
print(tags.head())

print("\n🔗 Links:")
print(links.head())


🎬 Movies:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

⭐ Ratings:
   userId  movieId  rating  timestamp
0       1       17     4.0  944249077
1       1       25     1.0  944250228
2       1       29     2.0  943230976
3       1       30     5.0  944249077
4       1       32     5.0  943228858

🏷️ Tags:
   userId  movieId          tag   timestamp
0      22    26479  Kevin Kline  1583038886
1      22    79592     misogyny  1581476297
2      22   2

Now the next step is to  Merge and Prepare Data for Content-Based Filtering. We are going to recommend 
our movie based on the content.

In [7]:
# Clean genres by replacing "|" with spaces
movies['genres'] = movies['genres'].str.replace('|', ' ', regex=False)

# Merge tags into a single string per movie
#we are also replacing null(NaN) values with empty strings 
tags['tag'] = tags['tag'].fillna('').astype(str)
tags_grouped = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(x)).reset_index()

# Merge tags into movies
movies = pd.merge(movies, tags_grouped, on='movieId', how='left')

# Fill missing tags with empty string
movies['tag'] = movies['tag'].fillna('')

# Combine genres and tags into a new 'content' column
movies['content'] = movies['genres'] + ' ' + movies['tag']

# Preview the new data
print("\n🎬 Movies with Combined Content:")
print(movies[['movieId', 'title', 'genres', 'tag', 'content']].head())



🎬 Movies with Combined Content:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  \
0  Adventure Animation Children Comedy Fantasy   
1                   Adventure Children Fantasy   
2                               Comedy Romance   
3                         Comedy Drama Romance   
4                                       Comedy   

                                                 tag  \
0  children Disney animation children Disney Disn...   
1  Robin Williams fantasy Robin Williams time tra...   
2  comedinha de velhinhos engraÃƒÂ§ada comedinha ...   
3  characters slurs based on novel or book chick ...   
4  Fantasy pregnancy remake family Steve Martin s...   

                            

In [8]:
# 1️⃣ Import TF-IDF and similarity tools
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

#lets use 1000 rows to save on memory space.
movies=movies.head(10000)
# 2️⃣ Create a TF-IDF vectorizer and transform the content
tfidf = TfidfVectorizer(stop_words='english')        # Removes common English words like 'the', 'is', etc.
tfidf_matrix = tfidf.fit_transform(movies['content'])  # Converts movie descriptions into vectors

# 3️⃣ Check the shape of the matrix (num_movies x num_features)
print("\n🔢 TF-IDF Matrix Shape:", tfidf_matrix.shape)

# 4️⃣ Compute cosine similarity between all movies

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)  # Each value [i][j] tells how similar movie i is to movie j

# 5️⃣ Create a reverse lookup map: title → index
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

# 6️⃣ Select a movie title to test the recommender
movie_title = "House of Cards (1993)"      # Change this to test with other movies
idx = indices[movie_title]

# 7️⃣ Get pairwise similarity scores of that movie with all others
sim_scores = list(enumerate(cosine_sim[idx]))

# 8️⃣ Sort movies by similarity score (most similar first)
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# 9️⃣ Get the indices of the top 5 most similar movies (excluding itself)
sim_scores = sim_scores[1:6]    # Skip the first one (the movie itself)
movie_indices = [i[0] for i in sim_scores]

# and laslty 10️⃣Print the recommended movie titles
print("\n🎯 Because you liked:", movie_title)
print("👉 You might also like:")
for i, title in enumerate(movies['title'].iloc[movie_indices], 1):
    print(f"{i}. {title}")



🔢 TF-IDF Matrix Shape: (10000, 31369)

🎯 Because you liked: House of Cards (1993)
👉 You might also like:
1. Isn't She Great? (2000)
2. Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (2002)
3. Black Moon Rising (1986)
4. Man of the House (2005)
5. U.S. Marshals (1998)


Instead of using TF-IDF, we can use NearestNeighbors which is efficient for large dataset like this one of ours.

In [24]:

# 4️⃣ Use Nearest Neighbors instead of full cosine similarity matrix
from sklearn.neighbors import NearestNeighbors

# Use cosine distance (1 - cosine similarity)
nn = NearestNeighbors(metric='cosine', algorithm='brute')
nn.fit(tfidf_matrix)

# -------------------------------
# 6️⃣ Pick a movie and find its index
# -------------------------------
movie_title ="House of Cards (1993)" #this one you can change to match your like.
idx = indices[movie_title]
# 7️⃣ Find top 5 nearest neighbors (excluding itself)

distances, indices_recommended = nn.kneighbors(tfidf_matrix[idx], n_neighbors=6)

# lastly 8️⃣ Print the top recommended titles
print("\n🎯 Because you liked:", movie_title)
print("👉 You might also like:")
for i, idx in enumerate(indices_recommended[0][1:], 1):  # Skip the first (the movie itself)
    print(f"{i}. {movies['title'].iloc[idx]}")



🎯 Because you liked: House of Cards (1993)
👉 You might also like:
1. Isn't She Great? (2000)
2. Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (2002)
3. Black Moon Rising (1986)
4. Man of the House (2005)
5. U.S. Marshals (1998)


Lets now try work out the UI for the movie recommendation.

In [9]:
# Write Streamlit app to app.py
with open("app.py", "w", encoding="utf-8") as f:
    f.write('''\
import streamlit as st
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# Load data
movies = pd.read_csv("movies.csv")
tags = pd.read_csv("tags.csv")

# Clean and prepare
movies['genres'] = movies['genres'].str.replace('|', ' ', regex=False)
tags['tag'] = tags['tag'].fillna('').astype(str)
tags_grouped = tags.groupby('movieId')['tag'].apply(lambda x: ' '.join(x)).reset_index()
movies = pd.merge(movies, tags_grouped, on='movieId', how='left')
movies['tag'] = movies['tag'].fillna('')
movies['content'] = movies['genres'] + ' ' + movies['tag']

# TF-IDF
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['content'])

# NearestNeighbors
nn = NearestNeighbors(metric='cosine', algorithm='brute')
nn.fit(tfidf_matrix)

# Reverse index
title_to_index = pd.Series(movies.index, index=movies['title']).drop_duplicates()

# Streamlit UI
st.title("🎬 Movie Recommendation System")
st.write("Select a movie to get similar recommendations:")

selected_movie = st.selectbox("Choose a movie", movies['title'].sort_values())

if st.button("Recommend"):
    idx = title_to_index[selected_movie]
    distances, indices_recommended = nn.kneighbors(tfidf_matrix[idx], n_neighbors=6)

    st.subheader("🎯 Because you liked: " + selected_movie)
    st.write("👉 You might also like:")
    for i, rec_idx in enumerate(indices_recommended[0][1:], 1):
        st.write(f"{i}. {movies['title'].iloc[rec_idx]}")
''')


In [11]:
!streamlit run app.py

^C
