<a href="https://colab.research.google.com/github/jaydeepika73/Movie-Recommendation-System/blob/main/Movie_Recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Movie Recommendation System**

**Objective**

The objective of a movie recommendation system is to predict and suggest movies that a user is likely to enjoy based on their past viewing history and preferences, as well as similarities with other users' behaviors. This aims to enhance user satisfaction and engagement by providing personalized content.

**Data Source**

A movie recommendation system data source typically includes information on user ratings, movie metadata (such as genre, director, cast, and release year), and user demographics. Common datasets used for this purpose are the MovieLens datasets, which provide extensive user-movie rating data.

**Import library**

In [None]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from surprise import Dataset, Reader, SVD, accuracy
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import spacy
import tensorflow as tf
from tensorflow import keras
import torch


In [None]:
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split

# Load the movie rating data
# Assume the data is in a pandas dataframe df with columns: userId, movieId, rating
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], Reader(rating_scale=(0.5, 5.0)))

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.25)

# Use the SVD algorithm for recommendations
algo = SVD()

# Train the algorithm on the trainset
algo.fit(trainset)

# Test the algorithm on the testset
predictions = algo.test(testset)

# Compute and print the RMSE
accuracy.rmse(predictions)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Assume you have a pandas dataframe 'movies' with columns: 'title' and 'description'

# Create TF-IDF matrix for movie descriptions
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['description'])

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Function to get movie recommendations based on the cosine similarity score
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = movies[movies['title'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return movies['title'].iloc[movie_indices]

# Example usage
print(get_recommendations('The Godfather'))


**Import Dataset**

In [None]:
# Import necessary libraries
import pandas as pd

# Define the file paths (assuming the files are in the same directory as your script)
ratings_file = 'ratings.csv'
movies_file = 'movies.csv'

# Load the datasets
ratings = pd.read_csv(ratings_file)
movies = pd.read_csv(movies_file)

# Display the first few rows of each dataset to verify
print("Ratings Data:")
print(ratings.head())

print("\nMovies Data:")
print(movies.head())


**Describe Data**

In [None]:
# Example of data description for a movie recommendation system

import pandas as pd

# Load the data
movies_df = pd.read_csv('movies.csv')        # Information about the movies
ratings_df = pd.read_csv('ratings.csv')      # User ratings for the movies
users_df = pd.read_csv('users.csv')          # Information about the users

# Display the first few rows of each DataFrame to understand their structure
print("Movies DataFrame:")
print(movies_df.head())

print("\nRatings DataFrame:")
print(ratings_df.head())

print("\nUsers DataFrame:")
print(users_df.head())

# Movies DataFrame columns:
# - movie_id: unique identifier for each movie
# - title: title of the movie
# - genres: genres associated with the movie (separated by | if multiple)

# Ratings DataFrame columns:
# - user_id: unique identifier for each user
# - movie_id: unique identifier for each movie
# - rating: rating given by the user to the movie (typically on a scale from 1 to 5)
# - timestamp: time at which the rating was given

# Users DataFrame columns:
# - user_id: unique identifier for each user
# - age: age of the user
# - gender: gender of the user (e.g., M or F)
# - occupation: occupation of the user
# - zip_code: ZIP code of the user


**Data Visualization**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

# Merge datasets
data = pd.merge(ratings, movies, on='movieId')

# Visualization 1: Distribution of movie ratings
plt.figure(figsize=(10, 6))
sns.histplot(data['rating'], bins=10, kde=False)
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# Visualization 2: Top 10 Most Rated Movies
top_movies = data['title'].value_counts().head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_movies.values, y=top_movies.index, palette='viridis')
plt.title('Top 10 Most Rated Movies')
plt.xlabel('Number of Ratings')
plt.ylabel('Movie Title')
plt.show()

# Visualization 3: Average Rating of Top 10 Most Rated Movies
top_rated_movies = data.groupby('title')['rating'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_rated_movies.values, y=top_rated_movies.index, palette='viridis')
plt.title('Top 10 Highest Rated Movies')
plt.xlabel('Average Rating')
plt.ylabel('Movie Title')
plt.show()

# Visualization 4: Rating Distribution by Genre
data['genres'] = data['genres'].str.split('|')
genres_data = data.explode('genres')

plt.figure(figsize=(14, 8))
sns.boxplot(x='rating', y='genres', data=genres_data, palette='Set2')
plt.title('Rating Distribution by Genre')
plt.xlabel('Rating')
plt.ylabel('Genre')
plt.show()

# Visualization 5: Number of Ratings per Genre
genre_counts = genres_data['genres'].value_counts()
plt.figure(figsize=(14, 8))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='Set3')
plt.title('Number of Ratings per Genre')
plt.xlabel('Number of Ratings')
plt.ylabel('Genre')
plt.show()


**Data Preprocessing**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load datasets
movies = pd.read_csv('movies.csv')  # Movies metadata
ratings = pd.read_csv('ratings.csv')  # User ratings

# Display first few rows of the datasets
print("Movies DataFrame:")
print(movies.head())

print("\nRatings DataFrame:")
print(ratings.head())

# Merge the datasets on 'movieId'
merged_df = pd.merge(ratings, movies, on='movieId')

# Display first few rows of the merged dataset
print("\nMerged DataFrame:")
print(merged_df.head())

# Drop unnecessary columns (if any)
# For example, if there's a 'timestamp' column in the ratings data
merged_df = merged_df.drop(columns=['timestamp'])

# Display the data after dropping unnecessary columns
print("\nDataFrame after dropping unnecessary columns:")
print(merged_df.head())

# Encode categorical data (e.g., genres)
# One-hot encoding for 'genres' column
merged_df['genres'] = merged_df['genres'].str.split('|')
merged_df = merged_df.explode('genres')
merged_df = pd.get_dummies(merged_df, columns=['genres'])

# Display the data after encoding
print("\nDataFrame after encoding genres:")
print(merged_df.head())

# Split the data into training and testing sets
train_data, test_data = train_test_split(merged_df, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
print(f"\nTraining data shape: {train_data.shape}")
print(f"Testing data shape: {test_data.shape}")

# Standardize the ratings
scaler = StandardScaler()
train_data['rating'] = scaler.fit_transform(train_data[['rating']])
test_data['rating'] = scaler.transform(test_data[['rating']])

# Display the first few rows of the standardized data
print("\nTraining data after standardization:")
print(train_data.head())

print("\nTesting data after standardization:")
print(test_data.head())


**Define Target Variable (y) and Feature Variables (X)**



In [None]:
import pandas as pd

# Example dataset with user ratings
data = {
    'user_id': [1, 1, 1, 2, 2, 3, 3, 4],
    'movie_id': [101, 102, 103, 101, 104, 102, 103, 105],
    'rating': [5, 3, 4, 4, 2, 5, 3, 4],
    'timestamp': [964982703, 964981247, 964982224, 964983815, 964982931, 964982791, 964981632, 964982176]
}

# Create DataFrame
df = pd.DataFrame(data)

# Define the target variable (y) - Ratings
y = df['rating']

# Define feature variables (X) - user_id, movie_id, and timestamp
X = df[['user_id', 'movie_id', 'timestamp']]

print("Feature Variables (X):")
print(X.head())

print("\nTarget Variable (y):")
print(y.head())


**Train Test Split**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Assuming you have a DataFrame 'ratings' with columns: 'userId', 'movieId', 'rating'
# Sample data loading, replace this with actual data loading
data = {
    'userId': [1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
    'movieId': [1, 2, 2, 3, 3, 4, 4, 5, 5, 1],
    'rating': [4, 5, 5, 3, 4, 2, 3, 4, 5, 3]
}
ratings = pd.DataFrame(data)

# Perform train-test split
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

# Display the results
print("Train set:")
print(train)
print("\nTest set:")
print(test)


**Modeling**

In [None]:
# Install Surprise library if you haven't already
# !pip install scikit-surprise

from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import SVD  # Example algorithm, you can choose others like KNN, etc.
from surprise import accuracy

# Load the movielens-100k dataset (you can replace with your own dataset)
data = Dataset.load_builtin('ml-100k')

# Define a reader to specify the rating scale
reader = Reader(rating_scale=(1, 5))

# Load the dataset with the reader
dataset = data.build_full_trainset()
trainset, testset = train_test_split(dataset, test_size=0.2)

# Choose an algorithm (SVD here) and train it on the dataset
algo = SVD()
algo.fit(trainset)

# Predict ratings for the testset
predictions = algo.test(testset)

# Evaluate the model by computing Root Mean Squared Error (RMSE)
accuracy.rmse(predictions)

# Make recommendations for a user
user_id = str(196)  # Example user ID
items_to_predict = [str(item) for item in range(1, 100)]  # Example items to predict
predicted_ratings = {}
for item_id in items_to_predict:
    predicted_ratings[item_id] = algo.predict(user_id, item_id).est

# Get top n recommendations
n = 10
top_n = sorted(predicted_ratings.items(), key=lambda x: x[1], reverse=True)[:n]
for movie_id, rating in top_n:
    print(f'Movie ID: {movie_id}, Estimated Rating: {rating}')


**Model Evaluation**

In [None]:
from sklearn.metrics import mean_squared_error

# Example actual ratings and predicted ratings
actual_ratings = [4, 3, 5, 2, 4, 1, 5, 3, 4, 2]
predicted_ratings = [3.8, 2.9, 4.5, 2.1, 3.7, 1.2, 4.8, 3.1, 3.9, 2.3]

# Evaluate using Mean Squared Error (MSE)
mse = mean_squared_error(actual_ratings, predicted_ratings)
print(f"Mean Squared Error (MSE): {mse}")


**Prediction**

In [None]:
# Importing necessary libraries
from surprise import Dataset, Reader, SVD
from surprise.model_selection import cross_validate

# Define the format of the data
reader = Reader(line_format='user item rating', sep=',')

# Load the dataset
data = Dataset.load_from_file('path_to_your_dataset.csv', reader=reader)

# Use the SVD algorithm
algo = SVD()

# Train the algorithm on the dataset
trainset = data.build_full_trainset()
algo.fit(trainset)

# Define a function to predict ratings
def predict_rating(user_id, item_id):
    prediction = algo.predict(user_id, item_id)
    return prediction.est

# Example usage:
user_id = 'user1'
item_id = 'item1'
predicted_rating = predict_rating(user_id, item_id)
print(f'Predicted rating for user {user_id} on item {item_id}: {predicted_rating}')


**Explanation**

A movie recommendation system is a technology that suggests movies to users based on their preferences, past viewing history, and other relevant factors. It employs techniques such as collaborative filtering, content-based filtering, and hybrid approaches to provide personalized recommendations. These systems leverage machine learning algorithms to analyze large datasets of user behavior and movie characteristics, aiming to enhance user satisfaction by suggesting relevant and appealing movies. Overall, movie recommendation systems play a crucial role in enhancing user experience and engagement in streaming platforms and movie databases.