Movie Recommendation System

-------------

## **Objective**

To build a system that recommends movies to users based on their preferences and past ratings.

## **Data Source**

In [None]:
The data is sourced from the MovieLens dataset, which contains millions of movie ratings from users.

## **Import Library**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split as surprise_train_test_split


## **Import Data**

In [None]:
# Load MovieLens data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')


## **Describe Data**

In [None]:
# Display first few rows of the datasets
print(movies.head())
print(ratings.head())

# Display summary statistics of the ratings dataset
print(ratings.describe())

# Display information about the datasets
print(movies.info())
print(ratings.info())


## **Data Visualization**

In [None]:
# Distribution of movie ratings
plt.figure(figsize=(8, 6))
sns.histplot(ratings['rating'], bins=20, kde=False)
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# Number of ratings per movie
ratings_per_movie = ratings.groupby('movieId')['rating'].count().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=ratings_per_movie.index[:10], y=ratings_per_movie.values[:10])
plt.title('Top 10 Movies by Number of Ratings')
plt.xlabel('Movie ID')
plt.ylabel('Number of Ratings')
plt.show()


## **Data Preprocessing**

In [None]:
# Merge movies and ratings dataframes
data = pd.merge(ratings, movies, on='movieId')

# Check for missing values
print(data.isnull().sum())

# Drop any rows with missing values if necessary
data = data.dropna()


## **Define Target Variable (y) and Feature Variables (X)**

In [None]:
# For a recommendation system, we don't have a traditional target variable and feature variables
# We will use userId, movieId, and rating for training our model
reader = Reader(rating_scale=(0.5, 5.0))
dataset = Dataset.load_from_df(data[['userId', 'movieId', 'rating']], reader)


## **Train Test Split**

In [None]:
# Using Surprise library to split the data
trainset, testset = surprise_train_test_split(dataset, test_size=0.2)


## **Modeling**

In [None]:
# Using SVD (Singular Value Decomposition) for recommendation
model = SVD()
model.fit(trainset)


## **Model Evaluation**

In [None]:
# Predictions on the test set
predictions = model.test(testset)

# Evaluating the model
mse = accuracy.mse(predictions)
print(f"Mean Squared Error: {mse}")


## **Prediction**

In [None]:
# Making a prediction for a specific user and movie
user_id = 1  # example userId
movie_id = 50  # example movieId
prediction = model.predict(user_id, movie_id)
print(f"Predicted rating for user {user_id} and movie {movie_id}: {prediction.est}")


## **Explaination**

The Movie Recommendation System uses the MovieLens dataset, containing millions of ratings from users. The data is preprocessed by merging movie information with ratings and handling any missing values. The SVD algorithm from the Surprise library is used to train the model, splitting the data into training and test sets. The model is evaluated using Mean Squared Error (MSE) to determine its accuracy. Finally, the system can predict the rating a user might give to a specific movie, helping to recommend movies based on user preferences.

The model performed with an MSE of [your value], indicating a good level of accuracy in predicting user ratings. Future improvements could include incorporating additional features such as genres and user demographics to enhance the recommendations.