<a href="https://www.kaggle.com/code/manishkr1754/movie-recommendation-system?scriptVersionId=144425487" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Movie Recommendation System</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

In today's era of digital entertainment, the vast array of available movies and TV shows can overwhelm viewers when choosing what to watch. This project aims to tackle this issue through the development of a movie recommendation system, leveraging the power of data science and machine learning.

The problem can be classified as a **Recommendation System Machine Learning Problem**. The primary goal is **to construct a predictive model capable of suggesting personalized movie recommendations to users**. This model will analyze historical user preferences, movie ratings, and viewing habits to provide tailored movie suggestions. Additionally, it involves the application of **Collaborative Filtering**, **Content-Based Filtering**, or hybrid approaches to enhance recommendation accuracy.

By employing advanced recommendation algorithms and data analysis, this project seeks to simplify the decision-making process for viewers, enriching their entertainment experience while simultaneously demonstrating the practical use of machine learning in content recommendation systems.

## 2) Understanding Data
---

The project uses **Movies Data** which contains several variables (independent variables) and the outcome variable or dependent variable.

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import numpy as np
import pandas as pd

# for text data preprocessing
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import difflib

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

### Downloading stop words for text preprocessing

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
movies_data = pd.read_csv('Datasets/Day18_Movies_Data.csv') 

In [None]:
movies_data

In [None]:
print('The size of Dataframe is: ', movies_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
movies_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in movies_data.columns if movies_data[feature].dtype != 'O']
categorical_features = [feature for feature in movies_data.columns if movies_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=movies_data.isnull().sum().sort_values(ascending=False)
percent=(movies_data.isnull().sum()/movies_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
movies_data.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
movies_data.describe(include='object')

## 5) Data Cleaning and Preprocessing
---

### Selecting the relevant features for recommendation

In [None]:
selected_features = ['genres','keywords','tagline','cast','director']
selected_features

### Replace the null values with a null string in selected features

In [None]:
for feature in selected_features:
    movies_data[feature] = movies_data[feature].fillna('')

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=movies_data.isnull().sum().sort_values(ascending=False)
percent=(movies_data.isnull().sum()/movies_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

### Combining all the 5 selected features

In [None]:
combined_features = movies_data['genres']+' '+movies_data['keywords']+' '+movies_data['tagline']+' '+movies_data['cast']+' '+movies_data['director']

In [None]:
combined_features

### Stemming

In [None]:
porter_stemmer = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [porter_stemmer.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
combined_features = combined_features.apply(stemming)

In [None]:
combined_features

## 6) Model Building
---

### Feature Extraction

#### Transform the text data to feature vectors that can be used as input to the Logistic regression

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
vectorizer.fit(combined_features)

combined_features = vectorizer.transform(combined_features)

In [None]:
combined_features

In [None]:
print(combined_features)

### Cosine Similarity

#### Getting the similarity scores using cosine similarity

In [None]:
similarity = cosine_similarity(combined_features)

In [None]:
print(similarity)

In [None]:
similarity.shape

### Movie Recommendation Systems Sub-Steps

#### `Step-1` Getting Movie name from the User

In [None]:
movie_name = input(' Enter your favourite movie name : ')

#### `Step-2` Creating a list with all the movie names given in the dataset

In [None]:
list_of_all_titles = movies_data['title'].tolist()
print(list_of_all_titles)

#### `Step 3` Finding the close match for the movie name given by the user

In [None]:
find_close_match = difflib.get_close_matches(movie_name, list_of_all_titles)
print(find_close_match)

In [None]:
close_match = find_close_match[0]
print(close_match)

#### `Step 4` Finding the index of the movie with title

In [None]:
index_of_the_movie = movies_data[movies_data.title == close_match]['index'].values[0]
print(index_of_the_movie)

#### `Step 5` Getting a list of similar movies

In [None]:
similarity_score = list(enumerate(similarity[index_of_the_movie]))
print(similarity_score)

In [None]:
len(similarity_score)

#### `Step 6` Sorting the movies based on their similarity score

In [None]:
sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1], reverse = True) 
print(sorted_similar_movies)

#### `Step 7` Print the name of similar movies based on the index

In [None]:
print('Movies suggested for you : \n')

i = 1

for movie in sorted_similar_movies:
    index = movie[0]
    title_from_index = movies_data[movies_data.index==index]['title'].values[0]
    if (i<30):
        print(i, '.',title_from_index)
        i+=1

## 7) Movie Recommendation System Demonstration
---

In [None]:
movie_name = input(' Enter your favourite movie name : ')

list_of_all_titles = movies_data['title'].tolist()

find_close_match = difflib.get_close_matches(movie_name, list_of_all_titles)

close_match = find_close_match[0]

index_of_the_movie = movies_data[movies_data.title == close_match]['index'].values[0]

similarity_score = list(enumerate(similarity[index_of_the_movie]))

sorted_similar_movies = sorted(similarity_score, key = lambda x:x[1], reverse = True) 

print('Movies suggested for you : \n')

i = 1

for movie in sorted_similar_movies:
    index = movie[0]
    title_from_index = movies_data[movies_data.index==index]['title'].values[0]
    if (i<30):
        print(i, '.',title_from_index)
        i+=1