# Movie Recommendation System

This project creates a movie recommendation system using machine learning and natural language processing (NLP). The goal is to recommend movies similar to a given movie based on its description (tags), genre, and overview. This system uses **cosine similarity** to compare movies based on their textual features.

---

## Step 1: Importing Libraries

The first step is to import the necessary libraries for data processing and manipulation.

- pandas: For data manipulation and analysis, particularly useful for working with structured data like CSV files.
- numpy: For numerical operations, used here for array manipulations.
- files from google.colab: This helps to handle file uploads in a Google Colab environment.


In [None]:
import pandas as pd
import numpy as np


## Step 2: Uploading the Dataset
The next step is to upload the dataset from your local machine into the Colab environment.
- uploaded = files.upload(). This command prompts the user to upload a file containing the movie dataset. The dataset will be processed in subsequent steps.

In [2]:
# prompt: upload file

from google.colab import files
uploaded = files.upload()
# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))
#
# df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))




Saving dataset.csv to dataset.csv


## Step 3: Loading the Dataset
After uploading the dataset, I read it into a pandas DataFrame for easy manipulation.

- pandas.read_csv(): This method is used to read the uploaded CSV file into the movies DataFrame. The file is assumed to contain various columns such as id, title, genre, and overview.



In [5]:
movies = pd.read_csv('dataset.csv')

## Step 4: Data Inspection
After loading the dataset, I inspect the first few rows, columns, and basic info of the dataset.

- movies.head(): Displays the first 5 rows of the dataset to get a quick look at the data.
- movies.columns: Lists the column names in the dataset.
- movies.info(): Provides a concise summary of the DataFrame, including the number of non-null values and data types.

In [6]:
movies.head()

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


In [7]:
movies.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

In [8]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 10000 non-null  int64  
 1   title              10000 non-null  object 
 2   genre              9997 non-null   object 
 3   original_language  10000 non-null  object 
 4   overview           9987 non-null   object 
 5   popularity         10000 non-null  float64
 6   release_date       10000 non-null  object 
 7   vote_average       10000 non-null  float64
 8   vote_count         10000 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 703.2+ KB


## Step 5: Preprocessing the Data
In this step, I create a new column tags by combining the genre and overview columns.

- I concatenate the genre and overview columns into a single tags column. This will represent the textual content of each movie and be used in the similarity comparison.

In [9]:
movies['tags'] = movies['genre'] + movies['overview']

In [10]:
movies.head()

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count,tags
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862,"Drama,CrimeFramed in the 1940s for the double ..."
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731,"Comedy,Drama,RomanceRaj is a rich, carefree, h..."
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280,"Drama,CrimeSpanning the years 1945 to 1955, a ..."
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959,"Drama,History,WarThe true story of how busines..."
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811,"Drama,CrimeIn the continuing saga of the Corle..."


## Step 6: Selecting Relevant Columns
I now select only the columns that are necessary for the recommendation system: id, title, and tags.

- I create a new DataFrame, new_df, with just the id, title, and tags columns.
- The genre and overview columns are dropped since their information is already combined in the tags column.

In [None]:
new_df = movies[['id', 'title', 'genre', 'overview', 'tags']]

In [12]:
new_df.head()

Unnamed: 0,id,title,genre,overview,tags
0,278,The Shawshank Redemption,"Drama,Crime",Framed in the 1940s for the double murder of h...,"Drama,CrimeFramed in the 1940s for the double ..."
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance","Raj is a rich, carefree, happy-go-lucky second...","Comedy,Drama,RomanceRaj is a rich, carefree, h..."
2,238,The Godfather,"Drama,Crime","Spanning the years 1945 to 1955, a chronicle o...","Drama,CrimeSpanning the years 1945 to 1955, a ..."
3,424,Schindler's List,"Drama,History,War",The true story of how businessman Oskar Schind...,"Drama,History,WarThe true story of how busines..."
4,240,The Godfather: Part II,"Drama,Crime",In the continuing saga of the Corleone crime f...,"Drama,CrimeIn the continuing saga of the Corle..."


In [13]:
new_df = new_df.drop(columns=['genre','overview'])

In [14]:
new_df.head()

Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,"Drama,CrimeFramed in the 1940s for the double ..."
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,RomanceRaj is a rich, carefree, h..."
2,238,The Godfather,"Drama,CrimeSpanning the years 1945 to 1955, a ..."
3,424,Schindler's List,"Drama,History,WarThe true story of how busines..."
4,240,The Godfather: Part II,"Drama,CrimeIn the continuing saga of the Corle..."


## Step 7: Vectorizing the Tags
Next, I use CountVectorizer from sklearn to convert the textual tags data into numerical vectors

- CountVectorizer: This converts the text data in tags into a numerical format, where each unique word is assigned a numerical value.
- fit_transform(): This method fits the vectorizer on the data and transforms the data into vectors.
- max_features=10000: Limits the number of features (words) considered to 10,000, which helps in managing the dataset size.
- stop_words='english': Excludes common words like "the", "and", etc., to focus on more meaningful words.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
cv = CountVectorizer(max_features=10000, stop_words='english')

In [17]:
cv

In [18]:
vec = cv.fit_transform(new_df['tags'].values.astype('U')).toarray()

In [20]:
vec

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [21]:
vec.shape

(10000, 10000)

## Step 8: Calculating Cosine Similarity
I now compute the cosine similarity between the movies based on their tags vectors.

- Cosine Similarity: This measures the similarity between two vectors by calculating the cosine of the angle between them. The closer the cosine value is to 1, the more similar the vectors are.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
sim = cosine_similarity(vec)

In [24]:
sim

array([[1.        , 0.06253054, 0.05802589, ..., 0.07963978, 0.07597372,
        0.03798686],
       [0.06253054, 1.        , 0.08980265, ..., 0.        , 0.        ,
        0.        ],
       [0.05802589, 0.08980265, 1.        , ..., 0.02541643, 0.03636965,
        0.        ],
       ...,
       [0.07963978, 0.        , 0.02541643, ..., 1.        , 0.03327792,
        0.03327792],
       [0.07597372, 0.        , 0.03636965, ..., 0.03327792, 1.        ,
        0.04761905],
       [0.03798686, 0.        , 0.        , ..., 0.03327792, 0.04761905,
        1.        ]])

## Step 9: Recommending Movies
I can now find similar movies by using cosine similarity. For instance, I can find movies similar to "The Shawshank Redemption".

- This query returns the row in new_df corresponding to "The Shawshank Redemption".

In [25]:
new_df[new_df['title']=='The Shawshank Redemption']

Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,"Drama,CrimeFramed in the 1940s for the double ..."


## Step 10: Sorting Movies Based on Similarity
I calculate the similarity of the movie with others and sort them to find the top 5 most similar movies.

- sorted(list(enumerate(sim[0]))): Sorts the similarity values for the first movie (i.e., "The Shawshank Redemption") in descending order.
- new_df.iloc[i[0]].title: Prints the title of the movie based on the sorted similarity scores.

In [26]:
dist = sorted(list(enumerate(sim[0])),reverse=True,key=lambda vec:vec[1])

In [27]:
for i in dist[0:5]:
    print(new_df.iloc[i[0]].title)

The Shawshank Redemption
Anything for Her
The Woodsman
The Getaway
Pusher II


## Step 11: Creating a Function for Recommendations
I wrap the recommendation logic in a function called recommend that accepts a movie title and suggests similar movies.

- recommend(movies): The function takes a movie title as input and returns the top 5 most similar movies based on cosine similarity.
- It calculates the index of the movie in the DataFrame and then uses the cosine similarity matrix to find the closest matches.

In [28]:
def recommend(movies):
    index = new_df[new_df['title']==movies].index[0]
    distance = sorted(list(enumerate(sim[index])),reverse=True,key=lambda vec:vec[1])
    for i in distance[0:5]:
        print(new_df.iloc[i[0]].title)

In [29]:
recommend('Iron Man')

Iron Man
Mazinger Z: Infinity
Justice League Dark
Iron Man 3
The Colony


In [30]:
recommend('The Shawshank Redemption')

The Shawshank Redemption
Anything for Her
The Woodsman
The Getaway
Pusher II


In [31]:
recommend('The Godfather')

The Godfather
The Godfather: Part II
Felon
House of Gucci
Gotti
