In [32]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [33]:
df=pd.read_csv("movie_dataset.csv")

If you visulaize the dataset, you will see that it has many extra info about a movie. We do not need all of them. So, we choose keywords, cast, genres and director column to use as our feature set (the so called "content" of the movie).

In [34]:
features = ['keywords', 'cast', 'genres', 'director']

Our next task is to create a function for combining the values of these columns into a single string.

In [38]:
def combine_features(row) :
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

Now, we need to call this function over each row of our dataframe. But, before doing that, we need to clean and preprocess the data for our use. We will fill all the NaN values with blank string in the dataframe.

In [39]:
for feature in features:
    df[feature] = df[feature].fillna('') # filling all NaNs with blank string

df["combined_features"] = df.apply(combine_features,axis=1) # Applying combined_features()

In [40]:
df.iloc[0].combined_features

'culture clash future space war space colony society Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Michelle Rodriguez Action Adventure Fantasy Science Fiction James Cameron'

Now that we have obtained the combined strings, we can now feed these strings to a CountVectorizer() object for getting the count matrix

In [41]:
cv = CountVectorizer() # creating new CountVectorizer() object
count_matrix = cv.fit_transform(df["combined_features"]) # feeding the combined strings (movie comtents) to CountVectorizer() object

At this point, 60% work is done. Now, we need to obtain the cosine similarity matrix from the count matrix

In [42]:
cosine_sim = cosine_similarity(count_matrix)

Now, we will define two helper function to get movie title from movie index and vice-versa

In [43]:
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]

In [44]:
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

In [45]:
movie_user_likes = "Avatar"
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index])) # accesing the row corresponding

Now comes the most vital point. We will sort the list similar_movies according to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.

In [46]:
sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True) [1:]

Now we will run a loop to print first 5 entries from sorted_similar_movies list.

In [47]:
i=0
print(f'Top 5 Similar Movies to {movie_user_likes} are \n')
for element in sorted_similar_movies:
    print(get_title_from_index(element[0]))
    i=i+1
    if i>5:
        break

Top 5 Similar Movies to Avatar are 

Guardians of the Galaxy
Aliens
Star Wars: Clone Wars: Volume 1
Star Trek Into Darkness
Star Trek Beyond
Alien
