In [None]:
# Import Pandas
import pandas as pd

# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)


WeightedRating(WR)=(v/v+m⋅(R))+(m/v+m⋅(C))

v is the number of votes for the movie;

m is the minimum votes required to be listed in the chart;

R is the average rating of the movie;

C is the mean vote across the whole report.

In [None]:
# Calculate mean of vote average column
C = metadata['vote_average'].mean()
print(C)
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

#gives 160 means number of votes should be greater than or equal to 160

# Filter out all qualified movies into a new DataFrame

q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

#gives the shape as(4555,24)


Next step is to calculate the weighted rating for each movie.

Define a function, weighted_rating(); with M & C as arguments

Then select the vote_count(v) and vote_average(R) column from the q_movies data frame;


In [None]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)
    # Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies if needed
#q_movies[['title', 'vote_count', 'vote_average', 'score']].head(20)

Content-Based Recommender
--------------------------
Plot Description Based Recommender
----------------------------------
Here we try to build a system that recommends movies that are similar to a particular movie.To achieve this, we need to  compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.The plot description is available in the overview feature in the metadata dataset. For suggesting movies based on their context we need to find the TF-IDF tokeninzer.
*Import the Tfidf module using scikit-learn;
*Remove stop words like 'the', 'an', etc. since they do not give any useful  information about the topic;
*Replace not-a-number values with a blank string;
*Construct the TF-IDF matrix on the data.

In [None]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

#Output the shape of tfidf_matrix if necessary 
#tfidf_matrix.shape
#(45466, 75827)

In [None]:
We can use cosine similarity to calculate a numeric quantity that denotes the similarity between two movies.
  You use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate 
  (especially when used in conjunction with TF-IDF scores).Since we have used the TF-IDF vectorizer, calculating the dot 
  product between each vector will directly yield the cosine similarity score. Hence use sklearn's linear_kernel() instead 
  of cosine_similarities() since it is faster.

In [None]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [None]:
Almost final stages , do the following 

Get the index of the movie given its title.

Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

Return the titles corresponding to the indices of the top elements.

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

EXAMPLE


In [None]:
get_recommendations('The Dark Knight Rises')
1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen