In [3]:
import pandas as pd
import gc
import warnings; warnings.simplefilter('ignore')

In [25]:
# read original data
metadata = pd.read_csv("data/movies_metadata.csv")
keywords = pd.read_csv("data/keywords.csv")

metadata.shape, keywords.shape

((45466, 24), (46419, 2))

Let's remove low rated movies from our catalog. This way we get to keep our demo clean and fast. Additionally, It does not make much sense to recommend these movies anyway.

We will use IMDB formula to decide movie rating,

$$ 
Weighted\ Rating\ (WR) =  \left(\frac{v}{v+m}\right) R + \left(\frac{m}{v+m}\right) C
$$

where,

- v is the number of votes for the movie
- m is the minimum votes required to be listed in the chart
- R is the average rating of the movie
- C is the mean vote across the whole report

In [26]:
# Filtering out movies with no rating data
metadata = metadata[metadata["vote_average"].notna() == True]

# Extracting the vote count and average vote for each movie
v = metadata["vote_count"]
C = metadata["vote_average"].mean()

# Converting vote counts to integers for consistent operations
vote_counts = metadata["vote_count"].astype(int)

# Defining m as the minimum number of votes required, here using the 80th percentile of vote counts
m = vote_counts.quantile(0.80)

# Filtering out movies with fewer votes than our threshold
metadata = metadata[metadata["vote_count"] > m]

# Calculating the weighted rating (WR) for each movie using the IMDB formula
WR = (v / (v + m)) * metadata["vote_average"] + (m / (v + m)) * C

# Adding the calculated WR to the metadata DataFrame
metadata["WR"] = WR

metadata.shape

(9048, 25)

In [31]:
# Likewise, we will only keep the required movies keywords and discard remaining dataset
keywords['id'] = keywords['id'].astype(str)

keywords = keywords[keywords['id'].isin(metadata['id'])]

keywords.shape

(9137, 2)

In [34]:
# Saving the metadata and filtered_keywords DataFrames to CSV files
metadata.to_csv("metadata_filtered.csv", index=False)
keywords.to_csv("keywords_filtered.csv", index=False)