## Sorting IMDB Top 250 Movies
I tried to find out how IMDb scores movies.


# Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import math
from sklearn.preprocessing import MinMaxScaler

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.width", 500)
pd.set_option("display.float_format", lambda x: "%.2f" % x)

# Import Dataset
The data set containing IMDB's movies for this section has been loaded.


In [3]:
df = pd.read_csv("/content/drive/MyDrive/Data_Sets/movies_metadata.csv", low_memory=False)
df = df[["title", "vote_average", "vote_count"]]
df.head()

Unnamed: 0,title,vote_average,vote_count
0,Toy Story,7.7,5415.0
1,Jumanji,6.9,2413.0
2,Grumpier Old Men,6.5,92.0
3,Waiting to Exhale,6.1,34.0
4,Father of the Bride Part II,5.7,173.0


# Sorting by Vote Average

When we examine it, we see that only one movie with 10 points comes first.



In [4]:
df.sort_values("vote_average", ascending=False).head()

Unnamed: 0,title,vote_average,vote_count
21642,Ice Age Columbus: Who Were the First Americans?,10.0,1.0
15710,If God Is Willing and da Creek Don't Rise,10.0,1.0
22396,Meat the Truth,10.0,1.0
22395,Marvin Hamlisch: What He Did For Love,10.0,1.0
35343,Elaine Stritch: At Liberty,10.0,1.0


Now, vote_count is also very important here because it makes more sense to rank the movies with a certain number of votes. Therefore, we first need to determine the number of votes by looking at the statistical values of the number of votes.

In [5]:
df["vote_count"].describe([0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99, 1]).T

count   45460.00
mean      109.90
std       491.31
min         0.00
10%         1.00
20%         2.00
30%         4.00
40%         6.00
50%        10.00
60%        15.00
70%        25.00
80%        50.00
90%       160.00
95%       434.00
99%      2183.82
100%    14075.00
max     14075.00
Name: vote_count, dtype: float64

In [6]:
df[df["vote_count"] > 434 ].sort_values("vote_average", ascending=False).head()

Unnamed: 0,title,vote_average,vote_count
10309,Dilwale Dulhania Le Jayenge,9.1,661.0
40251,Your Name.,8.5,1030.0
314,The Shawshank Redemption,8.5,8358.0
834,The Godfather,8.5,6024.0
1176,Psycho,8.3,2405.0


Looking at the statistical values above, we understand that it would be more logical to rank films that have over 95% of votes, that is, movies that have received more than 434 votes.

In [7]:
df["vote_count_score"] = MinMaxScaler(feature_range=(1,10)).fit(df[["vote_count"]]).transform(df[["vote_count"]])

In [10]:
df.sort_values("vote_count_score", ascending=False).head()

Unnamed: 0,title,vote_average,vote_count,vote_count_score
15480,Inception,8.1,14075.0,10.0
12481,The Dark Knight,8.3,12269.0,8.85
14551,Avatar,7.2,12114.0,8.75
17818,The Avengers,7.4,12000.0,8.67
26564,Deadpool,7.4,11444.0,8.32


Now, when we look at the rankings, the situation has improved slightly, but this alone will not be enough. Therefore, I believe that if we express the vote_count and vote_average properties in the same way and multiply them, the situation will be a little closer to reality.

In [11]:
df["average_count_score"] = df["vote_average"] * df["vote_count_score"]

In [12]:
df.sort_values("average_count_score", ascending=False).head()

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score
15480,Inception,8.1,14075.0,10.0,81.0
12481,The Dark Knight,8.3,12269.0,8.85,73.42
22879,Interstellar,8.1,11187.0,8.15,66.04
17818,The Avengers,7.4,12000.0,8.67,64.18
14551,Avatar,7.2,12114.0,8.75,62.97


# Weighted Rating

Until 2015, the IMDB site had a calculation formula, but it is not known how it was calculated after 2015, but we will create a close approximation to it, but for now, let's look at the formula before 2015:

weighted_rating = $(v/(v+M)*r) + (M/(v+M)*C)$


# IMDB Weighted Rating

In [13]:
M = 2500
C = df["vote_average"].mean()
def imdb_weighted_rating(r,v,M,c):
  return (v/(v+M)*r)+(M/(v+M)*C)

df["imdb_weighted_rating"] = imdb_weighted_rating(df["vote_average"],df["vote_count"],M,C)

In [14]:
df.sort_values("imdb_weighted_rating", ascending=False).head()

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score,imdb_weighted_rating
12481,The Dark Knight,8.3,12269.0,8.85,73.42,7.85
314,The Shawshank Redemption,8.5,8358.0,6.34,53.93,7.84
2843,Fight Club,8.3,9678.0,7.19,59.66,7.75
15480,Inception,8.1,14075.0,10.0,81.0,7.73
292,Pulp Fiction,8.3,8670.0,6.54,54.31,7.7


# Bayesian Average Rating Score

Let's explain why we need this method first: Look, friends, IMDB has changed its method since 2015 and now, if you go to the IMDB website and look at the ranking, the ranking is different from the ranking we just made, but we will try to adapt the Bayesian method ourselves because IMDB may be using it.

In [15]:
df= pd.read_csv("/content/drive/MyDrive/Data_Sets/imdb_ratings.csv")
df = df.iloc[:, 1:]
df.head()

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318


In [16]:
def bayesian_average_rating(n, confidence=0.95):
    if sum(n) == 0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    for k, n_k in enumerate(n):
        first_part += (k + 1) * (n[k] + 1) / (N + K)
        second_part += (k + 1) * (k + 1) * (n[k] + 1) / (N + K)
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    return score

In [17]:
df["bar_score"] = df.apply(lambda x: bayesian_average_rating(x[["one", "two", "three", "four", "five",
                                                                "six", "seven", "eight", "nine", "ten"]]), axis=1)

In [18]:
df.sort_values("bar_score", ascending=False).head(10)

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one,bar_score
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733,9.15
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128,8.94
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345,8.9
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469,8.81
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318,8.77
6,167260,7. The Lord of the Rings: The Return of ...,8.9,703093,433087,270113,117411,44760,21818,10873,7987,6554,28990,8.75
5,108052,6. Schindler's List (1993),8.9,453906,383584,220586,82367,27219,12922,6234,4572,4289,19328,8.74
11,109830,12. Forrest Gump (1994),8.8,622104,553654,373644,151284,51140,22720,11692,7647,5941,12110,8.7
12,1375666,13. Inception (2010),8.7,724798,627987,408686,174229,60668,26910,13436,8703,6932,17621,8.69
10,137523,11. Fight Club (1999),8.8,637087,572654,371752,152295,53059,24755,12648,8606,6948,17435,8.67
