IMDB publishes weighted vote averages instead of straightforward raw data averages. In essence, while all user votes are accepted and considered, not every vote has an identical influence (or 'weight') on the final rating.

If irregular voting patterns are observed, a different weighting calculation might be used to maintain the integrity of our system. To keep our rating mechanism effective, the precise method used to compute the rating is not publicly disclosed.

### Explanation of IMDb Weighted Rating Formula

The IMDb formula for calculating weighted ratings is designed to balance the average rating of a movie with the overall average rating across all movies. Here’s a breakdown of the formula:


- ***Weighted_Rating = (v/(v+M) * r) + (M/(v+M) * C)***

Where:
- \( v \) is the number of votes for the movie.
- \( M \) is the minimum number of votes required to be listed in the Top 250.
- \( r \) is the average rating of the movie.
- \( C \) is the mean vote across the whole report (the average rating across all movies).

### Explanation

1. **(v/(v+M) * r)**:
   - This part of the formula gives more weight to the movie's own rating (\( r \)) if it has received a large number of votes (\( v \)). The fraction \( \frac{v}{v+M} \) determines how much weight is given to the movie's own average rating.

2. **(M/(v+M) * C)**:
   - This part of the formula adds a baseline of the overall average rating (\( C \)) into the calculation. The fraction \( \frac{M}{v+M} \) determines how much weight is given to the average rating across all movies. This ensures that movies with very few votes don't have disproportionately high or low ratings affecting the overall ranking.

### Key Points

- **Balancing Act**: The formula is a weighted average that balances the specific movie's rating with the general average rating across all movies. This approach ensures that a movie with a small number of votes doesn't disproportionately affect the ranking due to extreme ratings.
  
- **Minimum Votes (M)**: The inclusion of \( M \) (minimum number of votes) helps to mitigate the impact of movies with only a few votes. Movies need to have a certain level of engagement (votes) to be compared fairly against other movies.

- **Weighted Influence**: The more votes a movie has (\( v \)), the more influence its own rating (\( r \)) has on the final weighted rating. Conversely, if a movie has fewer votes, the general average rating (\( C \)) has more influence.

This method of calculating ratings helps IMDb maintain a fair and balanced rating system, ensuring that highly rated movies with only a few votes don't unjustly rank higher than movies with a more substantial and representative number of votes.


IMDB publishes weighted vote averages instead of straightforward raw data averages. In essence, while all user votes are accepted and considered, not every vote has an identical influence (or 'weight') on the final rating.

If irregular voting patterns are observed, a different weighting calculation might be used to maintain the integrity of our system. To keep our rating mechanism effective, the precise method used to compute the rating is not publicly disclosed.

### Explanation of IMDb Weighted Rating Formula

The IMDb formula for calculating weighted ratings is designed to balance the average rating of a movie with the overall average rating across all movies. Here’s a breakdown of the formula:


- ***Weighted_Rating = (v/(v+M) * r) + (M/(v+M) * C)***

Where:
- \( v \) is the number of votes for the movie.
- \( M \) is the minimum number of votes required to be listed in the Top 250.
- \( r \) is the average rating of the movie.
- \( C \) is the mean vote across the whole report (the average rating across all movies).

### Explanation

1. **(v/(v+M) * r)**:
   - This part of the formula gives more weight to the movie's own rating (\( r \)) if it has received a large number of votes (\( v \)). The fraction \( \frac{v}{v+M} \) determines how much weight is given to the movie's own average rating.

2. **(M/(v+M) * C)**:
   - This part of the formula adds a baseline of the overall average rating (\( C \)) into the calculation. The fraction \( \frac{M}{v+M} \) determines how much weight is given to the average rating across all movies. This ensures that movies with very few votes don't have disproportionately high or low ratings affecting the overall ranking.

### Key Points

- **Balancing Act**: The formula is a weighted average that balances the specific movie's rating with the general average rating across all movies. This approach ensures that a movie with a small number of votes doesn't disproportionately affect the ranking due to extreme ratings.
  
- **Minimum Votes (M)**: The inclusion of \( M \) (minimum number of votes) helps to mitigate the impact of movies with only a few votes. Movies need to have a certain level of engagement (votes) to be compared fairly against other movies.

- **Weighted Influence**: The more votes a movie has (\( v \)), the more influence its own rating (\( r \)) has on the final weighted rating. Conversely, if a movie has fewer votes, the general average rating (\( C \)) has more influence.

This method of calculating ratings helps IMDb maintain a fair and balanced rating system, ensuring that highly rated movies with only a few votes don't unjustly rank higher than movies with a more substantial and representative number of votes.


### Importing Libraries & Modules & Dataset

In [1]:
import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

df = pd.read_csv("movies_metadata.csv",
                 low_memory=False)  # Closing for DtypeWarning 

In [2]:
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


### Data Understanding and Preparing

In [3]:
## Choosing the necessary column 

df = df[["title", "vote_average", "vote_count"]]

df.head()

Unnamed: 0,title,vote_average,vote_count
0,Toy Story,7.7,5415.0
1,Jumanji,6.9,2413.0
2,Grumpier Old Men,6.5,92.0
3,Waiting to Exhale,6.1,34.0
4,Father of the Bride Part II,5.7,173.0


In [4]:
# How many movies are in the data set?

df.shape

(45466, 3)

In [5]:
## Sorting by Average of Votes

df.sort_values("vote_average", ascending=False).head(20)

Unnamed: 0,title,vote_average,vote_count
21642,Ice Age Columbus: Who Were the First Americans?,10.0,1.0
15710,If God Is Willing and da Creek Don't Rise,10.0,1.0
22396,Meat the Truth,10.0,1.0
22395,Marvin Hamlisch: What He Did For Love,10.0,1.0
35343,Elaine Stritch: At Liberty,10.0,1.0
186,Reckless,10.0,1.0
45047,The Human Surge,10.0,1.0
22377,The Guide,10.0,1.0
22346,هیچ کجا هیچ کس,10.0,1.0
1634,Other Voices Other Rooms,10.0,1.0


The top movies, sorted only by their average ratings, are not acceptable due to their low vote counts. To improve this ranking, we should establish a minimum threshold for vote counts.

In [6]:
# Let's first apply a filter based on vote_count and then attempt to rank according to the average rating.

# Firstly apply to describe func for vote_count 

df["vote_count"].describe([0.10, 0.25, 0.50, 0.70, 0.80, 0.90, 0.95, 0.99]).T

count   45460.00000
mean      109.89734
std       491.31037
min         0.00000
10%         1.00000
25%         3.00000
50%        10.00000
70%        25.00000
80%        50.00000
90%       160.00000
95%       434.00000
99%      2183.82000
max     14075.00000
Name: vote_count, dtype: float64

The average vote count for movies is 109, and the median is 10. Given the large dataset and our goal to identify the top 250 movies, we set the minimum vote count at 400, as 95% of the movies have vote counts up to 434.

In [7]:
#Sorted by average votes with lower vote counts limit

df[df["vote_count"] > 400].sort_values("vote_average", ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count
10309,Dilwale Dulhania Le Jayenge,9.1,661.0
40251,Your Name.,8.5,1030.0
834,The Godfather,8.5,6024.0
314,The Shawshank Redemption,8.5,8358.0
1152,One Flew Over the Cuckoo's Nest,8.3,3001.0
1176,Psycho,8.3,2405.0
1178,The Godfather: Part II,8.3,3418.0
292,Pulp Fiction,8.3,8670.0
1184,Once Upon a Time in America,8.3,1104.0
5481,Spirited Away,8.3,3968.0


The outcome is unsatisfactory. Considering both the average vote and the number of votes when sorting the movies might yield better results this time.

In [8]:
## Convert TO vote_count values to scales between 1-10

df["vote_count_score"] = MinMaxScaler(feature_range=(1, 10)). \
    fit(df[["vote_count"]]). \
    transform(df[["vote_count"]])

In [9]:
df.head(10)

Unnamed: 0,title,vote_average,vote_count,vote_count_score
0,Toy Story,7.7,5415.0,4.46252
1,Jumanji,6.9,2413.0,2.54295
2,Grumpier Old Men,6.5,92.0,1.05883
3,Waiting to Exhale,6.1,34.0,1.02174
4,Father of the Bride Part II,5.7,173.0,1.11062
5,Heat,7.7,1886.0,2.20597
6,Sabrina,6.2,141.0,1.09016
7,Tom and Huck,5.4,45.0,1.02877
8,Sudden Death,5.5,174.0,1.11126
9,GoldenEye,6.6,1194.0,1.76348


In [10]:
#Created a score column and named average_count_score(vote_average * vote_count )

df["average_count_score"] = df["vote_average"] * df["vote_count_score"]

In [11]:
## Sort the average count score

df.sort_values("average_count_score", ascending=False).head(20)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score
15480,Inception,8.1,14075.0,10.0,81.0
12481,The Dark Knight,8.3,12269.0,8.84519,73.41505
22879,Interstellar,8.1,11187.0,8.15332,66.0419
17818,The Avengers,7.4,12000.0,8.67318,64.18153
14551,Avatar,7.2,12114.0,8.74607,62.97174
26564,Deadpool,7.4,11444.0,8.31766,61.55065
2843,Fight Club,8.3,9678.0,7.18842,59.66388
20051,Django Unchained,7.8,10297.0,7.58423,59.15697
23753,Guardians of the Galaxy,7.9,10014.0,7.40327,58.48582
292,Pulp Fiction,8.3,8670.0,6.54387,54.31414


### IMDB Weighted Rating


-**weighted_rating** = (v/(v+M) * r) + (M/(v+M) * C)

-**r**= vote average

-**v** = vote count

-**M** = minimum votes required to be listed in the Top 250

-**C** = the mean vote across the whole report (currently 7.0)

##### Understanding formula:

**(v/(v+M) * r)** 

#### Film 1:                                         
#### r = 8
#### M = 500
#### v = 1000

#### (1000 / (1000+500))*8 = 5.33

#### Film 2:
#### r = 8
#### M = 500
#### v = 3000

#### (3000 / (3000+500))*8 = 6.85

#### If a movie has received more points than the required number of votes, the severity of the correction applied to this movie's score will be less.


**(M/(v+M) * C)**

#### Film 1:
#### r = 8
#### M = 500
#### v = 1000

#### 500/(1000+500) * 7 = 2.33

#### Film 2:
#### r = 8
#### M = 500
#### v = 300

#### 500/(3000+500) * 7 = 1


#### Calculating formula:

####                                              Film 1                     Film 2 
 
#### -weighted_rating = (5.33+2.33)= 7.66         (6.85+1)=7.85



#### Creating a function

In [17]:
M = 2500
C = df['vote_average'].mean()

def weighted_rating(r, v, M, C):
    return (v / (v + M) * r) + (M / (v + M) * C)

In [18]:
## Create a weighted_rating column in the dataframe

df["weighted_rating"] = weighted_rating(df["vote_average"],
                                        df["vote_count"], M, C)


## Sort the weighted_rating

df.sort_values("weighted_rating", ascending=False).head(10)

Unnamed: 0,title,vote_average,vote_count,vote_count_score,average_count_score,weighted_rating
12481,The Dark Knight,8.3,12269.0,8.84519,73.41505,7.84604
314,The Shawshank Redemption,8.5,8358.0,6.34437,53.92714,7.83648
2843,Fight Club,8.3,9678.0,7.18842,59.66388,7.74946
15480,Inception,8.1,14075.0,10.0,81.0,7.72567
292,Pulp Fiction,8.3,8670.0,6.54387,54.31414,7.69978
834,The Godfather,8.5,6024.0,4.85194,41.24146,7.6548
22879,Interstellar,8.1,11187.0,8.15332,66.0419,7.64669
351,Forrest Gump,8.2,8147.0,6.20945,50.91748,7.59377
7000,The Lord of the Rings: The Return of the King,8.1,8226.0,6.25996,50.70571,7.52155
4863,The Lord of the Rings: The Fellowship of the Ring,8.0,8892.0,6.68583,53.48661,7.47731


This ranking is calculated based on IMBD's own formulation.

### Bayesian Average Rating Score

In [19]:
## Creating movies according to their ratings(stars) using the Bar_score

def bayesian_average_rating(n, confidence=0.95):
    if sum(n) == 0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    for k, n_k in enumerate(n):
        first_part += (k + 1) * (n[k] + 1) / (N + K)
        second_part += (k + 1) * (k + 1) * (n[k] + 1) / (N + K)
    score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
    return score

In [23]:
## Read the imbd ratings dataset (1 t0 ten)

df1 = pd.read_csv("imdb_ratings.csv")
df1 = df1.iloc[0:, 1:]
df1.head()

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318


In [24]:
df1["bar_score"] = df1.apply(lambda x: bayesian_average_rating(x[["one", "two", "three", "four", "five",
                                                                "six", "seven", "eight", "nine", "ten"]]), axis=1)

df1.sort_values("bar_score", ascending=False).head(20)

Unnamed: 0,id,movieName,rating,ten,nine,eight,seven,six,five,four,three,two,one,bar_score
0,111161,1. The Shawshank Redemption (1994),9.2,1295382,600284,273091,87368,26184,13515,6561,4704,4355,34733,9.14539
1,68646,2. The Godfather (1972),9.1,837932,402527,199440,78541,30016,16603,8419,6268,5879,37128,8.94002
3,468569,4. The Dark Knight (2008),9.0,1034863,649123,354610,137748,49483,23237,11429,8082,7173,30345,8.89596
2,71562,3. The Godfather: Part II (1974),9.0,486356,324905,175507,70847,26349,12657,6210,4347,3892,20469,8.8125
4,50083,5. 12 Angry Men (1957),8.9,246765,225437,133998,48341,15773,6278,2866,1723,1478,8318,8.76793
6,167260,7. The Lord of the Rings: The Return of ...,8.9,703093,433087,270113,117411,44760,21818,10873,7987,6554,28990,8.75204
5,108052,6. Schindler's List (1993),8.9,453906,383584,220586,82367,27219,12922,6234,4572,4289,19328,8.74361
11,109830,12. Forrest Gump (1994),8.8,622104,553654,373644,151284,51140,22720,11692,7647,5941,12110,8.69915
12,1375666,13. Inception (2010),8.7,724798,627987,408686,174229,60668,26910,13436,8703,6932,17621,8.69315
10,137523,11. Fight Club (1999),8.8,637087,572654,371752,152295,53059,24755,12648,8606,6948,17435,8.67448


In [25]:
# Weighted Average Ratings
# IMDb publishes weighted vote averages rather than raw data averages.
# The simplest way to explain it is that although we accept and consider all votes received by users,
# not all votes have the same impact (or ‘weight’) on the final rating.

# When unusual voting activity is detected,
# an alternate weighting calculation may be applied in order to preserve the reliability of our system.
# To ensure that our rating mechanism remains effective,
# we do not disclose the exact method used to generate the rating.
#
# See also the complete FAQ for IMDb ratings.