Link to Kaggle notebook: www.kaggle.com/code/junginger/imdb-ratings/

This dataset contains data for IMDB's 1000 top movies/TV shows. The following features are included:
* Poster_Link
* Series_Title
* Released_Year
* Certificate
* Runtime
* Genre
* IMDB_Rating
* Overview
* Meta_score
* Director
* Star1
* Star2
* Star3
* Star4
* No_of_Votes
* Gross

When first exploring the dataset I created a correlation matrix of the numerical features and noticed a modest correlation between metascores and IMDB ratings. Let's have a closer look at the relationship between these two features:

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

# Import data
df = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")
df['IMDB_Rating'] = df['IMDB_Rating'] * 10

# Clean data
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')
df = df.dropna(subset=['Released_Year', 'Runtime', 'Gross', 'No_of_Votes','IMDB_Rating', 'Meta_score']) # Remove NaNs
df['Gross'] = df['Gross'].str.replace(',', '').astype(float) # Cast Gross to int ($)
df['Runtime'] = df['Runtime'].str.extract('(\d+)').astype(int) # Cast Runtime to int (min)

# Calculate correlation
correlation = df['IMDB_Rating'].corr(df['Meta_score'])

# Scatterplot
plt.figure(figsize=(8, 5))
plt.scatter(df['IMDB_Rating'], df['Meta_score'], alpha=0.2)
plt.xlabel('IMDB_Rating')
plt.ylabel('Meta_score')
plt.title(f'Ratings Comparison (n={df.shape[0]})')
plt.text(0.75, 0.35, f'R = {correlation:.3f}', transform=plt.gca().transAxes, fontsize=12)
plt.show()

The main difference between the ratings systems is that Metacritic aggregates reviews from **established critics and publications** then assigns each review a score, whereas IMDb scores come from registered users of the IMDb website and is **audience-based**.

From the plot above we can see most points share similar ratings, though a signifcant portion does not. Next I transform the data into the delta of the scores and plot a histogram. The data is skewed 

In [None]:
df['delta'] = df['Meta_score'] - df['IMDB_Rating']
plt.hist(df['delta'], bins=30, color='royalblue')
plt.title('Histogram of Delta (Meta_score - IMDB_Rating)')
plt.show()

Let's also glance at the top and worst performers as a sanity check. 

In [None]:
selected_features = ['Meta_score', 'IMDB_Rating', 'Series_Title', 'Released_Year'] 
sorted_df = df.sort_values(by='delta', ascending=True)
top_10 = sorted_df.head(10)
print('\nTop 10 discrepencies')
print(top_10[selected_features])
print('\nBottom 10 discrepencies')
bot_10 = sorted_df.tail(10)
print(bot_10[selected_features])

Let's output a correlation matrix. The strongest correlation is a negative relationship to year released.

In [None]:
X = df.select_dtypes(include=['number'])
X = X.drop(columns=['IMDB_Rating', 'Meta_score'])
print('Correlation vs. Delta')
print(X.corr()['delta'])

Next delta is plotted against remaining numerical features. 

Everything looks pretty even. We can see the negative correlation to year released and can even draw an insight that **most movies released before ~1970 have higher IMDB ratings**. At first glance this negative trend continues into the 2000's. 

In [None]:
# Scatterplots of delta vs numerical features
features = [col for col in X.columns if col != 'delta']
num_features = len(features)
cols = 4 
rows = (num_features // cols) + int(num_features % cols > 0)
fig, axs = plt.subplots(rows, cols, figsize=(15, 5 * rows))
axs = axs.flatten()
for i, feature in enumerate(features):
    axs[i].scatter(df[feature], df['delta'], alpha=0.5)
    axs[i].set_title(f'delta vs {feature}')
    axs[i].set_xlabel(feature)
    axs[i].set_ylabel('delta')
for i in range(num_features, rows * cols):
    axs[i].axis('off')
plt.tight_layout()
plt.show()

To get a closer look at the low-delta samples, I ran k-means clustering with 3 groups using only the delta feature. 

The scatterplot and descriptive stats show the dataset was split into something corresponding to:
*     Cluster 0 - Films very underrated by critics
*     Cluster 1 - Films with similar ratings between critics and viewers
*     Cluster 2 - Films somewhat underrated by critics

Cluster 0 takes on 123 films with a median delta of -21.7 and max of -14 (compared to cluster 1's median of 8.5 and min of 2)

Silhouette score is a measure of inter-cluster similarity (takes value -1, worst, to 1, best separation).

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Select features and run K-means
selected_features = ['delta']
X = df[selected_features].copy()
kmeans = KMeans(n_clusters=3, n_init=10, random_state=0).fit(X)
labels = kmeans.labels_

# Plot each cluster
X['cluster'] = labels
plt.figure(figsize=(8, 5))
colors = plt.cm.rainbow(np.linspace(0, 1, kmeans.n_clusters)) # Define colormap

for i, color in enumerate(colors):
    subset = df[X['cluster'] == i]
    plt.scatter(subset['IMDB_Rating'], subset['Meta_score'], c=[color], label=f'Cluster {i}')

sil_score = silhouette_score(X[selected_features], labels)
plt.text(0.60, 0.30, f'Silhouette Score = {sil_score:.3f}', transform=plt.gca().transAxes, fontsize=12)
plt.legend()
plt.show()

# Descriptive stats on delta per cluster
cluster_stats = X.groupby('cluster').agg({
    'delta': ['count', 'mean', 'std', 'median', 'max', 'min']
})

print('\nDescriptive statistics on delta per cluster')
print(cluster_stats)

Next let's breakdown cluster 0 by genre and see if any genres are over/underrepresented. Initially I plotted this information in a bar graph, but it seems more suited for a table.

Note: The genres are grouped in the original dataset (ie movies have multiple genres) which produces too many categories, so I handled this by generating one-hot encoding of unique genres.



In [None]:
# Create one-hot encodings for genres and count
df['cluster'] = X['cluster']
genre_split = df['Genre'].str.get_dummies(sep=', ')
df_onehot = pd.concat([df, genre_split], axis=1)
total_genre_counts = genre_split.sum()
cluster0_genre_counts = genre_split[df['cluster'] == 0].sum()
summary_df = pd.DataFrame({
    'Total_Films': total_genre_counts,
    'Cluster_0_Films': cluster0_genre_counts
}).fillna(0)

# Output summary
summary_df['Percentage'] = ((summary_df['Cluster_0_Films'] / summary_df['Total_Films']) * 100).round(2) # Calculate % of films in Cluster 0 for each genre
summary_df = summary_df.sort_values(by='Percentage', ascending=False) # Sort by %, highest first
print(f"\nTotal number of films in the dataset: {len(df)}")
print(f"Total number of films in Cluster 0: {len(df[df['cluster'] == 0])}")
print(f"Percentage of films in Cluster 0: {100*len(df[df['cluster'] == 0])/len(df):.2f}%\n")
print(summary_df)