Link to Kaggle notebook: www.kaggle.com/code/junginger/imdb-ratings/

This dataset contains data for IMDB's 1000 top movies/TV shows. The following features are included:
* Poster_Link
* Series_Title
* Released_Year
* Certificate
* Runtime
* Genre
* IMDB_Rating
* Overview
* Meta_score
* Director
* Star1
* Star2
* Star3
* Star4
* No_of_Votes
* Gross

When first exploring the dataset I created a correlation matrix of the numerical features and noticed a modest correlation between metascores and IMDB ratings. Let's have a closer look at the relationship between these two features:

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt

# Import data
df = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")
df['IMDB_Rating'] = df['IMDB_Rating'] * 10

# Calculate correlation
correlation = df['IMDB_Rating'].corr(df['Meta_score'])

# Scatterplot
plt.figure(figsize=(8, 5))
plt.scatter(df['IMDB_Rating'], df['Meta_score'], alpha=0.2)
plt.xlabel('IMDB_Rating')
plt.ylabel('Meta_score')
plt.title('Ratings Comparison (n=1000)')
plt.text(0.75, 0.35, f'R = {correlation:.3f}', transform=plt.gca().transAxes, fontsize=12)
plt.show()

The main difference between the ratings systems is that Metacritic aggregates reviews from **established critics and publications** then assigns each review a score, whereas IMDb scores come from registered users of the IMDb website and is **audience-based**.

From the plot above we can see most points share similar ratings, though a signifcant portion does not. Let's transform the data to visualize this better.

In [None]:
df['delta'] = df['Meta_score'] - df['IMDB_Rating']
plt.hist(df['delta'], bins=30, color='royalblue')
plt.title('Histogram of Delta (Meta_score - IMDB_Rating)')
plt.show()

In [None]:
# import numpy as np
# import os
# import matplotlib.pyplot as plt
# import seaborn as sns

# # List all files
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# # Load dataset and describe
# file_path = 


# # Clean data
# data['Released_Year'] = pd.to_numeric(data['Released_Year'], errors='coerce')
# data = data.dropna(subset=['Released_Year']) # Remove entries with non-numeric Released_Year
# data['Gross'] = data['Gross'].str.replace(',', '').astype(float) # Cast Gross to int ($)
# data['Runtime'] = data['Runtime'].str.extract('(\d+)').astype(int) # Cast Runtime to int (min)

# # Describe data
# print(f"\nNumber of observations: {len(data)}") # Num of observations
# print("\nFeatures:") # Display features
# for column in data.columns:
#     print(column)
# print("\nDescriptive statistics on numerical features:") #Display statistics
# numerical_stats = data.describe()
# print(numerical_stats)

# # Create a 2x3 grid of boxplots
# fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))

# for idx, col in enumerate(numerical_stats.columns):
#     row, col_idx = divmod(idx, 3)
#     ax = axes[row, col_idx]
#     ax.boxplot(data[col].dropna().values)
#     ax.set_title(f'{col}')
#     ax.set_ylabel(col)
#     ax.set_xticks([])
    
# plt.tight_layout()
# plt.show()

# # Compute correlation matrix
# correlation_matrix = data[numerical_stats.columns].corr()

# # Visualize with a heatmap
# plt.figure(figsize=(10, 7))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', cbar=True, square=True, fmt='.2f')
# plt.title('Correlation Matrix')
# plt.show()