# Exploratory Data Analysis of IMDb's Top Global Movies (1950-2020)

This notebook contains the exploratory data analysis (EDA) for the dataset of IMDb's top global movies from 1950 to 2020. The analysis includes data loading, preprocessing, statistical analysis, and visualizations.

In [214]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In this section, we will load the processed data from the `data/` directory.

In [215]:
# Load the processed data
data_path = '../data/imdb_top_movies.csv'
df = pd.read_csv(data_path)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Title,Year,Rating,Genre,Director(s),Box Office Revenue,Lead Actors
0,1. The Shawshank Redemption,1994,9.3 (3M),"Epic, Period Drama, Prison Drama, Drama","Bob Gunton, Frank Darabont, Morgan Freeman, Ti...","Gross worldwide$29,332,133","Bob Gunton, Tim Robbins, Morgan Freeman"
1,2. The Godfather,1972,9.2 (2.1M),"Epic, Gangster, Tragedy, Crime, Drama","Al Pacino, Marlon Brando, Mario Puzo, Peter Cl...","Gross worldwide$250,342,198","Al Pacino, Marlon Brando, James Caan"
2,3. The Dark Knight,2008,9.0 (3M),"Action Epic, Epic, Superhero, Tragedy, Action,...","Salvatore Maroni, Michael Caine, Christian Bal...","Gross worldwide$1,009,057,329","Christian Bale, Aaron Eckhart, Heath Ledger"
3,4. The Godfather Part II,1974,9.0 (1.4M),"Epic, Gangster, Tragedy, Crime, Drama","Livio Giorgi, Al Pacino, Mario Puzo, Francis F...","Gross worldwide$47,964,222","Al Pacino, Robert De Niro, Robert Duvall"
4,5. 12 Angry Men,1957,9.0 (917K),"Legal Drama, Psychological Drama, Crime, Drama","Jack Warden, Lee J. Cobb, Sidney Lumet, Regina...","Gross worldwide$2,945","Henry Fonda, Martin Balsam, Lee J. Cobb"


## Data Overview

Let's take a look at the basic statistics and structure of the dataset.

In [216]:
# Display basic statistics
df.describe(include='all')

# Display the data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Title               250 non-null    object
 1   Year                250 non-null    int64 
 2   Rating              250 non-null    object
 3   Genre               250 non-null    object
 4   Director(s)         250 non-null    object
 5   Box Office Revenue  250 non-null    object
 6   Lead Actors         250 non-null    object
dtypes: int64(1), object(6)
memory usage: 13.8+ KB


In [217]:
df.iloc[0]

Title                                       1. The Shawshank Redemption
Year                                                               1994
Rating                                                         9.3 (3M)
Genre                           Epic, Period Drama, Prison Drama, Drama
Director(s)           Bob Gunton, Frank Darabont, Morgan Freeman, Ti...
Box Office Revenue                           Gross worldwide$29,332,133
Lead Actors                     Bob Gunton, Tim Robbins, Morgan Freeman
Name: 0, dtype: object

In [218]:
# Check for duplicates
movies_df.duplicated().sum()

0

## Data Cleaning

### 1. Handling Missing Values
Missing values can cause issues in analysis. We will identify and handle missing values in the dataset.


In [219]:
print(df.isnull().sum())

Title                 0
Year                  0
Rating                0
Genre                 0
Director(s)           0
Box Office Revenue    0
Lead Actors           0
dtype: int64


In [220]:
# Rename all columns
df.columns = ['title', 'year', 'rating', 'genre', 'directors', 'revenue', 'lead_actors']
df

Unnamed: 0,title,year,rating,genre,directors,revenue,lead_actors
0,1. The Shawshank Redemption,1994,9.3 (3M),"Epic, Period Drama, Prison Drama, Drama","Bob Gunton, Frank Darabont, Morgan Freeman, Ti...","Gross worldwide$29,332,133","Bob Gunton, Tim Robbins, Morgan Freeman"
1,2. The Godfather,1972,9.2 (2.1M),"Epic, Gangster, Tragedy, Crime, Drama","Al Pacino, Marlon Brando, Mario Puzo, Peter Cl...","Gross worldwide$250,342,198","Al Pacino, Marlon Brando, James Caan"
2,3. The Dark Knight,2008,9.0 (3M),"Action Epic, Epic, Superhero, Tragedy, Action,...","Salvatore Maroni, Michael Caine, Christian Bal...","Gross worldwide$1,009,057,329","Christian Bale, Aaron Eckhart, Heath Ledger"
3,4. The Godfather Part II,1974,9.0 (1.4M),"Epic, Gangster, Tragedy, Crime, Drama","Livio Giorgi, Al Pacino, Mario Puzo, Francis F...","Gross worldwide$47,964,222","Al Pacino, Robert De Niro, Robert Duvall"
4,5. 12 Angry Men,1957,9.0 (917K),"Legal Drama, Psychological Drama, Crime, Drama","Jack Warden, Lee J. Cobb, Sidney Lumet, Regina...","Gross worldwide$2,945","Henry Fonda, Martin Balsam, Lee J. Cobb"
...,...,...,...,...,...,...,...
245,246. A Silent Voice: The Movie,2016,8.1 (117K),"Anime, Coming-of-Age, Psychological Drama, Shō...","Reiko Yoshida, Pete Townshend, Lexi Marman, Mi...","Gross worldwide$30,819,442","Saori Hayami, Miyu Irino, Aoi Yûki"
246,247. The Help,2011,8.1 (510K),"Period Drama, Drama","Emma Stone, Octavia Spencer, Johnny Cash, Hill...","Gross worldwide$221,802,186","Octavia Spencer, Emma Stone, Viola Davis"
247,248. Amores Perros,2000,8.0 (261K),"Tragedy, Drama, Thriller","Emilio Echevarría, Goya Toledo, Guillermo Arri...","Gross worldwide$20,908,467","Emilio Echevarría, Gael García Bernal, Goya To..."
248,249. Rebecca,1940,8.1 (153K),"Dark Romance, Psychological Drama, Psychologic...","Laurence Olivier, The Second Mrs. de Winter, J...","Gross worldwide$113,328","Laurence Olivier, Joan Fontaine, George Sanders"


In [221]:
df.columns

Index(['title', 'year', 'rating', 'genre', 'directors', 'revenue',
       'lead_actors'],
      dtype='object')

### Rearranging and Filtering Dataset Values

To ensure the dataset is clean and meets the requirements for analysis, we will perform the following steps:

1. **Filter the `title` Column**: Ensure that the `title` column contains only alphabetic characters by removing any non-alphabetic values.
2. **Filter the `rating` Column**: Ensure that the `rating` column contains only numeric values.
3. **Filter the `revenue` Column**: Ensure that the `revenue` column contains only numeric values.

In [222]:
#divide the rating column into two columns with rating and votes
df[['rating', 'votes']] = df['rating'].str.split("(", expand=True)
print(df.head())

                         title  year rating  \
0  1. The Shawshank Redemption  1994   9.3    
1             2. The Godfather  1972   9.2    
2           3. The Dark Knight  2008   9.0    
3     4. The Godfather Part II  1974   9.0    
4              5. 12 Angry Men  1957   9.0    

                                               genre  \
0            Epic, Period Drama, Prison Drama, Drama   
1              Epic, Gangster, Tragedy, Crime, Drama   
2  Action Epic, Epic, Superhero, Tragedy, Action,...   
3              Epic, Gangster, Tragedy, Crime, Drama   
4     Legal Drama, Psychological Drama, Crime, Drama   

                                           directors  \
0  Bob Gunton, Frank Darabont, Morgan Freeman, Ti...   
1  Al Pacino, Marlon Brando, Mario Puzo, Peter Cl...   
2  Salvatore Maroni, Michael Caine, Christian Bal...   
3  Livio Giorgi, Al Pacino, Mario Puzo, Francis F...   
4  Jack Warden, Lee J. Cobb, Sidney Lumet, Regina...   

                         revenue           

In [223]:
df.columns

Index(['title', 'year', 'rating', 'genre', 'directors', 'revenue',
       'lead_actors', 'votes'],
      dtype='object')

In [224]:
df.head()

Unnamed: 0,title,year,rating,genre,directors,revenue,lead_actors,votes
0,1. The Shawshank Redemption,1994,9.3,"Epic, Period Drama, Prison Drama, Drama","Bob Gunton, Frank Darabont, Morgan Freeman, Ti...","Gross worldwide$29,332,133","Bob Gunton, Tim Robbins, Morgan Freeman",3M)
1,2. The Godfather,1972,9.2,"Epic, Gangster, Tragedy, Crime, Drama","Al Pacino, Marlon Brando, Mario Puzo, Peter Cl...","Gross worldwide$250,342,198","Al Pacino, Marlon Brando, James Caan",2.1M)
2,3. The Dark Knight,2008,9.0,"Action Epic, Epic, Superhero, Tragedy, Action,...","Salvatore Maroni, Michael Caine, Christian Bal...","Gross worldwide$1,009,057,329","Christian Bale, Aaron Eckhart, Heath Ledger",3M)
3,4. The Godfather Part II,1974,9.0,"Epic, Gangster, Tragedy, Crime, Drama","Livio Giorgi, Al Pacino, Mario Puzo, Francis F...","Gross worldwide$47,964,222","Al Pacino, Robert De Niro, Robert Duvall",1.4M)
4,5. 12 Angry Men,1957,9.0,"Legal Drama, Psychological Drama, Crime, Drama","Jack Warden, Lee J. Cobb, Sidney Lumet, Regina...","Gross worldwide$2,945","Henry Fonda, Martin Balsam, Lee J. Cobb",917K)


In [225]:
#remove the votes column from dataframe
df = df.drop('votes', axis=1)
df

Unnamed: 0,title,year,rating,genre,directors,revenue,lead_actors
0,1. The Shawshank Redemption,1994,9.3,"Epic, Period Drama, Prison Drama, Drama","Bob Gunton, Frank Darabont, Morgan Freeman, Ti...","Gross worldwide$29,332,133","Bob Gunton, Tim Robbins, Morgan Freeman"
1,2. The Godfather,1972,9.2,"Epic, Gangster, Tragedy, Crime, Drama","Al Pacino, Marlon Brando, Mario Puzo, Peter Cl...","Gross worldwide$250,342,198","Al Pacino, Marlon Brando, James Caan"
2,3. The Dark Knight,2008,9.0,"Action Epic, Epic, Superhero, Tragedy, Action,...","Salvatore Maroni, Michael Caine, Christian Bal...","Gross worldwide$1,009,057,329","Christian Bale, Aaron Eckhart, Heath Ledger"
3,4. The Godfather Part II,1974,9.0,"Epic, Gangster, Tragedy, Crime, Drama","Livio Giorgi, Al Pacino, Mario Puzo, Francis F...","Gross worldwide$47,964,222","Al Pacino, Robert De Niro, Robert Duvall"
4,5. 12 Angry Men,1957,9.0,"Legal Drama, Psychological Drama, Crime, Drama","Jack Warden, Lee J. Cobb, Sidney Lumet, Regina...","Gross worldwide$2,945","Henry Fonda, Martin Balsam, Lee J. Cobb"
...,...,...,...,...,...,...,...
245,246. A Silent Voice: The Movie,2016,8.1,"Anime, Coming-of-Age, Psychological Drama, Shō...","Reiko Yoshida, Pete Townshend, Lexi Marman, Mi...","Gross worldwide$30,819,442","Saori Hayami, Miyu Irino, Aoi Yûki"
246,247. The Help,2011,8.1,"Period Drama, Drama","Emma Stone, Octavia Spencer, Johnny Cash, Hill...","Gross worldwide$221,802,186","Octavia Spencer, Emma Stone, Viola Davis"
247,248. Amores Perros,2000,8.0,"Tragedy, Drama, Thriller","Emilio Echevarría, Goya Toledo, Guillermo Arri...","Gross worldwide$20,908,467","Emilio Echevarría, Gael García Bernal, Goya To..."
248,249. Rebecca,1940,8.1,"Dark Romance, Psychological Drama, Psychologic...","Laurence Olivier, The Second Mrs. de Winter, J...","Gross worldwide$113,328","Laurence Olivier, Joan Fontaine, George Sanders"


In [226]:
#datatypes of columns
df.dtypes

title          object
year            int64
rating         object
genre          object
directors      object
revenue        object
lead_actors    object
dtype: object

In [227]:
#convert the rating column to float
df['rating'] = df['rating'].astype(float)
df.dtypes

title           object
year             int64
rating         float64
genre           object
directors       object
revenue         object
lead_actors     object
dtype: object

In [228]:
#check for missing values like unknown or NA
df.isin(['Unknown', 'NA']).sum()

title          0
year           0
rating         0
genre          0
directors      0
revenue        4
lead_actors    1
dtype: int64

In [None]:
# Replace 'Unknown' values in the 'revenue' column with 0 
df['revenue'] = df['revenue'].replace('Unknown', 'Gross worldwide$0')
df.isin(['Unknown', 'NA']).sum()

title          0
year           0
rating         0
genre          0
directors      0
revenue        0
lead_actors    1
dtype: int64

In [230]:
df.dtypes

title           object
year             int64
rating         float64
genre           object
directors       object
revenue         object
lead_actors     object
dtype: object

In [231]:
print(df[['revenue']])

                           revenue
0       Gross worldwide$29,332,133
1      Gross worldwide$250,342,198
2    Gross worldwide$1,009,057,329
3       Gross worldwide$47,964,222
4            Gross worldwide$2,945
..                             ...
245     Gross worldwide$30,819,442
246    Gross worldwide$221,802,186
247     Gross worldwide$20,908,467
248        Gross worldwide$113,328
249                              0

[250 rows x 1 columns]


In [232]:
# Remove non-numeric characters from the 'revenue' column
df['revenue'] = (
    df['revenue']
    .str.replace('Gross worldwide', '', regex=False)  # Remove the 'Gross worldwide' text
    .str.replace(r'[\$,]', '', regex=True)            # Remove dollar signs and commas
)

# Convert the 'revenue' column to numeric
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

# Display the updated DataFrame
df

Unnamed: 0,title,year,rating,genre,directors,revenue,lead_actors
0,1. The Shawshank Redemption,1994,9.3,"Epic, Period Drama, Prison Drama, Drama","Bob Gunton, Frank Darabont, Morgan Freeman, Ti...",2.933213e+07,"Bob Gunton, Tim Robbins, Morgan Freeman"
1,2. The Godfather,1972,9.2,"Epic, Gangster, Tragedy, Crime, Drama","Al Pacino, Marlon Brando, Mario Puzo, Peter Cl...",2.503422e+08,"Al Pacino, Marlon Brando, James Caan"
2,3. The Dark Knight,2008,9.0,"Action Epic, Epic, Superhero, Tragedy, Action,...","Salvatore Maroni, Michael Caine, Christian Bal...",1.009057e+09,"Christian Bale, Aaron Eckhart, Heath Ledger"
3,4. The Godfather Part II,1974,9.0,"Epic, Gangster, Tragedy, Crime, Drama","Livio Giorgi, Al Pacino, Mario Puzo, Francis F...",4.796422e+07,"Al Pacino, Robert De Niro, Robert Duvall"
4,5. 12 Angry Men,1957,9.0,"Legal Drama, Psychological Drama, Crime, Drama","Jack Warden, Lee J. Cobb, Sidney Lumet, Regina...",2.945000e+03,"Henry Fonda, Martin Balsam, Lee J. Cobb"
...,...,...,...,...,...,...,...
245,246. A Silent Voice: The Movie,2016,8.1,"Anime, Coming-of-Age, Psychological Drama, Shō...","Reiko Yoshida, Pete Townshend, Lexi Marman, Mi...",3.081944e+07,"Saori Hayami, Miyu Irino, Aoi Yûki"
246,247. The Help,2011,8.1,"Period Drama, Drama","Emma Stone, Octavia Spencer, Johnny Cash, Hill...",2.218022e+08,"Octavia Spencer, Emma Stone, Viola Davis"
247,248. Amores Perros,2000,8.0,"Tragedy, Drama, Thriller","Emilio Echevarría, Goya Toledo, Guillermo Arri...",2.090847e+07,"Emilio Echevarría, Gael García Bernal, Goya To..."
248,249. Rebecca,1940,8.1,"Dark Romance, Psychological Drama, Psychologic...","Laurence Olivier, The Second Mrs. de Winter, J...",1.133280e+05,"Laurence Olivier, Joan Fontaine, George Sanders"


In [233]:
print(df[['revenue']])

          revenue
0    2.933213e+07
1    2.503422e+08
2    1.009057e+09
3    4.796422e+07
4    2.945000e+03
..            ...
245  3.081944e+07
246  2.218022e+08
247  2.090847e+07
248  1.133280e+05
249           NaN

[250 rows x 1 columns]


In [234]:
# #convert to numeric, setting invalid values as NaN
# df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
# print(df[['revenue']])