We will be analyzing a data set containing the top 1000 movies on IMDB from 1920-2020. 

In [36]:
%matplotlib inline
import pandas as pd


In [47]:
# load data
imdb_df = pd.read_csv('imdb_top_1000.csv')
imdb_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


Looking at the first few lines of the data set we see that it's already organized by IMDB rating which ranges from 1-10. It includes basic movie information such as the title, genre, release year, cast and director. There is other numerical information such as length of the movie and its gross earnings. 

In [38]:
imdb_df.info()
imdb_df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


(1000, 16)

We can see that some of the columns contain null values such as gross and meta score. One noticeable issue is the gross column is an object data type instead of an integer or float. This will have to be addressed so that proper analysis can be done.

In [48]:
#Top grossing movies
imdb_df['Gross'] = imdb_df['Gross'].str.replace(',', '')
imdb_df['Gross'] = imdb_df['Gross'].astype(float)
sorted_gross = imdb_df.sort_values(by='Gross', ascending=False)
selected_columns = sorted_gross[['Series_Title', 'Genre', 'Director', 'Gross', 'Released_Year']]
gross_df = pd.DataFrame(selected_columns)
gross_df.head(10)


Unnamed: 0,Series_Title,Genre,Director,Gross,Released_Year
477,Star Wars: Episode VII - The Force Awakens,"Action, Adventure, Sci-Fi",J.J. Abrams,936662225.0,2015
59,Avengers: Endgame,"Action, Adventure, Drama",Anthony Russo,858373000.0,2019
623,Avatar,"Action, Adventure, Fantasy",James Cameron,760507625.0,2009
60,Avengers: Infinity War,"Action, Adventure, Sci-Fi",Anthony Russo,678815482.0,2018
652,Titanic,"Drama, Romance",James Cameron,659325379.0,1997
357,The Avengers,"Action, Adventure, Sci-Fi",Joss Whedon,623279547.0,2012
891,Incredibles 2,"Animation, Action, Adventure",Brad Bird,608581744.0,2018
2,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,534858444.0,2008
582,Rogue One,"Action, Adventure, Sci-Fi",Gareth Edwards,532177324.0,2016
63,The Dark Knight Rises,"Action, Adventure",Christopher Nolan,448139099.0,2012


Here are the top 10 movies on IMDB based on gross earnings. A few things stand out about this table. The first is Titanic is the only movie before 2000 to be included in this table and is the only non-action movie. Another observation is that 70% of the movies in this table are associated with superheroes or sci-fi (Avengers, Dark Knight, Star Wars). Most of the top grossing movies have also come within the last 11 years. 

In [49]:
#Top imdb rated movies
sorted_rating = imdb_df.sort_values(by='IMDB_Rating', ascending=False)
selected_columns = sorted_rating[['Series_Title', 'Genre', 'Director', 'IMDB_Rating', 'Released_Year']]
rating_df = pd.DataFrame(selected_columns)
rating_df.head(10)


Unnamed: 0,Series_Title,Genre,Director,IMDB_Rating,Released_Year
0,The Shawshank Redemption,Drama,Frank Darabont,9.3,1994
1,The Godfather,"Crime, Drama",Francis Ford Coppola,9.2,1972
2,The Dark Knight,"Action, Crime, Drama",Christopher Nolan,9.0,2008
3,The Godfather: Part II,"Crime, Drama",Francis Ford Coppola,9.0,1974
4,12 Angry Men,"Crime, Drama",Sidney Lumet,9.0,1957
5,The Lord of the Rings: The Return of the King,"Action, Adventure, Drama",Peter Jackson,8.9,2003
6,Pulp Fiction,"Crime, Drama",Quentin Tarantino,8.9,1994
7,Schindler's List,"Biography, Drama, History",Steven Spielberg,8.9,1993
10,The Lord of the Rings: The Fellowship of the Ring,"Action, Adventure, Drama",Peter Jackson,8.8,2001
11,Forrest Gump,"Drama, Romance",Robert Zemeckis,8.8,1994


These are the top 10 movies on IMDB based on IMDB rating. We see that The Dark Knight is the only movie on this list that was also a top 10 grossing movie. Similar to how almost all top grossing movies were in the action genre, the top rated movies are all in the drama genre. It's interesting to see that there hasn't been a top 10 rated movie on IMDB since 2008. Based on the results of the 2 tables, it seems Hollywood has traded in quality for money with regards to movie making.

In [42]:
#Director Count
dir_count = imdb_df.groupby('Director')['Series_Title'].count()
sort_dir_count = dir_count.sort_values(ascending=False)
dir_df = pd.DataFrame(sort_dir_count)
dir_df.head(10)

Unnamed: 0_level_0,Series_Title
Director,Unnamed: 1_level_1
Alfred Hitchcock,14
Steven Spielberg,13
Hayao Miyazaki,11
Martin Scorsese,10
Akira Kurosawa,10
Stanley Kubrick,9
Woody Allen,9
Billy Wilder,9
Quentin Tarantino,8
Christopher Nolan,8


This list is the directors with the most movies in the top 1000 on IMDB. Now that we know who has directed the most movies, we can subset this list for further analysis. 

In [50]:
#Avg imdb rating for director
dir_names = ['Alfred Hitchcock', 'Steven Spielberg', 'Hayao Miyazaki', 'Martin Scorsese', 'Akira Kurosawa', 'Stanley Kubrick', 'Woody Allen', 'Billy Wilder', 
                  'Quentin Tarantino', 'Christopher Nolan']

# Filter the DataFrame to include only the selected names
subset = imdb_df[imdb_df['Director'].isin(dir_names)]

# Compute the average rating for the selected names
dir_rating = subset.groupby('Director')['IMDB_Rating'].mean()

sort_dir_rating = dir_rating.sort_values(ascending=False)
dir_rate_df = pd.DataFrame(sort_dir_rating)
dir_rate_df.head(10)


Unnamed: 0_level_0,IMDB_Rating
Director,Unnamed: 1_level_1
Christopher Nolan,8.4625
Stanley Kubrick,8.233333
Akira Kurosawa,8.22
Quentin Tarantino,8.175
Martin Scorsese,8.17
Billy Wilder,8.144444
Steven Spielberg,8.030769
Hayao Miyazaki,8.018182
Alfred Hitchcock,8.007143
Woody Allen,7.788889


This is the average IMDB rating for the top 10 directors. It's important subsetting based on total number of movies because we want to compare directors with proven track records. The average ratings are very close with the top 9 all having a rating around 8. Although Christopher Nolan hasn't made the most movies, he's at the top of this list because he has the highest rated film out of all the other directors. His film is also the highest grossing out of all the other directors on this list.

In [44]:
#Actor Count
act_count = imdb_df.groupby('Star1')['Series_Title'].count()
sort_act_count = act_count.sort_values(ascending=False)
count_df = pd.DataFrame(sort_act_count)
count_df.head(10)

Unnamed: 0_level_0,Series_Title
Star1,Unnamed: 1_level_1
Tom Hanks,12
Robert De Niro,11
Clint Eastwood,10
Al Pacino,10
Leonardo DiCaprio,9
Humphrey Bogart,9
Johnny Depp,8
James Stewart,8
Christian Bale,8
Toshirô Mifune,7


Similar to directors, we would like to see which actors were in the most films in the top 1000. The original data set includes columns for 4 of the actors from the film (Star1, Star2, etc.). We will assume that Star1 is the main character of the film and only look at actors with the most Star1 credits in the data set.

In [45]:
#Avg actor rating
selected_names = ['Tom Hanks', 'Robert De Niro', 'Clint Eastwood', 'Al Pacino', 'Leonardo DiCaprio', 'Humphrey Bogart', 'Johnny Depp', 'James Stewart', 
                  'Christian Bale', 'Toshirô Mifune']
subset = imdb_df[imdb_df['Star1'].isin(selected_names)]
average_rating = subset.groupby('Star1')['IMDB_Rating'].mean()
sort_act_rating = average_rating.sort_values(ascending=False)
act_rate_df = pd.DataFrame(sort_act_rating)
act_rate_df.head(10)


Unnamed: 0_level_0,IMDB_Rating
Star1,Unnamed: 1_level_1
Toshirô Mifune,8.242857
James Stewart,8.175
Leonardo DiCaprio,8.133333
Christian Bale,8.1125
Robert De Niro,8.072727
Tom Hanks,8.041667
Al Pacino,8.01
Clint Eastwood,7.97
Humphrey Bogart,7.955556
Johnny Depp,7.7375


The first two names in this list may not be so obvious to people who aren't die hard movie lovers. Toshirô Mifune and James Stewart were both actors who had great careers in the 1950s-1970s. Both of these actors don't have a top 10 rated film yet their track record says they have been apart of some of the greatest movies in cinema. Similar to the director ratings, most of the great actors have an average rating around 8. 