# IMDB Top 1000 Movies

### Why the Data was Chosen

The data was chosen to build an intuitive yet informative interactive app based around movie data. The data set features interesting variables that would provide a rich and meaningful analysis for my film student persona. Understanding cinema goes beyond viewership; it requires an analytical lens to appreciate its complexity fully. This data set will help provide insight into trends in overall gross and ratings when taking into account a films actors, directors, and release dates. This will help the film student draw conclusions about why some films are more successful than others.

### Data Provenance

The data set was taken from Kaggle (https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows). The author of this data set has indicated that this data was scraped from the IMDB site (https://www.imdb.com/) for the purpose of easy analysis for the public. I will be using this data to look for trends and commonailities in the top IMDB movies. This dataset has License CC0 meaning the author has relinquished all copyright and similar rights on the work and dedicated those rights to the public domain.

### Data Cleaning 

In [275]:
# Import Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [276]:
# Load in Data and view first couple rows
df = pd.read_csv("./imdb_top_1000.csv")
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [277]:
# Drop poster link and overvierw col and view information on the columns of the dataframe
df.drop(['Poster_Link', 'Overview'], axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   1000 non-null   object 
 1   Released_Year  1000 non-null   int64  
 2   Certificate    899 non-null    object 
 3   Runtime        1000 non-null   object 
 4   Genre          1000 non-null   object 
 5   IMDB_Rating    1000 non-null   float64
 6   Meta_score     843 non-null    float64
 7   Director       1000 non-null   object 
 8   Star1          1000 non-null   object 
 9   Star2          1000 non-null   object 
 10  Star3          1000 non-null   object 
 11  Star4          1000 non-null   object 
 12  No_of_Votes    1000 non-null   int64  
 13  Gross          831 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 109.5+ KB


In [278]:
# Dealing with Missing Values in Certificate, Meta_score, and Gross
df = df.dropna(subset=['Certificate', 'Gross', 'Meta_score'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 997
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   714 non-null    object 
 1   Released_Year  714 non-null    int64  
 2   Certificate    714 non-null    object 
 3   Runtime        714 non-null    object 
 4   Genre          714 non-null    object 
 5   IMDB_Rating    714 non-null    float64
 6   Meta_score     714 non-null    float64
 7   Director       714 non-null    object 
 8   Star1          714 non-null    object 
 9   Star2          714 non-null    object 
 10  Star3          714 non-null    object 
 11  Star4          714 non-null    object 
 12  No_of_Votes    714 non-null    int64  
 13  Gross          714 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 83.7+ KB


In [279]:
# Convert Released_Year to numerical data
df['Released_Year'] = pd.to_numeric(df['Released_Year'], errors='coerce')

# Extract the numerical component of Runtime as numerical data
df['Runtime'] = df['Runtime'].str.extract('(\\d+)').astype(float)

# Remove commas from Gross and convert to numerical data
df['Gross'] = df['Gross'].str.replace(',', '').astype(float)

df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 997
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   714 non-null    object 
 1   Released_Year  714 non-null    int64  
 2   Certificate    714 non-null    object 
 3   Runtime        714 non-null    float64
 4   Genre          714 non-null    object 
 5   IMDB_Rating    714 non-null    float64
 6   Meta_score     714 non-null    float64
 7   Director       714 non-null    object 
 8   Star1          714 non-null    object 
 9   Star2          714 non-null    object 
 10  Star3          714 non-null    object 
 11  Star4          714 non-null    object 
 12  No_of_Votes    714 non-null    int64  
 13  Gross          714 non-null    float64
dtypes: float64(4), int64(2), object(8)
memory usage: 83.7+ KB


In [280]:
# Split Genre col by comma seperators and create a new dummy for each genre
genre_dummies = df['Genre'].str.get_dummies(sep=', ')

# Concatenating the original dataFrame with the new dummies dataFrame. Now we had a col for every genre where a 1 indicates it is that genre
final_df = pd.concat([df.drop('Genre', axis=1), genre_dummies], axis=1)

final_df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 997
Data columns (total 34 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   714 non-null    object 
 1   Released_Year  714 non-null    int64  
 2   Certificate    714 non-null    object 
 3   Runtime        714 non-null    float64
 4   IMDB_Rating    714 non-null    float64
 5   Meta_score     714 non-null    float64
 6   Director       714 non-null    object 
 7   Star1          714 non-null    object 
 8   Star2          714 non-null    object 
 9   Star3          714 non-null    object 
 10  Star4          714 non-null    object 
 11  No_of_Votes    714 non-null    int64  
 12  Gross          714 non-null    float64
 13  Action         714 non-null    int64  
 14  Adventure      714 non-null    int64  
 15  Animation      714 non-null    int64  
 16  Biography      714 non-null    int64  
 17  Comedy         714 non-null    int64  
 18  Crime          

In [281]:
# Export to data.csv
final_df.to_csv('data.csv', index=False)

### Exploratory Analysis

#### Amount of Observations

In [282]:
final_df.shape

(714, 34)

There are 714 total observations each with 34 variables

#### Amount of categories for categorical variables

In [283]:
# Get the amount of unique genres
unique_genres = set(df['Genre'].str.split(', ').explode().unique())
print(unique_genres)
print(f"Number of unique genres: {len(unique_genres)}")

{'Comedy', 'War', 'Western', 'Music', 'History', 'Action', 'Biography', 'Horror', 'Fantasy', 'Animation', 'Sci-Fi', 'Film-Noir', 'Musical', 'Crime', 'Family', 'Mystery', 'Thriller', 'Drama', 'Romance', 'Adventure', 'Sport'}
Number of unique genres: 21


In the original dataframe, Genre was a catagorical varible that could contain any combination of the 21 unique genres. In data.csv the genres are split into different cols to maintain the tidy data format

In [284]:
# Get the amount of unique stars
all_stars = pd.concat([final_df['Star1'], final_df['Star2'], final_df['Star3'], final_df['Star4']]).unique()
unique_stars = set(all_stars)
print(unique_stars)
print(f"Number of unique stars: {len(unique_stars)}")

{'Babak Karimi', 'Anthony Michael Hall', 'Tony Moran', 'Caroline Goodall', 'Saif Ali Khan', 'Pamela Adlon', 'Sandra Bullock', 'Ben Johnson', 'Jeff Garlin', 'Dan Hicks', 'Joanna Cassidy', 'Lee Pace', 'Eddie Murphy', 'Anne Le Ny', 'Holly Hunter', 'Izabela Vidovic', 'Lee Sun-kyun', 'Anna Kendrick', 'Katrin Cartlidge', 'Thomas Bo Larsen', 'Pernilla Allwin', 'Alan Ruck', 'Barry Bostwick', 'Emmy Rossum', 'Roy Dotrice', 'Gastone Moschin', 'Olivia Cooke', 'Ellen Burstyn', 'Jan Sterling', "Paige O'Hara", 'Katrin Saß', 'Niels Arestrup', 'Matvey Novikov', 'Rory Cochrane', 'Sally Hawkins', 'Jane Galloway Heitz', 'Donald Sutherland', 'Jim Carrey', 'Stephen Boyd', 'Samuel L. Jackson', 'Anne-Marie Duff', 'Gérard Jugnot', 'Andrew Garfield', 'Meat Loaf', 'Brian Cox', 'Graham Greene', 'Julia Ormond', 'Ariane Labed', 'Donny Alamsyah', 'Patrick McGoohan', "Auli'i Cravalho", 'Greg Kinnear', 'Suzanne Pleshette', 'Natalie Dessay', 'Philip Seymour Hoffman', 'Jay Baruchel', 'Emma Thompson', 'Dexter Fletcher', 

A movie has a combination of four stars from the 1914 unique stars.

In [285]:
# Get the amount of unique titles
unique_titles = final_df['Series_Title'].unique()
set(unique_titles)
print(unique_titles)
print(f"Number of unique titles: {len(unique_titles)}")

['The Shawshank Redemption' 'The Godfather' 'The Dark Knight'
 'The Godfather: Part II' '12 Angry Men'
 'The Lord of the Rings: The Return of the King' 'Pulp Fiction'
 "Schindler's List" 'Inception' 'Fight Club'
 'The Lord of the Rings: The Fellowship of the Ring' 'Forrest Gump'
 'Il buono, il brutto, il cattivo' 'The Lord of the Rings: The Two Towers'
 'The Matrix' 'Goodfellas'
 'Star Wars: Episode V - The Empire Strikes Back'
 "One Flew Over the Cuckoo's Nest" 'Gisaengchung' 'Interstellar'
 'Cidade de Deus' 'Sen to Chihiro no kamikakushi' 'Saving Private Ryan'
 'The Green Mile' 'La vita è bella' 'Se7en' 'The Silence of the Lambs'
 'Star Wars' 'Shichinin no samurai' 'Joker' 'Whiplash' 'The Intouchables'
 'The Prestige' 'The Departed' 'The Pianist' 'Gladiator'
 'American History X' 'The Usual Suspects' 'Léon' 'The Lion King'
 'Terminator 2: Judgment Day' 'Nuovo Cinema Paradiso' 'Back to the Future'
 'Once Upon a Time in the West' 'Psycho' 'Casablanca' 'Modern Times'
 'City Lights' 'Cap

There are 714 unique film titles

In [286]:
# Get the amount of unique directors
unique_directors = final_df['Director'].unique()
set(unique_directors)
print(unique_directors)
print(f"Number of unique titles: {len(unique_directors)}")

['Frank Darabont' 'Francis Ford Coppola' 'Christopher Nolan'
 'Sidney Lumet' 'Peter Jackson' 'Quentin Tarantino' 'Steven Spielberg'
 'David Fincher' 'Robert Zemeckis' 'Sergio Leone' 'Lana Wachowski'
 'Martin Scorsese' 'Irvin Kershner' 'Milos Forman' 'Bong Joon Ho'
 'Fernando Meirelles' 'Hayao Miyazaki' 'Roberto Benigni' 'Jonathan Demme'
 'George Lucas' 'Akira Kurosawa' 'Todd Phillips' 'Damien Chazelle'
 'Olivier Nakache' 'Roman Polanski' 'Ridley Scott' 'Tony Kaye'
 'Bryan Singer' 'Luc Besson' 'Roger Allers' 'James Cameron'
 'Giuseppe Tornatore' 'Alfred Hitchcock' 'Michael Curtiz'
 'Charles Chaplin' 'Nadine Labaki' 'Makoto Shinkai' 'Bob Persichetti'
 'Anthony Russo' 'Lee Unkrich' 'Rajkumar Hirani' 'Andrew Stanton'
 'Florian Henckel von Donnersmarck' 'Chan-wook Park' 'Stanley Kubrick'
 'Sam Mendes' 'Thomas Vinterberg' 'Asghar Farhadi' 'Denis Villeneuve'
 'Michel Gondry' 'Jean-Pierre Jeunet' 'Guy Ritchie' 'Darren Aronofsky'
 'Gus Van Sant' 'Majid Majidi' 'John Lasseter' 'Mel Gibson'
 'Bri

There are 402 unique directors

#### Missing Data

In [287]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 997
Data columns (total 34 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Series_Title   714 non-null    object 
 1   Released_Year  714 non-null    int64  
 2   Certificate    714 non-null    object 
 3   Runtime        714 non-null    float64
 4   IMDB_Rating    714 non-null    float64
 5   Meta_score     714 non-null    float64
 6   Director       714 non-null    object 
 7   Star1          714 non-null    object 
 8   Star2          714 non-null    object 
 9   Star3          714 non-null    object 
 10  Star4          714 non-null    object 
 11  No_of_Votes    714 non-null    int64  
 12  Gross          714 non-null    float64
 13  Action         714 non-null    int64  
 14  Adventure      714 non-null    int64  
 15  Animation      714 non-null    int64  
 16  Biography      714 non-null    int64  
 17  Comedy         714 non-null    int64  
 18  Crime          

Since there are 714 entreis and every column has 714 non-null entries, there is zero missing data in any column

#### Distribution of Continuous Variables

In [288]:
# Get statistical summary for numerical columns
final_df.describe()

Unnamed: 0,Released_Year,Runtime,IMDB_Rating,Meta_score,No_of_Votes,Gross,Action,Adventure,Animation,Biography,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,...,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,1995.735294,123.715686,7.937115,77.158263,356134.8,78513590.0,0.196078,0.228291,0.088235,0.123249,...,0.02521,0.037815,0.015406,0.098039,0.123249,0.078431,0.02381,0.138655,0.040616,0.022409
std,18.585196,25.887535,0.293278,12.401144,353901.1,114978000.0,0.397307,0.420026,0.283836,0.328954,...,0.156873,0.190883,0.123248,0.297576,0.328954,0.269038,0.152562,0.345829,0.197538,0.148113
min,1930.0,72.0,7.6,28.0,25229.0,1305.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1987.0,104.25,7.7,70.0,96009.75,6157408.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2001.0,120.0,7.9,78.0,236602.5,34850150.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2009.75,136.0,8.1,86.0,507792.2,102464100.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2019.0,238.0,9.3,100.0,2343110.0,936662200.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


This is a summary  of the numerical columns. This table gives insight on how the numerical data is distributed among the films. Each column requires a unique understanding but an example is looking at the mean of the Runtime variable, we can conclude that most movies have a runtime that is close to 124 minutes. However, this data has a standard deviation of 26 so the data must be fairly spread out

#### Outliers

In [289]:
# Define the variables are interested in finding outliers for
vars = ['Runtime', 'IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross']


def outlier(column):
    # Calculate IQR and bounds
    Q1 = final_df[column].quantile(0.25)
    Q3 = final_df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identifying outliers
    outliers = final_df[(final_df[column] < lower_bound) | (final_df[column] > upper_bound)]

    print(f"Outliers for {column}: {outliers.shape[0]}")

# Loop through for the variables we are interested in
for column in vars:
    outlier(column)

Outliers for Runtime: 22
Outliers for IMDB_Rating: 13
Outliers for Meta_score: 10
Outliers for No_of_Votes: 31
Outliers for Gross: 60


As the above print statements indicate, there are a couple outliers for the five continuous variables we are interested in investigating.

### Data Dictionary

| Column Name         | Data Type | Description                                                                                          |
|---------------------|-----------|------------------------------------------------------------------------------------------------------|
| Series_Title        | object    | The title of the movie                                                                               |
| Released_Year       | int64     | The year the movie was released                                                                      |
| Certificate         | object    | The certification of the movie                                                                       |
| Runtime             | float64   | The runtime of the movie in minutes                                                                  |
| [All Genre Columns] | int64     | A genre the movie may fall under (A 0 indicates the movie does not fall under this genre, a 1 means it does) |
| IMDB_Rating         | float64   | The IMDB rating of the movie                                                                         |
| Meta_score          | float64   | The Metacritic score of the movie                                                                    |
| Director            | object    | The director of the movie                                                                            |
| Star1               | object    | The primary star of the movie                                                                        |
| Star2               | object    | The secondary star of the movie                                                                      |
| Star3               | object    | Additional star of the movie                                                                         |
| Star4               | object    | Additional star of the movie                                                                         |
| No_of_Votes         | int64     | The number of IMDB votes for the movie                                                               |
| Gross               | float64   | The gross earnings of the movie in the US market, in dollars                                         |


### Potential UI Components

1. Interactive data charts and graphs that allow user to select individual data points and view a summary of that data point
2. Dropdown filters for categorial columns for user to see movies with given specifications
3. Slider filters for numerical columns for user to see movies with given specifications
4. Radio buttons for users to select what type of visulization they want


### Potential Data Visualizations

1. Amount of movies over time in a line chart or bar chart. This would show how the number of movies has changed over the years to identify trends in movie production. Can highlight significant periods in film history. Hover for exact counts, click on a year to see a list of movies released that year. 
2. Top directors by average rating in a horizontal bar chart. Rank directors based on the average rating of their movies. Hovering will show average rating and some movies that are directed by them.
3. Actors genre preferance in a stacked bar chart or radar chart. For selected actors, display the distribution of their movie genres. Hovering will show exact counts or percentages of movies an actor has in each genre. Click on actor in dropdown to filter.
4. Genre popularity over time in a stacked area chart or multi-line chart. Display the popularity of different genres over time by plotting the number of movies in each genre by year. Click on a genre in dropdown or line to filter the dashboard to show only movies from that genre
5. Ratings versus Gross in a scatter plot. Plot each movie with its rating on one axis and its gross earnings on the other. Hover to display the movie title, director, and year.

