In [None]:
import numpy as np
import pandas as pd

# Load IMDB Data

A data set of 1,000 most popular movies on IMDB from 2006 - 2016. The data fields included are:

Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascrore

1. Download the dataset from [Kaggle](https://www.kaggle.com/datasets/PromptCloudHQ/imdb-data?resource=download) - you may need to create a free account with your google account
2. Upload the file to your collab session through clicking on the Files menu in the left toolbar (see the image for where to find that)

![img](https://drive.google.com/uc?export=view&id=1P4YcZK7g_1gl5XyLClD0StYLo9u9w2w4)

3. Read into a pandas dataframe as below



In [None]:
df = pd.read_csv('IMDB-Movie-Data.csv')

In [None]:
df.head(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0


## Excercises

1. Check how many missing values there are in each column of the data set

In [None]:
df.isna().sum()

Rank                    0
Title                   0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

In [None]:
# other options
# (df.isnull().sum() / len(df)).sort_values(ascending = False) --> shows percentage of missing values per column
# df.info() --> shows some general info on the dataframe, including how many values are not missing per column

2. Check how many unique directors are in the list of top 1000 movies

In [None]:
df['Director'].nunique()

644

In [None]:
# other options (more custom)
# len(df['Director'].unique()) --> always preferred to use built-in pandas functionality (if available and you're aware for it)

3. Drop all rows where the Metascore (Average Critic's score) is missing

In [None]:
df['Metascore'].isnull().sum()

64

In [None]:
df_with_dropped_mts = df.dropna(subset = ['Metascore'])

In [None]:
df_with_dropped_mts['Metascore'].isnull().sum()

0

In [None]:
# note that the original dataframe did not change! (which is usually best-practice)
df['Metascore'].isnull().sum()

64

4. Replace the missing values in the `Revenue (Millions)` column with the average Revenue for other movies in the list for the same `Year`





In [None]:
# rename the revenue column for simplicity
df = df.rename(columns = {'Revenue (Millions)': 'revenue_mill'})

In [None]:
df['revenue_mill'].isnull().sum()

128

In [None]:
avg_by_year = df.loc[~df['revenue_mill'].isna()]\
    .groupby('Year')['revenue_mill']\
    .mean()\
    .round(3) # for creating completely accurate results, better to leave out rounding (rounding is useful for displaying the results)

df.loc[df['revenue_mill'].isnull(), :] = avg_by_year

In [None]:
df['revenue_mill'].isna().sum()

0

Some extra tasks you can ponder before going to the cinema. Reach out if you have questions!

5. Find all he movies where Leonardo DiCaprio is listed as actor.



In [None]:
leonardo_movies = df[df['Actors'].str.contains('Leonardo DiCaprio')]  ## all the name of the movies from Dicaprio
leonardo_movies.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
80,81,Inception,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
82,83,The Wolf of Wall Street,"Biography,Comedy,Crime","Based on the true story of Jordan Belfort, fro...",Martin Scorsese,"Leonardo DiCaprio, Jonah Hill, Margot Robbie,M...",2013,180,8.2,865134,116.87,75.0
99,100,The Departed,"Crime,Drama,Thriller",An undercover cop and a mole in the police att...,Martin Scorsese,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...",2006,151,8.5,937414,132.37,85.0
129,130,The Revenant,"Adventure,Drama,Thriller",A frontiersman on a fur trading expedition in ...,Alejandro González Iñárritu,"Leonardo DiCaprio, Tom Hardy, Will Poulter, Do...",2015,156,8.0,499424,183.64,76.0
137,138,The Great Gatsby,"Drama,Romance","A writer and wall street trader, Nick, finds h...",Baz Luhrmann,"Leonardo DiCaprio, Carey Mulligan, Joel Edgert...",2013,143,7.3,386102,144.81,55.0


In [None]:
leonardo_movies.shape[0]  ## the number of movies from Dicaprio

10

6. In which genres did he play mostly?


In [None]:
pd.DataFrame(leonardo_movies.groupby('Genre').size().sort_values(ascending=False))

Unnamed: 0_level_0,0
Genre,Unnamed: 1_level_1
"Adventure,Drama,Thriller",2
"Drama,Romance",2
"Action,Adventure,Sci-Fi",1
"Action,Drama,Romance",1
"Biography,Comedy,Crime",1
"Crime,Drama,Thriller",1
"Drama,Western",1
"Mystery,Thriller",1


7. And what is the average ratng of his movies? Is it higher or lower than the average rating of all movies?

In [None]:
leonardo_avg_rating = leonardo_movies['Rating'].mean()  ## average rating Dicaprio's

overall_avg_rating = df['Rating'].mean()   ## average of all movies

print("Leonardo's rating:", leonardo_avg_rating)
print("Rating of all movies:",overall_avg_rating)

Leonardo's rating: 7.969999999999999
Rating of all movies: 6.723199999999999
