# Data Source
This project uses the IMDb Top 1000 dataset from Kaggle: [Kaggle Dataset Link](https://www.kaggle.com/datasets/ramjasmaurya/top-250s-in-imdb?resource=download&select=imdb+%281000+movies%29+in+june+2022.csv )


# Data Understanding

### üìä IMDb Movies Dataset ‚Äî Column Descriptions

| **Column Name** | **Description** |
|-----------------|-----------------|
| `ranking of movie` | The movie‚Äôs position in the IMDb top 1000 list. |
| `movie name` | The title of the movie. |
| `Year` | The year the movie was released. |
| `certificate` | The movie‚Äôs age certification (e.g., PG, R, 12A, 15). |
| `runtime` | Duration of the movie in minutes (e.g., ‚Äú142 min‚Äù). |
| `genre` | The main genres of the movie (e.g., Drama, Action). |
| `RATING` | IMDb user rating out of 10. |
| `metascore` | A score from Metacritic (0‚Äì100) based on critic reviews. |
| `DETAIL ABOUT MOVIE` | A short summary or plot of the movie. |
| `DIRECTOR` | The name of the movie‚Äôs director. |
| `ACTOR 1` | The name of the first main actor. |
| `ACTOR 2` | The name of the second main actor. |
| `ACTOR 3` | The name of the third main actor. |
| `ACTOR 4` | The name of the fourth main actor. |
| `votes` | Number of IMDb user votes the movie received. |
| `GROSS COLLECTION` | Box office gross earnings (in USD, e.g., ‚Äú$134.97M‚Äù). |

# Data Exploration

In [570]:
import numpy as np
import pandas as pd

In [571]:
df = pd.read_csv('imdb (1000 movies) in june 2022.csv')
df.head(2)

Unnamed: 0,ranking of movie\r\n,movie name\r\n,Year,certificate,runtime,genre,RATING,metascore,DETAIL ABOUT MOVIE\n,DIRECTOR\r\n,ACTOR 1\n,ACTOR 2\n,ACTOR 3,ACTOR 4,votes,GROSS COLLECTION\r\n
0,1,The Shawshank Redemption,-1994,15,142 min,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M
1,2,The Godfather,-1972,X,175 min,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,$134.97M


## Check Data Types

In [None]:
df.info()

In [None]:
df.columns.to_list()

In [574]:
'ranking of movie\r\n'.strip()

'ranking of movie'

In [575]:
# Handle column names
df.columns = df.columns.str.strip().str.lower()
df.columns

Index(['ranking of movie', 'movie name', 'year', 'certificate', 'runtime',
       'genre', 'rating', 'metascore', 'detail about movie', 'director',
       'actor 1', 'actor 2', 'actor 3', 'actor 4', 'votes',
       'gross collection'],
      dtype='object')

In [576]:
df.head(1)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime,genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection
0,1,The Shawshank Redemption,-1994,15,142 min,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M


In [None]:
df.info()

## Check Summary Statistics for Numerical Columns

In [578]:
df.describe().round(2)

Unnamed: 0,rating,metascore
count,1000.0,837.0
mean,7.96,78.63
std,0.28,12.05
min,7.6,28.0
25%,7.7,71.0
50%,7.9,80.0
75%,8.1,88.0
max,9.3,100.0


In [579]:
df.metascore.isna().mean() * 100

np.float64(16.3)

## Check Summary Statistics for Categorical Columns

In [None]:

df.select_dtypes(include='object').columns.to_list() 


In [581]:
df.describe(include= ['object']).drop(['ranking of movie' , 'year' , 'runtime' , 'votes' , 'gross collection'] , axis =1) 

Unnamed: 0,movie name,certificate,genre,detail about movie,director,actor 1,actor 2,actor 3,actor 4
count,1000,995,1000,1000,1000,1000,1000,1000,1000
unique,997,15,202,1000,554,661,836,896,934
top,Scarface,15,Drama,"A scientist finds a way of becoming invisible,...",Alfred Hitchcock,Tom Hanks,Emma Watson,Carrie Fisher,Michael Caine
freq,2,287,84,1,13,12,6,4,4


In [582]:
df.certificate.isna().mean()*100

np.float64(0.5)

## Check Duplicates

In [583]:
df.duplicated().sum()

np.int64(0)

## Check Missing Values

In [None]:
(df.isna().mean() * 100).sort_values(ascending=False)

# Data Cleaning 

## In-depth check for categorical columns

In [585]:
cat_cols = df.select_dtypes(include= 'object').columns
cat_cols

Index(['ranking of movie', 'movie name', 'year', 'certificate', 'runtime',
       'genre', 'detail about movie', 'director', 'actor 1', 'actor 2',
       'actor 3', 'actor 4', 'votes', 'gross collection'],
      dtype='object')

In [None]:
for col in cat_cols:

    print(col)
    print(df[col].nunique())
    print(df[col].unique())
    print('/n', '*' * 100, '/n')

### Cleaning the **year** column 

In [None]:
df.year.sample(10)

In [588]:
'-1972'.strip('-')

'1972'

In [589]:
'(III) (2016)'.split()[1][1:-1]

'2016'

In [None]:
def clean_year(x):    

    if x[0] == '-':  
        return int(x.replace('-', ''))
    
    elif 'I' in x:
        return int(x.split()[1][1:-1])
    
    else:
        return int(x)
    
df.year = df.year.apply(clean_year)
print(df.year.unique())
print(df['year'].dtype)

### Cleaning runtime column 

In [591]:
df.runtime

0      142 min
1      175 min
2      152 min
3      201 min
4      195 min
        ...   
995    113 min
996    118 min
997     83 min
998     86 min
999     71 min
Name: runtime, Length: 1000, dtype: object

In [592]:
int('113 min'.split()[0])

113

In [593]:
df.runtime= df.runtime.apply(lambda runtime : runtime.split()[0] ).astype('int')
print(df.runtime.unique)
print(df.runtime.dtype)

<bound method Series.unique of 0      142
1      175
2      152
3      201
4      195
      ... 
995    113
996    118
997     83
998     86
999     71
Name: runtime, Length: 1000, dtype: int64>
int64


In [None]:
print(df.runtime.to_list())

In [595]:
df = df.rename(columns= {'runtime' : 'runtime (min)'})
df.head(2)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection
0,1,The Shawshank Redemption,1994,15,142,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M
1,2,The Godfather,1972,X,175,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,$134.97M


### Cleaning vote column 

In [None]:
df.votes 

In [597]:
'2,603,314'.replace(',' , '')

'2603314'

In [None]:
df.votes = df.votes.str.replace(',' , '').astype('int')
print(df.votes.unique)
print(df.votes.dtype)
print(df.votes.to_list())


### Cleaning gross collection

In [599]:
'$28.34M'.replace('$' , '').replace('M' , '')

'28.34'

In [600]:
'$28.34M'[1:-1]

'28.34'

In [None]:
def clean_gross_collection_col(revenue): 
    if type(revenue) ==float :
        return revenue 
    elif '$' in revenue : 
        return revenue[1:-1]
    else : 
        return revenue 
    
df['gross collection']=df['gross collection'].apply(clean_gross_collection_col).astype('float')
print(df['gross collection'].unique)
print(df['gross collection'].dtype)
print(df['gross collection'].to_list())


In [602]:
gross_mean = df['gross collection'].mean().round(2)

df['gross collection'] = df['gross collection'].fillna(gross_mean)
df['gross collection']

0       28.34
1      134.97
2      534.86
3      377.85
4       96.90
        ...  
995     70.16
996     30.50
997    184.93
998     70.16
999     70.16
Name: gross collection, Length: 1000, dtype: float64

In [None]:
df = df.rename(columns= {'gross collection' : 'gross collection (M)'})
df.head(2)

In [604]:
gross_mean = df['gross collection (M)'].mean().round(2)

df['gross collection (M)'] = df['gross collection (M)'].fillna(gross_mean)
df['gross collection (M)']

0       28.34
1      134.97
2      534.86
3      377.85
4       96.90
        ...  
995     70.16
996     30.50
997    184.93
998     70.16
999     70.16
Name: gross collection (M), Length: 1000, dtype: float64

In [605]:
df.head(1)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection (M)
0,1,The Shawshank Redemption,1994,15,142,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,28.34


### Clean certificate columns 

In [None]:
df[df.certificate.isna()]

In [607]:
df.dropna(subset= 'certificate', inplace= True, ignore_index= True)  

## In-depth check for Numerical columns

In [608]:
df.describe().round(2)

Unnamed: 0,year,runtime (min),rating,metascore,votes,gross collection (M)
count,995.0,995.0,995.0,836.0,995.0,995.0
mean,1991.18,123.73,7.96,78.61,303822.55,70.23
std,23.86,28.58,0.28,12.05,361475.09,103.13
min,1920.0,45.0,7.6,28.0,25277.0,0.0
25%,1975.0,103.0,7.7,71.0,60302.0,5.01
50%,1999.0,120.0,7.9,80.0,155549.0,44.79
75%,2010.0,138.0,8.1,88.0,419512.5,70.16
max,2022.0,321.0,9.3,100.0,2603314.0,936.66


In [609]:
drop_index=df[df['gross collection (M)']==0].index
df.drop(drop_index , axis=0 , inplace = True )
df.reset_index(drop =True , inplace=True)
df[df['gross collection (M)']==0]

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection (M)


In [None]:
num_cols = df.select_dtypes(include= 'number').columns
for col in num_cols:
    print(col)
    print(df[col].isna().mean()*100)
    print('/n', '*' * 100, '/n')

In [616]:
metascore_median = df.metascore.median()

df.metascore = df.metascore.fillna(metascore_median)
df.metascore.isna().sum()

np.int64(0)

In [None]:
(df.isna().mean() * 100).sort_values(ascending=False)

In [618]:
df.shape

(993, 16)

## Drop Unnecessary columns 

In [619]:
df.drop(['ranking of movie', 'detail about movie'], axis= 1, inplace= True)

In [620]:
df.duplicated().sum()

np.int64(0)

# Data Analysis 

## What is the minimum and maximum ratings 

In [621]:
print(F"Minimum Rating : {df.rating.min()} , Maximum Rating : {df.rating.max()}")

Minimum Rating : 7.6 , Maximum Rating : 9.3


## What is the movies with rating>9?

In [626]:
df[df.rating>9][['movie name' , 'rating']]

Unnamed: 0,movie name,rating
0,The Shawshank Redemption,9.3
1,The Godfather,9.2


## Top 10 movies per metascore ?

In [638]:
df.sort_values(by='metascore' , ascending=False ).head(10)[['movie name' , 'metascore' , 'rating']]

Unnamed: 0,movie name,metascore,rating
444,Sweet Smell of Success,100.0,8.0
433,The Leopard,100.0,8.0
125,Vertigo,100.0,8.3
121,Lawrence of Arabia,100.0,8.3
272,Three Colours: Red,100.0,8.1
131,Citizen Kane,100.0,8.3
285,Fanny and Alexander,100.0,8.1
492,Boyhood,100.0,7.9
572,Notorious,100.0,7.9
55,Rear Window,100.0,8.5


In [639]:
df.sort_values(by=['metascore' , 'rating'] , ascending=[False , False] ).head(10)[['movie name' , 'metascore' , 'rating']]

Unnamed: 0,movie name,metascore,rating
1,The Godfather,100.0,9.2
55,Rear Window,100.0,8.5
56,Casablanca,100.0,8.5
121,Lawrence of Arabia,100.0,8.3
125,Vertigo,100.0,8.3
131,Citizen Kane,100.0,8.3
272,Three Colours: Red,100.0,8.1
285,Fanny and Alexander,100.0,8.1
433,The Leopard,100.0,8.0
444,Sweet Smell of Success,100.0,8.0


## Top 10 Genre 

In [642]:
df.genre.value_counts().head(10)

genre
Drama                           84
Drama, Romance                  37
Comedy, Drama, Romance          33
Comedy, Drama                   33
Action, Crime, Drama            32
Crime, Drama                    30
Animation, Adventure, Comedy    30
Crime, Drama, Mystery           29
Crime, Drama, Thriller          26
Biography, Drama, History       25
Name: count, dtype: int64

## Top 10 Directories per Gross Collection ?

In [646]:
df.groupby('director')['gross collection (M)'].sum().sort_values(ascending=False).head(10)

director
Steven Spielberg     2478.13
Anthony Russo        2205.04
Christopher Nolan    1937.45
James Cameron        1748.24
Peter Jackson        1597.31
J.J. Abrams          1423.17
Brad Bird            1099.63
Robert Zemeckis      1049.44
Pete Docter          1009.54
Jon Watts             804.75
Name: gross collection (M), dtype: float64

## Top 10 First Actors from 2000-2022 ?

In [653]:
df_filtered=df[(df.year>=2000) & (df.year>=2002)]
df_filtered['actor 1'].value_counts().head(10)

actor 1
Leonardo DiCaprio     8
Christian Bale        6
Tom Cruise            5
Tom Hanks             5
Matt Damon            4
Irrfan Khan           4
Akshay Kumar          4
Shah Rukh Khan        4
Daniel Radcliffe      4
Ayushmann Khurrana    4
Name: count, dtype: int64

## Is there a relation between `revenue` and `rating` or `metascore`

In [655]:
df.corr(numeric_only=True).round(2)

Unnamed: 0,year,runtime (min),rating,metascore,votes,gross collection (M)
year,1.0,0.2,-0.09,-0.31,0.25,0.2
runtime (min),0.2,1.0,0.26,-0.04,0.16,0.12
rating,-0.09,0.26,1.0,0.24,0.5,0.1
metascore,-0.31,-0.04,0.24,1.0,-0.07,-0.07
votes,0.25,0.16,0.5,-0.07,1.0,0.55
gross collection (M),0.2,0.12,0.1,-0.07,0.55,1.0


In [656]:
df[['rating' , 'metascore' , 'gross collection (M)']].corr(numeric_only=True).round(2)

Unnamed: 0,rating,metascore,gross collection (M)
rating,1.0,0.24,0.1
metascore,0.24,1.0,-0.07
gross collection (M),0.1,-0.07,1.0


## What is the average `runtime` per `genre`?

In [661]:
df.groupby('genre')['runtime (min)'].mean().sort_values(ascending=False).head(10).round(2)


genre
Drama, Musical, Sport        224.00
Adventure, Drama, Family     220.00
Crime, Drama, Fantasy        189.00
Drama, Family, Musical       181.00
Biography, Comedy, Crime     180.00
Adventure, Drama, History    174.67
Biography, Drama, War        172.00
Comedy, Drama, Musical       170.50
Adventure, Drama, War        166.33
Drama, Musical               166.00
Name: runtime (min), dtype: float64