# Data Source
This project uses the IMDb Top 1000 dataset from Kaggle: [Kaggle Dataset Link](https://www.kaggle.com/datasets/ramjasmaurya/top-250s-in-imdb?resource=download&select=imdb+%281000+movies%29+in+june+2022.csv )


# Data Understanding

### üìä IMDb Movies Dataset ‚Äî Column Descriptions

| **Column Name** | **Description** |
|-----------------|-----------------|
| `ranking of movie` | The movie‚Äôs position in the IMDb top 1000 list. |
| `movie name` | The title of the movie. |
| `Year` | The year the movie was released. |
| `certificate` | The movie‚Äôs age certification (e.g., PG, R, 12A, 15). |
| `runtime` | Duration of the movie in minutes (e.g., ‚Äú142 min‚Äù). |
| `genre` | The main genres of the movie (e.g., Drama, Action). |
| `RATING` | IMDb user rating out of 10. |
| `metascore` | A score from Metacritic (0‚Äì100) based on critic reviews. |
| `DETAIL ABOUT MOVIE` | A short summary or plot of the movie. |
| `DIRECTOR` | The name of the movie‚Äôs director. |
| `ACTOR 1` | The name of the first main actor. |
| `ACTOR 2` | The name of the second main actor. |
| `ACTOR 3` | The name of the third main actor. |
| `ACTOR 4` | The name of the fourth main actor. |
| `votes` | Number of IMDb user votes the movie received. |
| `GROSS COLLECTION` | Box office gross earnings (in USD, e.g., ‚Äú$134.97M‚Äù). |

# Data Exploration

In [570]:
import numpy as np
import pandas as pd

In [571]:
df = pd.read_csv('imdb (1000 movies) in june 2022.csv')
df.head(2)

Unnamed: 0,ranking of movie\r\n,movie name\r\n,Year,certificate,runtime,genre,RATING,metascore,DETAIL ABOUT MOVIE\n,DIRECTOR\r\n,ACTOR 1\n,ACTOR 2\n,ACTOR 3,ACTOR 4,votes,GROSS COLLECTION\r\n
0,1,The Shawshank Redemption,-1994,15,142 min,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M
1,2,The Godfather,-1972,X,175 min,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,$134.97M


## Check Data Types

In [572]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ranking of movie
   1000 non-null   object 
 1   movie name
         1000 non-null   object 
 2   Year                 1000 non-null   object 
 3   certificate          995 non-null    object 
 4   runtime              1000 non-null   object 
 5   genre                1000 non-null   object 
 6   RATING               1000 non-null   float64
 7   metascore            837 non-null    float64
 8   DETAIL ABOUT MOVIE
  1000 non-null   object 
 9   DIRECTOR
           1000 non-null   object 
 10  ACTOR 1
             1000 non-null   object 
 11  ACTOR 2
             1000 non-null   object 
 12  ACTOR 3              1000 non-null   object 
 13  ACTOR 4              1000 non-null   object 
 14  votes                1000 non-null   object 
 15  GROSS COLLECTION
   820 non-null    object

In [573]:
df.columns.to_list()

['ranking of movie\r\n',
 'movie name\r\n',
 'Year',
 'certificate',
 'runtime',
 'genre',
 'RATING',
 'metascore',
 'DETAIL ABOUT MOVIE\n',
 'DIRECTOR\r\n',
 'ACTOR 1\n',
 'ACTOR 2\n',
 'ACTOR 3',
 'ACTOR 4',
 'votes',
 'GROSS COLLECTION\r\n']

In [574]:
'ranking of movie\r\n'.strip()

'ranking of movie'

In [575]:
# Handle column names
df.columns = df.columns.str.strip().str.lower()
df.columns

Index(['ranking of movie', 'movie name', 'year', 'certificate', 'runtime',
       'genre', 'rating', 'metascore', 'detail about movie', 'director',
       'actor 1', 'actor 2', 'actor 3', 'actor 4', 'votes',
       'gross collection'],
      dtype='object')

In [576]:
df.head(1)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime,genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection
0,1,The Shawshank Redemption,-1994,15,142 min,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M


In [577]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ranking of movie    1000 non-null   object 
 1   movie name          1000 non-null   object 
 2   year                1000 non-null   object 
 3   certificate         995 non-null    object 
 4   runtime             1000 non-null   object 
 5   genre               1000 non-null   object 
 6   rating              1000 non-null   float64
 7   metascore           837 non-null    float64
 8   detail about movie  1000 non-null   object 
 9   director            1000 non-null   object 
 10  actor 1             1000 non-null   object 
 11  actor 2             1000 non-null   object 
 12  actor 3             1000 non-null   object 
 13  actor 4             1000 non-null   object 
 14  votes               1000 non-null   object 
 15  gross collection    820 non-null    object 
dtypes: floa

## Check Summary Statistics for Numerical Columns

In [578]:
df.describe().round(2)

Unnamed: 0,rating,metascore
count,1000.0,837.0
mean,7.96,78.63
std,0.28,12.05
min,7.6,28.0
25%,7.7,71.0
50%,7.9,80.0
75%,8.1,88.0
max,9.3,100.0


In [579]:
df.metascore.isna().mean() * 100

np.float64(16.3)

## Check Summary Statistics for Categorical Columns

In [580]:

df.select_dtypes(include='object').columns.to_list() 


['ranking of movie',
 'movie name',
 'year',
 'certificate',
 'runtime',
 'genre',
 'detail about movie',
 'director',
 'actor 1',
 'actor 2',
 'actor 3',
 'actor 4',
 'votes',
 'gross collection']

In [581]:
df.describe(include= ['object']).drop(['ranking of movie' , 'year' , 'runtime' , 'votes' , 'gross collection'] , axis =1) 

Unnamed: 0,movie name,certificate,genre,detail about movie,director,actor 1,actor 2,actor 3,actor 4
count,1000,995,1000,1000,1000,1000,1000,1000,1000
unique,997,15,202,1000,554,661,836,896,934
top,Scarface,15,Drama,"A scientist finds a way of becoming invisible,...",Alfred Hitchcock,Tom Hanks,Emma Watson,Carrie Fisher,Michael Caine
freq,2,287,84,1,13,12,6,4,4


In [582]:
df.certificate.isna().mean()*100

np.float64(0.5)

## Check Duplicates

In [583]:
df.duplicated().sum()

np.int64(0)

## Check Missing Values

In [584]:
(df.isna().mean() * 100).sort_values(ascending=False)

gross collection      18.0
metascore             16.3
certificate            0.5
movie name             0.0
year                   0.0
runtime                0.0
genre                  0.0
ranking of movie       0.0
rating                 0.0
detail about movie     0.0
actor 1                0.0
director               0.0
actor 2                0.0
actor 3                0.0
actor 4                0.0
votes                  0.0
dtype: float64

# Data Cleaning 

## In-depth check for categorical columns

In [585]:
cat_cols = df.select_dtypes(include= 'object').columns
cat_cols

Index(['ranking of movie', 'movie name', 'year', 'certificate', 'runtime',
       'genre', 'detail about movie', 'director', 'actor 1', 'actor 2',
       'actor 3', 'actor 4', 'votes', 'gross collection'],
      dtype='object')

In [586]:
for col in cat_cols:

    print(col)
    print(df[col].nunique())
    print(df[col].unique())
    print('/n', '*' * 100, '/n')

ranking of movie
1000
['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16'
 '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30'
 '31' '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44'
 '45' '46' '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57' '58'
 '59' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '70' '71' '72'
 '73' '74' '75' '76' '77' '78' '79' '80' '81' '82' '83' '84' '85' '86'
 '87' '88' '89' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99' '100'
 '101' '102' '103' '104' '105' '106' '107' '108' '109' '110' '111' '112'
 '113' '114' '115' '116' '117' '118' '119' '120' '121' '122' '123' '124'
 '125' '126' '127' '128' '129' '130' '131' '132' '133' '134' '135' '136'
 '137' '138' '139' '140' '141' '142' '143' '144' '145' '146' '147' '148'
 '149' '150' '151' '152' '153' '154' '155' '156' '157' '158' '159' '160'
 '161' '162' '163' '164' '165' '166' '167' '168' '169' '170' '171' '172'
 '173' '174' '175' '176' '177' '178' '179

### Cleaning the **year** column 

In [587]:
df.year.sample(10)

591    -2017
123    -1960
658    -2004
491    -2009
770    -2020
8      -1994
386    -1999
930    -2009
99     -2021
230    -2019
Name: year, dtype: object

In [588]:
'-1972'.strip('-')

'1972'

In [589]:
'(III) (2016)'.split()[1][1:-1]

'2016'

In [590]:
def clean_year(x):    

    if x[0] == '-':  
        return int(x.replace('-', ''))
    
    elif 'I' in x:
        return int(x.split()[1][1:-1])
    
    else:
        return int(x)
    
df.year = df.year.apply(clean_year)
print(df.year.unique())
print(df['year'].dtype)

[1994 1972 2008 2003 1993 1974 1957 2021 2010 2002 1999 2001 1966 2020
 1990 1980 1975 2022 2014 1998 1997 1995 1991 1977 1962 1954 1946 2019
 2011 2006 2000 1988 1985 1979 1968 1960 1942 1936 1931 2018 2016 2017
 2012 2009 1986 1984 1981 1963 1964 1950 1940 2013 2007 2004 1992 1987
 1983 1973 1971 1961 1959 1958 1955 1952 1948 1944 1941 1927 1921 2015
 2005 1989 1982 1976 1969 1965 1953 1939 1928 1926 1925 1924 1996 1978
 1967 1951 1949 1937 1934 1930 1956 1947 1945 1920 1970 1943 1938 1933
 1932 1922 1935]
int64


### Cleaning runtime column 

In [591]:
df.runtime

0      142 min
1      175 min
2      152 min
3      201 min
4      195 min
        ...   
995    113 min
996    118 min
997     83 min
998     86 min
999     71 min
Name: runtime, Length: 1000, dtype: object

In [592]:
int('113 min'.split()[0])

113

In [593]:
df.runtime= df.runtime.apply(lambda runtime : runtime.split()[0] ).astype('int')
print(df.runtime.unique)
print(df.runtime.dtype)

<bound method Series.unique of 0      142
1      175
2      152
3      201
4      195
      ... 
995    113
996    118
997     83
998     86
999     71
Name: runtime, Length: 1000, dtype: int64>
int64


In [594]:
print(df.runtime.to_list())

[142, 175, 152, 201, 195, 202, 96, 164, 154, 148, 179, 139, 178, 142, 148, 153, 136, 145, 124, 133, 130, 169, 130, 125, 169, 189, 116, 127, 137, 118, 121, 133, 207, 130, 141, 132, 106, 112, 168, 164, 130, 151, 150, 155, 119, 106, 110, 88, 155, 89, 116, 147, 117, 165, 109, 112, 102, 87, 87, 160, 126, 122, 158, 106, 117, 181, 149, 105, 165, 164, 152, 170, 135, 98, 137, 101, 113, 122, 134, 178, 137, 160, 115, 149, 146, 143, 95, 116, 88, 110, 125, 170, 139, 125, 161, 160, 115, 123, 131, 148, 96, 165, 103, 153, 108, 122, 104, 102, 126, 81, 170, 99, 116, 142, 229, 170, 131, 129, 136, 149, 129, 218, 179, 125, 136, 128, 125, 103, 143, 89, 107, 119, 117, 153, 68, 143, 119, 138, 104, 139, 156, 130, 147, 167, 163, 186, 95, 321, 135, 129, 140, 138, 117, 132, 130, 97, 180, 122, 112, 158, 118, 132, 140, 156, 119, 135, 111, 100, 107, 107, 103, 138, 89, 178, 127, 130, 127, 132, 162, 109, 129, 124, 114, 91, 142, 130, 127, 132, 172, 110, 121, 161, 105, 136, 131, 89, 88, 138, 126, 99, 238, 110, 67, 95, 4

In [595]:
df = df.rename(columns= {'runtime' : 'runtime (min)'})
df.head(2)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection
0,1,The Shawshank Redemption,1994,15,142,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,$28.34M
1,2,The Godfather,1972,X,175,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,$134.97M


### Cleaning vote column 

In [596]:
df.votes 

0      2,603,314
1      1,798,731
2      2,574,810
3      1,787,701
4      1,323,776
         ...    
995       64,498
996       46,804
997      196,361
998       56,586
999       34,632
Name: votes, Length: 1000, dtype: object

In [597]:
'2,603,314'.replace(',' , '')

'2603314'

In [598]:
df.votes = df.votes.str.replace(',' , '').astype('int')
print(df.votes.unique)
print(df.votes.dtype)
print(df.votes.to_list())


<bound method Series.unique of 0      2603314
1      1798731
2      2574810
3      1787701
4      1323776
        ...   
995      64498
996      46804
997     196361
998      56586
999      34632
Name: votes, Length: 1000, dtype: int64>
int64
[2603314, 1798731, 2574810, 1787701, 1323776, 1239027, 769113, 191329, 1995346, 2284252, 1614489, 2050591, 1808760, 2011517, 746445, 111090, 1868779, 1125939, 1260107, 988024, 197691, 1745808, 744265, 738076, 1355358, 1265684, 679930, 1598334, 1073997, 1393581, 1332341, 54658, 340610, 447135, 26935, 753815, 817783, 836550, 109453, 39965, 1300776, 1293859, 810102, 1462476, 1101298, 1065395, 1133091, 1031212, 255977, 269013, 1168856, 654463, 859025, 324579, 656848, 482791, 562032, 238449, 182312, 88965, 85804, 1207441, 29086, 254299, 498204, 1067806, 1027286, 483043, 1505535, 1662365, 35266, 388873, 117759, 1090345, 384675, 565266, 1212647, 1135911, 384644, 1021071, 703019, 395117, 947846, 247943, 991472, 43400, 482516, 122776, 194279, 218593, 22103

### Cleaning gross collection

In [599]:
'$28.34M'.replace('$' , '').replace('M' , '')

'28.34'

In [600]:
'$28.34M'[1:-1]

'28.34'

In [601]:
def clean_gross_collection_col(revenue): 
    if type(revenue) ==float :
        return revenue 
    elif '$' in revenue : 
        return revenue[1:-1]
    else : 
        return revenue 
    
df['gross collection']=df['gross collection'].apply(clean_gross_collection_col).astype('float')
print(df['gross collection'].unique)
print(df['gross collection'].dtype)
print(df['gross collection'].to_list())


<bound method Series.unique of 0       28.34
1      134.97
2      534.86
3      377.85
4       96.90
        ...  
995       NaN
996     30.50
997    184.93
998       NaN
999       NaN
Name: gross collection, Length: 1000, dtype: float64>
float64
[28.34, 134.97, 534.86, 377.85, 96.9, 57.3, 4.36, nan, 107.93, 292.58, 342.55, 37.03, 315.54, 330.25, 6.1, nan, 171.48, 46.84, 290.48, 112.0, nan, 188.02, 7.56, 10.06, 216.54, 136.8, 57.6, 100.13, 204.84, 130.74, 322.74, nan, 0.27, nan, nan, 53.37, 13.09, 13.18, nan, nan, 53.09, 132.38, 32.57, 187.71, 6.72, 23.34, 19.5, 422.78, 11.99, nan, 210.61, 83.47, 78.9, 5.32, 32.0, 36.76, 1.02, 0.16, 0.02, nan, 1.66, 335.45, nan, 5.02, 190.24, 858.37, 678.82, 209.73, 162.81, 448.14, nan, 6.53, nan, 223.81, 11.29, 0.71, 25.54, 130.1, 2.38, 75.6, 85.16, 51.97, 248.16, 11.49, 44.02, nan, 0.28, 8.18, nan, nan, 0.29, nan, nan, nan, 12.39, nan, 0.69, 7.1, 6.86, 804.75, 293.0, 1.22, 415.0, 120.54, 34.4, 33.23, 30.33, 3.64, 138.43, 191.8, 67.44, 2.83, 46.36, na

In [602]:
gross_mean = df['gross collection'].mean().round(2)

df['gross collection'] = df['gross collection'].fillna(gross_mean)
df['gross collection']

0       28.34
1      134.97
2      534.86
3      377.85
4       96.90
        ...  
995     70.16
996     30.50
997    184.93
998     70.16
999     70.16
Name: gross collection, Length: 1000, dtype: float64

In [603]:
df = df.rename(columns= {'gross collection' : 'gross collection (M)'})
df.head(2)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection (M)
0,1,The Shawshank Redemption,1994,15,142,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,28.34
1,2,The Godfather,1972,X,175,"Crime, Drama",9.2,100.0,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1798731,134.97


In [604]:
gross_mean = df['gross collection (M)'].mean().round(2)

df['gross collection (M)'] = df['gross collection (M)'].fillna(gross_mean)
df['gross collection (M)']

0       28.34
1      134.97
2      534.86
3      377.85
4       96.90
        ...  
995     70.16
996     30.50
997    184.93
998     70.16
999     70.16
Name: gross collection (M), Length: 1000, dtype: float64

In [605]:
df.head(1)

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection (M)
0,1,The Shawshank Redemption,1994,15,142,Drama,9.3,81.0,Two imprisoned men bond over a number of years...,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2603314,28.34


### Clean certificate columns 

In [606]:
df[df.certificate.isna()]

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection (M)
258,259,Everything's Gonna Be Great,1998,,107,"Comedy, Drama, Thriller",8.1,,When Altan swipes prescription drugs from his ...,√ñmer Vargi,Cem Yilmaz,Mazhar Alanson,Ceyda D√ºvenci,Selim Nasit,25360,70.16
382,383,My Sassy Girl,2001,,137,"Comedy, Drama, Romance",8.0,,"A young man sees a drunk, cute woman standing ...",Jae-young Kwak,Tae-Hyun Cha,Jun Ji-hyun,In-mun Kim,Song Wok-suk,47937,70.16
526,527,Knockin' on Heaven's Door,1997,,87,"Action, Crime, Comedy",7.9,,Two terminally ill patients escape from a hosp...,Thomas Jahn,Til Schweiger,Jan Josef Liefers,Thierry van Werveke,Moritz Bleibtreu,30288,0.0
729,730,Bringing Up Baby,1938,,102,Comedy,7.8,91.0,While trying to secure a $1 million donation f...,Howard Hawks,Katharine Hepburn,Cary Grant,Charles Ruggles,Walter Catlett,61077,70.16
741,742,Perfect Strangers,2016,,96,"Comedy, Drama",7.7,,Seven long-time friends meet for dinner. They ...,Paolo Genovese,Giuseppe Battiston,Anna Foglietta,Marco Giallini,Edoardo Leo,64462,70.16


In [607]:
df.dropna(subset= 'certificate', inplace= True, ignore_index= True)  

## In-depth check for Numerical columns

In [608]:
df.describe().round(2)

Unnamed: 0,year,runtime (min),rating,metascore,votes,gross collection (M)
count,995.0,995.0,995.0,836.0,995.0,995.0
mean,1991.18,123.73,7.96,78.61,303822.55,70.23
std,23.86,28.58,0.28,12.05,361475.09,103.13
min,1920.0,45.0,7.6,28.0,25277.0,0.0
25%,1975.0,103.0,7.7,71.0,60302.0,5.01
50%,1999.0,120.0,7.9,80.0,155549.0,44.79
75%,2010.0,138.0,8.1,88.0,419512.5,70.16
max,2022.0,321.0,9.3,100.0,2603314.0,936.66


In [609]:
drop_index=df[df['gross collection (M)']==0].index
df.drop(drop_index , axis=0 , inplace = True )
df.reset_index(drop =True , inplace=True)
df[df['gross collection (M)']==0]

Unnamed: 0,ranking of movie,movie name,year,certificate,runtime (min),genre,rating,metascore,detail about movie,director,actor 1,actor 2,actor 3,actor 4,votes,gross collection (M)


In [615]:
num_cols = df.select_dtypes(include= 'number').columns
for col in num_cols:
    print(col)
    print(df[col].isna().mean()*100)
    print('/n', '*' * 100, '/n')

year
0.0
/n **************************************************************************************************** /n
runtime (min)
0.0
/n **************************************************************************************************** /n
rating
0.0
/n **************************************************************************************************** /n
metascore
16.012084592145015
/n **************************************************************************************************** /n
votes
0.0
/n **************************************************************************************************** /n
gross collection (M)
0.0
/n **************************************************************************************************** /n


In [616]:
metascore_median = df.metascore.median()

df.metascore = df.metascore.fillna(metascore_median)
df.metascore.isna().sum()

np.int64(0)

In [617]:
(df.isna().mean() * 100).sort_values(ascending=False)

ranking of movie        0.0
movie name              0.0
year                    0.0
certificate             0.0
runtime (min)           0.0
genre                   0.0
rating                  0.0
metascore               0.0
detail about movie      0.0
director                0.0
actor 1                 0.0
actor 2                 0.0
actor 3                 0.0
actor 4                 0.0
votes                   0.0
gross collection (M)    0.0
dtype: float64

In [618]:
df.shape

(993, 16)

## Drop Unnecessary columns 

In [619]:
df.drop(['ranking of movie', 'detail about movie'], axis= 1, inplace= True)

In [620]:
df.duplicated().sum()

np.int64(0)

# Data Analysis 

## What is the minimum and maximum ratings 

In [621]:
print(F"Minimum Rating : {df.rating.min()} , Maximum Rating : {df.rating.max()}")

Minimum Rating : 7.6 , Maximum Rating : 9.3


## What is the movies with rating>9?

In [626]:
df[df.rating>9][['movie name' , 'rating']]

Unnamed: 0,movie name,rating
0,The Shawshank Redemption,9.3
1,The Godfather,9.2


## Top 10 movies per metascore ?

In [638]:
df.sort_values(by='metascore' , ascending=False ).head(10)[['movie name' , 'metascore' , 'rating']]

Unnamed: 0,movie name,metascore,rating
444,Sweet Smell of Success,100.0,8.0
433,The Leopard,100.0,8.0
125,Vertigo,100.0,8.3
121,Lawrence of Arabia,100.0,8.3
272,Three Colours: Red,100.0,8.1
131,Citizen Kane,100.0,8.3
285,Fanny and Alexander,100.0,8.1
492,Boyhood,100.0,7.9
572,Notorious,100.0,7.9
55,Rear Window,100.0,8.5


In [639]:
df.sort_values(by=['metascore' , 'rating'] , ascending=[False , False] ).head(10)[['movie name' , 'metascore' , 'rating']]

Unnamed: 0,movie name,metascore,rating
1,The Godfather,100.0,9.2
55,Rear Window,100.0,8.5
56,Casablanca,100.0,8.5
121,Lawrence of Arabia,100.0,8.3
125,Vertigo,100.0,8.3
131,Citizen Kane,100.0,8.3
272,Three Colours: Red,100.0,8.1
285,Fanny and Alexander,100.0,8.1
433,The Leopard,100.0,8.0
444,Sweet Smell of Success,100.0,8.0


## Top 10 Genre 

In [642]:
df.genre.value_counts().head(10)

genre
Drama                           84
Drama, Romance                  37
Comedy, Drama, Romance          33
Comedy, Drama                   33
Action, Crime, Drama            32
Crime, Drama                    30
Animation, Adventure, Comedy    30
Crime, Drama, Mystery           29
Crime, Drama, Thriller          26
Biography, Drama, History       25
Name: count, dtype: int64

## Top 10 Directories per Gross Collection ?

In [646]:
df.groupby('director')['gross collection (M)'].sum().sort_values(ascending=False).head(10)

director
Steven Spielberg     2478.13
Anthony Russo        2205.04
Christopher Nolan    1937.45
James Cameron        1748.24
Peter Jackson        1597.31
J.J. Abrams          1423.17
Brad Bird            1099.63
Robert Zemeckis      1049.44
Pete Docter          1009.54
Jon Watts             804.75
Name: gross collection (M), dtype: float64

## Top 10 First Actors from 2000-2022 ?

In [653]:
df_filtered=df[(df.year>=2000) & (df.year>=2002)]
df_filtered['actor 1'].value_counts().head(10)

actor 1
Leonardo DiCaprio     8
Christian Bale        6
Tom Cruise            5
Tom Hanks             5
Matt Damon            4
Irrfan Khan           4
Akshay Kumar          4
Shah Rukh Khan        4
Daniel Radcliffe      4
Ayushmann Khurrana    4
Name: count, dtype: int64

## Is there a relation between `revenue` and `rating` or `metascore`

In [655]:
df.corr(numeric_only=True).round(2)

Unnamed: 0,year,runtime (min),rating,metascore,votes,gross collection (M)
year,1.0,0.2,-0.09,-0.31,0.25,0.2
runtime (min),0.2,1.0,0.26,-0.04,0.16,0.12
rating,-0.09,0.26,1.0,0.24,0.5,0.1
metascore,-0.31,-0.04,0.24,1.0,-0.07,-0.07
votes,0.25,0.16,0.5,-0.07,1.0,0.55
gross collection (M),0.2,0.12,0.1,-0.07,0.55,1.0


In [656]:
df[['rating' , 'metascore' , 'gross collection (M)']].corr(numeric_only=True).round(2)

Unnamed: 0,rating,metascore,gross collection (M)
rating,1.0,0.24,0.1
metascore,0.24,1.0,-0.07
gross collection (M),0.1,-0.07,1.0


## What is the average `runtime` per `genre`?

In [661]:
df.groupby('genre')['runtime (min)'].mean().sort_values(ascending=False).head(10).round(2)


genre
Drama, Musical, Sport        224.00
Adventure, Drama, Family     220.00
Crime, Drama, Fantasy        189.00
Drama, Family, Musical       181.00
Biography, Comedy, Crime     180.00
Adventure, Drama, History    174.67
Biography, Drama, War        172.00
Comedy, Drama, Musical       170.50
Adventure, Drama, War        166.33
Drama, Musical               166.00
Name: runtime (min), dtype: float64