# IMDB Movies Dataset
Top 1000 Movies by IMDB Rating.

## About Dataset
***
### IMDB:-
IMDb (Internet Movie Database) is an online database of information related to films, television series, podcasts, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.
***
#### Context
IMDB Dataset of top 1000 movies and tv shows.
URL of Dataset - https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

***
#### Content
Data :-
- **Poster_Link** - Link of the poster that imdb using
- **Series_Title** = Name of the movie
- **Released_Year** - Year at which that movie released
- **Certificate** - Certificate earned by that movie
- **Runtime** - Total runtime of the movie
- **Genre** - Genre of the movie
- **IMDB_Rating** - Rating of the movie at IMDB site
- **Overview** - mini story/ summary
- **Meta_score** - Score earned by the movie
- **Director** - Name of the Director
- **Star1, Star2, Star3, Star4** - Name of the Stars
- **No_of_votes** - Total number of votes
- **Gross** - Money earned by that movie

#### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly as py
import seaborn as sns
import plotly.express as px
%matplotlib inline

In [2]:
from plotly.offline import iplot
py.offline.init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()

#### Importing Dataset & Cleaning the Data

In [3]:
df = pd.read_csv('imdb_top_1000.csv')

In [4]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


In [6]:
df.isnull().sum()

Poster_Link        0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

In [7]:
df.dropna(subset=['Certificate', 'Meta_score','Gross'], inplace=True)

In [8]:
df.isnull().sum()

Poster_Link      0
Series_Title     0
Released_Year    0
Certificate      0
Runtime          0
Genre            0
IMDB_Rating      0
Overview         0
Meta_score       0
Director         0
Star1            0
Star2            0
Star3            0
Star4            0
No_of_Votes      0
Gross            0
dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 997
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    714 non-null    object 
 1   Series_Title   714 non-null    object 
 2   Released_Year  714 non-null    object 
 3   Certificate    714 non-null    object 
 4   Runtime        714 non-null    object 
 5   Genre          714 non-null    object 
 6   IMDB_Rating    714 non-null    float64
 7   Overview       714 non-null    object 
 8   Meta_score     714 non-null    float64
 9   Director       714 non-null    object 
 10  Star1          714 non-null    object 
 11  Star2          714 non-null    object 
 12  Star3          714 non-null    object 
 13  Star4          714 non-null    object 
 14  No_of_Votes    714 non-null    int64  
 15  Gross          714 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 94.8+ KB


In [10]:
df.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,714.0,714.0,714.0
mean,7.937115,77.158263,356134.8
std,0.293278,12.401144,353901.1
min,7.6,28.0,25229.0
25%,7.7,70.0,96009.75
50%,7.9,78.0,236602.5
75%,8.1,86.0,507792.2
max,9.3,100.0,2343110.0


In [11]:
df.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

In [12]:
df.drop(['Poster_Link', 'Poster_Link','Overview'],axis=1, inplace=True)

In [13]:
df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [14]:
df['Gross']= df['Gross'].str.replace(',', '')
df['Gross']= df['Gross'].astype(int)
df.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes,Gross
count,714.0,714.0,714.0,714.0
mean,7.937115,77.158263,356134.8,78513590.0
std,0.293278,12.401144,353901.1,114978000.0
min,7.6,28.0,25229.0,1305.0
25%,7.7,70.0,96009.75,6157408.0
50%,7.9,78.0,236602.5,34850150.0
75%,8.1,86.0,507792.2,102464100.0
max,9.3,100.0,2343110.0,936662200.0


### Top 20 most Voted Movies

In [15]:
top20 = df[['Series_Title', 'No_of_Votes']]
top20.sort_values(by=['No_of_Votes'], ascending=False)[:20].iplot(y='No_of_Votes', x='Series_Title',kind='bar', 
                                                                  yTitle='Vote Count', xTitle='Title of Movies', title='Top 20 Most Voted Movie')

- The Movie 'The Shawshank Redemption' received most numbers of votes from people, almost 2.3 million votes

# Movies Count per Certificate

In [16]:
px.histogram(x='Certificate', data_frame=df,color_discrete_sequence = [px.colors.qualitative.Set2])

- Most movies recived "U" Certificate.

### Genre With higest IMDb Rating and Revenue

In [71]:
t_rates=df.groupby('Genre').mean().sort_values('IMDB_Rating',ascending=False).head(10).reset_index()
t_rates.iplot(x='Genre', y='IMDB_Rating', kind='bar', xTitle='Genre', yTitle='Rating')



The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



- Genre 'Crime, Mystery, Thriller' has the higest IMDB rating of 8.5

In [73]:
t_rates=df.groupby('Genre').mean().sort_values('Gross',ascending=False).head(10).reset_index()
t_rates.iplot(x='Genre', y='Gross', kind='bar', xTitle='Genre', yTitle='Gross')


The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



- Genre 'Family, Sci-Fi' genrated higest revenue Alost 435 million

In [28]:
gross_by_cat = df.groupby('Certificate')[['Gross']].sum().sort_values(by=['Gross'], ascending=False, )
gross_by_cat.iplot(kind='bar')

- Movies with UA certificates earned more money than any other certificate catogries 

#### Top 10 Movies according certificate and Gross income

In [65]:
top_10_a = df.groupby(['Certificate', 'Series_Title'])[['Gross']].sum().reset_index()
top_10_a = top_10_a[top_10_a['Certificate'] == 'A'].sort_values(by=['Gross'], ascending=False).head(10)
top_10_a.iplot(x='Series_Title', y='Gross', kind='bar')


- Movie 'Joker' genrated the highest revenue in "A" rated movies

In [58]:
top_10_UA = df.groupby(['Certificate', 'Series_Title'])[['Gross']].sum().reset_index()
top_10_UA = top_10_UA[top_10_UA['Certificate'] == 'UA'].sort_values(by=['Gross'], ascending=False)[:10]
top_10_UA.iplot(x='Series_Title', y='Gross', kind='bar')

- Movie 'Avengers: Endgame' genrated the highest revenue in "UA" rated movies

In [86]:
top_10_U = df.groupby(['Certificate', 'Series_Title'])[['Gross']].sum().reset_index()
top_10_U = top_10_U[top_10_U['Certificate'] == 'U'].sort_values(by=['Gross'], ascending=False)[:10]
top_10_U.iplot(x='Series_Title', y='Gross', kind='bar')

- Movie 'Star Wars: Episode VII - The Force Awakens' genrated the highest revenue in "U" rated movies

In [60]:
top_10_R = df.groupby(['Certificate', 'Series_Title'])[['Gross']].sum().reset_index()
top_10_R = top_10_R[top_10_R['Certificate'] == 'R'].sort_values(by=['Gross'], ascending=False)[:10]
top_10_R.iplot(x='Series_Title', y='Gross', kind='bar')

- Movie 'Deadpool' genrated the highest revenue in "R" rated movies

### Gross Income relation between Meta Score and IMDb Ratings

In [80]:
df.iplot(kind = 'scatter', x='Meta_score',y='Gross', mode='markers', xTitle='Meta score', yTitle='Gross', title='Meta score and Gross income relation')

- We can observe that high Meta Scores can lead to higher Gross Revenue

In [83]:
df.iplot(kind = 'scatter', x='IMDB_Rating',y='Gross', mode='markers', xTitle='IMDb Rating', yTitle='Gross', title='IMDb Rating and Gross income relation')

- We can observe that high IMDb Rating Doesn't mean higher Gross Revenue