<a href="https://colab.research.google.com/github/mottaquikarim/OMDB_analysis/blob/master/OMDB_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OMDB Exploratory Analysis

In this notebook, we will explore a few key characteristics of an OMDB dataset. Our first step is to import the **movies_rated** dataaset as a CSV file and convert it into a `dataframe` object for processing.


In [0]:
# First, find out where we are ...
!ls

movies_rated.csv  sample_data




A copy of this dataset is available **[here](https://raw.githubusercontent.com/mottaquikarim/PythonProgramming/master/raw_data/movies_rated.csv)** on Github.

In [0]:
import pandas as pd

omdb_df = pd.read_csv('movies_rated.csv')

## Observing the first 5 rows 

In [5]:
omdb_df.head()

Unnamed: 0,title,year,content_rating,genre,duration,gross,Internet Movie Database,Rotten Tomatoes,Metacritic
0,The Shawshank Redemption,1994,R,Drama,142,1963330,9.3,9.1,8.0
1,The Godfather,1972,R,Crime,175,28341469,9.2,9.8,10.0
2,The Dark Knight,2008,PG-13,Action,152,1344258,9.0,9.4,8.2
3,The Godfather: Part II,1974,R,Crime,202,134966411,9.0,9.7,9.0
4,Pulp Fiction,1994,R,Crime,154,1935047,8.9,9.4,9.4


## Num Rows / Cols in this dataset

In [0]:
omdb_df.shape

(79, 9)

## List of column names

In [0]:
omdb_df.columns

Index(['title', 'year', 'content_rating', 'genre', 'duration', 'gross',
       'Internet Movie Database', 'Rotten Tomatoes', 'Metacritic'],
      dtype='object')

## Column datatypes

In [0]:
omdb_df.dtypes

title                       object
year                         int64
content_rating              object
genre                       object
duration                     int64
gross                        int64
Internet Movie Database    float64
Rotten Tomatoes            float64
Metacritic                 float64
dtype: object

## Number of unique genres are available in the dataset

In [0]:
omdb_df["genre"].nunique()

12

## Movies per/Genre

In [0]:
omdb_df["genre"].value_counts()

Crime                  16
Drama                  14
Action                 11
Adventure               9
Drama                   7
Biography               5
Animation               5
Comedy                  4
Western                 3
Mystery                 2
Horror                  2
Comedy                  1
Name: genre, dtype: int64

In [0]:
omdb_df[["genre", "title"]].groupby("genre").count()

Unnamed: 0_level_0,title
genre,Unnamed: 1_level_1
Action,11
Adventure,9
Animation,5
Biography,5
Comedy,4
Comedy,1
Crime,16
Drama,14
Drama,7
Horror,2


## Top 5 R-rated movies

In [0]:
all_R = omdb_df[omdb_df["content_rating"] == "R"]
all_R.sort_values(by="Internet Movie Database", ascending=False).head()

Unnamed: 0,title,year,content_rating,genre,duration,gross,Internet Movie Database,Rotten Tomatoes,Metacritic
0,The Shawshank Redemption,1994,R,Drama,142,1963330,9.3,9.1,8.0
1,The Godfather,1972,R,Crime,175,28341469,9.2,9.8,10.0
3,The Godfather: Part II,1974,R,Crime,202,134966411,9.0,9.7,9.0
5,Schindler's List,1993,R,Biography,195,534858444,8.9,9.7,9.3
7,"The Good, the Bad and the Ugly",1966,R,Western,178,57300000,8.9,9.7,9.0


## Average Rotten Tomatoes score for all available films

In [0]:
omdb_df["Rotten Tomatoes"].mean()

9.087341772151895

## Average Rotten Tomatoes score for the top 5 films

In [10]:
omdb_df.sort_values(by="Rotten Tomatoes", ascending=False).head()[["Rotten Tomatoes"]].mean()

Rotten Tomatoes    10.0
dtype: float64