# Project: Investigate a Dataset - TMDb Movie Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
    <ul>
    <li><a href="#question1">WHAT PROPERTIES CAN BE GOOD INDICATORS OF HIGH REVENUE MOVIES?</a></li>
    </ul>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description  

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.


### Question(s) for Analysis
>**Question One**: WHAT PROPERTIES CAN BE GOOD INDICATORS OF HIGH REVENUE MOVIES?

In [None]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html


In [None]:
# Upgrade pandas to use dataframe.explode() function. 
#!pip install --upgrade pandas==0.25.0

<a id='wrangling'></a>
## Data Wrangling

### General Properties


In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

df = pd.read_csv('../input/tmdbmovies/tmdb-movies.csv')
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.head(7)

## Checking for null values

In [None]:
df.isnull().sum()

## Data Cleaning

Droping the following columns: 

* ``cast``
* ``hompage``
* ``director``
* ``tagline``
* ``keywords``
* ``overview``
* ``production_company``

because they are specific to the movies, and do not offer value to the question(s) of analysis.

In [None]:
df.drop(columns=['cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'production_companies'], inplace=True)

In [None]:
df.info()

The column ``genres`` has some missing values. We are going to check the pattern of the rows with the null values.

In [None]:
# Creating a filter variable to check the rows with null values at the genre column.
filt = (df['genres'].isnull())
# Checking the rows
df.loc[filt]

We will be droping the rows with ``NaN`` in the ``genres`` column since they do not affect the quality of the data set.

In [None]:
df.drop(index=df[filt].index, inplace=True)

Data set quality is good, we will be proceeding in answering our analysis questions.

In [None]:
df.info()

<a id='eda'></a>
## Exploratory Data Analysis



<a id="question1"></a>

### WHAT PROPERTIES CAN BE GOOD INDICATORS OF HIGH REVENUE MOVIES?

We are going to investigate the correlation of:

* `` popularity ``
* `` vote_average ``
* `` vote_count ``
* `` budget_adj``
* `` revenue_adj``
* `` runtime ``

with the returns the movies.

we are going to check if the following properties have a positive correlation and how many point will be found close to the line-of-best-fit.

we will be checking this properties on groups of high earning movies:

<ul>
<li><a href="#10">Top 10</a></li>
<li><a href="#1000">Top 1000</a></li>
<li><a href="#10000">Top 10000</a></li>
<li><a href="#all">All Sorted Movies</a></li>
</ul>


In [None]:
# Helper function for plotting correlation graph.
def corr_plot(x, y, title, x_label, y_label):
    plt.title(title)
    plt.scatter(x, y);
    # This method plots a line of best-fit onto the scatter plot.
    plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))
            (np.unique(x)), color="red", label=("corr = " + str(y.corr(x))));
    plt.xlabel(x_label);
    plt.ylabel(y_label);
    plt.legend(loc=1)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.
df.describe()

We are going to add a new columns:

1. ``returns`` 

$$
returns = {revenueAdj} - {budgetAdj}
$$

which will show the returns of the movies.

In [None]:
# Creating a new column called [returns].
df['returns'] = df['revenue_adj'] - df['budget_adj']
df.head(10)

In [None]:
# We are going to create a data set sorted at the returns column in decending order.
df_sorted = df.sort_values(by=['returns'], ascending=False)

In [None]:
# The top 10 highest earners
df_top10 = df_sorted.head(10)
# The top 1000 highest earners
df_top1000 = df_sorted.head(1000)
# The top 10000 highest earners
df_top10000 = df_sorted.head(10000)

<a id='10'></a>
# TOP 10

We are going to investigate the folllowing properties and see correlation they have with high earnings in movies(Top 10):

* `` popularity ``
* `` vote_average ``
* `` vote_count ``
* `` budget_adj``
* `` revenue_adj``
* `` runtime ``


In [None]:
df_top10.describe()

In [None]:
# Scatter plot of Returns and Popularity
x = df_top10['returns']
y = df_top10['popularity']
title ='Correlation Between Returns and Popularity'
x_label = 'Returns'
y_label = 'Popularity'

corr_plot(x, y, title, x_label, y_label)

In [None]:
# Scatter plot of Returns and Vote average
x = df_top10['returns']
y = df_top10['vote_average']
title ='Correlation Between Returns and Vote Average'
x_label = 'Returns'
y_label = 'Vote Average'
corr_plot(x, y, title, x_label, y_label)

In [None]:
# Scatter plot of Returns and Budget_adj
x = df_top10['returns']
y = df_top10['budget_adj']
title ='Correlation Between Returns and Budget Adj'
x_label = 'Returns'
y_label = 'Budget Adj.'
corr_plot(x, y, title, x_label, y_label)

In [None]:
# Scatter plot of Returns and Vote count
x = df_top10['returns']
y = df_top10['vote_count']
title ='Correlation Between Returns and Vote Count'
x_label = 'Returns'
y_label = 'Vote Count'
corr_plot(x, y, title, x_label, y_label)

In [None]:
# Scatter plot of Returns and Revenue_adj
x = df_top10['returns']
y = df_top10['revenue_adj']
title ='Correlation Between Returns and Revenue Adj'
x_label = 'Returns'
y_label = 'Revenue Adj'
corr_plot(x, y, title, x_label, y_label)

In [None]:
# Scatter plot of Returns and Runtime
x = df_top10['returns']
y = df_top10['runtime']
title ='Correlation Between Returns and Runtime'
x_label = 'Returns'
y_label = 'Runtime'
corr_plot(x, y, title, x_label, y_label)

<a id='1000'></a>
# TOP 1000

We are going to investigate the folllowing properties and see correlation they have with high earnings in movies(Top 1000):

* `` popularity ``
* `` vote_average ``
* `` vote_count ``
* `` budget_adj``
* `` revenue_adj``
* `` runtime ``


In [None]:
df_top1000.describe()

In [None]:
x = df_top1000['returns']
y = df_top1000['popularity']
title ='Correlation Between Returns and Popularity'
x_label = 'Returns'
y_label = 'Popularity'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top1000['returns']
y = df_top1000['vote_average']
title ='Correlation Between Returns and Vote average'
x_label = 'Returns'
y_label = 'Vote Average'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top1000['returns']
y = df_top1000['budget_adj']
title ='Correlation Between Returns and Budget Adj'
x_label = 'Returns'
y_label = 'Budget Adj.'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top1000['returns']
y = df_top1000['vote_count']
title ='Correlation Between Returns and Vote Count'
x_label = 'Returns'
y_label = 'Vote Count'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top1000['returns']
y = df_top1000['revenue_adj']
title ='Correlation Between Returns and Revenue Adj'
x_label = 'Returns'
y_label = 'Revenue Adj'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top1000['returns']
y = df_top1000['runtime']
title ='Correlation Between Returns and Runtime'
x_label = 'Returns'
y_label = 'Runtime'
corr_plot(x, y, title, x_label, y_label)

<a id='10000'></a>
# TOP 10000

We are going to investigate the folllowing properties and see correlation they have with high earnings in movies(Top 10000):

* `` popularity ``
* `` vote_average ``
* `` vote_count ``
* `` budget_adj``
* `` revenue_adj``
* `` runtime ``


In [None]:
df_top10000.describe()

In [None]:
x = df_top10000['returns']
y = df_top10000['popularity']
title ='Correlation Between Returns and Popularity'
x_label = 'Returns'
y_label = 'Popularity'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top10000['returns']
y = df_top10000['vote_average']
title ='Correlation Between Returns and Vote Average'
x_label = 'Returns'
y_label = 'Vote Average'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top10000['returns']
y = df_top10000['budget_adj']
title ='Correlation Between Returns and Budget Adj'
x_label = 'Returns'
y_label = 'Budget Adj.'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top10000['returns']
y = df_top10000['vote_count']
title ='Correlation Between Returns and Vote Count'
x_label = 'Returns'
y_label = 'Vote Count'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top10000['returns']
y = df_top10000['revenue_adj']
title ='Correlation Between Returns and Revenue Adj'
x_label = 'Returns'
y_label = 'Revenue Adj'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_top10000['returns']
y = df_top10000['runtime']
title ='Correlation Between Returns and Runtime'
x_label = 'Returns'
y_label = 'Runtime'
corr_plot(x, y, title, x_label, y_label)

<a id='all'></a>
# ALL SORTED MOVIES

We are going to investigate the folllowing properties and see correlation they have with high earnings in all movies:

* `` popularity ``
* `` vote_average ``
* `` vote_count ``
* `` budget_adj``
* `` revenue_adj``
* `` runtime ``


In [None]:
x = df_sorted['returns']
y = df_sorted['popularity']
title ='Correlation Between Returns and Popularity'
x_label = 'Returns'
y_label = 'Popularity'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_sorted['returns']
y = df_sorted['vote_average']
title ='Correlation Between Returns and Vote Average'
x_label = 'Returns'
y_label = 'Vote Average'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_sorted['returns']
y = df_sorted['budget_adj']
title ='Correlation Between Returns and Budget Adj'
x_label = 'Returns'
y_label = 'Budget Adj.'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_sorted['returns']
y = df_sorted['vote_count']
title ='Correlation Between Returns and Vote Count'
x_label = 'Returns'
y_label = 'Vote Count'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_sorted['returns']
y = df_sorted['revenue_adj']
title ='Correlation Between Returns and Revenue Adj'
x_label = 'Returns'
y_label = 'Revenue Adj'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_sorted['returns']
y = df_sorted['runtime']
title ='Correlation Between Returns and Runtime'
x_label = 'Returns'
y_label = 'Runtime'
corr_plot(x, y, title, x_label, y_label)

In [None]:
x = df_sorted['returns']
y = df_sorted['runtime']

print(y.corr(x))

<a id='conclusions'></a>
## Conclusions

   After we have checked the properties in the data set, we have found out that:
   
   * `` popularity``
   * `` vote_average ``
   * `` vote_count ``
   * `` revenue_adj``
        
are good indicators of high earning movies since it show how the fans of the movies have recieved it.
    
   We also experiences limitation such as:
   > **1**: Not all rows of genres and imdb_id we filled and thus, we had to drop them, they may be small but when analysing earnings/returns, each movie has a specific amount associated with it.
        

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])