# Project: Investigate a Dataset - tmdb-movies.csv

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

> This dataset contains over 10,000 movies from the Tmdb movie platform, It contains columns like: `cast`,  `budget`, `revenue`, `runtime` and so many other columns that will help in analysing the data


### Question(s) for Analysis
##### Q1 Does higher budget equate higher revenue?
##### Q2: Will movies generate bigger revenues in the future?

In [4]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
#% matplotlib inline

movies = pd.read_csv('tmdb-movies.csv')
movies.head(3)


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0


<a id='wrangling'></a>
## Data Wrangling/cleaning
We have to wrangle the data to pinpoint the necessary cleaning needed.
cleanings could involve of some these:
<ul>
    <li>removing duplicate values</li>
    <li>Changing `release_date` to date-time</li>
    <li>removing rows where `budget_adj` and `revenue_adj`  = 0</li>
    <li>removing columns that are not useful in the analysis</li>


</ul>



In [3]:
# reading the first 5 rows of the dataset
movies.head(3)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0


In [None]:
movies.shape

In [None]:
movies.shape

In [None]:
movies.shape

some columns are not important for our analysis, the next step is to remove those columns.

In [None]:
# Deleting columns that are not important for my analysis
movies.drop(['id','imdb_id','overview','tagline', 'keywords','homepage', 'production_companies'],axis = 1, inplace = True)

In [None]:
#checking for null values using the .info() function
movies.info()

We have to further drop null values from some columns that still have nulls 

In [None]:
movies.dropna(subset = ['genres', 'director','cast'], inplace = True)
movies.info()

In [None]:
#changing release_date from object to date-time
movies.release_date = pd.to_datetime(movies['release_date'])

Before I remove rows with 0 values in `budget_adj` and `revenue_adj` I need to use a copy of the movies dataset as this operation will drastically reduce the dataset.

In [None]:
movies[movies['revenue_adj'] != 0]

In [None]:
mv_copy = movies.copy()
#removing the rows with 0 values 
mv_copy = mv_copy[(mv_copy['revenue_adj'] > 0) & (mv_copy['budget_adj'] > 0)]
mv_copy.head()

In [None]:
sum(mv_copy['revenue_adj'] == 0)

In [None]:
#checking for 0 values
mv_copy[mv_copy['budget_adj'] == 0]

In [None]:
mv_copy.dropna(subset = ['revenue_adj', 'budget_adj'], inplace = True)

In [None]:
#checking the size of mv_copy dataset
mv_copy.shape