# Project: An investigation into a TMDB dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

For this project I have chosen to look at the TMDB dataset. This data was provided as part of the Udacity Data Analysis Nanodegree and originated from [Kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata). It was originally sourced from IMDB but was replaced after takedown request.

_Questions_:

1. Which genres are most popular from year to year?
1. What kinds of properties are associated with movies that have high revenues?

In [186]:
import pandas as pd
import numpy as np
% matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
# Stop floats from displaying as scientific notation
pd.options.display.float_format = '{:20,.2f}'.format

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [187]:
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')

#### High level view of the dataset
----
In this section I'll take an initial view of the data - this will also include the number of rows, the datatypes and the number of non-null values in those columns. I'll also take a look at the number of unique values for those columns.

In [188]:
df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.99,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999939.28,1392445892.52
1,76341,tt1392190,28.42,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999939.28,348161292.49
2,262500,tt2908446,13.11,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101199955.47,271619025.41
3,140607,tt2488496,11.17,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999919.04,1902723129.8
4,168259,tt2820852,9.34,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799923.09,1385748801.47


In [189]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

The next cell was used to sort by different columns names (just replace the column name to get a look at the NaN data or see if there was any 0s).

In [190]:
df.sort_values(by=['vote_count'], ascending=False)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
1919,27205,tt1375666,9.36,160000000,825500000,Inception,Leonardo DiCaprio|Joseph Gordon-Levitt|Ellen P...,http://inceptionmovie.warnerbros.com/,Christopher Nolan,Your mind is the scene of the crime.,...,"Cobb, a skilled thief who commits corporate es...",148,Action|Thriller|Science Fiction|Mystery|Adventure,Legendary Pictures|Warner Bros.|Syncopy,7/14/10,9767,7.90,2010,160000000.00,825500000.00
4361,24428,tt0848228,7.64,220000000,1519557910,The Avengers,Robert Downey Jr.|Chris Evans|Mark Ruffalo|Chr...,http://marvel.com/avengers_movie/,Joss Whedon,Some assembly required.,...,When an unexpected enemy emerges and threatens...,143,Science Fiction|Action|Adventure,Marvel Studios,4/25/12,8903,7.30,2012,208943741.90,1443191435.21
1386,19995,tt0499549,9.43,237000000,2781505847,Avatar,Sam Worthington|Zoe Saldana|Sigourney Weaver|S...,http://www.avatarmovie.com/,James Cameron,Enter the World of Pandora.,...,"In the 22nd century, a paraplegic Marine is di...",162,Action|Adventure|Fantasy|Science Fiction,Ingenious Film Partners|Twentieth Century Fox ...,12/10/09,8458,7.10,2009,240886902.89,2827123750.41
2875,155,tt0468569,8.47,185000000,1001921825,The Dark Knight,Christian Bale|Michael Caine|Heath Ledger|Aaro...,http://thedarkknight.warnerbros.com/dvdsite/,Christopher Nolan,Why So Serious?,...,Batman raises the stakes in his war on crime. ...,152,Drama|Action|Crime|Thriller,DC Comics|Legendary Pictures|Warner Bros.|Syncopy,7/16/08,8432,8.10,2008,187365527.25,1014733032.48
4364,68718,tt1853728,5.94,100000000,425368238,Django Unchained,Jamie Foxx|Christoph Waltz|Leonardo DiCaprio|K...,http://unchainedmovie.com/,Quentin Tarantino,"Life, liberty and the pursuit of vengeance.",...,"With the help of a German bounty hunter, a fre...",165,Drama|Western,Columbia Pictures|The Weinstein Company,12/25/12,7375,7.70,2012,94974428.14,403991051.51
4382,70160,tt1392170,2.57,75000000,691210692,The Hunger Games,Jennifer Lawrence|Josh Hutcherson|Liam Hemswor...,http://www.thehungergames.movie/,Gary Ross,May The Odds Be Ever In Your Favor.,...,Every year in the ruins of what was once North...,142,Science Fiction|Adventure|Fantasy,Lionsgate|Color Force,3/12/12,7080,6.70,2012,71230821.10,656473401.94
5425,68721,tt1300854,4.95,200000000,1215439994,Iron Man 3,Robert Downey Jr.|Gwyneth Paltrow|Guy Pearce|D...,http://marvel.com/ironman3,Shane Black,Unleash the power behind the armor.,...,When Tony Stark's world is torn apart by a for...,130,Action|Adventure|Science Fiction,Marvel Studios,4/18/13,6882,6.90,2013,187206670.55,1137692372.64
4363,49026,tt1345836,6.59,250000000,1081041287,The Dark Knight Rises,Christian Bale|Michael Caine|Gary Oldman|Anne ...,http://www.thedarkknightrises.com/,Christopher Nolan,The Legend Ends,...,Following the death of District Attorney Harve...,165,Action|Crime|Drama|Thriller,Legendary Pictures|Warner Bros.|DC Entertainme...,7/16/12,6723,7.50,2012,237436070.34,1026712780.23
629,157336,tt0816692,24.95,165000000,621752480,Interstellar,Matthew McConaughey|Jessica Chastain|Anne Hath...,http://www.interstellarmovie.net/,Christopher Nolan,Mankind was born on Earth. It was never meant ...,...,Interstellar chronicles the adventures of a gr...,169,Adventure|Drama|Science Fiction,Paramount Pictures|Legendary Pictures|Warner B...,11/5/14,6498,8.00,2014,151980023.38,572690645.12
4367,49051,tt0903624,4.22,250000000,1017003568,The Hobbit: An Unexpected Journey,Ian McKellen|Martin Freeman|Richard Armitage|A...,http://www.thehobbit.com/,Peter Jackson,From the smallest beginnings come the greatest...,...,"Bilbo Baggins, a hobbit enjoying his quiet lif...",169,Adventure|Fantasy|Action,WingNut Films|New Line Cinema|Metro-Goldwyn-Ma...,11/26/12,6417,6.90,2012,237436070.34,965893322.82


A view of the number of unique values in the columns:

In [191]:
df.nunique()

id                      10865
imdb_id                 10855
popularity              10814
budget                    557
revenue                  4702
original_title          10571
cast                    10719
homepage                 2896
director                 5067
tagline                  7997
keywords                 8804
overview                10847
runtime                   247
genres                   2039
production_companies     7445
release_date             5909
vote_count               1289
vote_average               72
release_year               56
budget_adj               2614
revenue_adj              4840
dtype: int64

#### Confirm ambiguous datatypes
----
A confirmation of the datatypes for all columns described above as 'object'.

In [192]:
print("I am imdb_id: ", type(df['imdb_id'][0]))
print("I am original_title: ", type(df['original_title'][0]))
print("I am cast: ", type(df['cast'][0]))
print("I am homepage: ", type(df['homepage'][0]))
print("I am director: ", type(df['director'][0]))
print("I am tagline: ", type(df['tagline'][0]))
print("I am keywords: ", type(df['keywords'][0]))
print("I am overview: ", type(df['overview'][0]))
print("I am genres: ", type(df['genres'][0]))
print("I am production_companies: ", type(df['production_companies'][0]))
print("I am release_date: ", type(df['release_date'][0]))

I am imdb_id:  <class 'str'>
I am original_title:  <class 'str'>
I am cast:  <class 'str'>
I am homepage:  <class 'str'>
I am director:  <class 'str'>
I am tagline:  <class 'str'>
I am keywords:  <class 'str'>
I am overview:  <class 'str'>
I am genres:  <class 'str'>
I am production_companies:  <class 'str'>
I am release_date:  <class 'str'>


#### Initial thoughts on the data so far
---
_Null Values_

While looking at null rows, I noticed that a number of columns have 0:
* popularity
* budget
* revenue
* runtime
* budget_adj
* revenue_adj

_Uniques_

For columns like 'id' and 'original_title' I would expect these to be unique all the way through. A closer inspection of these rows is needed to decide on the action.

_Datatypes_

On the face of it, the only real issue is with release_date. For proper interogation, this will need converting from a string to a datetime object.

#### Closer look at the duplications
---

This section will look at the several hundred film title dupes and the single ID duplication.

In [193]:
# create an extra column and mark a row as True where a duplicate itle is found
df['is_duplicate_title'] = df.duplicated(['original_title'])

In [194]:
# filter anything that is True
df_dupe_title_filter = df[df['is_duplicate_title'] == True]

In [195]:
df_dupe_title_filter

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,is_duplicate_title
1133,281778,tt3297792,0.19,0,0,Survivor,Danielle Chuchran|Kevin Sorbo|Rocky Myers|Ruby...,http://www.arrowstormentertainment.com/#!survi...,John Lyde,Alone. Stranded. Deadly,...,93,Science Fiction|Action|Fantasy,Arrowstorm Entertainment,7/22/14,23,4.90,2014,0.00,0.00,True
1194,296626,tt3534842,0.13,0,0,Finders Keepers,Jaime Pressly|Kylie Rogers|Tobin Bell|Patrick ...,,Alexander Yellen,,...,85,Mystery|Thriller|Horror,HFD Productions|Hybrid LLC,10/18/14,33,4.80,2014,0.00,0.00,True
1349,42222,tt0076245,0.40,0,0,Julia,Jane Fonda|Vanessa Redgrave|Jason Robards|Maxi...,,Fred Zinnemann,"Through It All, Friendship Prevailed.",...,117,Drama,Twentieth Century Fox Film Corporation,10/2/77,10,5.00,1977,0.00,0.00,True
1440,7445,tt0765010,1.22,26000000,43318349,Brothers,Tobey Maguire|Jake Gyllenhaal|Natalie Portman|...,,Jim Sheridan,There are two sides to every family.,...,104,Drama|Thriller|War,Lionsgate|Relativity Media|Sighvatsson Films|M...,1/27/09,381,6.70,2009,26426411.29,44028788.73,True
1513,62320,tt1014762,0.69,0,0,Home,Glenn Close|Yann Arthus-Bertrand|Jacques Gambl...,http://www.homethemovie.org/,Yann Arthus-Bertrand,A Stunning Visual Portrayal of Earth,...,95,Documentary,Europa Corp.|ElzÃ©vir Films|France 2 (FR2),6/3/09,109,7.80,2009,0.00,0.00,True
1707,79896,tt1336006,0.31,0,0,The Revenant,Chris Wylde|David Anders|Louise Griffiths|Jacy...,http://www.therevenantmovie.com/,D. Kerry Prior,What could be worse than having your best frie...,...,110,Comedy|Horror,Putrefactory Limited|Wanko Toys,8/16/09,30,5.50,2009,0.00,0.00,True
1753,36465,tt0992993,0.25,0,0,Into the Storm,Brendan Gleeson|Iain Glen|James D'Arcy|Janet M...,http://www.hbo.com/movies/into-the-storm/index...,Thaddeus O'Sullivan,,...,100,Drama|History|Foreign,,5/31/09,13,5.80,2009,0.00,0.00,True
1757,21398,tt1220213,0.32,5000000,0,Grace,Jordan Ladd|Samantha Ferris|Gabrielle Rose|Ste...,,Paul Solet,Love. Undying.,...,94,Horror|Thriller,ArieScope Pictures|Dark Eye Entertainment|Leom...,8/14/09,21,4.90,2009,5082002.17,0.00,True
1865,220903,tt1533395,0.10,0,0,Life,David Attenborough|Oprah Winfrey,http://www.bbc.co.uk/programmes/b00lbpcy,Martha Holmes|Simon Blakeney|Stephen Lyle,From the Makers of Planet Earth,...,500,Documentary,British Broadcasting Corporation (BBC),12/14/09,24,7.00,2009,0.00,0.00,True
2036,41505,tt1179069,0.79,22000000,851517,Shelter,Julianne Moore|Jonathan Rhys Meyers|Jeffrey De...,http://www.shelter-movie.jp/index.html,BjÃ¶rn Stein|MÃ¥ns MÃ¥rlind,Evil will rise.,...,112,Horror|Mystery|Thriller,NALA Films|IM Global|Maraci/Edelstein Films|Sh...,3/27/10,112,5.50,2010,22000000.00,851517.00,True


In [196]:
# use this cell to spot check titles for differences
df_2 = df[df['original_title'] == 'Robin Hood']
df_2.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,is_duplicate_title
1938,20662,tt0955308,2.12,200000000,310669540,Robin Hood,Russell Crowe|Cate Blanchett|Mark Strong|Oscar...,http://www.robinhoodthemovie.com/,Ridley Scott,"Rise and rise again, until lambs become lions.",...,140,Action,Imagine Entertainment|Universal Pictures|Scott...,5/12/10,844,6.1,2010,200000000.0,310669540.0,False
10593,11886,tt0070608,2.27,15000000,32056467,Robin Hood,Brian Bedford|Phil Harris|Peter Ustinov|Pat Bu...,,Wolfgang Reitherman,Meet Robin Hood and his MERRY MENagerie!,...,83,Animation|Family,Walt Disney Productions,11/8/73,641,6.9,1973,73667393.68,157434424.97,True


Upon inspection, it seems that duplicate film titles can be considered different films that have the same name.

The same process is applied for the duplicate ID. Since there is only one duplicate ID, there is no need to create a csv.

In [197]:
df['is_duplicate_id'] = df.duplicated(['id'])

In [198]:
df_dupe_id_filter = df[df['is_duplicate_id'] == True]

In [199]:
df_dupe_id_filter.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,is_duplicate_title,is_duplicate_id
2090,42194,tt0411951,0.6,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,...,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0,True,True


Now that I have the duplicate ID (and given that the original_title duplicates appear to be different films with the same title), this is a final sense check to make sure the entire row is a duplicate.

In [200]:
df_3 = df[df['id'] == 42194]
df_3.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,is_duplicate_title,is_duplicate_id
2089,42194,tt0411951,0.6,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,...,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0,False,False
2090,42194,tt0411951,0.6,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,...,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0,True,True


Drop the duplicated ID row from the main dataframe:

In [201]:
df.drop_duplicates(subset=['id'],inplace=True)

### Data Cleaning (Replace this with more specific notes!)

In [202]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.

#### Drop columns:
In this section, I've decided to drop columns that are extraneous to the analysis: 
* **imdb_id**: this appears to relate to the previous IMDB data. Assumption is that this was left in by Kaggle to map the IMDB and TMDB ids together
* **popularity**: I was initially unclear what this measured. After some research on the [TMDB website](https://developers.themoviedb.org/3/getting-started/popularity) it seems that this value is derived from a number of factors and used to assist TMDB's internal search UX e.g. boosting results, etc.
* **budget** and **revenue**: since budget_adj and revenue_adj have already been normalised to 2010 levels for more direct comparision, these two columns are no longer required
* **homepage**, **tagline**, **overview** and **keywords**: seem unnecessary to include this for the type of intended analysis
* **is_duplicate_title**: is no longer necessary
* **is_duplicate_id**: is no longer necessary

In [203]:
df.drop(['imdb_id', 'popularity', 'budget', 'revenue', 'homepage', 'tagline', 'overview', 'keywords', 'is_duplicate_title', 'is_duplicate_id'], axis=1, inplace=True)

In [204]:
df.head(1)

Unnamed: 0,id,original_title,cast,director,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,Colin Trevorrow,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999939.28,1392445892.52


#### Update the datatypes
---

This section will look to change the release_date from a string to a datetime object.

In [205]:
df['release_date'] = pd.to_datetime(df['release_date'])

In [206]:
# check it's worked
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 13 columns):
id                      10865 non-null int64
original_title          10865 non-null object
cast                    10789 non-null object
director                10821 non-null object
runtime                 10865 non-null int64
genres                  10842 non-null object
production_companies    9835 non-null object
release_date            10865 non-null datetime64[ns]
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null int64
budget_adj              10865 non-null float64
revenue_adj             10865 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(4), object(5)
memory usage: 1.2+ MB


#### Deal with null values
---

This section will look to drop rows with null values. For **runtime**, **budget_adj** and **revenue_adj**, 0 values are also considered null.

For the first 'Genres' based question in the introduction, I will look create a dataframe that cleans out null genres, drop extraneous columns and then clean out the 0 values for budget_adj and revenue_adj. This should return me the most number of intact rows suitable for that exploration.

For the question that looks at the types of properties associcated with high revenue movies, I will wish to keep some of the dropped columns for that analysis - therefore, I will use a separate dataframe.

In [207]:
df_genres = df.copy()
df_genres.drop(['cast', 'director', 'runtime', 'production_companies', 'budget_adj', 'revenue_adj'], axis=1, inplace=True)
df_genres.head(1)

Unnamed: 0,id,original_title,genres,release_date,vote_count,vote_average,release_year
0,135397,Jurassic World,Action|Adventure|Science Fiction|Thriller,2015-06-09,5562,6.5,2015


In [208]:
# drop NaN values - only targets genres at this stage
df_genres.dropna(axis=0, how='any', inplace=True)

Next, I want to:
* find out the individual genres that are in the genres column; and
* split these genres out into separate columns for analysis of the first question

In [209]:
df_genres_split = df_genres['genres'].str[:].str.split('|', expand=True)

In [210]:
df_genres_split.head(10)

Unnamed: 0,0,1,2,3,4
0,Action,Adventure,Science Fiction,Thriller,
1,Action,Adventure,Science Fiction,Thriller,
2,Adventure,Science Fiction,Thriller,,
3,Action,Adventure,Science Fiction,Fantasy,
4,Action,Crime,Thriller,,
5,Western,Drama,Adventure,Thriller,
6,Science Fiction,Action,Thriller,Adventure,
7,Drama,Adventure,Science Fiction,,
8,Family,Animation,Adventure,Comedy,
9,Comedy,Animation,Family,,


In [211]:
# 
df_genres_split.groupby(0).count()

Unnamed: 0_level_0,1,2,3,4
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,1527,1225,624,193
Adventure,564,486,266,109
Animation,371,233,92,22
Comedy,1607,705,154,35
Crime,377,259,89,9
Documentary,120,17,2,1
Drama,1741,866,278,45
Family,128,69,23,5
Fantasy,261,217,111,44
Foreign,8,7,1,1


In [212]:
genres_list = ['Action',
               'Adventure',
               'Animation',
               'Comedy', 
               'Crime', 
               'Documentary', 
               'Drama', 
               'Family', 
               'Fantasy', 
               'Foreign', 
               'History', 
               'Horror', 
               'Music', 
               'Mystery', 
               'Romance', 
               'Science Fiction', 
               'TV Movie', 
               'Thriller', 
               'War', 
               'Western']

count = 0
for i in genres_list:
    boolean_list = []
    genre_index = genres_list[count]
    for row in df_genres['genres']:
        if genre_index in row:
            boolean_list.append(True)
        else:
            boolean_list.append(False)
    df_genres[genres_list[count]] = boolean_list
    count = count + 1

In [213]:
df_genres.head(1)

Unnamed: 0,id,original_title,genres,release_date,vote_count,vote_average,release_year,Action,Adventure,Animation,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,135397,Jurassic World,Action|Adventure|Science Fiction|Thriller,2015-06-09,5562,6.5,2015,True,True,False,...,False,False,False,False,False,True,False,True,False,False


In [214]:
df_genres.drop(['id', 'original_title', 'genres', 'release_date'], axis=1, inplace=True)

In [215]:
df_genres.head(1)

Unnamed: 0,vote_count,vote_average,release_year,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,5562,6.5,2015,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False


In [216]:
# df_genres_test = df_genres[df_genres.release_year != 2015]
# use this later for dropping specific 0 values from budget_adj

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1: Which genres are most popular from year to year?

In this section we will look to address the question: "Which genres are the most popular from year to year?". To approach this, we need to set some definitions.

**Most popular**: within our cleaned dataset, we have the columns 'vote_count' and 'vote_average'. These two metrics define the number of times a film received a vote by an individual and the average score out of 10 across all the individuals that voted for that film. To define a baseline for popularity, we will look at the average number of votes a film receives AND the vote_average value that is equal to or greater than the 3rd quartile.

In [217]:
df_genres.describe()

Unnamed: 0,vote_count,vote_average,release_year
count,10842.0,10842.0,10842.0
mean,217.82,5.97,2001.31
std,576.18,0.93,12.81
min,10.0,1.5,1960.0
25%,17.0,5.4,1995.0
50%,38.0,6.0,2006.0
75%,146.0,6.6,2011.0
max,9767.0,9.2,2015.0


As we can see in the cell above, the vote_count has an average of **217.82**, while vote_average has a 3rd quartile of **6.60**. We will use these as our popularity baseline before looking at what are the _most_ popular genres from year to year.

In [218]:
df_genres = df_genres[df_genres['vote_count'] > 217.82]
df_genres = df_genres[df_genres['vote_average'] > 6.60]
df_genres

Unnamed: 0,vote_count,vote_average,release_year,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
1,6185,7.10,2015,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
3,5292,7.50,2015,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,2947,7.30,2015,True,False,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
5,3929,7.20,2015,False,True,False,False,False,False,True,...,False,False,False,False,False,False,False,True,False,True
7,4572,7.60,2015,False,True,False,False,False,False,True,...,False,False,False,False,False,True,False,False,False,False
9,3935,8.00,2015,False,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12,2854,7.60,2015,False,False,False,False,False,False,True,...,False,False,False,False,False,True,False,False,False,False
14,4304,7.40,2015,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
15,2389,7.40,2015,False,False,False,False,True,False,True,...,False,False,False,True,False,False,False,False,False,True
17,3779,7.00,2015,True,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


In [230]:
# create action dataframe
df_action = df_genres[df_genres['Action'] == True].copy()
df_action.drop(['Adventure', 
                'Animation', 
                'Comedy',
                'Crime',
                'Documentary', 
                'Drama', 
                'Family', 
                'Fantasy', 
                'Foreign', 
                'History', 
                'Horror', 
                'Music', 
                'Mystery', 
                'Romance', 
                'Science Fiction', 
                'TV Movie', 
                'Thriller', 
                'War', 
                'Western'], axis=1, inplace=True)
df_action.head(1)

Unnamed: 0,vote_count,vote_average,release_year,Action
1,6185,7.1,2015,True


In [231]:
# create adventure dataframe
df_adventure = df_genres[df_genres['Adventure'] == True].copy()
df_adventure.drop(['Action', 
                'Animation', 
                'Comedy', 
                'Crime',
                'Documentary', 
                'Drama', 
                'Family', 
                'Fantasy', 
                'Foreign', 
                'History', 
                'Horror', 
                'Music', 
                'Mystery', 
                'Romance', 
                'Science Fiction', 
                'TV Movie', 
                'Thriller', 
                'War', 
                'Western'], axis=1, inplace=True)
df_adventure.head(1)

Unnamed: 0,vote_count,vote_average,release_year,Adventure
1,6185,7.1,2015,True


In [232]:
# create animation dataframe
df_animation = df_genres[df_genres['Animation'] == True].copy()
df_animation.drop(['Action', 
                'Adventure', 
                'Comedy', 
                'Crime',
                'Documentary', 
                'Drama', 
                'Family', 
                'Fantasy', 
                'Foreign', 
                'History', 
                'Horror', 
                'Music', 
                'Mystery', 
                'Romance', 
                'Science Fiction', 
                'TV Movie', 
                'Thriller', 
                'War', 
                'Western'], axis=1, inplace=True)
df_animation.head(1)

Unnamed: 0,vote_count,vote_average,release_year,Animation
9,3935,8.0,2015,True


In [111]:
# number of times a film is in the 'Crime' genre
df_crime = df_genres[df_genres['Crime'] == True]
crime_counts = df_crime['Crime'].count()
crime_counts

145

### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!