# Project: TMDB 5000 Movie Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

SOURCE: https://www.kaggle.com/tmdb/tmdb-movie-metadata

 We firstly import libraries to be used

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [2]:
#Imports dataset to analyse

df = pd.read_csv("tmdb-movies.csv")

In [3]:
#prints a few lines from the top

df.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [4]:
#prints a few lines from the end

df.tail()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
10861,21,tt0060371,0.080598,0,0,The Endless Summer,Michael Hynson|Robert August|Lord 'Tally Ho' B...,,Bruce Brown,,...,"The Endless Summer, by Bruce Brown, is one of ...",95,Documentary,Bruce Brown Films,6/15/66,11,7.4,1966,0.0,0.0
10862,20379,tt0060472,0.065543,0,0,Grand Prix,James Garner|Eva Marie Saint|Yves Montand|Tosh...,,John Frankenheimer,Cinerama sweeps YOU into a drama of speed and ...,...,Grand Prix driver Pete Aron is fired by his te...,176,Action|Adventure|Drama,Cherokee Productions|Joel Productions|Douglas ...,12/21/66,20,5.7,1966,0.0,0.0
10863,39768,tt0060161,0.065141,0,0,Beregis Avtomobilya,Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...,,Eldar Ryazanov,,...,An insurance agent who moonlights as a carthie...,94,Mystery|Comedy,Mosfilm,1/1/66,11,6.5,1966,0.0,0.0
10864,21449,tt0061177,0.064317,0,0,"What's Up, Tiger Lily?",Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...,,Woody Allen,WOODY ALLEN STRIKES BACK!,...,"In comic Woody Allen's film debut, he took the...",80,Action|Comedy,Benedict Pictures Corp.,11/2/66,22,5.4,1966,0.0,0.0
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Harold P. Warren|Tom Neyman|John Reynolds|Dian...,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,127642.279154,0.0


There seems to be a few columns with multiple entries separated bi a `|`, these are `cast`, `genres` and `production_companies`

In [5]:
df.cast

0        Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
1        Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...
2        Shailene Woodley|Theo James|Kate Winslet|Ansel...
3        Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...
4        Vin Diesel|Paul Walker|Jason Statham|Michelle ...
                               ...                        
10861    Michael Hynson|Robert August|Lord 'Tally Ho' B...
10862    James Garner|Eva Marie Saint|Yves Montand|Tosh...
10863    Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z...
10864    Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh...
10865    Harold P. Warren|Tom Neyman|John Reynolds|Dian...
Name: cast, Length: 10866, dtype: object

All column names seem to be descriptive enough and follow `snake_case` format

In [6]:
df.columns

Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

Missing values (aka `NaN`) are found on `imdb_id`,`cast`, `homepage`, `director`, `tagline`, `keywords`, `overview`, `genres` and `production_companies`, 

In [7]:
df.isna().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

There is just one duplicate present in the entire dataset

In [8]:
df.duplicated().sum()

1

corresponding to `index` 2090 and the `original_title` TEKKEN

In [9]:
df[df.duplicated()]

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
2090,42194,tt0411951,0.59643,30000000,967000,TEKKEN,Jon Foo|Kelly Overton|Cary-Hiroyuki Tagawa|Ian...,,Dwight H. Little,Survival is no game,...,"In the year of 2039, after World Wars destroy ...",92,Crime|Drama|Action|Thriller|Science Fiction,Namco|Light Song Films,3/20/10,110,5.0,2010,30000000.0,967000.0


### Data Cleaning

In [10]:
#Creates a copy of the dataset to be cleaned
dff = df 

As there is just one duplicated registry, we can immediatly drop it

In [11]:
dff = dff.drop(2090)

We will start by separating columns `cast`, `genres` and `production_companies`

In a wide-table format this could be done by using `split`:

```python
cast = dff.cast.str.split('|', expand = True)
cast = cast.rename(columns={0: 'cast_1', 
                            1: 'cast_2', 
                            2: 'cast_3',
                            3: 'cast_4',
                            4: 'cast_5',
                           })
cast
```

But for keeping tidyness this will be saved in a long format

In [12]:
dff.cast = dff.cast.str.split('|')
dff = dff.explode('cast')
dff

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Bryce Dallas Howard,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Irrfan Khan,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Vincent D'Onofrio,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Nick Robinson,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Harold P. Warren,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Tom Neyman,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,John Reynolds,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Diane Mahree,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00


And the same will be done for `genres`

In [36]:
dff[dff['cast']=='Chris Pratt']

Unnamed: 0,index,id,imdb_id,popularity,budget,revenue,original_title,cast,director,overview,runtime,genres,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Action,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Adventure,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
2,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Science Fiction,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
3,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Thriller,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
6697,630,118340,tt2015381,14.311205,170000000,773312399,Guardians of the Galaxy,Chris Pratt,James Gunn,"Light years from Earth, 26 years after being a...",121,Action,7/30/14,5612,7.9,2014,156585500.0,712291100.0
6698,630,118340,tt2015381,14.311205,170000000,773312399,Guardians of the Galaxy,Chris Pratt,James Gunn,"Light years from Earth, 26 years after being a...",121,Science Fiction,7/30/14,5612,7.9,2014,156585500.0,712291100.0
6699,630,118340,tt2015381,14.311205,170000000,773312399,Guardians of the Galaxy,Chris Pratt,James Gunn,"Light years from Earth, 26 years after being a...",121,Adventure,7/30/14,5612,7.9,2014,156585500.0,712291100.0
15486,1454,10521,tt0901476,1.074072,30000000,114663461,Bride Wars,Chris Pratt,Gary Winick,Two best friends become rivals when their resp...,89,Comedy,1/9/09,501,5.8,2009,30492010.0,116544000.0
18824,1709,21862,tt1078885,0.30936,0,0,Deep in the Valley,Chris Pratt,Christian Forte,"Best friends, Carl and Lester, find themselves...",87,Comedy,8/28/09,13,4.2,2009,0.0,0.0
40042,3448,63492,tt0770703,1.120851,20000000,30426096,What's Your Number?,Chris Pratt,Mark Mylod,Ally Darling (Anna Faris) is realizing she's a...,106,Comedy,9/30/11,390,6.2,2011,19387960.0,29495000.0


In [13]:
dff.genres = dff.genres.str.split('|')
dff = dff.explode('genres')
dff

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Adventure,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Science Fiction,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Bryce Dallas Howard,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Harold P. Warren,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Tom Neyman,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,John Reynolds,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Diane Mahree,,Harold P. Warren,It's Shocking! It's Beyond Your Imagination!,...,A family gets lost on the road and stumbles up...,74,Horror,Norm-Iris,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00


but not for `production_companies` as this will be dropped

Now that all cells have just one value, we can start dropping columns having a majority of `NaN` as they will not be used for the analysis. These are 

`homepage`, `tagline`, `keywords` and `production_companies`

In [14]:
dff = dff.drop(columns=['homepage', 'tagline', 'keywords', 'production_companies'])

To avoid issues with indexing the old index will be kept as a column

In [15]:
dff = dff.reset_index()

In [16]:
dff

Unnamed: 0,index,id,imdb_id,popularity,budget,revenue,original_title,cast,director,overview,runtime,genres,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Action,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
1,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Adventure,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
2,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Science Fiction,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
3,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Thriller,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
4,0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Bryce Dallas Howard,Colin Trevorrow,Twenty-two years after the events of Jurassic ...,124,Action,6/9/15,5562,6.5,2015,1.379999e+08,1.392446e+09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131849,10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Harold P. Warren,Harold P. Warren,A family gets lost on the road and stumbles up...,74,Horror,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
131850,10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Tom Neyman,Harold P. Warren,A family gets lost on the road and stumbles up...,74,Horror,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
131851,10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,John Reynolds,Harold P. Warren,A family gets lost on the road and stumbles up...,74,Horror,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00
131852,10865,22293,tt0060666,0.035919,19000,0,Manos: The Hands of Fate,Diane Mahree,Harold P. Warren,A family gets lost on the road and stumbles up...,74,Horror,11/15/66,15,1.5,1966,1.276423e+05,0.000000e+00


Now we can drop all 43 duplicated registries just generated by exploding

In [17]:
dff = dff.drop_duplicates()

In [18]:
dff.duplicated().sum()

0

In [19]:
dff.isna().sum()

index               0
id                  0
imdb_id           105
popularity          0
budget              0
revenue             0
original_title      0
cast              125
director          347
overview           30
runtime             0
genres             94
release_date        0
vote_count          0
vote_average        0
release_year        0
budget_adj          0
revenue_adj         0
dtype: int64

<a id='eda'></a>
## Exploratory Data Analysis



We can start checking some simple statistics

In [20]:
dff.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
index,131811.0,5586.059,3133.116,0.0,2877.0,5618.0,8325.0,10865.0
id,131811.0,57641.87,85929.21,5.0,10142.0,17803.0,56590.0,417859.0
popularity,131811.0,0.7165645,1.124384,6.5e-05,0.229734,0.419114,0.786159,32.98576
budget,131811.0,17888330.0,34898710.0,0.0,0.0,200000.0,20000000.0,425000000.0
revenue,131811.0,48493010.0,133495900.0,0.0,0.0,0.0,32482680.0,2781506000.0
runtime,131811.0,103.4122,29.00881,0.0,90.0,100.0,113.0,900.0
vote_count,131811.0,254.7582,644.2871,10.0,18.0,45.0,179.0,9767.0
vote_average,131811.0,5.942857,0.9052944,1.5,5.4,6.0,6.6,9.2
release_year,131811.0,2000.576,12.80028,1960.0,1994.0,2005.0,2010.0,2015.0
budget_adj,131811.0,21496250.0,38499860.0,0.0,0.0,284923.284406,27914080.0,425000000.0


We can check what numeric variables could be analysed. These are `popularity`, `budget`, `revenue`, `runtime`, `vote_count`,`vote_average`, `release_year`, `budget_adj`, `revenue_adj`

In [21]:
dff.dtypes

index               int64
id                  int64
imdb_id            object
popularity        float64
budget              int64
revenue             int64
original_title     object
cast               object
director           object
overview           object
runtime             int64
genres             object
release_date       object
vote_count          int64
vote_average      float64
release_year        int64
budget_adj        float64
revenue_adj       float64
dtype: object

We separate all numeric columns for later analysis

In [22]:
numeric = dff[['popularity', 'budget', 'revenue', 'runtime', 'vote_count', 'vote_average', 'release_year',
              'revenue_adj', 'revenue_adj']]


### Which genres are most popular from year to year?

All years possible

In [23]:
dff.release_year.unique()

array([2015, 2014, 1977, 2009, 2010, 1999, 2001, 2008, 2011, 2002, 1994,
       2012, 2003, 1997, 2013, 1985, 2005, 2006, 2004, 1972, 1980, 2007,
       1979, 1984, 1983, 1995, 1992, 1981, 1996, 2000, 1982, 1998, 1989,
       1991, 1988, 1987, 1968, 1974, 1975, 1962, 1964, 1971, 1990, 1961,
       1960, 1976, 1993, 1967, 1963, 1986, 1973, 1970, 1965, 1969, 1978,
       1966], dtype=int64)

Finds `index` of observations with highest scores

In [29]:
index_max = dff.groupby(['release_year'])['popularity'].idxmax()

index_max = pd.DataFrame(index_max).reset_index()

What allow us to filter the dataframe by joining

In [30]:
index_max.set_index('popularity')

Unnamed: 0_level_0,release_year
popularity,Unnamed: 1_level_1
122430,1960
122055,1961
118640,1962
126280,1963
119050,1964
129532,1965
131262,1966
125794,1967
116993,1968
130002,1969


By joining both tables we can find what is the most popular genre per year

In [34]:
most_popular = index_max.merge(dff, 
                               how='left', 
                               left_on ="popularity", 
                               right_on = dff.index)[['release_year_x','genres']]

most_popular

Unnamed: 0,release_year_x,genres
0,1960,Drama
1,1961,Adventure
2,1962,Adventure
3,1963,Action
4,1964,Adventure
5,1965,Adventure
6,1966,Animation
7,1967,Family
8,1968,Science Fiction
9,1969,Adventure


Also we can see what is the most popular genre by counting number of years

In [None]:
most_popular.group_by('')

### What kinds of properties are associated with movies that have high revenues?

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!