# DAND Term 1 Project 3
Jesse Fredrickson

8/15/18

## TMDb movie data

[Link](https://www.google.com/url?q=https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv&sa=D&ust=1534386214169000)

In this project, I will conduct my own data analysis and document my findings in this jupyter notebook. Please note that all findings are tentative and based simply on observations, not on inferential statistics or machine learning. Correlation does not imply causation.

### Brainstorming

In [35]:
# imports, loads, and magics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

path = 'C:/Users/Jesse/Documents/PyData/DAND_t1_p3/'
df = pd.read_csv('tmdb-movies.csv')
df.head(2)

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0


In [46]:
# don't care about some columns immediately
dropcols = ['homepage', 'tagline', 'overview']
df.drop(dropcols, axis=1, inplace=True)

In [47]:
# check for duplicates in all columns
df.drop_duplicates(inplace = True)
for col in df.columns:
    print('Duplicates in ' + col + ': ' + str(df.duplicated(col).sum()))

Duplicates in id: 0
Duplicates in imdb_id: 9
Duplicates in popularity: 48
Duplicates in budget: 10020
Duplicates in revenue: 5981
Duplicates in original_title: 0
Duplicates in cast: 141
Duplicates in director: 5571
Duplicates in keywords: 2024
Duplicates in runtime: 10328
Duplicates in genres: 8571
Duplicates in production_companies: 3315
Duplicates in release_date: 4814
Duplicates in vote_count: 9294
Duplicates in vote_average: 10499
Duplicates in release_year: 10515
Duplicates in budget_adj: 8016
Duplicates in revenue_adj: 5853


In [42]:
# examine duplicates
df[df['imdb_id'].duplicated()]; # reveals that imdb_id duplicates are all Nan... that's fine

# want to drop all original title duplicates while preserving all relevant data.
# Saving only the first occurrance might throw away valuable data
# on way to do this would be to groupby original_title, replace all 0s with np.nan, ffill and bfill
# then drop duplicates.
def fillmeupdaddy(df, col):
    df.groupby([col]).ffill().bfill(inplace=True)
    df.drop_duplicates(['original_title'], inplace=True)
    return df

df.replace(0,np.nan)
clean_df = fillmeupdaddy(df, 'original_title')

In [48]:
for col in df.columns:
    print('Duplicates in ' + col + ': ' + str(df.duplicated(col).sum()))

Duplicates in id: 0
Duplicates in imdb_id: 9
Duplicates in popularity: 48
Duplicates in budget: 10020
Duplicates in revenue: 5981
Duplicates in original_title: 0
Duplicates in cast: 141
Duplicates in director: 5571
Duplicates in keywords: 2024
Duplicates in runtime: 10328
Duplicates in genres: 8571
Duplicates in production_companies: 3315
Duplicates in release_date: 4814
Duplicates in vote_count: 9294
Duplicates in vote_average: 10499
Duplicates in release_year: 10515
Duplicates in budget_adj: 8016
Duplicates in revenue_adj: 5853


In [49]:
# Now count Nans
df.isnull().sum()

id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      74
director                  43
keywords                1467
runtime                    0
genres                    23
production_companies    1019
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

It is worth noting that although the `clean_df` dataframe is clean of duplicates, it has a number of NaNs in some columns. It may be necessary to have to remove NaNs for some analyses