## Data Cleaning  ##

Going to reduce the number of files being used and will clean and merge them w/in this document.  Goal is to have data organized so that we can work with the following:
    - genre (TMDB)
    - studio (gross profits)
    - cost (TN, not sure what that stands for)
    - profit (BOM)
    

In [7]:
import pandas as pd

movie_gross_df = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
# imdb_name_basics_df = pd.read_csv('zippedData/imdb.name.basics.csv.gz')
# imdb_title_akas_df = pd.read_csv('zippedData/imdb.title.akas.csv.gz')
imdb_title_basics_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz')
# imdb_title_crew_df = pd.read_csv('zippedData/imdb.title.crew.csv.gz')
# imdb_title_principals_df = pd.read_csv('zippedData/imdb.title.principals.csv.gz')
# imdb_title_ratings_df = pd.read_csv('zippedData/imdb.title.ratings.csv.gz')
# rt_movie_info_df = pd.read_csv('zippedData/rt.movie_info.tsv.gz', delimiter='\t')
# rt_reviews_df = pd.read_csv('zippedData/rt.reviews.tsv.gz', delimiter = '\t', encoding='latin1')
tmdb_movies_df = pd.read_csv('zippedData/tmdb.movies.csv.gz')
tn_movie_budgets_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

In [8]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


### Cleaning To Dos for Movie Gross ###
- Get rid of rows with null values in the studio field and in the domestic gross field since there aren't many of them 
- ignore the foreign gross nulls, info may be listed on other sheets iirc. Alternately possibly could be calculated w/ info from other sheets
- do a unique value check on year, make sure nothing weird
- do a unique value check on studio, can we replace w/proper names?
- sort by domestic gross and check the tail
- clean out "(YEAR)" from movie titles (via regular expression?)

In [11]:
movie_gross_df.dropna(subset=['studio','domestic_gross'], inplace = True)

In [12]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   2007 non-null   object 
 4   year            3356 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 157.3+ KB


In [13]:
movie_gross_df['year'].unique()

array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018], dtype=int64)

**NOTE THAT THIS ONLY RUNS FROM 2010-2018**

In [14]:
movie_gross_df['studio'].unique()

array(['BV', 'WB', 'P/DW', 'Sum.', 'Par.', 'Uni.', 'Fox', 'Wein.', 'Sony',
       'FoxS', 'SGem', 'WB (NL)', 'LGF', 'MBox', 'CL', 'W/Dim.', 'CBS',
       'Focus', 'MGM', 'Over.', 'Mira.', 'IFC', 'CJ', 'NM', 'SPC', 'ParV',
       'Gold.', 'JS', 'RAtt.', 'Magn.', 'Free', '3D', 'UTV', 'Rela.',
       'Zeit.', 'Anch.', 'PDA', 'Lorb.', 'App.', 'Drft.', 'Osci.', 'IW',
       'Rog.', 'Eros', 'Relbig.', 'Viv.', 'Hann.', 'Strand', 'NGE',
       'Scre.', 'Kino', 'Abr.', 'CZ', 'ATO', 'First', 'GK', 'FInd.',
       'NFC', 'TFC', 'Pala.', 'Imag.', 'NAV', 'Arth.', 'CLS', 'Mont.',
       'Olive', 'CGld', 'FOAK', 'IVP', 'Yash', 'ICir', 'FM', 'Vita.',
       'WOW', 'Truly', 'Indic.', 'FD', 'Vari.', 'TriS', 'ORF', 'IM',
       'Elev.', 'Cohen', 'NeoC', 'Jan.', 'MNE', 'Trib.', 'Rocket',
       'OMNI/FSR', 'KKM', 'Argo.', 'SMod', 'Libre', 'FRun', 'WHE', 'P4',
       'KC', 'SD', 'AM', 'MPFT', 'Icar.', 'AGF', 'A23', 'Da.', 'NYer',
       'Rialto', 'DF', 'KL', 'ALP', 'LG/S', 'WGUSA', 'MPI', 'RTWC', 'FIP',
  

We are not going to try to replace these with proper names, too many - what may end up happening is that we look at the top few studios and just correct those...

In [17]:
v_count = movie_gross_df['studio'].value_counts()

In [19]:
v_count[:11]

IFC      166
Uni.     147
WB       140
Magn.    136
Fox      136
SPC      123
Sony     109
BV       106
LGF      102
Par.     101
Eros      89
Name: studio, dtype: int64

In [28]:
movie_gross_df.loc[movie_gross_df['studio'] == "LGF"] #plug in different studio values to see movie titles 

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
26,The Expendables,LGF,103100000.0,171400000,2010
51,Saw 3D,LGF,45700000.0,90400000,2010
64,Killers,LGF,47100000.0,51100000,2010
65,Kick-Ass,LGF,48100000.0,48100000,2010
87,The Last Exorcism,LGF,41000000.0,26700000,2010
...,...,...,...,...,...
3207,Hell Fest,LGF,11100000.0,7100000,2018
3229,Kin,LGF,5700000.0,4300000,2018
3231,Traffik,LGF,9200000.0,336000,2018
3235,Condorito: La Pelicula,LGF,448000.0,8000000,2018


In [31]:
movie_gross_df.sort_values('domestic_gross').head(10)

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1476,Storage 24,Magn.,100.0,,2013
2321,The Chambermaid,FM,300.0,,2015
2757,Satanic,Magn.,300.0,,2016
2756,News From Planet Mars,KL,300.0,,2016
1018,Apartment 143,Magn.,400.0,426000.0,2012
3078,2:22,Magn.,400.0,,2017
3077,Max & Leon,Distrib.,500.0,,2017
1126,Death of a Superhero,Trib.,600.0,,2012
2920,Amityville: The Awakening,W/Dim.,700.0,7700000.0,2017
1475,Into the White,Magn.,700.0,,2013


looks like domestic gross makes sense, both highs and lows 

In [32]:
#importing regex so I can more easily find the films with the year added to their titles

import re

In [36]:


# df[df['Country (region)'].str.count('^[pP].*')>0]
movie_gross_df[movie_gross_df['title'].str.count("/\(([^()\]*)\)/g'")>0]

error: unterminated character set at position 4

## Cleaning To Dos for tmdb_movies_df ##
- add genre columns
- check min/maxs
- note no nulls
- clean out rows w/empty lists in genre category (like 10% of the data set, 2.6k, more then I like but average isn't going to cut it and there's no way to manually enter that many)

In [39]:
tmdb_movies_df.tail()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


In [38]:
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


### The below gets rid of the empty lists in genre ###

In [44]:
tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
517,517,[],31059,ru,Наша Russia: Яйца судьбы,3.867,2010-01-21,Nasha Russia: Yaytsa sudby,4.3,25
559,559,[],151316,en,Shrek’s Yule Log,3.424,2010-12-07,Shrek’s Yule Log,4.7,9
589,589,[],75828,en,Erratum,3.154,2010-09-16,Erratum,6.6,7
689,689,[],150782,en,Bikini Frankenstein,2.625,2010-01-18,Bikini Frankenstein,6.0,4
731,731,[],200946,en,Weakness,2.451,2010-10-24,Weakness,4.5,2
...,...,...,...,...,...,...,...,...,...,...
26495,26495,[],556601,en,Recursion,0.600,2018-08-28,Recursion,2.0,1
26497,26497,[],514045,en,The Portuguese Kid,0.600,2018-02-14,The Portuguese Kid,2.0,1
26498,26498,[],497839,en,The 23rd Annual Critics' Choice Awards,0.600,2018-01-11,The 23rd Annual Critics' Choice Awards,2.0,1
26500,26500,[],561932,en,Two,0.600,2018-02-04,Two,1.0,1


In [47]:

tmdb_movies_df.drop(tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]'].index, inplace = True)

In [48]:
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24038 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         24038 non-null  int64  
 1   genre_ids          24038 non-null  object 
 2   id                 24038 non-null  int64  
 3   original_language  24038 non-null  object 
 4   original_title     24038 non-null  object 
 5   popularity         24038 non-null  float64
 6   release_date       24038 non-null  object 
 7   title              24038 non-null  object 
 8   vote_average       24038 non-null  float64
 9   vote_count         24038 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [49]:
tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count


In [41]:
tmdb_movies_df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0,26517.0
mean,13258.0,295050.15326,3.130912,5.991281,194.224837
std,7654.94288,153661.615648,4.355229,1.852946,960.961095
min,0.0,27.0,0.6,0.0,1.0
25%,6629.0,157851.0,0.6,5.0,2.0
50%,13258.0,309581.0,1.374,6.0,5.0
75%,19887.0,419542.0,3.694,7.0,28.0
max,26516.0,608444.0,80.773,10.0,22186.0


In [50]:
## going to try and create a column for each genre based on the list and then populate it with boolean for the particular film
## start going to make a dictionary of the codes, then going to loop through, each loop creates a column and populates it

genre_dict = {28:'Action', 12: 'Adventure', 16: 'Animation', 35: 'Comedy', 80:'Crime', 99:'Documentary',18:'Drama', 10751:'Family',
              14:'Fantasy', 36: 'History', 27:'Horror', 10402:'Music', 9648:'Mystery', 10749:'Romance', 878:'Science Fiction',
              53:'Thriller', 10752:'War',37:'Western'}
#didn't include TV Movie category since we don't care about those (may want to filter them out...)

In [51]:
tmdb_movies_df[tmdb_movies_df['genre_ids'].str.contains('10770')]

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
64,64,"[10770, 16, 14, 10751]",47626,en,Legend of the BoneKnapper Dragon,13.013,2010-10-15,Legend of the BoneKnapper Dragon,6.4,116
89,89,"[35, 10402, 10770]",44244,en,Camp Rock 2: The Final Jam,11.542,2010-09-03,Camp Rock 2: The Final Jam,6.1,878
101,101,"[16, 10751, 10770]",50393,en,Kung Fu Panda Holiday,11.083,2010-11-26,Kung Fu Panda Holiday,6.6,110
158,158,"[35, 10749, 10770, 10402]",35558,en,Starstruck,9.406,2010-02-14,Starstruck,6.7,627
235,235,"[10770, 18, 10751, 14]",50479,en,Avalon High,8.098,2010-11-12,Avalon High,6.1,313
...,...,...,...,...,...,...,...,...,...,...
26381,26381,"[10770, 99]",560930,en,Gypsy's Revenge,0.600,2018-11-06,Gypsy's Revenge,6.7,3
26387,26387,"[99, 10770]",525846,en,Casey Anthony's Parents Speak,0.600,2018-05-28,Casey Anthony's Parents Speak,6.5,2
26398,26398,"[80, 10770]",571692,en,Nightmare Best Friend,0.600,2018-12-29,Nightmare Best Friend,6.0,1
26402,26402,"[18, 10749, 10770]",562466,en,Christmas on the Coast,0.600,2018-11-25,Christmas on the Coast,6.0,1


## Clears out TV Movies from df, creates a new column for each genre fills it with true/false based on genre id column ##

In [52]:
# clears out the tv movies 
tmdb_movies_df.drop(tmdb_movies_df.loc[tmdb_movies_df['genre_ids'].str.contains('10770')].index, inplace = True)

In [55]:
tmdb_movies_df['genre_ids'].str.contains('28')

0        False
1        False
2         True
3        False
4         True
         ...  
26512    False
26513    False
26514     True
26515     True
26516    False
Name: genre_ids, Length: 22954, dtype: bool

In [57]:
for key, value in genre_dict.items():
    tmdb_movies_df[value] = tmdb_movies_df['genre_ids'].str.contains(str(key))

In [58]:
tmdb_movies_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count,...,Fantasy,History,Horror,Music,Mystery,Romance,Science Fiction,Thriller,War,Western
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,...,True,False,False,False,False,False,False,False,False,False
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,...,True,False,False,False,False,False,False,False,False,False
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368,...,False,False,False,False,False,False,True,False,False,False
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174,...,False,False,False,False,False,False,False,False,False,False
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186,...,False,False,False,False,False,False,True,False,False,False
