## Data Cleaning  ##

Going to reduce the number of files being used and will clean and merge them w/in this document.  Goal is to have data organized so that we can work with the following:
    - genre (TMDB)
    - studio (gross profits)
    - cost (TN, not sure what that stands for)
    - profit (BOM)
    

In [58]:
!ls zippedData

InitialDataExploration.ipynb
Untitled.ipynb
bom.movie_gross.csv.gz
imdb.name.basics.csv.gz
imdb.title.akas.csv.gz
imdb.title.basics.csv.gz
imdb.title.crew.csv.gz
imdb.title.principals.csv.gz
imdb.title.ratings.csv.gz
rt.movie_info.tsv.gz
rt.reviews.tsv.gz
tmdb.movies.csv.gz
tn.movie_budgets.csv.gz


In [59]:
import pandas as pd
#importing regex so I can more easily find the films with the year added to their titles and to remove punctuation 
# from title column 
import re

movie_gross_df = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
imdb_title_basics_df = pd.read_csv('zippedData/imdb.title.basics.csv.gz')
imdb_title_ratings_df = pd.read_csv('zippedData/imdb.title.ratings.csv.gz')
tmdb_movies_df = pd.read_csv('zippedData/tmdb.movies.csv.gz')
tn_movie_budgets_df = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

## Supress scientific notation ##

In [60]:
##will display floats to the second decimal place
## this code resets this change: pd.reset_option('^display.', silent=True)

pd.options.display.float_format = '{:.2f}'.format

### Cleaning To Dos for Movie Gross ###
- Get rid of rows with null values in the studio field and in the domestic gross field since there aren't many of them 
- ignore the foreign gross nulls, info may be listed on other sheets iirc. Alternately possibly could be calculated w/ info from other sheets
- do a unique value check on year, make sure nothing weird
- do a unique value check on studio, can we replace w/proper names?
- sort by domestic gross and check the tail
- clean out "(YEAR)" from movie titles (via regular expression?)

In [61]:
#remove rows with null values in studio and domestic gross columns
movie_gross_df.dropna(subset=['studio','domestic_gross'], inplace = True)

We are not going to try to replace these with proper names, too many - what may end up happening is that we look at the top few studios and just correct those...

In [62]:
v_count = movie_gross_df['studio'].value_counts()

In [63]:
v_count[:11]

IFC      166
Uni.     147
WB       140
Fox      136
Magn.    136
SPC      123
Sony     109
BV       106
LGF      102
Par.     101
Eros      89
Name: studio, dtype: int64

In [64]:
reg_expression = '\([0-9]{4}\)' #looks for 4-digit numeric string between '(' and ')'

# found the below format online, returns all movies that meet the reg expression

titles_need_formatting = movie_gross_df[movie_gross_df['title'].str.count(reg_expression)>0]
titles_need_formatting

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1,Alice in Wonderland (2010),BV,334200000.00,691300000,2010
10,Clash of the Titans (2010),WB,163200000.00,330000000,2010
55,A Nightmare on Elm Street (2010),WB (NL),63100000.00,52600000,2010
85,Legion (2010),SGem,40200000.00,27800000,2010
106,Death at a Funeral (2010),SGem,42700000.00,6300000,2010
...,...,...,...,...,...
3326,The Little Mermaid (2018),Conglomerate,147000.00,,2018
3340,Revenge (2018),Neon,102000.00,,2018
3341,Unstoppable (2018),WGUSA,101000.00,,2018
3365,The Apparition (2018),MBox,28300.00,,2018


In [65]:
# removes the last 7 chars in a title string if it meets the reg expression


movie_gross_df['title'] = movie_gross_df['title'].apply(lambda x: re.sub(reg_expression,"",x)).str.rstrip()



In [66]:
# finds all the re-releases

substring = 're-release'

movie_gross_df[movie_gross_df['title'].str.find(substring)>0]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
1823,The Conformist (2014 re-release),KL,58700.0,,2014
1833,Alphaville (2013 re-release),Rialto,47700.0,,2014
2139,The Third Man (2015 re-release),Rialto,449000.0,,2015
2604,Only Yesterday (2016 re-release),GK,453000.0,,2016
3264,2001: A Space Odyssey (2018 re-release),WB,3200000.0,,2018
3289,Schindler's List (2018 re-release),Uni.,833000.0,,2018
3296,The Sound of Music (2018 re-release),Fathom,616000.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018


In [67]:
#drop the re-releases from the data set as they were not made in the listed year

movie_gross_df.drop(movie_gross_df[movie_gross_df['title'].str.find(substring)>0].index, inplace=True)

In [68]:
#Clean up column titles 


movie_gross_df.rename(columns = {'title':'Title', 'studio':'Studio', 'domestic_gross':'Domestic Gross',
          'foreign_gross':'Foreign Gross', 'year':'Year'}, inplace = True)
movie_gross_df

Unnamed: 0,Title,Studio,Domestic Gross,Foreign Gross,Year
0,Toy Story 3,BV,415000000.00,652000000,2010
1,Alice in Wonderland,BV,334200000.00,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.00,664300000,2010
3,Inception,WB,292600000.00,535700000,2010
4,Shrek Forever After,P/DW,238700000.00,513900000,2010
...,...,...,...,...,...
3381,Beauty and the Dogs,Osci.,8900.00,,2018
3382,The Quake,Magn.,6200.00,,2018
3384,El Pacto,Sony,2500.00,,2018
3385,The Swan,Synergetic,2400.00,,2018


In [69]:
#create cleaned up version of the title column by converting lowercase and removing all punctuation to use in link
movie_gross_df['Clean Title'] = movie_gross_df['Title'].str.replace(r'[^\w\s]+', '').str.lower()


In [70]:
#create Link column 
movie_gross_df['Link'] = movie_gross_df['Clean Title'] + movie_gross_df['Year'].astype(str)


Unnamed: 0,Title,Studio,Domestic Gross,Foreign Gross,Year,Clean Title,Link
0,Toy Story 3,BV,415000000.00,652000000,2010,toy story 3,toy story 32010
1,Alice in Wonderland,BV,334200000.00,691300000,2010,alice in wonderland,alice in wonderland2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.00,664300000,2010,harry potter and the deathly hallows part 1,harry potter and the deathly hallows part 12010
3,Inception,WB,292600000.00,535700000,2010,inception,inception2010
4,Shrek Forever After,P/DW,238700000.00,513900000,2010,shrek forever after,shrek forever after2010
...,...,...,...,...,...,...,...
3381,Beauty and the Dogs,Osci.,8900.00,,2018,beauty and the dogs,beauty and the dogs2018
3382,The Quake,Magn.,6200.00,,2018,the quake,the quake2018
3384,El Pacto,Sony,2500.00,,2018,el pacto,el pacto2018
3385,The Swan,Synergetic,2400.00,,2018,the swan,the swan2018


In [83]:
#export file
movie_gross_df.to_csv('movie_gross_df_CLEAN_UPDATED.csv')
movie_gross_df

Unnamed: 0,Title,Studio,Domestic Gross,Foreign Gross,Year,Clean Title,Link
0,Toy Story 3,BV,415000000.00,652000000,2010,toy story 3,toy story 32010
1,Alice in Wonderland,BV,334200000.00,691300000,2010,alice in wonderland,alice in wonderland2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.00,664300000,2010,harry potter and the deathly hallows part 1,harry potter and the deathly hallows part 12010
3,Inception,WB,292600000.00,535700000,2010,inception,inception2010
4,Shrek Forever After,P/DW,238700000.00,513900000,2010,shrek forever after,shrek forever after2010
...,...,...,...,...,...,...,...
3381,Beauty and the Dogs,Osci.,8900.00,,2018,beauty and the dogs,beauty and the dogs2018
3382,The Quake,Magn.,6200.00,,2018,the quake,the quake2018
3384,El Pacto,Sony,2500.00,,2018,el pacto,el pacto2018
3385,The Swan,Synergetic,2400.00,,2018,the swan,the swan2018


## Cleaning To Dos for tmdb_movies_df ##
- add genre columns
- check min/maxs
- note no nulls
- clean out rows w/empty lists in genre category (like 10% of the data set, 2.6k, more then I like but average isn't going to cut it and there's no way to manually enter that many)

### The below gets rid of the empty lists in genre ###

In [71]:
tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]']

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
517,517,[],31059,ru,Наша Russia: Яйца судьбы,3.87,2010-01-21,Nasha Russia: Yaytsa sudby,4.30,25
559,559,[],151316,en,Shrek’s Yule Log,3.42,2010-12-07,Shrek’s Yule Log,4.70,9
589,589,[],75828,en,Erratum,3.15,2010-09-16,Erratum,6.60,7
689,689,[],150782,en,Bikini Frankenstein,2.62,2010-01-18,Bikini Frankenstein,6.00,4
731,731,[],200946,en,Weakness,2.45,2010-10-24,Weakness,4.50,2
...,...,...,...,...,...,...,...,...,...,...
26495,26495,[],556601,en,Recursion,0.60,2018-08-28,Recursion,2.00,1
26497,26497,[],514045,en,The Portuguese Kid,0.60,2018-02-14,The Portuguese Kid,2.00,1
26498,26498,[],497839,en,The 23rd Annual Critics' Choice Awards,0.60,2018-01-11,The 23rd Annual Critics' Choice Awards,2.00,1
26500,26500,[],561932,en,Two,0.60,2018-02-04,Two,1.00,1


In [72]:

tmdb_movies_df.drop(tmdb_movies_df[tmdb_movies_df['genre_ids'] == '[]'].index, inplace = True)

## Clears out TV Movies from df ##

In [73]:
# clears out the tv movies 
tmdb_movies_df.drop(tmdb_movies_df.loc[tmdb_movies_df['genre_ids'].str.contains('10770')].index, inplace = True)

In [74]:
tmdb_movies_df.columns = [x.title() for x in tmdb_movies_df.columns]
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22954 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         22954 non-null  int64  
 1   Genre_Ids          22954 non-null  object 
 2   Id                 22954 non-null  int64  
 3   Original_Language  22954 non-null  object 
 4   Original_Title     22954 non-null  object 
 5   Popularity         22954 non-null  float64
 6   Release_Date       22954 non-null  object 
 7   Title              22954 non-null  object 
 8   Vote_Average       22954 non-null  float64
 9   Vote_Count         22954 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 1.9+ MB


## Remove punctuation, turn lowercase ##

In [75]:
#uses regular expressions to remove punctuation

tmdb_movies_df['Clean Title'] = tmdb_movies_df['Title'].str.replace(r'[^\w\s]+', '')



In [76]:
#lowercase string
tmdb_movies_df['Clean Title'] = tmdb_movies_df['Clean Title'].str.lower()
tmdb_movies_df



Unnamed: 0.1,Unnamed: 0,Genre_Ids,Id,Original_Language,Original_Title,Popularity,Release_Date,Title,Vote_Average,Vote_Count,Clean Title
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.53,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.70,10788,harry potter and the deathly hallows part 1
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.73,2010-03-26,How to Train Your Dragon,7.70,7610,how to train your dragon
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.52,2010-05-07,Iron Man 2,6.80,12368,iron man 2
3,3,"[16, 35, 10751]",862,en,Toy Story,28.00,1995-11-22,Toy Story,7.90,10174,toy story
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.30,22186,inception
...,...,...,...,...,...,...,...,...,...,...,...
26512,26512,"[27, 18]",488143,en,Laboratory Conditions,0.60,2018-10-13,Laboratory Conditions,0.00,1,laboratory conditions
26513,26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.60,2018-05-01,_EXHIBIT_84xxx_,0.00,1,_exhibit_84xxx_
26514,26514,"[14, 28, 12]",381231,en,The Last One,0.60,2018-10-01,The Last One,0.00,1,the last one
26515,26515,"[10751, 12, 28]",366854,en,Trailer Made,0.60,2018-06-22,Trailer Made,0.00,1,trailer made


## Create link column ##

In [79]:
#link up the clean title and year to create unique ID 
#start by creating a Year column (not a number type)
tmdb_movies_df['Year'] = tmdb_movies_df['Release_Date'].apply(lambda x: x[:4])

tmdb_movies_df['Link']=tmdb_movies_df['Clean Title']+tmdb_movies_df['Year']
tmdb_movies_df.to_csv('tmdb_movies_df_CLEAN UPDATED.csv')

### Cleaning Up imdb_title_ratings_df ###

No need to add link column as tconst can be used

In [82]:
#cleans up column titles

imdb_title_ratings_df.rename(columns = {'averagerating':'Average Rating', 'numvotes':'Number of Votes'}, inplace = True)
imdb_title_ratings_df

Unnamed: 0,tconst,Average Rating,Number of Votes
0,tt10356526,8.30,31
1,tt10384606,8.90,559
2,tt1042974,6.40,20
3,tt1043726,4.20,50352
4,tt1060240,6.50,21
...,...,...,...
73851,tt9805820,8.10,25
73852,tt9844256,7.50,24
73853,tt9851050,4.70,14
73854,tt9886934,7.00,5


In [81]:
imdb_title_ratings_df.to_csv('imdb_title_ratings_df_CLEAN UPDATED.csv')