# Movie Analysis: Data Scrubbing

## About:
In the data scrubbing phase I will focus on cleaning up the columns I plan on using, and building up the data frame I will use for the EDA phase:

1. US Gross Revenue
2. Genre
3. Actors
4. Time of Year (date)
5. Keywords (content)
6. Combined

### Project imports:

In [1]:
# imports for entire data gathering phase
import pandas as pd 
import os

## 1. US Gross Revene
This column will be how we measure the other columns, so we will start here and drop any rows that don't have this information.

In [2]:
revenue_path = os.path.join(os.pardir, 'data', 'interim', 'money.csv')
revenue_df = pd.read_csv(revenue_path)

In [3]:
revenue_df.head()

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
0,tt2488496,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams,Lucasfilm,[US],519,$245MM,$937MM
1,tt4154796,Avengers: Endgame,2019,Anthony Russo,Marvel Studios,[US],111,$356MM,$858MM
2,tt0499549,Avatar,2009,James Cameron,Twentieth Century Fox,[US],533,$237MM,$761MM
3,tt1825683,Black Panther,2018,Ryan Coogler,Marvel Studios,[US],269,$200MM,$700MM
4,tt4154756,Avengers: Infinity War,2018,Anthony Russo,Marvel Studios,[US],376,$321MM,$679MM


In [4]:
revenue_df.loc[revenue_df['imdb_id'] == 'tt0091605']

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
5194,tt0091605,The Name of the Rose,1986,Jean-Jacques Annaud,Constantin Film,[DE],2485,ITL 30B,$7.2MM


In [5]:
revenue_df = revenue_df[:-1]

### Changes:

1. Convert 'us_gross', and 'budget_usd' values into floats. That means stripping the non-number characters out as well as changing 'MM' to ',000,000'.
2. Convert year column to int, so the years don't have the trailing .0.
3. region_code does not need the brackets around the abbreviations.

In [6]:
# Created 3/22/2020 with current exhange values. Values not adjusted for the date the movie was created.
def get_conversion_rate(value):
    """Get exchange rate for given currency code
    
    Arguments:
        value (string): String with currency code or symbol in it

    Returns:
        rate (float): Conversion rate to usd
    """
    if '£' in value:
        return 0.854
    elif '€' in value:
        return 0.9334
    elif 'AUD' in value:
        return 1.7229
    elif 'CAD' in value:
        return 1.435
    elif 'FRF' in value:
        return 6.55957 * 0.9334
    elif 'INR' in value:
        return 75.394
    elif 'THB' in value:
        return 32.68
    elif 'EM' in value:
        return 1 # cant find info on EM
    elif 'JPY' in value:
        return 110.75
    elif 'SKW' in value:
        return 1254.45
    elif 'HUF' in value:
        return 327.94
    elif 'NGN' in value:
        return 364
    elif 'CNY' in value:
        return 7.0950
    elif 'ESP' in value:
        return 155.42826
    elif 'RUR' in value:
        return 79.87
    elif 'HKD' in value:
        return 7.7570
    elif 'ISK' in value:
        return 140.490
    elif 'PHP' in value:
        return 51.19
    elif 'DKK' in value:
        return 6.9716
    elif 'CZK' in value:
        return 25.5620
    elif 'SKK' in value:
        return 10.3753
    elif 'NOK' in value:
        return 11.7890
    elif 'MXN' in value:
        return 24.4215
    elif 'JMD' in value:
        return 135.07
    elif 'PLN' in value:
        return 4.23
    elif 'KRW' in value:
        return 1228.97
    elif 'ITL' in value:
        return 1804.64
    else:
        return 1

In [7]:
def strip_currency_code(value):
    """Strips currency code from front of currency string

    Arguments: 
        value (string): currency amount prefaced with currency code
    
    Returns:
        value (string): value without the currency code
    """
    if value[:1] in '$£€':
        return value[1:]
    else:
        return value[3:]

In [8]:
def convert_money(value):
    """Takes currency string and parses it into correct amount in USD
    
    Arguments:
        value (string): currency in form: CAD 345.3B 

    Returns:
        value (int): currency converted to USD and in standard numeric form
    """
    # type check:
    if type(value) != str:
        return
    
    # check currency sign and get coefficient
    coef = get_conversion_rate(value)
    value = strip_currency_code(value)
    if 'K' in value:
        value = (float(value.strip('K')) * 1000) / coef
    elif 'MM' in value:
        value = (float(value.strip('MM')) * 1000000) / coef
    elif 'B' in value:
        value = (float(value.strip('B')) * 1000000000) / coef
    else:
        value = float(value.strip()) / coef
    return value

In [9]:
revenue_df['us_gross'] = revenue_df['us_gross'].apply(convert_money)

In [10]:
revenue_df['us_gross']

0        937000000.0
1        858000000.0
2        761000000.0
3        700000000.0
4        679000000.0
            ...     
14696            NaN
14697            NaN
14698            NaN
14699            NaN
14700            NaN
Name: us_gross, Length: 14701, dtype: float64

In [11]:
revenue_df['budget_usd'] = revenue_df['budget_usd'].apply(convert_money)

In [12]:
revenue_df['budget_usd'].isna().sum()

6705

In [13]:
revenue_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
13854,tt1723126,Last Days Here,2011,Don Argott,9.14 Pictures,[US],114022,,8000.0
11117,tt0110115,Imaginary Crimes,1994,Anthony Drazan,Morgan Creek Entertainment,[US],47710,,90000.0
6685,tt0389326,Riding Giants,2004,Stacy Peralta,Agi Orsi Productions,[US],55948,,2300000.0
4650,tt0088395,Where the Boys Are,1984,Hy Averback,Incorporated Television Company (ITC),[GB],15228,,11000000.0
14559,tt2537176,I Spit on Your Grave 2,2013,Steven R. Monroe,Cinetel Films,[US],4001,,809.0


Now for region code. We actually don't need this column so we will drop it.

In [14]:
revenue_df.drop(columns='region_code', inplace=True)

For the 'year' column we went ahead and dropped the missing rows, because there were only 6 of them.

In [15]:
revenue_df.isna().sum()

imdb_id             1
title               1
year               16
director           30
production_co     356
rank               13
budget_usd       6705
us_gross          103
dtype: int64

Cleaning up Nan values:

In [16]:
# first, change the missing values from budget to -1, so we dont drop 1910 rows.
revenue_df['budget_usd'] = revenue_df['budget_usd'].fillna(-1)

In [17]:
# also, fill in the production_co missing values with an 'Unknown'
revenue_df['production_co'] = revenue_df['production_co'].fillna('Unknown')

In [18]:
revenue_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14701 entries, 0 to 14700
Data columns (total 8 columns):
imdb_id          14700 non-null object
title            14700 non-null object
year             14685 non-null object
director         14671 non-null object
production_co    14701 non-null object
rank             14688 non-null object
budget_usd       14701 non-null float64
us_gross         14598 non-null float64
dtypes: float64(2), object(6)
memory usage: 918.9+ KB


In [19]:
revenue_df = revenue_df.dropna()

In [20]:
revenue_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross
8061,tt1002617,Between Me and My Mind,2019,Steven Cantor,Believe Entertainment Group,96620,-1.0,875000.0
12226,tt3374966,The Mafia Kills Only in Summer,2013,Pif,Wildside,29230,-1.0,37000.0
10058,tt3228088,Spark: A Space Tail,2016,Aaron Woodley,ToonBox Entertainment,32181,-1.0,196000.0
12624,tt2435458,Last Weekend,2014,Tom Dolby,Gran Via Productions,58207,-1.0,27000.0
8015,tt6257174,The Miseducation of Cameron Post,2018,Desiree Akhavan,Beachside Films,5354,-1.0,905000.0


Now for dropping duplicates:

In [21]:
revenue_df = revenue_df.drop_duplicates()

In [22]:
revenue_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14431 entries, 0 to 14598
Data columns (total 8 columns):
imdb_id          14431 non-null object
title            14431 non-null object
year             14431 non-null object
director         14431 non-null object
production_co    14431 non-null object
rank             14431 non-null object
budget_usd       14431 non-null float64
us_gross         14431 non-null float64
dtypes: float64(2), object(6)
memory usage: 1014.7+ KB


In [23]:
revenue_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross
14233,tt0117132,A Girl Called Rosemary,1996,Bernd Eichinger,Constantin Film,84954,-1.0,4000.0
1464,tt0289879,The Butterfly Effect,2004,Eric Bress,BenderSpink,1627,13000000.0,58000000.0
14398,tt0439467,The Big Question,2004,Francesco Cabras,Ganga,285077,-1.0,2000.0


In [24]:
def calc_revenue(df):
    return df['us_gross'] - df['budget_usd'] 

In [25]:
revenue_df['revenue'] = revenue_df[revenue_df['budget_usd'] > 0].apply(calc_revenue, axis=1)

Added in a revenue column after being advised about how much better it would be for a metric.

In [26]:
revenue_df[revenue_df['budget_usd'] > 0].sort_values('revenue').head(2)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue
1107,tt0401729,John Carter,2012,Andrew Stanton,Walt Disney Pictures,1817,250000000.0,73000000.0,-177000000.0
1271,tt1440129,Battleship,2012,Peter Berg,Universal Pictures,3013,209000000.0,65000000.0,-144000000.0


Change rank type from string to int

In [27]:
def fix_rank(value):
    value = value.replace(',', '')
    return int(value)

In [28]:
revenue_df['rank'] = revenue_df['rank'].apply(fix_rank)
revenue_df['popular'] = revenue_df['rank'].apply(lambda x: x < revenue_df['rank'].quantile(.1))

In [29]:
revenue_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular
8893,tt0102587,Only Yesterday,1991,Isao Takahata,Nippon Television Network (NTV),5706,-1.0,453000.0,,False
7478,tt5974460,The Women's Balcony,2016,Emil Ben-Shimon,Pie Films,77579,-1.0,1200000.0,,False
2974,tt1800246,That Awkward Moment,2014,Tom Gormican,Treehouse Pictures,4811,8000000.0,26000000.0,18000000.0,False


### Save as CSV

In [30]:
revenue_save_path = os.path.join(os.pardir, 'data', 'processed', 'revenue.csv')
revenue_df.to_csv(revenue_save_path, index=False)

In [31]:
test_revenue_save = pd.read_csv(revenue_save_path)
test_revenue_save.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular
8863,tt6335734,"The Night Is Short, Walk on Girl",2017,Masaaki Yuasa,Science Saru,16510,-1.0,406000.0,,False
3012,tt0884328,The Mist,2007,Frank Darabont,Dimension Films,2216,18000000.0,26000000.0,8000000.0,True
2734,tt2247476,When the Game Stands Tall,2014,Thomas Carter,Affirm Films,16032,15000000.0,30000000.0,15000000.0,False


## 2. Genre:
For genre we will need a dataset that lists each movie and it's genre. To analyze the success of the genre, we will need to examine the relationship of genre to the revenue earned.

Bringing in the list of movie titles:

In [32]:
titles_path = os.path.join(os.pardir, 'data', 'raw', 'movies.csv')

In [33]:
genres_df = pd.read_csv(titles_path)
genres_df.head()

Unnamed: 0,tconst,primaryTitle,startYear,genres
0,tt0000009,Miss Jerry,1894,Romance
1,tt0000147,The Corbett-Fitzsimmons Fight,1897,"Documentary,News,Sport"
2,tt0000335,Soldiers of the Cross,1900,"Biography,Drama"
3,tt0000502,Bohemios,1905,\N
4,tt0000574,The Story of the Kelly Gang,1906,"Biography,Crime,Drama"


In [34]:
genres_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545821 entries, 0 to 545820
Data columns (total 4 columns):
tconst          545821 non-null object
primaryTitle    545821 non-null object
startYear       545821 non-null object
genres          545821 non-null object
dtypes: object(4)
memory usage: 16.7+ MB


### Changes:
Looking at the initial dataframe, these are the things I would like to change:
1. Change column names
2. Drop original_title and runtime_minutes columns

In [35]:
genres_df = genres_df.rename(columns={'tconst': 'imdb_id', 'primaryTitle': 'title', 'startYear': 'year'})

In [36]:
genres_df.sample(3)

Unnamed: 0,imdb_id,title,year,genres
466688,tt6259456,Dismembering the Band,\N,Horror
159692,tt0286702,The House Next Door,2002,"Drama,Thriller"
373770,tt3185948,Flow,2011,Documentary


That looks good. Let me deal with Nan's:

In [37]:
genres_df.isna().sum()

imdb_id    0
title      0
year       0
genres     0
dtype: int64

### Save as CSV

In [38]:
genres_save_path = os.path.join(os.pardir, 'data', 'processed', 'genres.csv')
genres_df.to_csv(genres_save_path, index=False)

In [39]:
test_genres_save = pd.read_csv(genres_save_path)
test_genres_save.sample(3)

Unnamed: 0,imdb_id,title,year,genres
345745,tt2332544,Darwin et la science de l'évolution,2002,Documentary
397594,tt3854280,Baby Lu,\N,Drama
96658,tt0148842,Shikast,1953,Drama


## 3. Actors
These columns will be key in identifying the people who have the ability to produce high quality work on a consistent basis.

In [40]:
people_path = os.path.join(os.pardir, 'data', 'raw', 'imdb.name.basics.csv')
people_df = pd.read_csv(people_path)

In [41]:
people_df.sample(3)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
499694,nm7808566,Joanne Dawn,,,actor,"tt8318000,tt5291938,tt5300206,tt5440598"
367882,nm4630906,José Balado,,,"producer,sound_department,director","tt7439428,tt5532044,tt4537412,tt5532036"
243162,nm5163011,Sherice Griffiths,,,"miscellaneous,assistant_director,director","tt3969108,tt6664774,tt0974015,tt2249786"


### Changes:
Some cleanup tasks:
1. Change name of primary_name column to 'name'
2. Select all the actors and actress
3. Drop birth_year, death_year, known_for_titles

In [42]:
people_df.sample(3)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
323402,nm5061485,Javier Martín-Domínguez,,,"director,actor","tt0390699,tt0085880,tt0407419,tt2292850"
310989,nm3276178,Dominique Keller,,,"producer,director,writer","tt1398920,tt2121824,tt5618812,tt7343524"
410235,nm7193938,Josefin Finér,,,,tt1781874


In [43]:
people_df = people_df.rename(columns={'primary_name': 'name'})

In [44]:
def can_act(professions):
    if type(professions) != str:
        return False
    if 'actor' in professions or 'actress' in professions:
        return True
    else:
        return False

In [45]:
people_df['can_act'] = people_df['primary_profession'].apply(can_act)

In [46]:
people_df.sample(3)

Unnamed: 0,nconst,name,birth_year,death_year,primary_profession,known_for_titles,can_act
274845,nm4300329,Jason 'Bamba' Anderson,,,"producer,actor,writer","tt3586562,tt4458480,tt3239564,tt1832973",True
472010,nm7425076,Gilberto Martins,,,actor,tt4815352,True
474027,nm9845078,Boycut,,,actor,,True


Okay, we will grab all the actors and directors and make individual dataframes for them:

In [47]:
actors_df = people_df[people_df['can_act'] == True]

And now we can drop the unwanted columns:

In [48]:
drop_columns = ['primary_profession', 'can_act', 'birth_year', 'death_year', 'known_for_titles']
actors_df = actors_df.drop(columns=drop_columns)

In [49]:
actors_df.sample(3)

Unnamed: 0,nconst,name
168547,nm2178489,Sorab Wadia
299700,nm4467597,Stuart J. Prowse
484884,nm8497364,Joe Oviguian


Let's check for missing values:

In [50]:
actors_df.isna().sum()

nconst    0
name      0
dtype: int64

There we go. A very large list of actors and actresses. We can join them to the titles and see if there are any patterns amongst the top performing titles.

### Save as CSV

In [51]:
actors_save_path = os.path.join(os.pardir, 'data', 'processed', 'actors.csv')
actors_df.to_csv(actors_save_path, index=False)

In [52]:
test_actors_save = pd.read_csv(actors_save_path)
test_actors_save.sample(3)

Unnamed: 0,nconst,name
255707,nm8847811,Vicky Chen
38102,nm0603268,Anders Mordal
198945,nm4758886,Jewels Lubin


## 4. Time of Year (date)
Time of year will be an important metric to discover the most opportune time to release a film.

In [53]:
date_path = os.path.join(os.pardir, 'data', 'raw', 'tmdb_movies.csv')
date_df = pd.read_csv(date_path)

In [54]:
date_df.sample(3)

Unnamed: 0,imdbId,budget,revenue,originalTitle,releaseDate
23005,tt0250202,0,1051948.0,All Over the Guy,2001-08-10
40169,tt1483756,0,0.0,The Trouble with Bliss,2011-10-01
9333,tt0386032,9000000,24538513.0,Sicko,2007-05-18


### Changes:
We only need a couple columns from this set:
1. imdb_id
2. release_date

The column names are ok as well, so this will be very simple.

In [55]:
date_df = date_df.drop_duplicates()

In [56]:
date_df = date_df.rename(columns={'imdbId': 'imdb_id', 'originalTitle': 'title', 'releaseDate': 'date'})

In [57]:
date_df = date_df[['imdb_id', 'date']]

In [58]:
date_df = date_df.dropna()

In [59]:
date_df.sample(3)

Unnamed: 0,imdb_id,date
2227,tt1302067,2010-12-17
42401,tt1348318,2009-09-22
34438,tt1780762,2012-09-12


In [60]:
date_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14466 entries, 0 to 43495
Data columns (total 2 columns):
imdb_id    14466 non-null object
date       14466 non-null object
dtypes: object(2)
memory usage: 339.0+ KB


In [61]:
date_df.isna().sum()

imdb_id    0
date       0
dtype: int64

### Save to CSV

In [62]:
date_save_path = os.path.join(os.pardir, 'data', 'processed', 'date.csv')
date_df.to_csv(date_save_path, index=False)

In [63]:
test_date_save = pd.read_csv(date_save_path)
test_date_save.sample(3)

Unnamed: 0,imdb_id,date
9977,tt2205948,2012-10-05
4144,tt1129445,2009-10-22
8518,tt0285879,2003-08-17


## 5. Keywords (content)

In [64]:
keywords_path = os.path.join(os.pardir, 'data', 'raw', 'tmdb_keywords.csv')
keywords_df = pd.read_csv(keywords_path)

In [65]:
keywords_df.sample(3)

Unnamed: 0,imdbId,keywordId,keyword
142874,tt0118900,9826,murder
72286,tt0046247,10520,slave auction
169387,tt0096180,1416,jazz


In [66]:
keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257984 entries, 0 to 257983
Data columns (total 3 columns):
imdbId       257984 non-null object
keywordId    257984 non-null int64
keyword      257984 non-null object
dtypes: int64(1), object(2)
memory usage: 5.9+ MB


This is a simple dataframe, when I created it I knew exactly the columns I would use. 

I do need to change the column names from camelCase to snake_case (node.js uses camelCase):

In [67]:
keywords_df = keywords_df.rename(columns={'imdbId': 'imdb_id', 'keywordId': 'keyword_id'})

In [68]:
keywords_df.sample(3)

Unnamed: 0,imdb_id,keyword_id,keyword
202552,tt1814836,2282,love of animals
176139,tt1974419,14818,model
71029,tt3231054,3571,crucifixion


In [69]:
keywords_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
257979     True
257980     True
257981     True
257982     True
257983     True
Length: 257984, dtype: bool

In [70]:
keywords_df.isna().sum()

imdb_id       0
keyword_id    0
keyword       0
dtype: int64

### Save to CSV

In [71]:
keywords_save_path = os.path.join(os.pardir, 'data', 'processed', 'keywords.csv')
keywords_df.to_csv(keywords_save_path, index=False)

In [72]:
test_keywords_save = pd.read_csv(keywords_save_path)
test_keywords_save.sample(3)

Unnamed: 0,imdb_id,keyword_id,keyword
79686,tt2053463,9665,cover-up
221625,tt0089034,10714,serial killer
230190,tt0120263,236316,anarchic comedy


## Building Dataset
In this section I will combine all the individual datasets into one large dataframe that I can explore in the EDA phase. 

I will keep the actors and keywords seperate for now so they don't explode the dataframe.

In [73]:
# joining revenue with genres:
combined_df = revenue_df.set_index('imdb_id').join(genres_df.set_index('imdb_id'), rsuffix='_rev')

In [74]:
combined_df.head(3)

Unnamed: 0_level_0,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,title_rev,year_rev,genres
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
tt2488496,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams,Lucasfilm,519,245000000.0,937000000.0,692000000.0,True,Star Wars: Episode VII - The Force Awakens,2015,"Action,Adventure,Sci-Fi"
tt4154796,Avengers: Endgame,2019,Anthony Russo,Marvel Studios,111,356000000.0,858000000.0,502000000.0,True,Avengers: Endgame,2019,"Action,Adventure,Drama"
tt0499549,Avatar,2009,James Cameron,Twentieth Century Fox,533,237000000.0,761000000.0,524000000.0,True,Avatar,2009,"Action,Adventure,Fantasy"


In [75]:
combined_df = combined_df.drop(columns=['title_rev', 'year_rev'])

In [76]:
combined_df = combined_df.reset_index()

In [77]:
# adding in time of year next:
combined_df = combined_df.set_index('imdb_id').join(date_df.set_index('imdb_id')).reset_index()

In [78]:
combined_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,genres,date
3142,tt0327437,Around the World in 80 Days,2004,Frank Coraci,Walden Media,7403,110000000.0,24000000.0,-86000000.0,False,"Action,Adventure,Comedy",2004-06-16
13366,tt1308165,The Taqwacores,2010,Eyad Zahra,Rumanni Filmworks,119019,-1.0,11000.0,,False,"Drama,Music",2010-01-24
9758,tt5866930,The Adventurers,2017,Stephen Fung,Gravity Pictures,31684,-1.0,217000.0,,False,"Action,Adventure,Crime",2017-08-11
5447,tt0401383,The Diving Bell and the Butterfly,2007,Julian Schnabel,Pathé Renn Productions,4053,-1.0,6000000.0,,False,"Biography,Drama",2007-05-23
7622,tt0990413,Sugar,2008,Anna Boden,HBO Films,32329,-1.0,1100000.0,,False,"Drama,Sport",2008-04-03


In [79]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14431 entries, 0 to 14430
Data columns (total 12 columns):
imdb_id          14431 non-null object
title            14431 non-null object
year             14431 non-null object
director         14431 non-null object
production_co    14431 non-null object
rank             14431 non-null int64
budget_usd       14431 non-null float64
us_gross         14431 non-null float64
revenue          7896 non-null float64
popular          14431 non-null bool
genres           14317 non-null object
date             14375 non-null object
dtypes: bool(1), float64(3), int64(1), object(7)
memory usage: 1.2+ MB


In [80]:
combined_df = combined_df.dropna()

In [81]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 0 to 14430
Data columns (total 12 columns):
imdb_id          7865 non-null object
title            7865 non-null object
year             7865 non-null object
director         7865 non-null object
production_co    7865 non-null object
rank             7865 non-null int64
budget_usd       7865 non-null float64
us_gross         7865 non-null float64
revenue          7865 non-null float64
popular          7865 non-null bool
genres           7865 non-null object
date             7865 non-null object
dtypes: bool(1), float64(3), int64(1), object(7)
memory usage: 745.0+ KB


### Save to CSV

In [82]:
combined_save_path = os.path.join(os.pardir, 'data', 'processed', 'combined.csv')
combined_df.to_csv(combined_save_path, index=False)

In [83]:
test_combined_save = pd.read_csv(combined_save_path)
test_combined_save.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,genres,date
2201,tt0191754,28 Days,2000,Betty Thomas,Columbia Pictures,5530,43000000.0,37000000.0,-6000000.0,False,"Comedy,Drama",2000-04-06
5854,tt1313092,Animal Kingdom,2010,David Michôd,Porchlight Films,3901,2902084.0,1000000.0,-1902084.0,False,"Crime,Drama",2010-06-03
1695,tt0299172,Home on the Range,2004,Will Finn,Walt Disney Pictures,8010,110000000.0,50000000.0,-60000000.0,False,"Adventure,Animation,Comedy",2004-04-02


In [84]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 0 to 14430
Data columns (total 12 columns):
imdb_id          7865 non-null object
title            7865 non-null object
year             7865 non-null object
director         7865 non-null object
production_co    7865 non-null object
rank             7865 non-null int64
budget_usd       7865 non-null float64
us_gross         7865 non-null float64
revenue          7865 non-null float64
popular          7865 non-null bool
genres           7865 non-null object
date             7865 non-null object
dtypes: bool(1), float64(3), int64(1), object(7)
memory usage: 745.0+ KB
