# Movie Analysis: Data Scrubbing

## About:
In the data scrubbing phase I will focus on cleaning up the columns I plan on using, and building up the data frame I will use for the EDA phase:

1. US Gross Revenue
2. Genre
3. Actors
4. Time of Year (date)
5. Keywords (content)
6. Combined

### Project imports:

In [1]:
# imports for entire data gathering phase
import pandas as pd 
import os

## 1. US Gross Revene
This column will be how we measure the other columns, so we will start here and drop any rows that don't have this information.

In [2]:
revenue_path = os.path.join(os.pardir, 'data', 'interim', 'money.csv')
revenue_df = pd.read_csv(revenue_path)

In [3]:
revenue_df.head()

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
0,tt2488496,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams,Lucasfilm,[US],519,$245MM,$937MM
1,tt4154796,Avengers: Endgame,2019,Anthony Russo,Marvel Studios,[US],111,$356MM,$858MM
2,tt0499549,Avatar,2009,James Cameron,Twentieth Century Fox,[US],533,$237MM,$761MM
3,tt1825683,Black Panther,2018,Ryan Coogler,Marvel Studios,[US],269,$200MM,$700MM
4,tt4154756,Avengers: Infinity War,2018,Anthony Russo,Marvel Studios,[US],376,$321MM,$679MM


In [4]:
revenue_df.loc[revenue_df['imdb_id'] == 'tt0091605']

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
5194,tt0091605,The Name of the Rose,1986,Jean-Jacques Annaud,Constantin Film,[DE],2485,ITL 30B,$7.2MM


In [5]:
revenue_df = revenue_df[:-1]

### Changes:

1. Convert 'us_gross', and 'budget_usd' values into floats. That means stripping the non-number characters out as well as changing 'MM' to ',000,000'.
2. Convert year column to int, so the years don't have the trailing .0.
3. region_code does not need the brackets around the abbreviations.

In [6]:
# Created 3/22/2020 with current exhange values. Values not adjusted for the date the movie was created.
def get_conversion_rate(value):
    """Get exchange rate for given currency code
    
    Arguments:
        value (string): String with currency code or symbol in it

    Returns:
        rate (float): Conversion rate to usd
    """
    if '£' in value:
        return 0.854
    elif '€' in value:
        return 0.9334
    elif 'AUD' in value:
        return 1.7229
    elif 'CAD' in value:
        return 1.435
    elif 'FRF' in value:
        return 6.55957 * 0.9334
    elif 'INR' in value:
        return 75.394
    elif 'THB' in value:
        return 32.68
    elif 'EM' in value:
        return 1 # cant find info on EM
    elif 'JPY' in value:
        return 110.75
    elif 'SKW' in value:
        return 1254.45
    elif 'HUF' in value:
        return 327.94
    elif 'NGN' in value:
        return 364
    elif 'CNY' in value:
        return 7.0950
    elif 'ESP' in value:
        return 155.42826
    elif 'RUR' in value:
        return 79.87
    elif 'HKD' in value:
        return 7.7570
    elif 'ISK' in value:
        return 140.490
    elif 'PHP' in value:
        return 51.19
    elif 'DKK' in value:
        return 6.9716
    elif 'CZK' in value:
        return 25.5620
    elif 'SKK' in value:
        return 10.3753
    elif 'NOK' in value:
        return 11.7890
    elif 'MXN' in value:
        return 24.4215
    elif 'JMD' in value:
        return 135.07
    elif 'PLN' in value:
        return 4.23
    elif 'KRW' in value:
        return 1228.97
    elif 'ITL' in value:
        return 1804.64
    else:
        return 1

In [7]:
def strip_currency_code(value):
    """Strips currency code from front of currency string

    Arguments: 
        value (string): currency amount prefaced with currency code
    
    Returns:
        value (string): value without the currency code
    """
    if value[:1] in '$£€':
        return value[1:]
    else:
        return value[3:]

In [8]:
def convert_money(value):
    """Takes currency string and parses it into correct amount in USD
    
    Arguments:
        value (string): currency in form: CAD 345.3B 

    Returns:
        value (int): currency converted to USD and in standard numeric form
    """
    # type check:
    if type(value) != str:
        return
    
    # check currency sign and get coefficient
    coef = get_conversion_rate(value)
    value = strip_currency_code(value)
    if 'K' in value:
        value = (float(value.strip('K')) * 1000) / coef
    elif 'MM' in value:
        value = (float(value.strip('MM')) * 1000000) / coef
    elif 'B' in value:
        value = (float(value.strip('B')) * 1000000000) / coef
    else:
        value = float(value.strip()) / coef
    return value

In [9]:
revenue_df['us_gross'] = revenue_df['us_gross'].apply(convert_money)

In [10]:
revenue_df['us_gross']

0        937000000.0
1        858000000.0
2        761000000.0
3        700000000.0
4        679000000.0
            ...     
14696            NaN
14697            NaN
14698            NaN
14699            NaN
14700            NaN
Name: us_gross, Length: 14701, dtype: float64

In [11]:
revenue_df['budget_usd'] = revenue_df['budget_usd'].apply(convert_money)

In [12]:
revenue_df['budget_usd'].isna().sum()

6705

In [13]:
revenue_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
3426,tt0085334,A Christmas Story,1983,Bob Clark,Metro-Goldwyn-Mayer (MGM),[US],7231,3300000.0,21000000.0
10010,tt0473753,Angel-A,2005,Luc Besson,EuropaCorp,[FR],13972,16070280.0,203000.0
3209,tt0088885,The Care Bears Movie,1985,Arna Selznick,Nelvana,[CA],28304,2000000.0,23000000.0
4290,tt0376136,The Rum Diary,2011,Bruce Robinson,GK Films,[GB],5713,45000000.0,13000000.0
8999,tt1214962,Seeking Justice,2011,Roger Donaldson,Endgame Entertainment,[US],12847,17000000.0,412000.0


Now for region code. We actually don't need this column so we will drop it.

In [14]:
revenue_df.drop(columns='region_code', inplace=True)

For the 'year' column we went ahead and dropped the missing rows, because there were only 6 of them.

In [15]:
revenue_df.isna().sum()

imdb_id             1
title               1
year               16
director           30
production_co     356
rank               13
budget_usd       6705
us_gross          103
dtype: int64

Cleaning up Nan values:

In [16]:
# first, change the missing values from budget to -1, so we dont drop 1910 rows.
revenue_df['budget_usd'] = revenue_df['budget_usd'].fillna(-1)

In [17]:
# also, fill in the production_co missing values with an 'Unknown'
revenue_df['production_co'] = revenue_df['production_co'].fillna('Unknown')

In [18]:
revenue_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14701 entries, 0 to 14700
Data columns (total 8 columns):
imdb_id          14700 non-null object
title            14700 non-null object
year             14685 non-null object
director         14671 non-null object
production_co    14701 non-null object
rank             14688 non-null object
budget_usd       14701 non-null float64
us_gross         14598 non-null float64
dtypes: float64(2), object(6)
memory usage: 918.9+ KB


In [19]:
revenue_df = revenue_df.dropna()

In [20]:
revenue_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross
8620,tt3110960,Jimmy's Hall,2014,Ken Loach,Sixteen Films,20527,-1.0,561000.0
13381,tt0148421,Looking for an Echo,2000,Martin Davidson,Echo Productions,110129,-1.0,13000.0
9211,tt0118843,"Black Cat, White Cat",1998,Emir Kusturica,CiBy 2000,10546,-1.0,351000.0
10922,tt0446747,Mutual Appreciation,2005,Andrew Bujalski,Mutual Appreciation LLC,101865,-1.0,104000.0
10988,tt0341569,Self Medicated,2005,Monty Lapica,Promise Pictures,179570,400000.0,101000.0


Now for dropping duplicates:

In [21]:
revenue_df = revenue_df.drop_duplicates()

In [22]:
revenue_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14431 entries, 0 to 14598
Data columns (total 8 columns):
imdb_id          14431 non-null object
title            14431 non-null object
year             14431 non-null object
director         14431 non-null object
production_co    14431 non-null object
rank             14431 non-null object
budget_usd       14431 non-null float64
us_gross         14431 non-null float64
dtypes: float64(2), object(6)
memory usage: 1014.7+ KB


In [23]:
revenue_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross
3690,tt0462519,School for Scoundrels,2006,Todd Phillips,Dimension Films,19341,35000000.0,18000000.0
11592,tt7690016,Never Goin' Back,2018,Augustine Frizzell,Sailor Bear,15528,-1.0,61000.0
1458,tt0364751,Without a Paddle,2004,Steven Brill,Paramount Pictures,5268,19000000.0,58000000.0


In [24]:
def calc_revenue(df):
    return df['us_gross'] - df['budget_usd'] 

In [25]:
revenue_df['revenue'] = revenue_df[revenue_df['budget_usd'] > 0].apply(calc_revenue, axis=1)

Added in a revenue column after being advised about how much better it would be for a metric.

In [26]:
revenue_df[revenue_df['budget_usd'] > 0].sort_values('revenue').head(2)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue
1107,tt0401729,John Carter,2012,Andrew Stanton,Walt Disney Pictures,1817,250000000.0,73000000.0,-177000000.0
1271,tt1440129,Battleship,2012,Peter Berg,Universal Pictures,3013,209000000.0,65000000.0,-144000000.0


Change rank type from string to int

In [27]:
def fix_rank(value):
    value = value.replace(',', '')
    return int(value)

In [28]:
revenue_df['rank'] = revenue_df['rank'].apply(fix_rank)
revenue_df['popular'] = revenue_df['rank'].apply(lambda x: x < revenue_df['rank'].quantile(.1))

In [29]:
revenue_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular
10281,tt4383288,"Polina, danser sa vie",2016,Valérie Müller,Everybody on the Deck,41395,-1.0,165000.0,,False
1808,tt0100050,Look Who's Talking Too,1990,Amy Heckerling,TriStar Pictures,12804,-1.0,48000000.0,,False
3837,tt0488658,Unaccompanied Minors,2006,Paul Feig,Warner Bros.,19415,25000000.0,17000000.0,-8000000.0,False


### Save as CSV

In [30]:
revenue_save_path = os.path.join(os.pardir, 'data', 'processed', 'revenue.csv')
revenue_df.to_csv(revenue_save_path, index=False)

In [31]:
test_revenue_save = pd.read_csv(revenue_save_path)
test_revenue_save.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular
1164,tt0162650,Shaft,2000,John Singleton,Paramount Pictures,3704,46000000.0,70000000.0,24000000.0,False
6145,tt0102913,Shout,1991,Jeffrey Hornaday,Robert Simonds Productions,27536,-1.0,3500000.0,,False
1674,tt1179891,My Bloody Valentine,2009,Patrick Lussier,Lionsgate,5156,15000000.0,52000000.0,37000000.0,False


## 2. Genre:
For genre we will need a dataset that lists each movie and it's genre. To analyze the success of the genre, we will need to examine the relationship of genre to the revenue earned.

Bringing in the list of movie titles:

In [32]:
titles_path = os.path.join(os.pardir, 'data', 'raw', 'movies.csv')

In [33]:
genres_df = pd.read_csv(titles_path)
genres_df.head()

Unnamed: 0,tconst,primaryTitle,startYear,genres
0,tt0000009,Miss Jerry,1894,Romance
1,tt0000147,The Corbett-Fitzsimmons Fight,1897,"Documentary,News,Sport"
2,tt0000335,Soldiers of the Cross,1900,"Biography,Drama"
3,tt0000502,Bohemios,1905,\N
4,tt0000574,The Story of the Kelly Gang,1906,"Biography,Crime,Drama"


In [34]:
genres_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545821 entries, 0 to 545820
Data columns (total 4 columns):
tconst          545821 non-null object
primaryTitle    545821 non-null object
startYear       545821 non-null object
genres          545821 non-null object
dtypes: object(4)
memory usage: 16.7+ MB


### Changes:
Looking at the initial dataframe, these are the things I would like to change:
1. Change column names
2. Drop original_title and runtime_minutes columns

In [35]:
genres_df = genres_df.rename(columns={'tconst': 'imdb_id', 'primaryTitle': 'title', 'startYear': 'year'})

In [36]:
genres_df.sample(3)

Unnamed: 0,imdb_id,title,year,genres
211020,tt0469695,You Bet Your Life,2005,"Drama,Thriller"
320428,tt1871282,Gotta Have Faith,\N,Comedy
536397,tt9381904,Tough Guy,\N,Drama


That looks good. Let me deal with Nan's:

In [37]:
genres_df.isna().sum()

imdb_id    0
title      0
year       0
genres     0
dtype: int64

### Save as CSV

In [38]:
genres_save_path = os.path.join(os.pardir, 'data', 'processed', 'genres.csv')
genres_df.to_csv(genres_save_path, index=False)

In [39]:
test_genres_save = pd.read_csv(genres_save_path)
test_genres_save.sample(3)

Unnamed: 0,imdb_id,title,year,genres
434896,tt5139324,Peter Pacifica,2015,"Documentary,Sport"
521160,tt8637464,The Rise of Modern Cooking,\N,Documentary
116087,tt0189817,Nos va la marcha,1979,"Documentary,Musical"


## 3. Actors
These columns will be key in identifying the people who have the ability to produce high quality work on a consistent basis.

In [40]:
people_path = os.path.join(os.pardir, 'data', 'raw', 'imdb.name.basics.csv')
people_df = pd.read_csv(people_path)

In [41]:
people_df.sample(3)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
90001,nm1519079,Guy Davidi,1978.0,,"director,cinematographer,editor","tt2127300,tt1811387,tt5131742,tt2125423"
89722,nm1303407,Deniz Özerman,1969.0,,actress,"tt0470883,tt5104676,tt4321754,tt1815871"
517456,nm6510424,Gregory Melville,,,producer,tt2505700


### Changes:
Some cleanup tasks:
1. Change name of primary_name column to 'name'
2. Select all the actors and actress
3. Drop birth_year, death_year, known_for_titles

In [42]:
people_df.sample(3)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
432766,nm5163762,Ramiro Ariza Castillo,,,"cinematographer,editor",tt2273418
502223,nm7649017,Ronnie Shafer,,,"actor,producer",tt4795624
328285,nm4425411,Skyler Vallo,,,"actress,soundtrack","tt4163224,tt0844441,tt1758795,tt0460627"


In [43]:
people_df = people_df.rename(columns={'primary_name': 'name'})

In [44]:
def can_act(professions):
    if type(professions) != str:
        return False
    if 'actor' in professions or 'actress' in professions:
        return True
    else:
        return False

In [45]:
people_df['can_act'] = people_df['primary_profession'].apply(can_act)

In [46]:
people_df.sample(3)

Unnamed: 0,nconst,name,birth_year,death_year,primary_profession,known_for_titles,can_act
497105,nm8566085,Federico López-Schaper,,,"composer,actor,music_department","tt5609672,tt4947588,tt5609554,tt6487296",True
260751,nm4093330,James Cathcart,,,"director,actor,writer","tt1730311,tt2357469,tt2006791",True
386460,nm7707127,Maria Muro,,,actor,tt5175426,True


Okay, we will grab all the actors and directors and make individual dataframes for them:

In [47]:
actors_df = people_df[people_df['can_act'] == True]

And now we can drop the unwanted columns:

In [48]:
drop_columns = ['primary_profession', 'can_act', 'birth_year', 'death_year', 'known_for_titles']
actors_df = actors_df.drop(columns=drop_columns)

In [49]:
actors_df.sample(3)

Unnamed: 0,nconst,name
370954,nm7069097,Norman Provost
568055,nm8812055,Shiu-Hang Kwok
581597,nm7953852,John David Givhan


Let's check for missing values:

In [50]:
actors_df.isna().sum()

nconst    0
name      0
dtype: int64

There we go. A very large list of actors and actresses. We can join them to the titles and see if there are any patterns amongst the top performing titles.

### Save as CSV

In [51]:
actors_save_path = os.path.join(os.pardir, 'data', 'processed', 'actors.csv')
actors_df.to_csv(actors_save_path, index=False)

In [52]:
test_actors_save = pd.read_csv(actors_save_path)
test_actors_save.sample(3)

Unnamed: 0,nconst,name
88962,nm4464725,Chetanya Adib
100744,nm1286701,Mohan Bhandari
116662,nm4598055,Matteo Lucchi


## 4. Time of Year (date)
Time of year will be an important metric to discover the most opportune time to release a film.

In [53]:
date_path = os.path.join(os.pardir, 'data', 'raw', 'tmdb_movies.csv')
date_df = pd.read_csv(date_path)

In [54]:
date_df.sample(3)

Unnamed: 0,imdbId,budget,revenue,originalTitle,releaseDate
12839,tt2042568,11000000,32935319.0,Inside Llewyn Davis,2013-10-13
5257,tt0431021,14000000,85446075.0,The Possession,2012-08-30
12893,tt0462244,6000000,18197398.0,Daddy Day Camp,2007-08-08


### Changes:
We only need a couple columns from this set:
1. imdb_id
2. release_date
3. add month column

The column names are ok as well, so this will be very simple.

In [55]:
date_df = date_df.drop_duplicates()

In [56]:
date_df = date_df.rename(columns={'imdbId': 'imdb_id', 'originalTitle': 'title', 'releaseDate': 'date'})

In [57]:
date_df = date_df[['imdb_id', 'date']]

In [58]:
date_df = date_df.dropna()

In [59]:
date_df.sample(3)

Unnamed: 0,imdb_id,date
23089,tt0125794,1997-11-27
10830,tt0071115,1974-12-09
30096,tt1528071,2013-09-06


In [60]:
date_df['date'] = pd.to_datetime(date_df['date'], infer_datetime_format=True)

In [61]:
date_df['month'] = date_df['date'].apply(lambda x: x.month_name())

In [62]:
date_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14466 entries, 0 to 43495
Data columns (total 3 columns):
imdb_id    14466 non-null object
date       14466 non-null datetime64[ns]
month      14466 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 452.1+ KB


In [63]:
date_df.isna().sum()

imdb_id    0
date       0
month      0
dtype: int64

### Save to CSV

In [64]:
date_save_path = os.path.join(os.pardir, 'data', 'processed', 'date.csv')
date_df.to_csv(date_save_path, index=False)

In [65]:
test_date_save = pd.read_csv(date_save_path)
test_date_save.sample(3)

Unnamed: 0,imdb_id,date,month
6268,tt1189073,2011-08-17,August
221,tt2381249,2015-07-23,July
2275,tt1616195,2011-11-09,November


## 5. Keywords (content)

In [66]:
keywords_path = os.path.join(os.pardir, 'data', 'raw', 'tmdb_keywords.csv')
keywords_df = pd.read_csv(keywords_path)

In [67]:
keywords_df.sample(3)

Unnamed: 0,imdbId,keywordId,keyword
208734,tt0118742,212642,abusive father
99707,tt0086197,207883,1940s
235668,tt4085084,1543,war veteran


In [68]:
keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257984 entries, 0 to 257983
Data columns (total 3 columns):
imdbId       257984 non-null object
keywordId    257984 non-null int64
keyword      257984 non-null object
dtypes: int64(1), object(2)
memory usage: 5.9+ MB


This is a simple dataframe, when I created it I knew exactly the columns I would use. 

I do need to change the column names from camelCase to snake_case (node.js uses camelCase):

In [69]:
keywords_df = keywords_df.rename(columns={'imdbId': 'imdb_id', 'keywordId': 'keyword_id'})

In [70]:
keywords_df.sample(3)

Unnamed: 0,imdb_id,keyword_id,keyword
121841,tt0089092,14906,slave
176695,tt4827986,818,based on novel or book
123663,tt1232783,12339,slasher


In [71]:
keywords_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
257979     True
257980     True
257981     True
257982     True
257983     True
Length: 257984, dtype: bool

In [72]:
keywords_df.isna().sum()

imdb_id       0
keyword_id    0
keyword       0
dtype: int64

### Save to CSV

In [73]:
keywords_save_path = os.path.join(os.pardir, 'data', 'processed', 'keywords.csv')
keywords_df.to_csv(keywords_save_path, index=False)

In [74]:
test_keywords_save = pd.read_csv(keywords_save_path)
test_keywords_save.sample(3)

Unnamed: 0,imdb_id,keyword_id,keyword
5442,tt1408101,156395,imax
108018,tt0963794,208421,american tourist
122633,tt0105483,162710,culture


## Building Dataset
In this section I will combine all the individual datasets into one large dataframe that I can explore in the EDA phase. 

I will keep the actors and keywords seperate for now so they don't explode the dataframe.

In [75]:
# joining revenue with genres:
combined_df = revenue_df.set_index('imdb_id').join(genres_df.set_index('imdb_id'), rsuffix='_rev')

In [76]:
combined_df.head(3)

Unnamed: 0_level_0,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,title_rev,year_rev,genres
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
tt2488496,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams,Lucasfilm,519,245000000.0,937000000.0,692000000.0,True,Star Wars: Episode VII - The Force Awakens,2015,"Action,Adventure,Sci-Fi"
tt4154796,Avengers: Endgame,2019,Anthony Russo,Marvel Studios,111,356000000.0,858000000.0,502000000.0,True,Avengers: Endgame,2019,"Action,Adventure,Drama"
tt0499549,Avatar,2009,James Cameron,Twentieth Century Fox,533,237000000.0,761000000.0,524000000.0,True,Avatar,2009,"Action,Adventure,Fantasy"


In [77]:
combined_df = combined_df.drop(columns=['title_rev', 'year_rev'])

In [78]:
combined_df = combined_df.reset_index()

In [79]:
# adding in time of year next:
combined_df = combined_df.set_index('imdb_id').join(date_df.set_index('imdb_id')).reset_index()

In [80]:
combined_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,genres,date,month
6828,tt0098659,Winter People,1989,Ted Kotcheff,Castle Rock Entertainment,36360,-1.0,2000000.0,,False,Drama,1989-04-14,April
8914,tt7690638,Soorma,2018,Shaad Ali,CS Films',21047,2652731.0,390000.0,-2262731.0,False,"Biography,Drama,Sport",2018-07-13,July
8059,tt0084264,Lonely Hearts,1982,Paul Cox,Adam Packer Film Productions,172569,-1.0,777000.0,,False,"Comedy,Drama,Romance",1982-11-01,November
7770,tt0248617,Yaadein...,2001,Subhash Ghai,Mukta Arts,62607,2122185.0,1000000.0,-1122185.0,False,"Drama,Musical,Romance",2001-07-27,July
2504,tt0918927,Doubt,2008,John Patrick Shanley,Goodspeed Productions,5994,20000000.0,33000000.0,13000000.0,False,"Drama,Mystery",2008-11-27,November


In [81]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14431 entries, 0 to 14430
Data columns (total 13 columns):
imdb_id          14431 non-null object
title            14431 non-null object
year             14431 non-null object
director         14431 non-null object
production_co    14431 non-null object
rank             14431 non-null int64
budget_usd       14431 non-null float64
us_gross         14431 non-null float64
revenue          7896 non-null float64
popular          14431 non-null bool
genres           14317 non-null object
date             14375 non-null datetime64[ns]
month            14375 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 1.3+ MB


In [82]:
combined_df = combined_df.dropna()

In [83]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 0 to 14430
Data columns (total 13 columns):
imdb_id          7865 non-null object
title            7865 non-null object
year             7865 non-null object
director         7865 non-null object
production_co    7865 non-null object
rank             7865 non-null int64
budget_usd       7865 non-null float64
us_gross         7865 non-null float64
revenue          7865 non-null float64
popular          7865 non-null bool
genres           7865 non-null object
date             7865 non-null datetime64[ns]
month            7865 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 806.5+ KB


### Save to CSV

In [84]:
combined_save_path = os.path.join(os.pardir, 'data', 'processed', 'combined.csv')
combined_df.to_csv(combined_save_path, index=False)

In [85]:
test_combined_save = pd.read_csv(combined_save_path)
test_combined_save.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,genres,date,month
3760,tt0115571,The Arrival,1996,David Twohy,Live Entertainment,5850,25000000.0,14000000.0,-11000000.0,False,"Sci-Fi,Thriller",1996-05-31,May
4027,tt0083190,Thief,1981,Michael Mann,Mann/Caan Productions,9013,5500000.0,11000000.0,5500000.0,False,"Action,Crime,Drama",1981-03-27,March
3061,tt0097142,Dad,1989,Gary David Goldberg,Amblin Entertainment,19326,19000000.0,22000000.0,3000000.0,False,"Comedy,Drama",1989-10-27,October


In [86]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 0 to 14430
Data columns (total 13 columns):
imdb_id          7865 non-null object
title            7865 non-null object
year             7865 non-null object
director         7865 non-null object
production_co    7865 non-null object
rank             7865 non-null int64
budget_usd       7865 non-null float64
us_gross         7865 non-null float64
revenue          7865 non-null float64
popular          7865 non-null bool
genres           7865 non-null object
date             7865 non-null datetime64[ns]
month            7865 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 806.5+ KB
