# Movie Analysis: Data Scrubbing

## About:
In the data scrubbing phase I will focus on cleaning up the columns I plan on using, and building up the data frame I will use for the EDA phase:

1. US Gross Revenue
2. Genre
3. Actors
4. Time of Year (date)
5. Keywords (content)
6. Combined

### Project imports:

In [1]:
# imports for entire data gathering phase
import pandas as pd 
import os

## 1. US Gross Revene
This column will be how we measure the other columns, so we will start here and drop any rows that don't have this information.

In [2]:
revenue_path = os.path.join(os.pardir, 'data', 'interim', 'money.csv')
revenue_df = pd.read_csv(revenue_path)

In [3]:
revenue_df.head()

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
0,tt2488496,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams,Lucasfilm,[US],519,$245MM,$937MM
1,tt4154796,Avengers: Endgame,2019,Anthony Russo,Marvel Studios,[US],111,$356MM,$858MM
2,tt0499549,Avatar,2009,James Cameron,Twentieth Century Fox,[US],533,$237MM,$761MM
3,tt1825683,Black Panther,2018,Ryan Coogler,Marvel Studios,[US],269,$200MM,$700MM
4,tt4154756,Avengers: Infinity War,2018,Anthony Russo,Marvel Studios,[US],376,$321MM,$679MM


In [4]:
revenue_df.loc[revenue_df['imdb_id'] == 'tt0091605']

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
5194,tt0091605,The Name of the Rose,1986,Jean-Jacques Annaud,Constantin Film,[DE],2485,ITL 30B,$7.2MM


In [5]:
revenue_df = revenue_df[:-1]

### Changes:

1. Convert 'us_gross', and 'budget_usd' values into floats. That means stripping the non-number characters out as well as changing 'MM' to ',000,000'.
2. Convert year column to int, so the years don't have the trailing .0.
3. region_code does not need the brackets around the abbreviations.

In [6]:
# Created 3/22/2020 with current exhange values. Values not adjusted for the date the movie was created.
def get_conversion_rate(value):
    """Get exchange rate for given currency code
    
    Arguments:
        value (string): String with currency code or symbol in it

    Returns:
        rate (float): Conversion rate to usd
    """
    if '£' in value:
        return 0.854
    elif '€' in value:
        return 0.9334
    elif 'AUD' in value:
        return 1.7229
    elif 'CAD' in value:
        return 1.435
    elif 'FRF' in value:
        return 6.55957 * 0.9334
    elif 'INR' in value:
        return 75.394
    elif 'THB' in value:
        return 32.68
    elif 'EM' in value:
        return 1 # cant find info on EM
    elif 'JPY' in value:
        return 110.75
    elif 'SKW' in value:
        return 1254.45
    elif 'HUF' in value:
        return 327.94
    elif 'NGN' in value:
        return 364
    elif 'CNY' in value:
        return 7.0950
    elif 'ESP' in value:
        return 155.42826
    elif 'RUR' in value:
        return 79.87
    elif 'HKD' in value:
        return 7.7570
    elif 'ISK' in value:
        return 140.490
    elif 'PHP' in value:
        return 51.19
    elif 'DKK' in value:
        return 6.9716
    elif 'CZK' in value:
        return 25.5620
    elif 'SKK' in value:
        return 10.3753
    elif 'NOK' in value:
        return 11.7890
    elif 'MXN' in value:
        return 24.4215
    elif 'JMD' in value:
        return 135.07
    elif 'PLN' in value:
        return 4.23
    elif 'KRW' in value:
        return 1228.97
    elif 'ITL' in value:
        return 1804.64
    else:
        return 1

In [7]:
def strip_currency_code(value):
    """Strips currency code from front of currency string

    Arguments: 
        value (string): currency amount prefaced with currency code
    
    Returns:
        value (string): value without the currency code
    """
    if value[:1] in '$£€':
        return value[1:]
    else:
        return value[3:]

In [8]:
def convert_money(value):
    """Takes currency string and parses it into correct amount in USD
    
    Arguments:
        value (string): currency in form: CAD 345.3B 

    Returns:
        value (int): currency converted to USD and in standard numeric form
    """
    # type check:
    if type(value) != str:
        return
    
    # check currency sign and get coefficient
    coef = get_conversion_rate(value)
    value = strip_currency_code(value)
    if 'K' in value:
        value = (float(value.strip('K')) * 1000) / coef
    elif 'MM' in value:
        value = (float(value.strip('MM')) * 1000000) / coef
    elif 'B' in value:
        value = (float(value.strip('B')) * 1000000000) / coef
    else:
        value = float(value.strip()) / coef
    return value

In [9]:
revenue_df['us_gross'] = revenue_df['us_gross'].apply(convert_money)

In [10]:
revenue_df['us_gross']

0        937000000.0
1        858000000.0
2        761000000.0
3        700000000.0
4        679000000.0
            ...     
14696            NaN
14697            NaN
14698            NaN
14699            NaN
14700            NaN
Name: us_gross, Length: 14701, dtype: float64

In [11]:
revenue_df['budget_usd'] = revenue_df['budget_usd'].apply(convert_money)

In [12]:
revenue_df['budget_usd'].isna().sum()

6705

In [13]:
revenue_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,region_code,rank,budget_usd,us_gross
3441,tt1430612,Brick Mansions,2014,Camille Delamarre,Relativity Media,[US],8992,28000000.0,20000000.0
10132,tt3458236,Lambert & Stamp,2014,James D. Cooper,Motocinema,[US],158522,,183000.0
6536,tt1127715,Sin Nombre,2009,Cary Joji Fukunaga,Scion Films,[GB],9926,,2500000.0
13256,tt2063819,The Olivia Experiment,2012,Sonja Schenk,Mansfield Films,[US],110963,,15000.0
4631,tt0089629,Moving Violations,1985,Neal Israel,SLM Production Group,[US],15521,,11000000.0


Now for region code. We actually don't need this column so we will drop it.

In [14]:
revenue_df.drop(columns='region_code', inplace=True)

For the 'year' column we went ahead and dropped the missing rows, because there were only 6 of them.

In [15]:
revenue_df.isna().sum()

imdb_id             1
title               1
year               16
director           30
production_co     356
rank               13
budget_usd       6705
us_gross          103
dtype: int64

Cleaning up Nan values:

In [16]:
# first, change the missing values from budget to -1, so we dont drop 1910 rows.
revenue_df['budget_usd'] = revenue_df['budget_usd'].fillna(-1)

In [17]:
# also, fill in the production_co missing values with an 'Unknown'
revenue_df['production_co'] = revenue_df['production_co'].fillna('Unknown')

In [18]:
revenue_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14701 entries, 0 to 14700
Data columns (total 8 columns):
imdb_id          14700 non-null object
title            14700 non-null object
year             14685 non-null object
director         14671 non-null object
production_co    14701 non-null object
rank             14688 non-null object
budget_usd       14701 non-null float64
us_gross         14598 non-null float64
dtypes: float64(2), object(6)
memory usage: 918.9+ KB


In [19]:
revenue_df = revenue_df.dropna()

In [20]:
revenue_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross
63,tt2250912,Spider-Man: Homecoming,2017,Jon Watts,Columbia Pictures,624,175000000.0,334000000.0
8239,tt0109655,Double Happiness,1994,Mina Shum,British Columbia Film Commission,46162,-1.0,759000.0
2235,tt0146336,Urban Legend,1998,Jamie Blanks,Phoenix Pictures,3984,14000000.0,38000000.0
13724,tt4031126,Lycan,2017,Bev Land,1 Bullet in the Gun Productions,32164,250000.0,9000.0
3874,tt0315733,21 Grams,2003,Alejandro G. Iñárritu,This Is That Productions,5120,20000000.0,16000000.0


Now for dropping duplicates:

In [21]:
revenue_df = revenue_df.drop_duplicates()

In [22]:
revenue_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14431 entries, 0 to 14598
Data columns (total 8 columns):
imdb_id          14431 non-null object
title            14431 non-null object
year             14431 non-null object
director         14431 non-null object
production_co    14431 non-null object
rank             14431 non-null object
budget_usd       14431 non-null float64
us_gross         14431 non-null float64
dtypes: float64(2), object(6)
memory usage: 1014.7+ KB


In [23]:
revenue_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross
2558,tt0063442,Planet of the Apes,1968,Franklin J. Schaffner,APJAC Productions,2942,5800000.0,33000000.0
10633,tt3626442,Björk: Biophilia Live,2014,Nick Fenton,Unknown,242834,-1.0,128000.0
4021,tt0089421,King Solomon's Mines,1985,J. Lee Thompson,Cannon Group,10361,12000000.0,15000000.0


In [24]:
def calc_revenue(df):
    return df['us_gross'] - df['budget_usd'] 

In [25]:
revenue_df['revenue'] = revenue_df[revenue_df['budget_usd'] > 0].apply(calc_revenue, axis=1)

Added in a revenue column after being advised about how much better it would be for a metric.

In [26]:
revenue_df[revenue_df['budget_usd'] > 0].sort_values('revenue').head(2)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue
1107,tt0401729,John Carter,2012,Andrew Stanton,Walt Disney Pictures,1817,250000000.0,73000000.0,-177000000.0
1271,tt1440129,Battleship,2012,Peter Berg,Universal Pictures,3013,209000000.0,65000000.0,-144000000.0


Change rank type from string to int

In [27]:
def fix_rank(value):
    value = value.replace(',', '')
    return int(value)

In [28]:
revenue_df['rank'] = revenue_df['rank'].apply(fix_rank)
revenue_df['popular'] = revenue_df['rank'].apply(lambda x: x < revenue_df['rank'].quantile(.1))

In [29]:
revenue_df.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular
8753,tt0101991,Rhapsody in August,1991,Akira Kurosawa,Feature Film Enterprise II,22493,-1.0,516000.0,,False
11469,tt2358456,I Am Big Bird: The Caroll Spinney Story,2014,Dave LaMattina,Copper Pot Pictures,76677,100000.0,68000.0,-32000.0,False
9994,tt7014234,The Iron Orchard,2018,Ty Roberts,Santa Rita Film Co.,37375,-1.0,205000.0,,False


### Save as CSV

In [30]:
revenue_save_path = os.path.join(os.pardir, 'data', 'processed', 'revenue.csv')
revenue_df.to_csv(revenue_save_path, index=False)

In [31]:
test_revenue_save = pd.read_csv(revenue_save_path)
test_revenue_save.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular
1038,tt0088011,Romancing the Stone,1984,Robert Zemeckis,Twentieth Century Fox,3899,10000000.0,77000000.0,67000000.0,False
5071,tt2170299,Bad Words,2013,Jason Bateman,Aggregate Films,9677,10000000.0,7800000.0,-2200000.0,False
9667,tt0398971,Down and Derby,2005,Eric Hendershot,Stonehaven Media,58796,-1.0,232000.0,,False


## 2. Genre:
For genre we will need a dataset that lists each movie and it's genre. To analyze the success of the genre, we will need to examine the relationship of genre to the revenue earned.

Bringing in the list of movie titles:

In [32]:
titles_path = os.path.join(os.pardir, 'data', 'raw', 'movies.csv')

In [33]:
genres_df = pd.read_csv(titles_path)
genres_df.head()

Unnamed: 0,tconst,primaryTitle,startYear,genres
0,tt0000009,Miss Jerry,1894,Romance
1,tt0000147,The Corbett-Fitzsimmons Fight,1897,"Documentary,News,Sport"
2,tt0000335,Soldiers of the Cross,1900,"Biography,Drama"
3,tt0000502,Bohemios,1905,\N
4,tt0000574,The Story of the Kelly Gang,1906,"Biography,Crime,Drama"


In [34]:
genres_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545821 entries, 0 to 545820
Data columns (total 4 columns):
tconst          545821 non-null object
primaryTitle    545821 non-null object
startYear       545821 non-null object
genres          545821 non-null object
dtypes: object(4)
memory usage: 16.7+ MB


### Changes:
Looking at the initial dataframe, these are the things I would like to change:
1. Change column names
2. Drop original_title and runtime_minutes columns

In [35]:
genres_df = genres_df.rename(columns={'tconst': 'imdb_id', 'primaryTitle': 'title', 'startYear': 'year'})

In [36]:
genres_df.sample(3)

Unnamed: 0,imdb_id,title,year,genres
480705,tt6789128,Untitled Bank Project,\N,Action
307702,tt1679313,The Problem Solver,2010,"Comedy,Mystery"
139518,tt0241625,A keresztes vitézek,1921,\N


That looks good. Let me deal with Nan's:

In [37]:
genres_df.isna().sum()

imdb_id    0
title      0
year       0
genres     0
dtype: int64

### Save as CSV

In [38]:
genres_save_path = os.path.join(os.pardir, 'data', 'processed', 'genres.csv')
genres_df.to_csv(genres_save_path, index=False)

In [39]:
test_genres_save = pd.read_csv(genres_save_path)
test_genres_save.sample(3)

Unnamed: 0,imdb_id,title,year,genres
331384,tt2085866,Death Is My Profession,2011,Drama
320712,tt1876486,The Faces Behind the Dolls,2011,Documentary
457465,tt5925288,Futari no tougekyou,2016,Documentary


## 3. Actors
These columns will be key in identifying the people who have the ability to produce high quality work on a consistent basis.

In [40]:
people_path = os.path.join(os.pardir, 'data', 'raw', 'imdb.name.basics.csv')
people_df = pd.read_csv(people_path)

In [41]:
people_df.sample(3)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
41757,nm0524900,Dariela Ludlow,,,"cinematographer,camera_department,director","tt0416496,tt0485851,tt1532583,tt1127826"
124023,nm2015126,Elena Kagan,,,actress,"tt0449519,tt5815100,tt0431022,tt0476978"
228439,nm3301282,Kelley J. Jackson,,,"actress,writer,director","tt0840979,tt2402207,tt4039658,tt0203259"


### Changes:
Some cleanup tasks:
1. Change name of primary_name column to 'name'
2. Select all the actors and actress
3. Drop birth_year, death_year, known_for_titles

In [42]:
people_df.sample(3)

Unnamed: 0,nconst,primary_name,birth_year,death_year,primary_profession,known_for_titles
356145,nm6591978,Pierre-Emmanuel Barré,,,"actor,writer","tt7033196,tt6431316,tt5180488,tt5492624"
299202,nm4260382,Sean Warner,,,actor,"tt2224532,tt1814876,tt1696514,tt1835919"
236901,nm3952307,Michael Faner,,,"sound_department,cinematographer,producer","tt9248032,tt8560814,tt8367846,tt1430047"


In [43]:
people_df = people_df.rename(columns={'primary_name': 'name'})

In [44]:
def can_act(professions):
    if type(professions) != str:
        return False
    if 'actor' in professions or 'actress' in professions:
        return True
    else:
        return False

In [45]:
people_df['can_act'] = people_df['primary_profession'].apply(can_act)

In [46]:
people_df.sample(3)

Unnamed: 0,nconst,name,birth_year,death_year,primary_profession,known_for_titles,can_act
196940,nm2157208,Golnaz Farmani,,,actress,"tt0499537,tt2575080",True
603285,nm7819257,J.T. Lewis,2000.0,,,"tt7406432,tt3758314,tt7989888",False
59080,nm10021381,Pooja Malekar,,,actress,,True


Okay, we will grab all the actors and directors and make individual dataframes for them:

In [47]:
actors_df = people_df[people_df['can_act'] == True]

And now we can drop the unwanted columns:

In [48]:
drop_columns = ['primary_profession', 'can_act', 'birth_year', 'death_year', 'known_for_titles']
actors_df = actors_df.drop(columns=drop_columns)

In [49]:
actors_df.sample(3)

Unnamed: 0,nconst,name
372670,nm9021148,David Engler
420343,nm6626945,Raul Walder
200269,nm2525861,Edip Tutal


Let's check for missing values:

In [50]:
actors_df.isna().sum()

nconst    0
name      0
dtype: int64

There we go. A very large list of actors and actresses. We can join them to the titles and see if there are any patterns amongst the top performing titles.

### Save as CSV

In [51]:
actors_save_path = os.path.join(os.pardir, 'data', 'processed', 'actors.csv')
actors_df.to_csv(actors_save_path, index=False)

In [52]:
test_actors_save = pd.read_csv(actors_save_path)
test_actors_save.sample(3)

Unnamed: 0,nconst,name
207975,nm5528958,J.P. Valenti
161009,nm4531335,Noah Gillett
85370,nm2913613,Bob Moore


## 4. Time of Year (date)
Time of year will be an important metric to discover the most opportune time to release a film.

In [53]:
date_path = os.path.join(os.pardir, 'data', 'raw', 'tmdb_movies.csv')
date_df = pd.read_csv(date_path)

In [54]:
date_df.sample(3)

Unnamed: 0,imdbId,budget,revenue,originalTitle,releaseDate
30781,tt2204371,0,0.0,Somm,2013-06-21
33163,tt8956390,0,0.0,Rojo,2018-10-26
19560,tt0238247,0,0.0,God's Army,2000-03-10


### Changes:
We only need a couple columns from this set:
1. imdb_id
2. release_date
3. add month column

The column names are ok as well, so this will be very simple.

In [55]:
date_df = date_df.drop_duplicates()

In [56]:
date_df = date_df.rename(columns={'imdbId': 'imdb_id', 'originalTitle': 'title', 'releaseDate': 'date'})

In [57]:
date_df = date_df[['imdb_id', 'date']]

In [58]:
date_df = date_df.dropna()

In [59]:
date_df.sample(3)

Unnamed: 0,imdb_id,date
32071,tt0088135,1983-12-01
7129,tt1298644,2019-05-09
15601,tt0264761,2002-03-13


In [60]:
date_df['date'] = pd.to_datetime(date_df['date'], infer_datetime_format=True)

In [61]:
date_df['month'] = date_df['date'].apply(lambda x: x.month_name())

In [62]:
date_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14466 entries, 0 to 43495
Data columns (total 3 columns):
imdb_id    14466 non-null object
date       14466 non-null datetime64[ns]
month      14466 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 452.1+ KB


In [63]:
date_df.isna().sum()

imdb_id    0
date       0
month      0
dtype: int64

### Save to CSV

In [64]:
date_save_path = os.path.join(os.pardir, 'data', 'processed', 'date.csv')
date_df.to_csv(date_save_path, index=False)

In [65]:
test_date_save = pd.read_csv(date_save_path)
test_date_save.sample(3)

Unnamed: 0,imdb_id,date,month
6411,tt0091954,1986-10-24,October
12685,tt0881909,2007-06-29,June
9199,tt0279064,2001-03-16,March


## 5. Keywords (content)

In [66]:
keywords_path = os.path.join(os.pardir, 'data', 'raw', 'tmdb_keywords.csv')
keywords_df = pd.read_csv(keywords_path)

In [67]:
keywords_df.sample(3)

Unnamed: 0,imdbId,keywordId,keyword
10064,tt0458339,242,new york city
197940,tt0303830,18266,south america
127232,tt0457513,549,prostitute


In [68]:
keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257984 entries, 0 to 257983
Data columns (total 3 columns):
imdbId       257984 non-null object
keywordId    257984 non-null int64
keyword      257984 non-null object
dtypes: int64(1), object(2)
memory usage: 5.9+ MB


This is a simple dataframe, when I created it I knew exactly the columns I would use. 

I do need to change the column names from camelCase to snake_case (node.js uses camelCase):

In [69]:
keywords_df = keywords_df.rename(columns={'imdbId': 'imdb_id', 'keywordId': 'keyword_id'})

In [70]:
keywords_df.sample(3)

Unnamed: 0,imdb_id,keyword_id,keyword
221836,tt1787816,6832,environmental protection agency
166265,tt0106966,239161,havana
13313,tt3783958,33928,audition


In [71]:
keywords_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
257979     True
257980     True
257981     True
257982     True
257983     True
Length: 257984, dtype: bool

In [72]:
keywords_df.isna().sum()

imdb_id       0
keyword_id    0
keyword       0
dtype: int64

### Save to CSV

In [73]:
keywords_save_path = os.path.join(os.pardir, 'data', 'processed', 'keywords.csv')
keywords_df.to_csv(keywords_save_path, index=False)

In [74]:
test_keywords_save = pd.read_csv(keywords_save_path)
test_keywords_save.sample(3)

Unnamed: 0,imdb_id,keyword_id,keyword
250057,tt5700176,212642,abusive father
40624,tt0118615,886,movie business
163552,tt1226774,10589,warmongering


## Building Dataset
In this section I will combine all the individual datasets into one large dataframe that I can explore in the EDA phase. 

I will keep the actors and keywords seperate for now so they don't explode the dataframe.

In [75]:
# joining revenue with genres:
combined_df = revenue_df.set_index('imdb_id').join(genres_df.set_index('imdb_id'), rsuffix='_rev')

In [76]:
combined_df.head(3)

Unnamed: 0_level_0,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,title_rev,year_rev,genres
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
tt2488496,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams,Lucasfilm,519,245000000.0,937000000.0,692000000.0,True,Star Wars: Episode VII - The Force Awakens,2015,"Action,Adventure,Sci-Fi"
tt4154796,Avengers: Endgame,2019,Anthony Russo,Marvel Studios,111,356000000.0,858000000.0,502000000.0,True,Avengers: Endgame,2019,"Action,Adventure,Drama"
tt0499549,Avatar,2009,James Cameron,Twentieth Century Fox,533,237000000.0,761000000.0,524000000.0,True,Avatar,2009,"Action,Adventure,Fantasy"


In [77]:
combined_df = combined_df.drop(columns=['title_rev', 'year_rev'])

In [78]:
combined_df = combined_df.reset_index()

In [79]:
# adding in time of year next:
combined_df = combined_df.set_index('imdb_id').join(date_df.set_index('imdb_id')).reset_index()

In [80]:
combined_df.sample(5)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,genres,date,month
11720,tt2611390,Difret,2014,Zeresenay Mehari,Haile Addis Pictures,80624,-1.0,50000.0,,False,"Biography,Crime,Drama",2014-01-19,January
3706,tt0110006,Heavyweights,1995,Steven Brill,Caravan Pictures,3223,-1.0,18000000.0,,False,"Comedy,Drama,Family",1995-02-17,February
7106,tt0097635,Jesus of Montreal,1989,Denys Arcand,Max Films Productions,35905,-1.0,1600000.0,,False,Drama,1989-05-17,May
9583,tt0843358,My Dog Tulip,2009,Paul Fierlinger,Norman Twain Productions,96677,-1.0,247000.0,,False,"Animation,Drama",2009-01-01,January
11291,tt3498950,Take Me to the River,2014,Martin Shore,EGBA Entertainment,134126,-1.0,69000.0,,False,"Documentary,Music",2014-03-11,March


In [81]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14431 entries, 0 to 14430
Data columns (total 13 columns):
imdb_id          14431 non-null object
title            14431 non-null object
year             14431 non-null object
director         14431 non-null object
production_co    14431 non-null object
rank             14431 non-null int64
budget_usd       14431 non-null float64
us_gross         14431 non-null float64
revenue          7896 non-null float64
popular          14431 non-null bool
genres           14317 non-null object
date             14375 non-null datetime64[ns]
month            14375 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 1.3+ MB


In [82]:
combined_df = combined_df.dropna()

In [83]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 0 to 14430
Data columns (total 13 columns):
imdb_id          7865 non-null object
title            7865 non-null object
year             7865 non-null object
director         7865 non-null object
production_co    7865 non-null object
rank             7865 non-null int64
budget_usd       7865 non-null float64
us_gross         7865 non-null float64
revenue          7865 non-null float64
popular          7865 non-null bool
genres           7865 non-null object
date             7865 non-null datetime64[ns]
month            7865 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 806.5+ KB


### Save to CSV

In [84]:
combined_save_path = os.path.join(os.pardir, 'data', 'processed', 'combined.csv')
combined_df.to_csv(combined_save_path, index=False)

In [85]:
test_combined_save = pd.read_csv(combined_save_path)
test_combined_save.sample(3)

Unnamed: 0,imdb_id,title,year,director,production_co,rank,budget_usd,us_gross,revenue,popular,genres,date,month
100,tt0198781,"Monsters, Inc.",2001,Pete Docter,Pixar Animation Studios,1265,115000000.0,290000000.0,175000000.0,True,"Adventure,Animation,Comedy",2001-11-01,November
1461,tt0096446,Willow,1988,Ron Howard,Metro-Goldwyn-Mayer (MGM),1463,35000000.0,57000000.0,22000000.0,True,"Action,Adventure,Drama",1988-05-01,May
447,tt0129290,Patch Adams,1998,Tom Shadyac,Universal Pictures,4797,90000000.0,135000000.0,45000000.0,False,"Biography,Comedy,Drama",1998-12-25,December


In [86]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 0 to 14430
Data columns (total 13 columns):
imdb_id          7865 non-null object
title            7865 non-null object
year             7865 non-null object
director         7865 non-null object
production_co    7865 non-null object
rank             7865 non-null int64
budget_usd       7865 non-null float64
us_gross         7865 non-null float64
revenue          7865 non-null float64
popular          7865 non-null bool
genres           7865 non-null object
date             7865 non-null datetime64[ns]
month            7865 non-null object
dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(7)
memory usage: 806.5+ KB
