# Download and clean Dataset

###### downloading and cleaning.py #####

## Movie industry analysis ##

#### The main purpose of this data analysis is to find out if there's a correlation between the rating of the movie and its box-office earnings. Movies high rated are always the ones that make more money? 

#### Note that in order to make a fair analyis in terms of box-office earnings we are filtering movies just from the 2000's and movies with earnings dollar currency only

### HYPOTHESIS
#### Other hypothesis we want to resolve:
####  1) Movie genre vs rating 
####  2) Movie genre vs awards - Why is hard for comedy movies to get to the Oscars?


### DATA SOURCE
#### For this proyect we are going to work with two different data sources: Kaggle and IMDb API
#### KAGGLE: Originally this data was downloades from IMDb API but the we need more information from the API that is missing. From this data we are going to use mainly imdb_title_id, original_title, year, country, language, budget and worlwide_gross_income.
#### IMDb API: we are using imdb_title_id to make the request to the api and get the info about awards.



In [1]:
###### downloading and cleaning.py #####

#Import libraries 

import numpy as np
import pandas as pd

import re

import warnings

warnings.filterwarnings('ignore')

# https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset -> Kaggle URL Dataset

In [2]:
#Download Kaggle dataset

# !kaggle datasets download -d stefanoleone992/imdb-extensive-dataset

In [3]:
# Find downloaded zip file from Kaggle

# !ls

In [4]:
#Decompress zip file

# !tar -xzvf imdb-extensive-dataset.zip

In [5]:
#Delete downloaded zip file

# !rm -rf imdb-extensive-dataset.zip

In [6]:
# !rm -rf IMDb names.csv
# !rm -rf IMDb ratings.csv **
# !rm -rf title_principals.csv                

In [7]:
# Read and convert the csv source data into a pandas dataframe.

pd.set_option('display.max_columns', None)
kaggle_movie_ratings = pd.read_csv("IMDb ratings.csv",encoding = "ISO-8859-1")

In [8]:
kaggle_movie_ratings.head()

Unnamed: 0,imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,votes_5,votes_4,votes_3,votes_2,votes_1,allgenders_0age_avg_vote,allgenders_0age_votes,allgenders_18age_avg_vote,allgenders_18age_votes,allgenders_30age_avg_vote,allgenders_30age_votes,allgenders_45age_avg_vote,allgenders_45age_votes,males_allages_avg_vote,males_allages_votes,males_0age_avg_vote,males_0age_votes,males_18age_avg_vote,males_18age_votes,males_30age_avg_vote,males_30age_votes,males_45age_avg_vote,males_45age_votes,females_allages_avg_vote,females_allages_votes,females_0age_avg_vote,females_0age_votes,females_18age_avg_vote,females_18age_votes,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
0,tt0000009,5.9,154,5.9,6.0,12,4,10,43,28,28,9,1,5,14,7.2,4.0,6.0,38.0,5.7,50.0,6.6,35.0,6.2,97.0,7.0,1.0,5.9,24.0,5.6,36.0,6.7,31.0,6.0,35.0,7.3,3.0,5.9,14.0,5.7,13.0,4.5,4.0,5.7,34.0,6.4,51.0,6.0,70.0
1,tt0000574,6.1,589,6.3,6.0,57,18,58,137,139,103,28,20,13,16,6.0,1.0,6.1,114.0,6.0,239.0,6.3,115.0,6.1,425.0,6.0,1.0,6.2,102.0,6.0,210.0,6.2,100.0,6.2,50.0,,,5.9,12.0,6.2,23.0,6.6,14.0,6.4,66.0,6.0,96.0,6.2,331.0
2,tt0001892,5.8,188,6.0,6.0,6,6,17,44,52,32,16,5,6,4,,,5.5,25.0,5.8,72.0,6.2,62.0,5.9,146.0,,,5.5,21.0,5.9,67.0,6.2,55.0,5.7,15.0,,,5.8,4.0,5.8,4.0,6.8,7.0,5.4,32.0,6.2,31.0,5.9,123.0
3,tt0002101,5.2,446,5.3,5.0,15,8,16,62,98,117,63,26,25,16,,,5.3,23.0,5.0,111.0,5.3,193.0,5.1,299.0,,,5.2,20.0,4.9,96.0,5.2,171.0,5.9,39.0,,,5.7,3.0,5.5,14.0,6.1,21.0,4.9,57.0,5.5,207.0,4.7,105.0
4,tt0002130,7.0,2237,6.9,7.0,210,225,436,641,344,169,66,39,20,87,7.5,4.0,7.0,402.0,7.0,895.0,7.1,482.0,7.0,1607.0,8.0,2.0,7.0,346.0,7.0,804.0,7.0,396.0,7.2,215.0,7.0,2.0,7.0,52.0,7.3,82.0,7.4,77.0,6.9,139.0,7.0,488.0,7.0,1166.0


In [9]:
# Read and convert the csv source data into a pandas dataframe.

pd.set_option('display.max_columns', None)
kaggle_movie_dataset = pd.read_csv("IMDb movies.csv",encoding = "ISO-8859-1")

In [10]:
# Shows first 3 rows of the dataset.

kaggle_movie_dataset.head(2)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0


In [11]:
kaggle_movie_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85855 entries, 0 to 85854
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_title_id          85855 non-null  object 
 1   title                  85855 non-null  object 
 2   original_title         85855 non-null  object 
 3   year                   85855 non-null  object 
 4   date_published         85855 non-null  object 
 5   genre                  85855 non-null  object 
 6   duration               85855 non-null  int64  
 7   country                85791 non-null  object 
 8   language               85022 non-null  object 
 9   director               85768 non-null  object 
 10  writer                 84283 non-null  object 
 11  production_company     81400 non-null  object 
 12  actors                 85786 non-null  object 
 13  description            83740 non-null  object 
 14  avg_vote               85855 non-null  float64
 15  vo

In [12]:
# Generate various summary statistics, excluding NaN values.

kaggle_movie_dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
duration,85855.0,100.351418,22.553848,41.0,88.0,96.0,108.0,808.0
avg_vote,85855.0,5.898656,1.234987,1.0,5.2,6.1,6.8,9.9
votes,85855.0,9493.489605,53574.359543,99.0,205.0,484.0,1766.5,2278845.0
metascore,13305.0,55.896881,17.784874,1.0,43.0,57.0,69.0,100.0
reviews_from_users,78258.0,46.040826,178.511411,1.0,4.0,9.0,27.0,10472.0
reviews_from_critics,74058.0,27.479989,58.339158,1.0,3.0,8.0,23.0,999.0


It seems the mayority of the variables in this dataset are discrete/categorical type. We might be able to use "count" methods for these ones after applying some data cleaning and data consolidation. We will get clearer information later on. For this analytis we are not going to use the continous variables

In [13]:
# Calculates the percentage of null registers for each variable

percent_missing = round(kaggle_movie_dataset.isnull().sum() * 100 / len(kaggle_movie_dataset), 2)
percent_missing

imdb_title_id             0.00
title                     0.00
original_title            0.00
year                      0.00
date_published            0.00
genre                     0.00
duration                  0.00
country                   0.07
language                  0.97
director                  0.10
writer                    1.83
production_company        5.19
actors                    0.08
description               2.46
avg_vote                  0.00
votes                     0.00
budget                   72.38
usa_gross_income         82.15
worlwide_gross_income    63.87
metascore                84.50
reviews_from_users        8.85
reviews_from_critics     13.74
dtype: float64

We definatelly need the "worlwide_gross_income" column and the "worlwide_gross_income" for this analysis even though they have one the highest missing data percentage. It shouldn't be a big inconvenience since we can only make 500 requests to the IMDb API (we need one per movie)

###### movie_functions.py #####

In [14]:
# Remove rows where the following columns values are missing

get_notnulls_columns = ["budget", "worlwide_gross_income", "genre", "country"]

def notNulls(columns, dataset):
    for col in columns:
        dataset = dataset[pd.notnull(dataset[col])]  
    return dataset

In [15]:
# import movie_functions as mf
# kaggle_movie_dataset2 = mf.notNulls(get_notnulls_columns, kaggle_movie_dataset)

In [16]:
kaggle_movie_dataset2 = notNulls(get_notnulls_columns, kaggle_movie_dataset)

In [17]:
# kaggle_movie_dataset2.info()

In [18]:
# Calculates the percentage of null registers for each variable

percent_missing2 = round(kaggle_movie_dataset2.isnull().sum() * 100 / len(kaggle_movie_dataset), 2)
percent_missing2

imdb_title_id            0.00
title                    0.00
original_title           0.00
year                     0.00
date_published           0.00
genre                    0.00
duration                 0.00
country                  0.00
language                 0.04
director                 0.00
writer                   0.08
production_company       0.16
actors                   0.00
description              0.19
avg_vote                 0.00
votes                    0.00
budget                   0.00
usa_gross_income         5.40
worlwide_gross_income    0.00
metascore                6.60
reviews_from_users       0.65
reviews_from_critics     0.58
dtype: float64

#### Let's check if there are null values within the columns we want to use after deleting null values from "budget"

In [19]:
# Gets number of registers and variables of this dataset

kaggle_movie_dataset.shape

(85855, 22)

In [20]:
kaggle_movie_dataset2.shape

(12761, 22)

#### We need "worlwide_gross_income" to be and integer and we are going to selecto only $ currency

In [21]:
get_dollars = kaggle_movie_dataset2["worlwide_gross_income"].str.startswith('$', na=False)

kaggle_movie_dataset2[get_dollars]

kaggle_movie_dataset2.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
165,tt0010323,Il gabinetto del dottor Caligari,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,Robert Wiene,"Carl Mayer, Hans Janowitz",Decla-Bioscop AG,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...","Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.1,55601,$ 18000,$ 8811,$ 8811,,237.0,160.0
210,tt0011440,Markens grÃ¸de,Markens grÃ¸de,1921,1921-12-02,Drama,107,Norway,,Gunnar Sommerfeldt,"Knut Hamsun, Gunnar Sommerfeldt",Christiana Film,"Amund Rydland, Karen Poulsen, Ragna Wettergree...",After the Nobel prize winning Knut Hamsun-nove...,6.6,195,NOK 250000,,$ 4272,,3.0,3.0
245,tt0012190,I quattro cavalieri dell'Apocalisse,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,Rex Ingram,"Vicente Blasco IbÃ¡Ã±ez, June Mathis",Metro Pictures Corporation,"Pomeroy Cannon, Josef Swickard, Bridgetta Clar...",An extended family split up in France and Germ...,7.2,3058,$ 800000,$ 9183673,$ 9183673,,45.0,16.0
251,tt0012349,Il monello,The Kid,1921,1923-11-26,"Comedy, Drama, Family",68,USA,"English, None",Charles Chaplin,Charles Chaplin,Charles Chaplin Productions,"Carl Miller, Edna Purviance, Jackie Coogan, Ch...","The Tramp cares for an abandoned child, but ev...",8.3,109038,$ 250000,,$ 26916,,173.0,105.0
348,tt0014624,La donna di Parigi,A Woman of Paris: A Drama of Fate,1923,1927-06-06,"Drama, Romance",82,USA,"None, English",Charles Chaplin,Charles Chaplin,Charles Chaplin Productions,"Edna Purviance, Clarence Geldart, Carl Miller,...",A kept woman runs into her former fiancÃ© and ...,7.0,4735,$ 351000,,$ 11233,,37.0,24.0


In [22]:
kaggle_movie_dataset2['worlwide_gross_income'] = kaggle_movie_dataset2['worlwide_gross_income'].str.replace('$', "")
kaggle_movie_dataset2.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
165,tt0010323,Il gabinetto del dottor Caligari,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,Robert Wiene,"Carl Mayer, Hans Janowitz",Decla-Bioscop AG,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...","Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.1,55601,$ 18000,$ 8811,8811,,237.0,160.0
210,tt0011440,Markens grÃ¸de,Markens grÃ¸de,1921,1921-12-02,Drama,107,Norway,,Gunnar Sommerfeldt,"Knut Hamsun, Gunnar Sommerfeldt",Christiana Film,"Amund Rydland, Karen Poulsen, Ragna Wettergree...",After the Nobel prize winning Knut Hamsun-nove...,6.6,195,NOK 250000,,4272,,3.0,3.0
245,tt0012190,I quattro cavalieri dell'Apocalisse,The Four Horsemen of the Apocalypse,1921,1923-04-16,"Drama, Romance, War",150,USA,,Rex Ingram,"Vicente Blasco IbÃ¡Ã±ez, June Mathis",Metro Pictures Corporation,"Pomeroy Cannon, Josef Swickard, Bridgetta Clar...",An extended family split up in France and Germ...,7.2,3058,$ 800000,$ 9183673,9183673,,45.0,16.0
251,tt0012349,Il monello,The Kid,1921,1923-11-26,"Comedy, Drama, Family",68,USA,"English, None",Charles Chaplin,Charles Chaplin,Charles Chaplin Productions,"Carl Miller, Edna Purviance, Jackie Coogan, Ch...","The Tramp cares for an abandoned child, but ev...",8.3,109038,$ 250000,,26916,,173.0,105.0
348,tt0014624,La donna di Parigi,A Woman of Paris: A Drama of Fate,1923,1927-06-06,"Drama, Romance",82,USA,"None, English",Charles Chaplin,Charles Chaplin,Charles Chaplin Productions,"Edna Purviance, Clarence Geldart, Carl Miller,...",A kept woman runs into her former fiancÃ© and ...,7.0,4735,$ 351000,,11233,,37.0,24.0


#### For simplicity we want o aggregate some of the categories within the genre column, but first we want to check how many unique values these ones have 

In [23]:
# Checking how many register we have for genre category


dict(kaggle_movie_dataset2.genre.value_counts())

{'Drama': 1084,
 'Comedy': 797,
 'Comedy, Drama': 595,
 'Comedy, Drama, Romance': 491,
 'Drama, Romance': 427,
 'Comedy, Romance': 396,
 'Action, Crime, Drama': 313,
 'Crime, Drama, Thriller': 226,
 'Animation, Adventure, Comedy': 219,
 'Crime, Drama, Mystery': 181,
 'Drama, Thriller': 176,
 'Horror, Thriller': 175,
 'Action, Adventure, Comedy': 170,
 'Action, Comedy, Crime': 169,
 'Crime, Drama': 158,
 'Action, Adventure, Sci-Fi': 148,
 'Action, Crime, Thriller': 142,
 'Horror': 139,
 'Action, Adventure, Drama': 136,
 'Biography, Drama, History': 135,
 'Action, Thriller': 133,
 'Horror, Mystery, Thriller': 131,
 'Comedy, Crime, Drama': 131,
 'Biography, Drama': 120,
 'Action, Adventure, Fantasy': 113,
 'Comedy, Crime': 106,
 'Drama, Mystery, Thriller': 96,
 'Action, Drama, Thriller': 93,
 'Drama, War': 81,
 'Comedy, Drama, Family': 80,
 'Adventure, Comedy, Family': 78,
 'Biography, Crime, Drama': 77,
 'Action, Comedy': 75,
 'Animation, Action, Adventure': 74,
 'Thriller': 71,
 'Action

In [24]:
genre_list = ['Drama', 'Comedy', 'Action', 'Crime', 'Horror', 'Adventure', 'Biography', 'Thriller', 'Fantasy', 'Animation']

###### movie_functions.py #####

In [25]:
# This function aggregates categories within a column given a list of categories

def categoryAggr(category_list, dataset, ref_column, new_column):
    for cat in category_list:
        dataset.loc[dataset[ref_column].str.startswith(f"{cat}"),f"{new_column}"] = f"{cat}"
        print(f"{cat} done")
    return dataset

In [26]:
ref_column = "genre"
new_column = "genre_main"

kaggle_movie_dataset2 = categoryAggr(genre_list, kaggle_movie_dataset2, ref_column, new_column)

Drama done
Comedy done
Action done
Crime done
Horror done
Adventure done
Biography done
Thriller done
Fantasy done
Animation done


In [27]:
# kaggle_movie_dataset2 = mf.categoryAggr(genre_list, kaggle_movie_dataset2, ref_column, new_column)

In [28]:
kaggle_movie_dataset2.loc[~kaggle_movie_dataset2["genre_main"].isin(genre_list), "genre_main"] = "Other"

In [29]:
dict(kaggle_movie_dataset2.genre_main.value_counts())

{'Comedy': 3557,
 'Drama': 2961,
 'Action': 2584,
 'Crime': 879,
 'Adventure': 648,
 'Biography': 637,
 'Horror': 567,
 'Animation': 554,
 'Other': 225,
 'Thriller': 77,
 'Fantasy': 72}

In [30]:
kaggle_movie_dataset2.head(2)

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,genre_main
165,tt0010323,Il gabinetto del dottor Caligari,Das Cabinet des Dr. Caligari,1920,1920-02-27,"Fantasy, Horror, Mystery",76,Germany,German,Robert Wiene,"Carl Mayer, Hans Janowitz",Decla-Bioscop AG,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...","Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.1,55601,$ 18000,$ 8811,8811,,237.0,160.0,Fantasy
210,tt0011440,Markens grÃ¸de,Markens grÃ¸de,1921,1921-12-02,Drama,107,Norway,,Gunnar Sommerfeldt,"Knut Hamsun, Gunnar Sommerfeldt",Christiana Film,"Amund Rydland, Karen Poulsen, Ragna Wettergree...",After the Nobel prize winning Knut Hamsun-nove...,6.6,195,NOK 250000,,4272,,3.0,3.0,Drama


In [31]:
# Let's change also genre_main data type to category
kaggle_movie_dataset2["genre_main"] = kaggle_movie_dataset2["genre_main"].astype("category")

#### For simplicity, we are going to take out 'Other' category from genre_main

In [32]:
kaggle_movie_dataset2 = kaggle_movie_dataset2[(kaggle_movie_dataset2["genre_main"] != "Other")]

#### Drop not needed columns

In [33]:
selected_columns = ['imdb_title_id', 'title', 'original_title', 'year', 'genre', 'genre_main', 'duration', 'country', 'language', 'director', 'writer', 'actors','budget', 'worlwide_gross_income']

In [34]:
kaggle_movie_dataset_final = kaggle_movie_dataset2[kaggle_movie_dataset2.columns.intersection(selected_columns)]

In [35]:
# Rearrage columns
kaggle_movie_dataset_final = kaggle_movie_dataset_final[['imdb_title_id', 'title', 'original_title', 'year', 'genre', 'genre_main', 'duration', 'country', 'language', 'director', 'writer', 'actors', 'budget','worlwide_gross_income']]

#### In order to make a fair comparison between movies for the gross income, we are going to select movies just from 2000's so inflation doesn't affect much

In [36]:
# Turn year values into integers

kaggle_movie_dataset_final["year"] = kaggle_movie_dataset_final["year"].astype("int")
kaggle_movie_dataset_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12536 entries, 165 to 85847
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   imdb_title_id          12536 non-null  object  
 1   title                  12536 non-null  object  
 2   original_title         12536 non-null  object  
 3   year                   12536 non-null  int64   
 4   genre                  12536 non-null  object  
 5   genre_main             12536 non-null  category
 6   duration               12536 non-null  int64   
 7   country                12536 non-null  object  
 8   language               12502 non-null  object  
 9   director               12534 non-null  object  
 10  writer                 12471 non-null  object  
 11  actors                 12534 non-null  object  
 12  budget                 12536 non-null  object  
 13  worlwide_gross_income  12536 non-null  object  
dtypes: category(1), int64(2), object(11)

In [37]:
# Filter out movies before the 2000's
kaggle_movie_dataset_final = kaggle_movie_dataset_final[(kaggle_movie_dataset_final["year"]>= 2000)]
kaggle_movie_dataset_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9510 entries, 4334 to 85847
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   imdb_title_id          9510 non-null   object  
 1   title                  9510 non-null   object  
 2   original_title         9510 non-null   object  
 3   year                   9510 non-null   int64   
 4   genre                  9510 non-null   object  
 5   genre_main             9510 non-null   category
 6   duration               9510 non-null   int64   
 7   country                9510 non-null   object  
 8   language               9478 non-null   object  
 9   director               9508 non-null   object  
 10  writer                 9451 non-null   object  
 11  actors                 9508 non-null   object  
 12  budget                 9510 non-null   object  
 13  worlwide_gross_income  9510 non-null   object  
dtypes: category(1), int64(2), object(11)

In [38]:
kaggle_movie_dataset_final.shape

(9510, 14)

#### We need a subset equally weighted

In [39]:
movie_dataset_final_sample = kaggle_movie_dataset_final.sample(n = 500, weights = kaggle_movie_dataset_final.groupby("genre_main")["genre_main"].transform('count'))

In [40]:
dict(movie_dataset_final_sample.genre_main.value_counts())

{'Comedy': 189,
 'Drama': 170,
 'Action': 105,
 'Crime': 10,
 'Biography': 9,
 'Animation': 7,
 'Horror': 5,
 'Adventure': 4,
 'Thriller': 1,
 'Fantasy': 0,
 'Other': 0}

#### Attach rating column to kaggle_movie_dataset_final dataframe from kaggle_movie_ratings. Only for movies within first dataframe

In [41]:
# Drop columns not nedeed from kaggle_movie_ratings
selected_columns_ratings = ['imdb_title_id', 'weighted_average_vote']
kaggle_movie_ratings2 = kaggle_movie_ratings[kaggle_movie_ratings.columns.intersection(selected_columns_ratings)]
kaggle_movie_ratings2.head()

Unnamed: 0,imdb_title_id,weighted_average_vote
0,tt0000009,5.9
1,tt0000574,6.1
2,tt0001892,5.8
3,tt0002101,5.2
4,tt0002130,7.0


In [42]:
# Pandas Excel Vlookup :)

kaggle_movie_dataset_final = kaggle_movie_dataset_final.merge(kaggle_movie_ratings2, on='imdb_title_id')
kaggle_movie_dataset_final.shape

(9510, 15)

#### Export to CSV

In [43]:
# Complete dataset
kaggle_movie_dataset_final.to_csv('kaggle_movie_dataset_final.csv', sep =',',index = False)

In [44]:
# Sample of 500 (rapidApi limit)
movie_dataset_final_sample.to_csv('movie_dataset_final_sample.csv', sep =',',index = False)