# 02 - Data Wrangling

## 2.1 Contents:<a id='2.1_Contents'></a>
* [2.2_Objective:](#2.2_Objective:)
* [2.3_Imports:](#2.3_Imports:)
* [2.4_Load_Movie_Data:](#2.4_Load_Movie_Data:)
* [2.5_Explore_Data:](#2.5_Explore_Data:)
* [2.6_Clean_Data:](#2.6_Clean_Data:)
     * [2.6.1_Rename_Columns:](#2.6.1_Rename_Columns:)
     * [2.6.2_Drop_Duplicate_rows:](#2.6.2_Drop_Duplicate_rows:)
     * [2.6.3_Change_column_dtypes:](#2.6.3_Rename_Columns:)
     * [2.6.4_Converting_list_of_dict_to_list_or_string:](#2.6.4_Converting_list_of_dict_to_list_or_string:)
     * [2.6.5_Summary_Statistics:](#2.6.5_Summary_Statistics:)
     
* [2.7_Joining_and_Saving_Clean_Dataset:](#2.8_Joining_and_Saving_Clean_Dataset:)     
* [2.8_Conclusion:](#2.8_Conclusion:)



## 2.2 Objective:

The goal of this step is to collect, explore, and clean dataset followed by identify key columns that we can use in the future for following 3 movie recommender systems.


i.   Simple recommender: non-personalized recommendation based on general popularity.

ii.  Content-Based recommendation: Venturing into personalized recommendations. These movie
     recommendations are based on plot overview, genre, and other metadata similarities.

iii. Collaborative filtering: Another personalized recommendation based on the behavior and
     preferences of similar users.

Potential concern which i may have to consider in the future is the size of the dataset and difficulty i may have running ML algorithm with limited computing power. 

## 2.3 Imports:

In [1]:
#Lets get our imports
import pandas as pd
from ast import literal_eval
import numpy as np


## 2.4 Load Movie Data:

In [2]:
# Load 'The Movies Dataset'
df_keywords = pd.read_csv('../src/data/The Movies Dataset/Keywords.csv')
df_links = pd.read_csv('../src/data/The Movies Dataset/links.csv')
df_metadata = pd.read_csv('../src/data/The Movies Dataset/movies_metadata.csv', dtype= {'popularity': 'object'})
df_ratings = pd.read_csv('../src/data/The Movies Dataset/ratings.csv')
df_credits = pd.read_csv('../src/data/The Movies Dataset/credits.csv')

## 2.5 Explore Data:

In [3]:
#Keyword dataset - can be used for content based filtering. 
df_keywords.info()
df_keywords.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [4]:
#Links database can be used in the future projects to match imdbID and tmdbID databases.
df_links.info()
df_links.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  45843 non-null  int64  
 1   imdbId   45843 non-null  int64  
 2   tmdbId   45624 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.0 MB


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [5]:
#the Metadata that will be used for the project. 
df_metadata.info()
df_metadata.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [6]:
#Ratings dataset includes user ratings for each movie they have watched. Essential database for content base filtering. 
df_ratings.info()
df_ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26024289 entries, 0 to 26024288
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 794.2 MB


Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [7]:
df_credits.info()
df_credits.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [8]:
#checking to see if 'id', 'movieId', and 'tmdbId' in all these datasets equal each other. 
df_keywords[df_keywords['id'] == 862]

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."


In [9]:
df_links[df_links['tmdbId'] == 862].T

Unnamed: 0,0
movieId,1.0
imdbId,114709.0
tmdbId,862.0


In [10]:
df_metadata[df_metadata['id'] == '862'].T

Unnamed: 0,0
adult,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,30000000
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
homepage,http://toystory.disney.com/toy-story
id,862
imdb_id,tt0114709
original_language,en
original_title,Toy Story
overview,"Led by Woody, Andy's toys live happily in his ..."


In [11]:
df_ratings[df_ratings['movieId'] == 862].T

Unnamed: 0,184624,200490,524001,610887,643892,692440,712546,842326,913951,949383,...,25338551,25541777,25583989,25636974,25704139,25788364,25798428,25814740,25980015,26001292
userId,1923.0,2103.0,5380.0,6177.0,6525.0,7050.0,7238.0,8659.0,9328.0,9682.0,...,263809.0,265840.0,266243.0,266783.0,267543.0,268336.0,268391.0,268568.0,270422.0,270654.0
movieId,862.0,862.0,862.0,862.0,862.0,862.0,862.0,862.0,862.0,862.0,...,862.0,862.0,862.0,862.0,862.0,862.0,862.0,862.0,862.0,862.0
rating,3.0,5.0,1.0,4.0,4.0,3.0,3.0,4.0,4.0,4.0,...,3.5,2.0,4.0,3.0,4.0,4.0,3.0,2.0,4.0,4.0
timestamp,858335006.0,946044912.0,878941641.0,859415226.0,857388995.0,951328483.0,988054686.0,997143296.0,1037486000.0,949005840.0,...,1461371000.0,945065463.0,974730526.0,1145249000.0,945299890.0,955427492.0,856529977.0,943826734.0,941664133.0,976204090.0


## 2.6 Clean Data:

### 2.6.1 Rename Columns:

In [12]:
# Rename id and tmdbId in all df to movieId so it's easier to join datasets. 
df_keywords.columns = ['movieId', 'keyword']
df_keywords.info()
df_keywords.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  46419 non-null  int64 
 1   keyword  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


Unnamed: 0,movieId,keyword
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [13]:
print(df_keywords[df_keywords['movieId'] == 862]['keyword'][0])

[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]


In [14]:
#Renaming columns in df_links
df_links = df_links.drop(columns = ['movieId'])
df_links.columns = ['imdbId', 'movieId']
df_links.info()
df_links.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   imdbId   45843 non-null  int64  
 1   movieId  45624 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 716.4 KB


Unnamed: 0,imdbId,movieId
0,114709,862.0
1,113497,8844.0
2,113228,15602.0
3,114885,31357.0
4,113041,11862.0


In [15]:
#delete columns with irrelavent data and rename 'id' to 'movieID' in df_metadata dataframe.
df_metadata=df_metadata.drop(columns = ['homepage', 'poster_path', 'video'])
df_metadata=df_metadata.rename(columns = {'id':'movieId'})
df_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   movieId                45466 non-null  object 
 5   imdb_id                45449 non-null  object 
 6   original_language      45455 non-null  object 
 7   original_title         45466 non-null  object 
 8   overview               44512 non-null  object 
 9   popularity             45461 non-null  object 
 10  production_companies   45463 non-null  object 
 11  production_countries   45463 non-null  object 
 12  release_date           45379 non-null  object 
 13  revenue                45460 non-null  float64
 14  runtime                45203 non-null  float64
 15  sp

In [16]:
#Renaming columns in df_credits.
df_credits = df_credits.rename(columns = {'id':'movieId'})
df_credits.info()
df_credits.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   cast     45476 non-null  object
 1   crew     45476 non-null  object
 2   movieId  45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


Unnamed: 0,cast,crew,movieId
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


### 2.6.2 Drop Duplicate rows:

In [17]:
df_duplicate_1 = df_metadata.duplicated('movieId')
df_metadata[df_duplicate_1]

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,imdb_id,original_language,original_title,overview,popularity,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
1465,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",105045,tt0111613,de,Das Versprechen,"East-Berlin, 1961, shortly after the erection ...",0.122178,...,"[{'iso_3166_1': 'DE', 'name': 'Germany'}]",1995-02-16,0.0,115.0,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Released,"A love, a hope, a wall.",The Promise,5.0,1.0
9165,False,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",5511,tt0062229,fr,Le Samouraï,Hitman Jef Costello is a perfectionist who alw...,9.091288,...,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1967-10-25,39481.0,105.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
9327,False,,0,"[{'id': 12, 'name': 'Adventure'}, {'id': 16, '...",23305,tt0295682,en,The Warrior,"In feudal India, a warrior (Khan) who renounce...",1.967992,...,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2001-09-23,0.0,86.0,"[{'iso_639_1': 'hi', 'name': 'हिन्दी'}]",Released,,The Warrior,6.3,15.0
12066,False,,1600000,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",14788,tt0454792,en,Bubble,Set against the backdrop of a decaying Midwest...,3.008299,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",2005-09-03,0.0,73.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Bubble,6.4,36.0
13375,False,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 9648, ...",141971,tt1180333,fi,Blackout,Recovering from a nail gun shot to the head an...,0.411949,...,"[{'iso_3166_1': 'FI', 'name': 'Finland'}]",2008-12-26,0.0,108.0,"[{'iso_639_1': 'fi', 'name': 'suomi'}]",Released,Which one is the first to return - memory or t...,Blackout,6.7,3.0
15074,False,,4,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",22649,tt0022879,en,A Farewell to Arms,British nurse Catherine Barkley (Helen Hayes) ...,2.411191,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1932-12-08,25.0,89.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Every woman who has loved will understand,A Farewell to Arms,6.2,29.0
15765,False,,2500,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",13209,tt0499537,fa,Offside,"Since women are banned from soccer matches, Ir...",1.529879,...,"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",2006-05-26,0.0,93.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,,Offside,6.7,27.0
16764,False,,0,"[{'id': 53, 'name': 'Thriller'}, {'id': 9648, ...",141971,tt1180333,fi,Blackout,Recovering from a nail gun shot to the head an...,0.411949,...,"[{'iso_3166_1': 'FI', 'name': 'Finland'}]",2008-12-26,0.0,108.0,"[{'iso_639_1': 'fi', 'name': 'suomi'}]",Released,Which one is the first to return - memory or t...,Blackout,6.7,3.0
20843,False,,40000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",77221,tt1701210,en,Black Gold,"On the Arabian Peninsula in the 1930s, two war...",6.475665,...,"[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",2011-12-21,5446000.0,130.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Black Gold,5.9,77.0
20899,False,,0,"[{'id': 18, 'name': 'Drama'}]",109962,tt0082992,en,Rich and Famous,Two literary women compete for 20 years: one w...,10.396878,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1981-09-23,0.0,115.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"From the very beginning, they knew they'd be f...",Rich and Famous,4.9,7.0


In [18]:
df_duplicate_2 = df_links.duplicated('movieId')
df_links[df_duplicate_2]

Unnamed: 0,imdbId,movieId
598,115978,
708,118114,
709,114103,
718,125877,
756,116992,
...,...,...
40391,2818654,298721.0
40629,127834,97995.0
45197,235679,10991.0
45202,287635,12600.0


In [19]:
df_duplicate_3 = df_keywords.duplicated('movieId')
df_keywords[df_duplicate_3]
df_keywords[df_keywords['movieId'] == 12600]

Unnamed: 0,movieId,keyword
5535,12600,"[{'id': 9663, 'name': 'sequel'}, {'id': 11451,..."
45779,12600,"[{'id': 9663, 'name': 'sequel'}, {'id': 11451,..."


In [20]:
df_duplicate_4 = df_credits.duplicated('movieId')
df_credits[df_duplicate_4]
df_credits[df_credits['movieId']==5511]

Unnamed: 0,cast,crew,movieId
7345,"[{'cast_id': 11, 'character': 'Jef Costello', ...","[{'credit_id': '52fe440ac3a36847f807ee01', 'de...",5511
9165,"[{'cast_id': 11, 'character': 'Jef Costello', ...","[{'credit_id': '52fe440ac3a36847f807ee01', 'de...",5511


In [21]:
# drop duplicates in df_metadata, df_links, and df_keywords, and df_credits.
df_metadata = df_metadata.drop_duplicates('movieId')
df_links = df_links.drop_duplicates('movieId')
df_keywords = df_keywords.drop_duplicates('movieId')
df_credits = df_credits.drop_duplicates('movieId')

### 2.6.3 Change column dtypes:

In [22]:
#Change column dtype for 'movieId' from float64 to int 64.
df_links = df_links.dropna()
df_links['movieId']= df_links['movieId'].astype('int64')
df_links.info()
df_links.head()


<class 'pandas.core.frame.DataFrame'>
Index: 45594 entries, 0 to 45842
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   imdbId   45594 non-null  int64
 1   movieId  45594 non-null  int64
dtypes: int64(2)
memory usage: 1.0 MB


Unnamed: 0,imdbId,movieId
0,114709,862
1,113497,8844
2,113228,15602
3,114885,31357
4,113041,11862


In [23]:
#df_metadata dataset cleaning.
#Start with removing null rows from vote_count
df_metadata[pd.isnull(df_metadata['vote_count'])]
df_metadata = df_metadata.dropna(subset=['vote_count'])

In [24]:
#Delete all Movies/Entries that have not been Released
df_metadata = df_metadata[df_metadata['status'] == 'Released']

In [25]:
#change 'release_date' dtype from object to datetime.
df_metadata['release_date'] = pd.to_datetime(df_metadata['release_date'], errors='coerce')

In [26]:
#change 'popularity' dtype from object to float64
df_metadata[pd.isnull(df_metadata['popularity'])]
df_metadata['popularity'] = df_metadata['popularity'].astype('float64')

In [27]:
#change 'movieId' dtype from object to int64
df_metadata[pd.isnull(df_metadata['movieId'])]
df_metadata['movieId'] = df_metadata['movieId'].astype('int64')

In [28]:
#change 'budget' & 'revenue' dtype from object/float64 to int64
df_metadata[pd.isnull(df_metadata['budget'])]
df_metadata['budget'] = df_metadata['budget'].astype('int64')
df_metadata['revenue'] = df_metadata['revenue'].astype('int64')
#change 'adult' dtype from object to bool
df_metadata['adult'] = df_metadata['adult'] = df_metadata['adult'].replace({'True': True, 'False': False})

In [29]:
#merge links data to obtain correct IMDB_ID
df_metadata = pd.merge(df_metadata,df_links, how = 'left', on = 'movieId' )
df_metadata = df_metadata.drop(columns = ['imdb_id'])

In [30]:
df_metadata.info()
df_metadata.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44985 entries, 0 to 44984
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   adult                  44985 non-null  bool          
 1   belongs_to_collection  4463 non-null   object        
 2   budget                 44985 non-null  int64         
 3   genres                 44985 non-null  object        
 4   movieId                44985 non-null  int64         
 5   original_language      44975 non-null  object        
 6   original_title         44985 non-null  object        
 7   overview               44065 non-null  object        
 8   popularity             44985 non-null  float64       
 9   production_companies   44985 non-null  object        
 10  production_countries   44985 non-null  object        
 11  release_date           44907 non-null  datetime64[ns]
 12  revenue                44985 non-null  int64         
 13  r

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,imdbId
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0,114709
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,113497
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,113228
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,[{'name': 'Twentieth Century Fox Film Corporat...,...,1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0,114885
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,"[{'name': 'Sandollar Productions', 'id': 5842}...",...,1995-02-10,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0,113041


In [31]:
#df_ratings dataset cleaning. remove timestamp column as its not relevant for our project.
df_ratings = df_ratings.drop(columns = 'timestamp')
df_ratings.info()
df_ratings.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26024289 entries, 0 to 26024288
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   userId   int64  
 1   movieId  int64  
 2   rating   float64
dtypes: float64(1), int64(2)
memory usage: 595.6 MB


Unnamed: 0,userId,movieId,rating
0,1,110,1.0
1,1,147,4.5
2,1,858,5.0
3,1,1221,5.0
4,1,1246,5.0


### 2.6.4 Converting list of dict to list or string:

In [32]:
### df_metadata data convertion.
# Convert 'belongs_to_collection' column data from dict to object with just the collection name. 
df_metadata[~pd.isnull(df_metadata['belongs_to_collection'])]
df_metadata['belongs_to_collection'][0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

In [33]:
df_metadata['belongs_to_collection'] = df_metadata['belongs_to_collection'].fillna('[]').apply(literal_eval).apply(lambda x: x.get('name') if isinstance(x, dict) else None)
df_metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,imdbId
0,False,Toy Story Collection,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0,114709
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,113497
2,False,Grumpy Old Men Collection,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,113228


In [34]:
#Convert 'genres' from list of dict to list. 
df_metadata[pd.isnull(df_metadata['genres'])]
df_metadata['genres'][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [35]:
df_metadata['genres'] = df_metadata['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df_metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,imdbId
0,False,Toy Story Collection,30000000,"[Animation, Comedy, Family]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0,114709
1,False,,65000000,"[Adventure, Fantasy, Family]",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,113497
2,False,Grumpy Old Men Collection,0,"[Romance, Comedy]",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,113228


In [36]:
#Convert 'production_companies' & 'production_countries' from list of dict to list. 
df_metadata['production_companies'][0]
df_metadata['production_countries'][3769]

"[{'iso_3166_1': 'AR', 'name': 'Argentina'}, {'iso_3166_1': 'DK', 'name': 'Denmark'}, {'iso_3166_1': 'FI', 'name': 'Finland'}, {'iso_3166_1': 'FR', 'name': 'France'}, {'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'IS', 'name': 'Iceland'}, {'iso_3166_1': 'IT', 'name': 'Italy'}, {'iso_3166_1': 'NL', 'name': 'Netherlands'}, {'iso_3166_1': 'NO', 'name': 'Norway'}, {'iso_3166_1': 'SE', 'name': 'Sweden'}, {'iso_3166_1': 'GB', 'name': 'United Kingdom'}, {'iso_3166_1': 'US', 'name': 'United States of America'}]"

In [37]:
df_metadata['production_companies'] = df_metadata['production_companies'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df_metadata['production_countries'] = df_metadata['production_countries'].fillna('[]').apply(literal_eval).apply(lambda x: [i['iso_3166_1'] for i in x] if isinstance(x, list) else [])
df_metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,imdbId
0,False,Toy Story Collection,30000000,"[Animation, Comedy, Family]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,[Pixar Animation Studios],...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0,114709
1,False,,65000000,"[Adventure, Fantasy, Family]",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[TriStar Pictures, Teitler Film, Interscope Co...",...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,113497
2,False,Grumpy Old Men Collection,0,"[Romance, Comedy]",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[Warner Bros., Lancaster Gate]",...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,113228


In [38]:
#Convert 'spoken_languages' from list of dict to list
df_metadata['spoken_languages'][0]

"[{'iso_639_1': 'en', 'name': 'English'}]"

In [39]:
df_metadata['spoken_languages'] = df_metadata['spoken_languages'].fillna('[]').apply(literal_eval).apply(lambda x: [i['iso_639_1'] for i in x] if isinstance(x, list) else [])
df_metadata.info()
df_metadata.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44985 entries, 0 to 44984
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   adult                  44985 non-null  bool          
 1   belongs_to_collection  4463 non-null   object        
 2   budget                 44985 non-null  int64         
 3   genres                 44985 non-null  object        
 4   movieId                44985 non-null  int64         
 5   original_language      44975 non-null  object        
 6   original_title         44985 non-null  object        
 7   overview               44065 non-null  object        
 8   popularity             44985 non-null  float64       
 9   production_companies   44985 non-null  object        
 10  production_countries   44985 non-null  object        
 11  release_date           44907 non-null  datetime64[ns]
 12  revenue                44985 non-null  int64         
 13  r

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,original_language,original_title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,imdbId
0,False,Toy Story Collection,30000000,"[Animation, Comedy, Family]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,[Pixar Animation Studios],...,1995-10-30,373554033,81.0,[en],Released,,Toy Story,7.7,5415.0,114709
1,False,,65000000,"[Adventure, Fantasy, Family]",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[TriStar Pictures, Teitler Film, Interscope Co...",...,1995-12-15,262797249,104.0,"[en, fr]",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,113497
2,False,Grumpy Old Men Collection,0,"[Romance, Comedy]",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[Warner Bros., Lancaster Gate]",...,1995-12-22,0,101.0,[en],Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,113228
3,False,,16000000,"[Comedy, Drama, Romance]",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,[Twentieth Century Fox Film Corporation],...,1995-12-22,81452156,127.0,[en],Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0,114885
4,False,Father of the Bride Collection,0,[Comedy],11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,"[Sandollar Productions, Touchstone Pictures]",...,1995-02-10,76578911,106.0,[en],Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0,113041


In [40]:
### df_keyword conversion
df_keywords['keyword'][0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [41]:
df_keywords['keyword'] = df_keywords['keyword'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df_keywords.info()
df_keywords.head()

<class 'pandas.core.frame.DataFrame'>
Index: 45432 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  45432 non-null  int64 
 1   keyword  45432 non-null  object
dtypes: int64(1), object(1)
memory usage: 2.0+ MB


Unnamed: 0,movieId,keyword
0,862,"[jealousy, toy, boy, friendship, friends, riva..."
1,8844,"[board game, disappearance, based on children'..."
2,15602,"[fishing, best friend, duringcreditsstinger, o..."
3,31357,"[based on novel, interracial relationship, sin..."
4,11862,"[baby, midlife crisis, confidence, aging, daug..."


In [42]:
### df_credit conversion

In [43]:
df_credits['cast'][1]

"[{'cast_id': 1, 'character': 'Alan Parrish', 'credit_id': '52fe44bfc3a36847f80a7c73', 'gender': 2, 'id': 2157, 'name': 'Robin Williams', 'order': 0, 'profile_path': '/sojtJyIV3lkUeThD7A2oHNm8183.jpg'}, {'cast_id': 8, 'character': 'Samuel Alan Parrish / Van Pelt', 'credit_id': '52fe44bfc3a36847f80a7c99', 'gender': 2, 'id': 8537, 'name': 'Jonathan Hyde', 'order': 1, 'profile_path': '/7il5D76vx6QVRVlpVvBPEC40MBi.jpg'}, {'cast_id': 2, 'character': 'Judy Sheperd', 'credit_id': '52fe44bfc3a36847f80a7c77', 'gender': 1, 'id': 205, 'name': 'Kirsten Dunst', 'order': 2, 'profile_path': '/wBXvh6PJd0IUVNpvatPC1kzuHtm.jpg'}, {'cast_id': 24, 'character': 'Peter Shepherd', 'credit_id': '52fe44c0c3a36847f80a7ce7', 'gender': 0, 'id': 145151, 'name': 'Bradley Pierce', 'order': 3, 'profile_path': '/j6iW0vVA23GQniAPSYI6mi4hiEW.jpg'}, {'cast_id': 10, 'character': 'Sarah Whittle', 'credit_id': '52fe44bfc3a36847f80a7c9d', 'gender': 1, 'id': 5149, 'name': 'Bonnie Hunt', 'order': 4, 'profile_path': '/7spiVQwmr

In [44]:
# Adding new columns with 'actor_id' and 'actor_name' extracted from cast column.
df_credits['actor_ids'] = df_credits['cast'].fillna('[]').apply(literal_eval).apply(lambda x: [i['id'] for i in x] if isinstance(x, list) else [])
df_credits['actor_names'] = df_credits['cast'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [45]:
df_credits['crew'][0]

'[{\'credit_id\': \'52fe4284c3a36847f8024f49\', \'department\': \'Directing\', \'gender\': 2, \'id\': 7879, \'job\': \'Director\', \'name\': \'John Lasseter\', \'profile_path\': \'/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f4f\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12891, \'job\': \'Screenplay\', \'name\': \'Joss Whedon\', \'profile_path\': \'/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f55\', \'department\': \'Writing\', \'gender\': 2, \'id\': 7, \'job\': \'Screenplay\', \'name\': \'Andrew Stanton\', \'profile_path\': \'/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f5b\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12892, \'job\': \'Screenplay\', \'name\': \'Joel Cohen\', \'profile_path\': \'/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f61\', \'department\': \'Writing\', \'gender\': 0, \'id\': 12893, \'job\': \'Screenplay\', \'name\': \'A

In [46]:
# Adding new columns with 'director' extracted from crew column.
df_credits['director'] = df_credits['crew'].fillna('[]').apply(literal_eval).apply(lambda x: ', '.join([i['name'] for i in x if i['job'] == 'Director']) if isinstance(x, list) else None)
df_credits.info()
df_credits.head()

<class 'pandas.core.frame.DataFrame'>
Index: 45432 entries, 0 to 45475
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   cast         45432 non-null  object
 1   crew         45432 non-null  object
 2   movieId      45432 non-null  int64 
 3   actor_ids    45432 non-null  object
 4   actor_names  45432 non-null  object
 5   director     45432 non-null  object
dtypes: int64(1), object(5)
memory usage: 3.4+ MB


Unnamed: 0,cast,crew,movieId,actor_ids,actor_names,director
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",John Lasseter
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Joe Johnston
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[6837, 3151, 13567, 16757, 589, 16523, 7166]","[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",Howard Deutch
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[Whitney Houston, Angela Bassett, Loretta Devi...",Forest Whitaker
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[Steve Martin, Diane Keaton, Martin Short, Kim...",Charles Shyer


### 2.6.5 Summary Statistics:

In [47]:
#df_metadata summary statistics.
df_metadata.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
budget,44985.0,4265096.631811,0.0,0.0,0.0,0.0,380000000.0,17509435.648189
movieId,44985.0,108100.065689,2.0,26225.0,59820.0,157353.0,469172.0,112318.916473
popularity,44985.0,2.938769,0.0,0.390896,1.135373,3.72826,547.488298,6.024704
release_date,44907.0,1992-04-23 19:34:06.389204352,1874-12-09 00:00:00,1978-08-21 00:00:00,2001-08-05 00:00:00,2010-12-04 12:00:00,2017-12-27 00:00:00,
revenue,44985.0,11322295.795888,0.0,0.0,0.0,0.0,2787965087.0,64660000.323766
runtime,44734.0,94.268856,0.0,85.0,95.0,107.0,1256.0,38.365762
vote_average,44985.0,5.623861,0.0,5.0,6.0,6.8,10.0,1.915941
vote_count,44985.0,110.928109,0.0,3.0,10.0,35.0,14075.0,493.782123
imdbId,44985.0,987561.298233,1.0,82764.0,281365.0,1528854.0,7158814.0,1357342.943985


## 2.7 Joining and Saving Clean Dataset

In [48]:
# joining credits and keyword datasets to metadata
df_md = pd.merge(df_metadata, df_keywords, how = 'inner', on = 'movieId')
df_md = pd.merge(df_md, df_credits[['movieId', 'actor_ids', 'actor_names', 'director']], how='inner', on='movieId')
df_md.info()
df_md.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44984 entries, 0 to 44983
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   adult                  44984 non-null  bool          
 1   belongs_to_collection  4463 non-null   object        
 2   budget                 44984 non-null  int64         
 3   genres                 44984 non-null  object        
 4   movieId                44984 non-null  int64         
 5   original_language      44974 non-null  object        
 6   original_title         44984 non-null  object        
 7   overview               44064 non-null  object        
 8   popularity             44984 non-null  float64       
 9   production_companies   44984 non-null  object        
 10  production_countries   44984 non-null  object        
 11  release_date           44906 non-null  datetime64[ns]
 12  revenue                44984 non-null  int64         
 13  r

Unnamed: 0,adult,belongs_to_collection,budget,genres,movieId,original_language,original_title,overview,popularity,production_companies,...,status,tagline,title,vote_average,vote_count,imdbId,keyword,actor_ids,actor_names,director
0,False,Toy Story Collection,30000000,"[Animation, Comedy, Family]",862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,[Pixar Animation Studios],...,Released,,Toy Story,7.7,5415.0,114709,"[jealousy, toy, boy, friendship, friends, riva...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",John Lasseter
1,False,,65000000,"[Adventure, Fantasy, Family]",8844,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,"[TriStar Pictures, Teitler Film, Interscope Co...",...,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,113497,"[board game, disappearance, based on children'...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Joe Johnston
2,False,Grumpy Old Men Collection,0,"[Romance, Comedy]",15602,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,"[Warner Bros., Lancaster Gate]",...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,113228,"[fishing, best friend, duringcreditsstinger, o...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",Howard Deutch
3,False,,16000000,"[Comedy, Drama, Romance]",31357,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,[Twentieth Century Fox Film Corporation],...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0,114885,"[based on novel, interracial relationship, sin...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[Whitney Houston, Angela Bassett, Loretta Devi...",Forest Whitaker
4,False,Father of the Bride Collection,0,[Comedy],11862,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,"[Sandollar Productions, Touchstone Pictures]",...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0,113041,"[baby, midlife crisis, confidence, aging, daug...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[Steve Martin, Diane Keaton, Martin Short, Kim...",Charles Shyer


In [49]:
#Saving df_md and df_ratings.

#datapath = '../src/data/Cleaned'
df_md.to_csv('../src/data/Cleaned/metadata_cleaned.csv')
df_ratings.to_csv('../src/data/Cleaned/ratings_cleaned.csv')

## 2.8 Conclusion:

In summary, all datasets collected from source were first explored to find the key identifier 'movieId' column that connects all the datasets together. Next, column types were updated to match content after duplicates and 'null' rows. All datasets were further cleaned by extracting key identifiers such as 'actor_name', 'director', 'genres', and 'keyword'. Finally all essential datasets were joined to form 'df_md' as final clean dataset ready to be used for next steps. 'df_ratings' was left as a seperate dataset since adding it to 'df_md' will unnecessarily enlarge the size of the file.

Another objective of this task was to identify the key identifiers for each recommendation system.

i.   Simple Recommender: we will be mainly be focusing on columns 'vote_average', 'vote_count', and 'genres'.

ii.  Content based recommendation: This will have us using 'director', 'actor', 'genres', 'belong_to_collection', 'vote_average', 'vote_count', and 'keyword'

iii.  Collaberative filtering: we will focus on using 'ratings' from 'df_ratings' dataset to personalize recommendation based on the behavior and preferences of similar users.
