# Movies Dataset Processing & Analysis

![Movies Image](plugins/assets/movies_image.png)

### Objective:
The aim of this project is to fetch historical data about movies metadata from open APIs. The gathered data will be processed, cleaned, then will be stored in the data lake using AWS S3, loaded into Snowflake, and Visualize through dashboard. 

The [Kaggle Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data) that used contains metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

## Key Features of the Project:

### Data Collection:
Identify reliable open sources for movies dataset.
Use Python to fetch data using libraries such as requests, pandas, or openpyxl.


### Data Processing:
Clean and preprocess the data to ensure it's in a usable format.
Handle missing data, duplicates, and irrelevant columns.
Perform any necessary transformations (e.g., from json to dataset).

### Data Storage:
Store the data in a local database (e.g., SQLite) or a cloud-based data warehouse (e.g., Google BigQuery, AWS Redshift) for later use.


### Dashboard Development:
Use a Python visualization library (e.g., Plotly, Dash, Matplotlib) to build an interactive dashboard.
The dashboard will allow users to interact with data, filter by transport type, and visualize transport routes, schedules, or other metrics.


### Experimentation & Analysis:
Experiment with data fetching, transformation, and the integration of APIs.
Explore possible analyses such as peak-hour transport usage, performance (on-time arrivals), and comparison across routes.

### Load Libraries:

In [205]:
# Import all necessary packages
import pandas as pd
import numpy as np
import plugins.utils as utils
import snowflake.connector
from plugins.config import snow_creds, aws_creds
import json
from typing import Optional
import ast

## Data Collection:

In [206]:
# Download the dataset from Kaggle
dataset_name = "rounakbanik/the-movies-dataset" 
download_folder = "./plugins/assets/data/the-movies-dataset"
utils.download_kaggle_dataset(dataset_name, download_folder)

Folder './plugins/assets/data/the-movies-dataset' has been filled.


In [207]:
# Load all of the data into DataFrames
# The column ID -> identifier will be distinguished by using m_id which stands for movie_id (Most likely to be tmdbId in links df)
credits = pd.read_csv("./plugins/assets/data/the-movies-dataset/credits.csv")
keywords = pd.read_csv("./plugins/assets/data/the-movies-dataset/keywords.csv")
links = pd.read_csv("./plugins/assets/data/the-movies-dataset/links.csv")
movies_metadata = pd.read_csv("./plugins/assets/data/the-movies-dataset/movies_metadata.csv")
ratings = pd.read_csv("./plugins/assets/data/the-movies-dataset/ratings.csv")

## Data Processing:

In [208]:
def safe_parse_collection(x: pd.Series):
    if pd.isna(x):
        return np.nan
    
    if isinstance(x, dict):
        return x  # Already a dictionary, return as is
    
    try:
        return ast.literal_eval(x)
    except (ValueError, SyntaxError, TypeError):
        return np.nan
    
# AI Generated Code
def normalize_data(
    df: pd.DataFrame, 
    subset_column: str, 
    id_column: str,
    prefix: str
) -> pd.DataFrame:
    """Optimized normalization of nested data using vectorized operations."""
    # Create working copy with only necessary columns
    working_df = df[[id_column, subset_column]].copy()
    
    # Vectorized parsing of nested data
    working_df[subset_column] = working_df[subset_column].apply(safe_parse_collection)
    
    # Filter valid entries and explode lists
    valid_mask = working_df[subset_column].apply(
        lambda x: isinstance(x, list) and len(x) > 0 and all(isinstance(i, dict) for i in x)
    )
    exploded_df = working_df[valid_mask].explode(subset_column)
    
    if exploded_df.empty:
        return pd.DataFrame()
    
    # Normalize nested dicts in vectorized manner
    normalized = pd.json_normalize(exploded_df[subset_column])
    
    # Add reference ID with type conversion for memory efficiency
    id_col_name = f"{prefix}_{id_column}"
    normalized[id_col_name] = exploded_df[id_column].astype(
        exploded_df[id_column].dtype
    ).values
    
    return normalized.reset_index(drop=True)

def extract_dict_values(
    df: pd.DataFrame, 
    column_name: str, 
    new_column_prefix: Optional[str] = None
) -> pd.DataFrame:
    """Optimized dictionary expansion using vectorized operations."""
    # Parse and normalize in bulk
    parsed = df[column_name].apply(safe_parse_collection)
    normalized = pd.json_normalize(parsed)
    
    # Apply prefix if specified
    if new_column_prefix:
        normalized = normalized.add_prefix(f"{new_column_prefix}_")
    
    # Join results using pandas' efficient merge instead of row-wise operations
    return pd.concat([df, normalized], axis=1)

#### Credits Dataframe Processing:

In [209]:
# Take a peek of current credits data
credits.sample(5)

Unnamed: 0,cast,crew,id
20600,[],"[{'credit_id': '52fe48029251416c9107cec9', 'de...",81225
43997,"[{'cast_id': 7, 'character': '', 'credit_id': ...","[{'credit_id': '52fe4c54c3a368484e1b2a87', 'de...",138724
6001,"[{'cast_id': 5, 'character': 'Phillip Dimitriu...","[{'credit_id': '52fe475cc3a36847f8131625', 'de...",48249
23197,"[{'cast_id': 10, 'character': 'Father James La...","[{'credit_id': '540a85e4c3a368799c000a81', 'de...",157832
17110,[],"[{'credit_id': '52fe4914c3a36847f81882ef', 'de...",56822


In [210]:
c_cast = normalize_data(credits, 'cast', 'id', 'm')
c_crew = normalize_data(credits, 'crew', 'id', 'm')

Process c_cast dataframe

In [211]:
c_cast.sample(5)

Unnamed: 0,cast_id,character,credit_id,gender,id,name,order,profile_path,m_id
293717,2,"Gabriel Lecouvreur, 'le Poulpe'",52fe466ec3a368484e090625,2,28463,Jean-Pierre Darroussin,0,/951QpVbYDqI8s0TSQrPyhFZOlHY.jpg,62012
274628,26,Mrs. Pitt,568e1e58c3a36858ec001f37,0,1559281,Alicia Ammon,10,,66488
178429,2,Clara Chevalier,52fe4607c3a36847f80e7f91,1,20882,Lea Massari,0,/gp9dPDL6w9VemMzdVi3al9RxnIg.jpg,42501
242124,56,Denver House Bartender,5741ddc1c3a3685ee600544e,2,121066,Lee Phelps,46,/yBLIKr2IpxLzSiqBt6N9PqGQ7EZ.jpg,43821
265815,12,Macarena,52fe46d2c3a36847f81140f5,0,226727,Ana De los Riscos,9,,45765


In [212]:
c_cast.gender.value_counts()

gender
2    226713
0    223964
1    111797
Name: count, dtype: int64

In [213]:
# c_cast preview
c_cast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562474 entries, 0 to 562473
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   cast_id       562474 non-null  int64 
 1   character     562474 non-null  object
 2   credit_id     562474 non-null  object
 3   gender        562474 non-null  int64 
 4   id            562474 non-null  int64 
 5   name          562474 non-null  object
 6   order         562474 non-null  int64 
 7   profile_path  388618 non-null  object
 8   m_id          562474 non-null  int64 
dtypes: int64(5), object(4)
memory usage: 38.6+ MB


Data cleaning process for c_cast

In [214]:
# Check for duplicated columns
c_cast[c_cast.duplicated(subset=['cast_id', 'credit_id', 'm_id', 'name'], keep=False)].sort_values(by=['cast_id', 'credit_id'])

Unnamed: 0,cast_id,character,credit_id,gender,id,name,order,profile_path,m_id
71117,1,Ash Ketchum,52fe43de9251416c750213f1,1,67830,Veronica Taylor,0,/28EFUb3bPWJaWYzZIxurGCrDpHk.jpg,10991
557033,1,Ash Ketchum,52fe43de9251416c750213f1,1,67830,Veronica Taylor,0,/28EFUb3bPWJaWYzZIxurGCrDpHk.jpg,10991
16653,1,Catherine Barkley,52fe444ac3a368484e01aad5,0,47439,Helen Hayes,0,/6QJDTvIT0v5E9pR1rgPtq59Ej8.jpg,22649
235036,1,Catherine Barkley,52fe444ac3a368484e01aad5,0,47439,Helen Hayes,0,/6QJDTvIT0v5E9pR1rgPtq59Ej8.jpg,22649
132474,1,Lafcadia - Warrior,52fe4465c3a368484e020913,2,76793,Irrfan Khan,0,/9O71WSILj1af9smwuN44nGd198Q.jpg,23305
...,...,...,...,...,...,...,...,...,...
512939,1011,Betty's Hawaiian Maid,52fe49fb9251416c750d9d99,0,1109654,Lola Gonzales,10,,97995
219993,1012,Max's Chef - in Mirror Gag,52fe49fc9251416c750d9d9d,0,1109655,Harry Mann,11,,97995
512940,1012,Max's Chef - in Mirror Gag,52fe49fc9251416c750d9d9d,0,1109655,Harry Mann,11,,97995
219994,1013,The Chimpanzee,52fe49fc9251416c750d9da1,0,1109656,Joe Martin,12,,97995


In [215]:
c_cast.drop_duplicates(subset=['cast_id', 'credit_id'], keep='first', inplace=True)

Validate cleaning

In [216]:
# Validate cleaning process
c_cast[c_cast.duplicated(subset=['cast_id', 'credit_id', 'm_id', 'name'], keep=False)].sort_values(by=['cast_id', 'credit_id']).count()

cast_id         0
character       0
credit_id       0
gender          0
id              0
name            0
order           0
profile_path    0
m_id            0
dtype: int64

In [217]:
c_cast.profile_path.str.split('.').str[-1].value_counts()

profile_path
jpg    388366
Name: count, dtype: int64

In [218]:
c_cast.isnull().sum()

cast_id              0
character            0
credit_id            0
gender               0
id                   0
name                 0
order                0
profile_path    173678
m_id                 0
dtype: int64

Process c_crew dataset

In [219]:
c_crew.sample(5)

Unnamed: 0,credit_id,department,gender,id,job,name,profile_path,m_id
310246,55494700c3a36841af0006ff,Camera,0,1463729,Grip,Ted Gregg,,135397
186856,555b0d4892514158780001a1,Art,0,1468925,Art Direction,Tracey Baryski,,14643
265956,566a4cc3c3a3682e98002e5b,Art,0,1542581,Set Decoration,Mototsugu Komaki,,143946
422781,52fe46a69251416c9105b38f,Directing,2,69038,Director,Carlo Vanzina,/wd51z5tECoJMS0fSn0GEojiHObp.jpg,38289
154681,555db0379251413b290005e2,Editing,2,310,Editor,Ronald Sanders,,24625


In [220]:
# c_crew preview
c_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 464314 entries, 0 to 464313
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   credit_id     464314 non-null  object
 1   department    464314 non-null  object
 2   gender        464314 non-null  int64 
 3   id            464314 non-null  int64 
 4   job           464314 non-null  object
 5   name          464314 non-null  object
 6   profile_path  95098 non-null   object
 7   m_id          464314 non-null  int64 
dtypes: int64(3), object(5)
memory usage: 28.3+ MB


Data cleaning process for c_crew

In [221]:
# Check for duplicated columns
c_crew[c_crew.duplicated(subset=['id', 'credit_id', 'm_id', 'name'], keep=False)].sort_values(by=['id', 'credit_id'])

Unnamed: 0,credit_id,department,gender,id,job,name,profile_path,m_id
259782,55c0d60e9251410f19001cec,Production,1,32,Producer,Robin Wright,/tXfQTgcIEPP7gtVdJ44ZxZPhacn.jpg,152795
278138,55c0d60e9251410f19001cec,Production,1,32,Producer,Robin Wright,/tXfQTgcIEPP7gtVdJ44ZxZPhacn.jpg,152795
90620,52fe43e2c3a36847f80760b5,Writing,2,202,Screenplay,Charlie Kaufman,/v5Zc2aplTL0y38Oe91zGnVBUtYi.jpg,4912
379605,52fe43e2c3a36847f80760b5,Writing,2,202,Screenplay,Charlie Kaufman,/v5Zc2aplTL0y38Oe91zGnVBUtYi.jpg,4912
317389,52fe4655c3a36847f80f96bf,Directing,2,525,Director,Christopher Nolan,/7OGmfDF4VHLLgbjxuEwTj3ga0uQ.jpg,43629
...,...,...,...,...,...,...,...,...
317743,596bdc83c3a3684c0200548b,Editing,0,1852798,Dialogue Editor,Matt Gorzkowski,,199591
317345,596bdcd9c3a3684bcb004d02,Sound,0,1852799,Sound Effects Editor,Michael Hanlan,,199591
317744,596bdcd9c3a3684bcb004d02,Sound,0,1852799,Sound Effects Editor,Michael Hanlan,,199591
317346,596bdfcac3a3684c02005741,Costume & Make-Up,0,1852802,Wardrobe Supervisor,Jasmine Murray-Bergquist,,199591


In [222]:
c_crew.drop_duplicates(subset=['id', 'credit_id', 'm_id', 'name'], keep='first', inplace=True)

Validate cleaning

In [223]:
c_crew[c_crew.duplicated(subset=['id', 'credit_id', 'm_id', 'name'], keep=False)].sort_values(by=['id', 'credit_id']).count()

credit_id       0
department      0
gender          0
id              0
job             0
name            0
profile_path    0
m_id            0
dtype: int64

In [224]:
c_crew.profile_path.str.split('.').str[-1].value_counts()

profile_path
jpg    95001
Name: count, dtype: int64

In [225]:
c_crew.isnull().sum()

credit_id            0
department           0
gender               0
id                   0
job                  0
name                 0
profile_path    368835
m_id                 0
dtype: int64

#### Keywords Dataframe Processing:

In [226]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [227]:
keywords = normalize_data(keywords, 'keywords', 'id', 'm')

In [228]:
keywords.head(5)

Unnamed: 0,id,name,m_id
0,931,jealousy,862
1,4290,toy,862
2,5202,boy,862
3,6054,friendship,862
4,9713,friends,862


Data Cleaning process

In [229]:
pd.DataFrame(keywords.name.value_counts())

Unnamed: 0_level_0,count
name,Unnamed: 1_level_1
woman director,3115
independent film,1930
murder,1308
based on novel,835
musical,734
...,...
helping animals,1
animal agriculture,1
brother sister,1
bad boy,1


In [230]:
# Check for duplicated columns
keywords[keywords.duplicated(keep=False)].sort_values(by=['id', 'name', 'm_id'])

Unnamed: 0,id,name,m_id
137369,65,holiday,19252
139324,65,holiday,19252
135581,65,holiday,26381
137536,65,holiday,26381
135628,65,holiday,26537
...,...,...,...
138012,237651,dreamland,325712
136432,238208,dead end road,23382
138387,238208,dead end road,23382
136265,238539,corrupt sheriff,325173


In [231]:
keywords[keywords.duplicated(keep=False)].sort_values(by=['id', 'name', 'm_id']).count()

id      4156
name    4156
m_id    4156
dtype: int64

In [232]:
keywords.drop_duplicates(keep='first', inplace=True)

Validate

In [233]:
keywords[keywords.duplicated(keep=False)].sort_values(by=['id', 'name', 'm_id']).count()

id      0
name    0
m_id    0
dtype: int64

In [234]:
keywords.isnull().sum()

id      0
name    0
m_id    0
dtype: int64

#### Links Dataframe Processing:

In [235]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [236]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  45843 non-null  int64  
 1   imdbId   45843 non-null  int64  
 2   tmdbId   45624 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 1.0 MB


In [237]:
links['tmdbId'] = links.tmdbId.fillna(0).astype(int)

Validate

In [238]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   movieId  45843 non-null  int64
 1   imdbId   45843 non-null  int64
 2   tmdbId   45843 non-null  int64
dtypes: int64(3)
memory usage: 1.0 MB


In [239]:
links.isnull().sum()

movieId    0
imdbId     0
tmdbId     0
dtype: int64

#### Movies Metaddata Dataframe Processing:

In [240]:
movies_metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173


In [241]:
# Inspect the first row of the movies_metadata DataFrame to decide the processing method
movies_metadata.iloc[0]

adult                                                                False
belongs_to_collection    {'id': 10194, 'name': 'Toy Story Collection', ...
budget                                                            30000000
genres                   [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
homepage                              http://toystory.disney.com/toy-story
id                                                                     862
imdb_id                                                          tt0114709
original_language                                                       en
original_title                                                   Toy Story
overview                 Led by Woody, Andy's toys live happily in his ...
popularity                                                       21.946943
poster_path                               /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
production_companies        [{'name': 'Pixar Animation Studios', 'id': 3}]
production_countries     

Extracting belongs_to_collection DataFrame

In [242]:
def extract_dict_values(row: pd.DataFrame, column_name: str, new_column_prefix: str = None):
    new_row = safe_parse_collection(row[column_name])
    if isinstance(new_row, dict):
        for key, value in new_row.items():
            if new_column_prefix:
                row[f'{new_column_prefix}_{key}'] = value
            else:
                row[key] = value
    return row

belongs_to_collection = movies_metadata[['belongs_to_collection', 'id']].copy()
belongs_to_collection.rename(columns={'id': 'm_id'}, inplace=True)
belongs_to_collection = belongs_to_collection.apply(lambda row: extract_dict_values(row, 'belongs_to_collection'), axis=1)
movies_metadata.drop('belongs_to_collection', axis=1, inplace=True)
belongs_to_collection = belongs_to_collection.drop('belongs_to_collection', axis=1)

Process & cleaning belongs_to_collection DataFrame

In [243]:
# Take a look of the top 5 rows of the belongs_to_collection DataFrame
belongs_to_collection.head()

Unnamed: 0,backdrop_path,id,m_id,name,poster_path
0,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,10194.0,862,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg
1,,,8844,,
2,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,119050.0,15602,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg
3,,,31357,,
4,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,96871.0,11862,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg


In [244]:
# Inspect the data type
belongs_to_collection.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45463 entries, 0 to 45462
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   backdrop_path  3263 non-null   object 
 1   id             4491 non-null   float64
 2   m_id           45463 non-null  int64  
 3   name           4491 non-null   object 
 4   poster_path    3948 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 1.7+ MB


In [245]:
# Remove rows with missing values
mask = belongs_to_collection[['id', 'name']].isna().all(axis=1)
belongs_to_collection = belongs_to_collection[~mask]
belongs_to_collection.head()

Unnamed: 0,backdrop_path,id,m_id,name,poster_path
0,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,10194.0,862,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg
2,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,119050.0,15602,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg
4,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,96871.0,11862,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg
9,/6VcVl48kNKvdXOZfJPdarlUGOsk.jpg,645.0,710,James Bond Collection,/HORpg5CSkmeQlAolx3bKMrKgfi.jpg
12,/9VM5LiJV0bGb1st1KyHA3cVnO2G.jpg,117693.0,21032,Balto Collection,/w0ZgH6Lgxt2bQYnf1ss74UvYftm.jpg


Validate

In [246]:
# Inspect the data type
belongs_to_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4491 entries, 0 to 45379
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   backdrop_path  3263 non-null   object 
 1   id             4491 non-null   float64
 2   m_id           4491 non-null   int64  
 3   name           4491 non-null   object 
 4   poster_path    3948 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 210.5+ KB


Extracting and separating genres column

In [247]:
genres = normalize_data(movies_metadata, 'genres', 'id', 'm')
movies_metadata.drop(columns=['genres'], inplace=True)
genres.sample(5)

Unnamed: 0,id,name,m_id
7195,10752,War,31681
52057,18,Drama,244267
24863,18,Drama,38583
78316,16,Animation,127380
36660,10749,Romance,41248


Validate

In [248]:
genres.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91094 entries, 0 to 91093
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      91094 non-null  int64 
 1   name    91094 non-null  object
 2   m_id    91094 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 2.1+ MB


Extracting production_companies column

In [249]:
production_companies = normalize_data(movies_metadata, 'production_companies', 'id', 'm')
movies_metadata.drop(columns=['production_companies'], inplace=True)
production_companies.sort_values(by='id').head()

Unnamed: 0,name,id,m_id
4053,Lucasfilm,1,847
570,Lucasfilm,1,11
2474,Lucasfilm,1,89
4517,Lucasfilm,1,10658
3889,Lucasfilm,1,87


Cleaning process

In [250]:
production_companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70557 entries, 0 to 70556
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    70557 non-null  object
 1   id      70557 non-null  int64 
 2   m_id    70557 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.6+ MB


Extracting production_countries DataFrame

In [251]:
production_countries = normalize_data(movies_metadata, 'production_countries', 'id', 'm')
movies_metadata.drop(columns=['production_countries'], inplace=True)
production_countries.sample(5)

Unnamed: 0,iso_3166_1,name,m_id
10039,US,United States of America,31592
25785,US,United States of America,198287
48140,SE,Sweden,37695
21551,US,United States of America,67885
12481,AT,Austria,445


Validate

In [252]:
production_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49430 entries, 0 to 49429
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   iso_3166_1  49430 non-null  object
 1   name        49430 non-null  object
 2   m_id        49430 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.1+ MB


Extracting spoken_languages DataFrame

In [253]:
spoken_languages = normalize_data(movies_metadata, 'spoken_languages', 'id', 'm')
movies_metadata.drop(columns=['spoken_languages'], inplace=True)
spoken_languages.sample(5)

Unnamed: 0,iso_639_1,name,m_id
5785,en,English,17170
19219,ru,Pусский,98914
10228,ja,日本語,5544
17035,en,English,17106
17209,cs,Český,31658


Validate

In [254]:
spoken_languages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53303 entries, 0 to 53302
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   iso_639_1  53303 non-null  object
 1   name       53303 non-null  object
 2   m_id       53303 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.2+ MB


#### Validate whether the data is valid and related

In [255]:
# Taking sample
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [256]:
credits[credits['id'] == 110]

Unnamed: 0,cast,crew,id
302,"[{'cast_id': 9, 'character': 'Valentine Dussau...","[{'credit_id': '52fe4219c3a36847f8003dbd', 'de...",110


In [257]:
c_cast[c_cast['id'] == 110].head()

Unnamed: 0,cast_id,character,credit_id,gender,id,name,order,profile_path,m_id
2674,4,Lt. Peter 'WEAPS' Ince,52fe44ccc3a36847f80aa7e9,2,110,Viggo Mortensen,4,/gYtVNMwX96fE9F0WVkdC0SGffkn.jpg,8963
3110,14,Lucifer,535e9b58c3a36830a9005700,2,110,Viggo Mortensen,5,/gYtVNMwX96fE9F0WVkdC0SGffkn.jpg,11980
7041,11,Lalin,52fe443bc3a36847f8089ef7,2,110,Viggo Mortensen,7,/gYtVNMwX96fE9F0WVkdC0SGffkn.jpg,6075
12512,11,Roy Nord,52fe44159251416c75028531,2,110,Viggo Mortensen,2,/gYtVNMwX96fE9F0WVkdC0SGffkn.jpg,11228
23655,7,Guy Foucard,52fe44b7c3a36847f80a5f6f,2,110,Viggo Mortensen,4,/gYtVNMwX96fE9F0WVkdC0SGffkn.jpg,8744


In [258]:
c_crew[c_crew['id'] == 110].head()

Unnamed: 0,credit_id,department,gender,id,job,name,profile_path,m_id
344703,5545f2619251414c92003c16,Production,2,110,Co-Producer,Viggo Mortensen,/gYtVNMwX96fE9F0WVkdC0SGffkn.jpg,283708


In [259]:
keywords[keywords['m_id'] == 110].head()

Unnamed: 0,id,name,m_id
2028,934,judge,110
2029,1533,isolation,110
2030,2863,mannequin,110
2031,4918,shadowing,110
2032,5259,english channel,110


In [260]:
links[links['tmdbId'] == 110].head()

Unnamed: 0,movieId,imdbId,tmdbId
303,306,111495,110


[Link validation](https://www.imdb.com/title/tt0111495/)

In [261]:
movies_metadata[movies_metadata['id'] == 110]

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,revenue,runtime,status,tagline,title,video,vote_average,vote_count
302,False,0,,110,tt0111495,fr,Trois couleurs : Rouge,Red This is the third film from the trilogy by...,7.832755,/77CFEssoKesi4zvtADEpIrSKhA3.jpg,1994-05-27,0,99.0,Released,,Three Colors: Red,False,7.8,246


In [262]:
belongs_to_collection[belongs_to_collection['m_id'] == 110]

Unnamed: 0,backdrop_path,id,m_id,name,poster_path
302,/AeHExfHIl70SZCea907KfEoSkfJ.jpg,131.0,110,Three Colors Collection,/rVdd23QuT5rHX7lZvuAkRRUkeCZ.jpg


In [263]:
genres[genres['m_id'] == 110]

Unnamed: 0,id,name,m_id
726,18,Drama,110
727,9648,Mystery,110
728,10749,Romance,110


In [264]:
production_companies[production_companies['m_id'] == 110]

Unnamed: 0,name,id,m_id
674,Zespół Filmowy TOR,38,110
675,Le Studio Canal+,183,110
676,France 3 Cinéma,591,110
677,Télévision Suisse-Romande,1245,110


In [265]:
production_countries[production_countries['m_id'] == 110]

Unnamed: 0,iso_3166_1,name,m_id
375,FR,France,110
376,PL,Poland,110
377,CH,Switzerland,110


In [266]:
spoken_languages[spoken_languages['m_id'] == 110]

Unnamed: 0,iso_639_1,name,m_id
414,fr,Français,110


## Notes
- Data on movies_metadata with id 82663, 122662, 249260 have manual correction in the csv
- With this all of the data can be concluded to be integrated and validated based on it's ID