## Title
Data Raw Exploration

### By:
Santiago Puerta - Juan Gómez

### Date:
2024-05-11

### Description:

This notebook explores the raw movie data collected from the TMDb API. It shows basic statistics, checks missing values, and looks at trends in popularity, genres, and ratings. The goal is to understand the data before building a recommendation system.


## Import  libraries

In [1]:
import pandas as pd

## Load data

In [2]:
from pathlib import Path

pd.set_option("display.max_columns", None)

DATA_DIR = Path.cwd().resolve().parents[1]

In [3]:
df_movies = pd.read_parquet(DATA_DIR / "data/01_raw/movies_dataset_2025-05-07.parquet")

## Exploration

In [4]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   adult              8000 non-null   bool          
 1   backdrop_path      7921 non-null   object        
 2   genre_ids          8000 non-null   object        
 3   id                 8000 non-null   int64         
 4   original_language  8000 non-null   object        
 5   original_title     8000 non-null   object        
 6   overview           8000 non-null   object        
 7   popularity         8000 non-null   float64       
 8   poster_path        7991 non-null   object        
 9   release_date       8000 non-null   object        
 10  title              8000 non-null   object        
 11  video              8000 non-null   bool          
 12  vote_average       8000 non-null   float64       
 13  vote_count         8000 non-null   int64         
 14  source  

In [5]:
df_movies.sample(5)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,source,entry_date,was_ingested,is_popular,runtime,budget,revenue,status,tagline,genres,spoken_languages,keywords
5436,False,/bE1AJOGtcvjegirI5Np6sTUO0gz.jpg,"[10751, 18, 35]",673271,en,13: The Musical,"After moving from New York City to Indiana, a ...",0.931,/rqShG2kTbsVbgrgjfoEwawjR88N.jpg,2022-08-12,13: The Musical,False,6.0,40,exploratory,2025-02-12,False,False,94,0,0,Released,,"[Familia, Drama, Comedia]","[עִבְרִית, English]","[new york city, indiana, usa, bar mitzvah, mus..."
4822,False,/ksDlbRulTNcp3BEzr4345fS54w.jpg,"[10749, 35]",1032124,en,Ask Me to Dance,"Unlucky in love, Jack and Jill are destined to...",0.3882,/u04ZJa53UZNfUEnv5H6bOZpLj73.jpg,2022-10-07,Ask Me to Dance,False,5.455,11,exploratory,2025-04-08,False,False,94,0,0,Released,,"[Romance, Comedia]",[English],[]
1332,False,/84bWZa16ALeGKcdmvKWv6Kvoohb.jpg,"[80, 18]",1007127,ko,댓글부대,Journalist Sang-jin uncovers the existence of ...,0.8556,/k7pU2kmGPs6kxoPAMygMU93Rw4C.jpg,2024-03-27,Troll Factory,False,5.7,11,exploratory,2025-01-18,False,False,109,0,6676327,Released,,"[Crimen, Drama]",[한국어/조선말],"[journalist, based on novel or book, national ..."
4660,False,/97bwlJw220Z5XE3xAHF6G8gA8g6.jpg,"[27, 14, 28]",644124,it,Dampyr,"In war-torn Balkans, bogus monster hunter Harl...",2.9222,/xdWjqmX4x0ObKIPqkr8Vptj99AZ.jpg,2022-10-28,Dampyr,False,6.167,162,exploratory,2025-05-05,False,False,109,15000000,362113,Released,,"[Terror, Fantasía, Acción]",[English],"[vampire, balkan war, based on comic, bonelli]"
3175,False,/lLl80wmNwnSdVAPkxqwJwm9M2WH.jpg,"[27, 18, 53]",1129932,es,Rabia,Alan and his father Alberto flee from the pain...,0.4412,/vOUqBfCtfud4ixOiftKom5ULfgj.jpg,2023-06-03,Rage,False,5.9,19,exploratory,2025-03-17,False,False,93,0,0,Released,,"[Terror, Drama, Suspense]",[Español],[]


### Null Values

In [6]:
print("\nNull values in Movie Data Set:")
null_counts = df_movies.isnull().sum()
display(null_counts[null_counts > 0].sort_values(ascending=False))


Null values in Movie Data Set:


backdrop_path    79
poster_path       9
dtype: int64

In [None]:
print("\nColumns with more than 30% missing values:")
null_threshold = 30
null_percent = df_movies.isnull().mean() * 100  # calculate & of null values
display(null_percent[null_percent > null_threshold].sort_values(ascending=False))


Columns with more than 30% missing values:


Series([], dtype: float64)

### Remove columns

In [8]:
df_movies = df_movies.drop(
    columns=["backdrop_path", "poster_path", "spoken_languages", "genre_ids"]
)

In [9]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   adult              8000 non-null   bool          
 1   id                 8000 non-null   int64         
 2   original_language  8000 non-null   object        
 3   original_title     8000 non-null   object        
 4   overview           8000 non-null   object        
 5   popularity         8000 non-null   float64       
 6   release_date       8000 non-null   object        
 7   title              8000 non-null   object        
 8   video              8000 non-null   bool          
 9   vote_average       8000 non-null   float64       
 10  vote_count         8000 non-null   int64         
 11  source             8000 non-null   object        
 12  entry_date         8000 non-null   datetime64[ns]
 13  was_ingested       8000 non-null   bool          
 14  is_popul

In [10]:
df_movies.shape

(8000, 22)

### Categorical Variables

In [11]:
cols_categoric = ["original_language", "source", "status"]  # 3

In [12]:
df_movies[cols_categoric] = df_movies[cols_categoric].astype("category")

- Ordinal: status

- Nominal: original_language, source

### Numerical Variables

In [13]:
cols_numeric = [
    "popularity",
    "vote_average",
    "vote_count",
    "runtime",
    "budget",
    "revenue",
]

- Float

In [14]:
cols_numeric_float = ["popularity", "vote_average", "budget", "revenue"]

In [15]:
df_movies[cols_numeric_float] = df_movies[cols_numeric_float].astype("float")

- Int

In [16]:
cols_numeric_int = ["vote_count", "runtime"]

In [17]:
df_movies[cols_numeric_int] = df_movies[cols_numeric_int].astype("int8")

### Boolean Variables

In [18]:
cols_boolean = ["adult", "video", "was_ingested", "is_popular"]

In [19]:
df_movies[cols_boolean] = df_movies[cols_boolean].astype("bool")

### String Variables

In [20]:
cols_string = [
    "id",
    "original_title",
    "overview",
    "title",
    "tagline",
    "genres",
    "keywords",
]

In [21]:
df_movies[cols_string] = df_movies[cols_string].astype("string")

### Date Variables

In [22]:
col_date = ["release_date", "entry_date"]

In [23]:
df_movies[col_date] = df_movies[col_date].astype("datetime64[ns]")

### Schema

In [24]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   adult              8000 non-null   bool          
 1   id                 8000 non-null   string        
 2   original_language  8000 non-null   category      
 3   original_title     8000 non-null   string        
 4   overview           8000 non-null   string        
 5   popularity         8000 non-null   float64       
 6   release_date       8000 non-null   datetime64[ns]
 7   title              8000 non-null   string        
 8   video              8000 non-null   bool          
 9   vote_average       8000 non-null   float64       
 10  vote_count         8000 non-null   int8          
 11  source             8000 non-null   category      
 12  entry_date         8000 non-null   datetime64[ns]
 13  was_ingested       8000 non-null   bool          
 14  is_popul

In [25]:
import pyarrow as pa

schema = pa.Table.from_pandas(df_movies).schema

In [26]:
schema

adult: bool
id: string
original_language: dictionary<values=string, indices=int8, ordered=0>
original_title: string
overview: string
popularity: double
release_date: timestamp[ns]
title: string
video: bool
vote_average: double
vote_count: int8
source: dictionary<values=string, indices=int8, ordered=0>
entry_date: timestamp[ns]
was_ingested: bool
is_popular: bool
runtime: int8
budget: double
revenue: double
status: dictionary<values=string, indices=int8, ordered=0>
tagline: string
genres: string
keywords: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2936

In [27]:
df_movies.to_parquet(
    DATA_DIR / "data/02_intermediate/movies_dataset_fixed.parquet",
    index=False,
    schema=schema,
)