# Análisis Exploratorio de la Capa Bronze

## Configuración inicial

Importamos las librerías:

In [1]:
from deltalake import DeltaTable
import pandas as pd

## Now Playing

Cargamos la Tabla Delta Bronze de Now Playing:

In [2]:
# Now Playing:

dt_np = DeltaTable('../data/delta/bronze/tmdb/now_playing')
df_np = dt_np.to_pandas()


Primeras inspecciones:

In [3]:
df_np.head()

Unnamed: 0,movie_id,title,vote_average,vote_count,popularity,date
0,1125899,Cleaner,6.75,142,417.3003,2025-04-03
1,1229730,Carjackers,7.1,37,435.5444,2025-04-03
2,822119,Captain America: Brave New World,6.118,1187,348.3977,2025-04-03
3,1261050,The Quiet Ones,6.2,15,363.2694,2025-04-03
4,1197306,A Working Man,6.968,94,356.5597,2025-04-03


Verificamos los datatypes para definir las transformaciones necesarias:

In [4]:
df_np.dtypes

movie_id          int64
title            object
vote_average    float64
vote_count        int64
popularity      float64
date             object
dtype: object

Revisamos la existencia de valores nulos y duplicados:

In [5]:
# Datos faltantes
missing_values = df_np.isnull().sum()
print("Valores nulos por columna:")
print(missing_values[missing_values > 0])

# Duplicados
duplicates = df_np.duplicated().sum()
print("Número de registros duplicados:", duplicates)

Valores nulos por columna:
Series([], dtype: int64)
Número de registros duplicados: 0


Revisamos la existencia de valores duplicados, filtrando simultáneamente por `movie_id` y `date`.

In [6]:
df_np.groupby(['movie_id', 'date']).size().reset_index(name='counts').query('counts > 1')

Unnamed: 0,movie_id,date,counts
68,1202479,2025-04-03,2


Verificamos fechas inconsistentes:

In [7]:
print('La fecha más antigua del dataset es: '+ df_np['date'].min())
print('La fecha más reciente del dataset es: '+ df_np['date'].max())


La fecha más antigua del dataset es: 2025-04-03
La fecha más reciente del dataset es: 2025-04-03


## Movie Details

In [8]:
# Movie Details:

dt_md = DeltaTable('../data/delta/bronze/tmdb/movie_details')
df_md = dt_md.to_pandas()

Primeras inspecciones:

In [9]:
df_md.head()

Unnamed: 0,movie_id,title,runtime,budget,genres,imdb_id,homepage,origin_countries,original_language,production_companies,release_date
0,1333100,Attack on Titan: THE LAST ATTACK,145,0,"[{'id': 16, 'name': 'Animation'}, {'id': 28, '...",tt33175825,,[JP],ja,"[{'id': 21444, 'logo_path': '/wSejGn3lAZdQ5muB...",2024-11-08
1,696506,Mickey 17,137,118000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",tt12299608,https://www.mickey17movie.com,"[GB, US]",en,"[{'id': 174, 'logo_path': '/zhD3hhtKB5qyv7ZeL4...",2025-02-28
2,717196,Niko: Beyond the Northern Lights,85,0,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",tt14813816,,"[IE, DE, DK, FI]",fi,"[{'id': 135965, 'logo_path': None, 'name': 'An...",2024-10-07
3,1249289,Alarum,95,20000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",tt31456973,https://justwatch.pro/movie/1249289/alarum,[US],en,"[{'id': 121204, 'logo_path': '/vbtvY4IxgUZk713...",2025-01-16
4,128,Princess Mononoke,134,23500000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",tt0119698,http://www.princess-mononoke.com/,[JP],ja,"[{'id': 10342, 'logo_path': '/uFuxPEZRUcBTEiYI...",1997-07-12


Verificamos los datatypes para definir las transformaciones necesarias:

In [10]:
df_md.dtypes

movie_id                 int64
title                   object
runtime                  int64
budget                   int64
genres                  object
imdb_id                 object
homepage                object
origin_countries        object
original_language       object
production_companies    object
release_date            object
dtype: object

Revisamos la existencia de valores nulos, tanto por columna como por filas:

In [11]:
df_md.isnull().sum()

movie_id                0
title                   0
runtime                 0
budget                  0
genres                  0
imdb_id                 1
homepage                0
origin_countries        0
original_language       0
production_companies    0
release_date            0
dtype: int64

In [12]:
df_md[df_md.isnull().any(axis=1)].head()

Unnamed: 0,movie_id,title,runtime,budget,genres,imdb_id,homepage,origin_countries,original_language,production_companies,release_date
28,1376879,Tunnel: Sun In The Dark,0,2240000,"[{'id': 10752, 'name': 'War'}, {'id': 36, 'nam...",,,[VN],vi,"[{'id': 31453, 'logo_path': None, 'name': 'HKF...",2025-04-04


Revisamos la existencia de valores duplicados, filtrando ÚNICAMENTE por `movie_id` por la naturaleza del dataset.

In [13]:
df_md.duplicated(subset=['movie_id']).sum()

0