### Extract and Transform

Se convirtio el dataset `movies_dataset.csv` a formato `parquet` desde `scripts/convert_csv_to_parquet.py` con las siguientes modificaciones:

- **Columnas eliminadas**: `video`, `imdb_id`, `adult`, `original_title`, `poster_path`, `homepage`
- **Valores nulos rellenados**:
  - `revenue`: 0
  - `budget`: 0
- **Conversión de tipos**:
  - `popularity`: `float64`
  - `budget`: `int64`
  - `id`: `int64`

In [2]:
import pandas as pd

df = pd.read_parquet(r'C:\Users\mauri\OneDrive\Escritorio\MLops\data\raw\movies_dataset.parquet')

Eliminar valores nulos de `release_date`

In [3]:
df = df.dropna(subset=['release_date'])

formato `AAAA-mm-dd` en fechas, y creacion de columna `release_year` para el año de estreno.

In [4]:
# 1. Identificar las filas con valores incorrectos
incorrect_format = df[~df['release_date'].str.match(r'^\d{4}-\d{2}-\d{2}$', na=False)]
# Muestra las filas con formato incorrecto.
incorrect_format

#Ya que los errores no pueden ser corregidos manualmente por falta de informacion
# Y ademas las demas filas parecieran incorrectas y con muchos Nan. Se procede a eliminar estos datos.

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
19730,0.065736,,"[{'name': 'Carousel Productions', 'id': 11176}...",,104.0,Released,,False,6.0,1,0.0,,,,,,,
29503,1.931659,,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...",,68.0,Released,,False,7.0,12,0.0,,,,,,,
35587,2.185485,,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...",,82.0,Released,,False,4.3,22,0.0,,,,,,,


In [5]:
#Se borraron esas 3 filas del df.
df.drop(incorrect_format.index, inplace=True)

In [6]:
#Ahora si pasamos a datetime con formato aaaa-mm-dd
df['release_date'] = pd.to_datetime(df['release_date'], format='%Y-%m-%d')

#Creacion columna release_year y muestra. 
df['release_year'] = df['release_date'].dt.year
df[['release_date','release_year']].sample(5)


Unnamed: 0,release_date,release_year
3206,2000-03-03,2000
15507,1972-06-01,1972
35888,2011-02-25,2011
1827,1942-06-04,1942
21376,2012-01-06,2012


crear columna  `return` dividiendo  `revenue` / `budget`, cuando no hay datos para calcularlo, tomar valor 0.



In [7]:
df['return'] = df.apply(lambda row: row['revenue'] / row['budget'] if pd.notnull(row['revenue']) and pd.notnull(row['budget']) and row['budget'] != 0 else 0, axis=1)

- `belongs_to_collection `,  `production_companies ` ,  `genres ` ,  `production_countries`,  `spoken_languages` están anidados.
- deberán desanidarlos O buscar la manera de acceder sin desanidarlos.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45376 entries, 0 to 45465
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4488 non-null   object        
 1   budget                 45376 non-null  float64       
 2   genres                 45376 non-null  object        
 3   id                     45376 non-null  float64       
 4   original_language      45365 non-null  object        
 5   overview               44435 non-null  object        
 6   popularity             45376 non-null  float64       
 7   production_companies   45376 non-null  object        
 8   production_countries   45376 non-null  object        
 9   release_date           45376 non-null  datetime64[ns]
 10  revenue                45376 non-null  float64       
 11  runtime                45130 non-null  float64       
 12  spoken_languages       45376 non-null  object        
 13  status