# Creacion de DataFrames de Netflix

## Lecturas archivos

In [20]:
import numpy as np
import pandas as pd

### api_netflix
Informacion de las peliculas dada por la API. Sus columnas son:
Las columnas nos entregan los siguientes datos:
- `id`: Identificador de la pelicula en la API
- `title`: Nombre de la pelicula
- `year`: Año de estreno de la pelicula
- `imdb_id`: Identificador de la pelicula en IMDB
- `tmdb_id`: Identificador de la pelicula en TMDB
- `tmdb_type`: Tipo del titulo en TMDB
- `type`: Tipo del titulo en la API

In [21]:
dfnetflix = pd.read_csv('data/api_netflix.csv')
dfnetflix.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id
0,1823218,The Woman in Cabin 10,2025,tt7130300,1290879
1,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785
2,1812460,She Walks in Darkness,2025,tt32129665,1275151
3,1996393,Swim to Me,2025,tt34682204,1484640
4,1944218,Everybody Loves Me When I'm Dead,2025,tt35669054,1429750


### imdb_basics
Informacion basica de cada titulo en IMDB. Sus columnas son:
- `tconst`: Id del titulo en IMDB
- `titleType`: Tipo del titulo
- `primaryTitle`: Nombre mas comun del titulo
- `originalTitle`: Nombre original del titulo
- `isAdult`: Bool que indica si es para adultos o no
- `startYear`: Año de salida, en series es el año de comienzo de la serie
- `endYear`: Año de fin de la serie
- `runtimeMinutes`: Duracion del titulo en minutos
- `genres`: Lista de generos del titulo

In [22]:
imdb_basics = pd.read_csv('tsv/title.basics.tsv', sep='\t')
imdb_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


### imdb_ratings
Rating de cada pelicula en IMDB. Sus columnas son:
- `tconst`: Id de IMDB
- `averageRating`: Puntaje promedio dado por los votos
- `numVotes`: Cantidad de votos

In [23]:
imdb_ratings = pd.read_csv('tsv/title.ratings.tsv', sep='\t')
imdb_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2180
1,tt0000002,5.5,302
2,tt0000003,6.4,2254
3,tt0000004,5.2,194
4,tt0000005,6.2,2992


### imdb_principals
Trabajadores involucrados en cada titulo de IMDB (directores, productores, actores, etc.). Sus columnas son:
- `tconst`: Id del titulo en IMDB
- `ordering`: Id para enumerar a los trabajadores por titulo
- `nconst`: Id de persona en IMDB
- `category`: Categoria del rol que cumplio en el titulo
- `job`: Trabajo que tenia en el titulo
- `characters`: En caso de ser actor, muestra los nombres de los personajes que interpreta

In [24]:
imdb_principals = pd.read_csv('tsv/title.principals.tsv', sep='\t')
imdb_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0005690,producer,producer,\N
3,tt0000001,4,nm0374658,cinematographer,director of photography,\N
4,tt0000002,1,nm0721526,director,\N,\N


### imdb_crew
Directores y escritores de cada titulo en IMDB. Sus columnas son:
- `tconst`: Id del titulo en IMDB
- `directors`: Id de persona del director en IMDB
- `writers`: Id de persona de los escritores en IMDB

In [25]:
imdb_crew = pd.read_csv('tsv/title.crew.tsv', sep='\t')
imdb_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,nm0721526
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


### imdb_name
Este Dataframe contiene informacion de cada persona relacionada a titulos dentro de IMDB. Sus columnas son:
- `nconst`: Id de la persona en IMDB
- `primaryName`: Nombre por el que es mas conocida la persona
- `birthYear`: Año de nacimiento de la persona
- `deathYear`: Año de fallecimiento de la persona
- `primaryProfession`: Los tres roles que mas suele cumplir en los titulos
- `knownForTitle`: Titulos por los que es conocido

In [26]:
imdb_name = pd.read_csv('tsv/name.basics.tsv', sep='\t')
imdb_name.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"actor,miscellaneous,producer","tt0050419,tt0072308,tt0027125,tt0025164"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0038355,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,music_department","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0069467,tt0083922,tt0050976"


## Creacion de Dataframe principal

### Join con `imdb_basics`

In [27]:
df_main1 = dfnetflix.merge(imdb_basics, how='left', left_on='imdb_id', right_on='tconst')
df_main1.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,1823218,The Woman in Cabin 10,2025,tt7130300,1290879,tt7130300,movie,The Woman in Cabin 10,The Woman in Cabin 10,0.0,2025,\N,92,"Drama,Mystery,Thriller"
1,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785,tt38466379,movie,"My Father, the BTK Killer","My Father, the BTK Killer",0.0,2025,\N,93,"Crime,Documentary"
2,1812460,She Walks in Darkness,2025,tt32129665,1275151,tt32129665,movie,She Walks in Darkness,Un fantasma en la batalla,0.0,2025,\N,108,"Drama,History,Thriller"
3,1996393,Swim to Me,2025,tt34682204,1484640,tt34682204,movie,Swim to Me,Limpia,0.0,2025,\N,102,Drama
4,1944218,Everybody Loves Me When I'm Dead,2025,tt35669054,1429750,tt35669054,movie,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,0.0,2025,\N,128,\N


### Join con `imdb_ratings`

In [28]:
df_main2 = df_main1.merge(imdb_ratings, how='left', left_on='imdb_id', right_on='tconst')
df_main2.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tconst_x,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst_y,averageRating,numVotes
0,1823218,The Woman in Cabin 10,2025,tt7130300,1290879,tt7130300,movie,The Woman in Cabin 10,The Woman in Cabin 10,0.0,2025,\N,92,"Drama,Mystery,Thriller",tt7130300,5.9,16028.0
1,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785,tt38466379,movie,"My Father, the BTK Killer","My Father, the BTK Killer",0.0,2025,\N,93,"Crime,Documentary",tt38466379,6.2,1063.0
2,1812460,She Walks in Darkness,2025,tt32129665,1275151,tt32129665,movie,She Walks in Darkness,Un fantasma en la batalla,0.0,2025,\N,108,"Drama,History,Thriller",tt32129665,7.4,70.0
3,1996393,Swim to Me,2025,tt34682204,1484640,tt34682204,movie,Swim to Me,Limpia,0.0,2025,\N,102,Drama,tt34682204,5.3,162.0
4,1944218,Everybody Loves Me When I'm Dead,2025,tt35669054,1429750,tt35669054,movie,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,0.0,2025,\N,128,\N,,,


### Join con `imdb_crew`

In [29]:
df_main3 = df_main2.merge(imdb_crew, how='left', left_on='imdb_id', right_on='tconst')
df_main3.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tconst_x,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst_y,averageRating,numVotes,tconst,directors,writers
0,1823218,The Woman in Cabin 10,2025,tt7130300,1290879,tt7130300,movie,The Woman in Cabin 10,The Woman in Cabin 10,0.0,2025,\N,92,"Drama,Mystery,Thriller",tt7130300,5.9,16028.0,tt7130300,nm1404307,"nm1100123,nm3143608,nm1404307,nm0296491,nm3405121"
1,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785,tt38466379,movie,"My Father, the BTK Killer","My Father, the BTK Killer",0.0,2025,\N,93,"Crime,Documentary",tt38466379,6.2,1063.0,tt38466379,nm1341941,\N
2,1812460,She Walks in Darkness,2025,tt32129665,1275151,tt32129665,movie,She Walks in Darkness,Un fantasma en la batalla,0.0,2025,\N,108,"Drama,History,Thriller",tt32129665,7.4,70.0,tt32129665,nm0246503,nm0246503
3,1996393,Swim to Me,2025,tt34682204,1484640,tt34682204,movie,Swim to Me,Limpia,0.0,2025,\N,102,Drama,tt34682204,5.3,162.0,tt34682204,nm3049622,"nm10202287,nm3049622,nm15953578"
4,1944218,Everybody Loves Me When I'm Dead,2025,tt35669054,1429750,tt35669054,movie,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,0.0,2025,\N,128,\N,,,,tt35669054,nm1552261,\N


### Limpieza

#### Eliminar columnas
Eliminamos columnas con informacion repetida o irrelevante

In [30]:
df_main4 = df_main3[['id', 'imdb_id', 'title', 'primaryTitle', 'originalTitle', 'titleType', 'year', 'startYear', 
                     'isAdult', 'runtimeMinutes', 'genres', 'averageRating', 'numVotes', 'directors', 'writers']]
df_main4.head()

Unnamed: 0,id,imdb_id,title,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0.0,92,"Drama,Mystery,Thriller",5.9,16028.0,nm1404307,"nm1100123,nm3143608,nm1404307,nm0296491,nm3405121"
1,11042350,tt38466379,"My Father, the BTK Killer","My Father, the BTK Killer","My Father, the BTK Killer",movie,2025,2025,0.0,93,"Crime,Documentary",6.2,1063.0,nm1341941,\N
2,1812460,tt32129665,She Walks in Darkness,She Walks in Darkness,Un fantasma en la batalla,movie,2025,2025,0.0,108,"Drama,History,Thriller",7.4,70.0,nm0246503,nm0246503
3,1996393,tt34682204,Swim to Me,Swim to Me,Limpia,movie,2025,2025,0.0,102,Drama,5.3,162.0,nm3049622,"nm10202287,nm3049622,nm15953578"
4,1944218,tt35669054,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,movie,2025,2025,0.0,128,\N,,,nm1552261,\N


#### Correcion valores nulos y tipos de columnas

In [32]:
df_main5 = df_main4.replace(['\\N', np.nan, None], pd.NA)

In [33]:
df_main5.dtypes

id                 int64
imdb_id           object
title             object
primaryTitle      object
originalTitle     object
titleType         object
year               int64
startYear         object
isAdult           object
runtimeMinutes    object
genres            object
averageRating     object
numVotes          object
directors         object
writers           object
dtype: object

In [35]:
df_main5['titleType'] = df_main5['titleType'].astype('category')
df_main5['startYear'] = df_main5['startYear'].astype('Int64')
df_main5['isAdult'] = df_main5['isAdult'].astype('Int64')
df_main5['runtimeMinutes'] = df_main5['runtimeMinutes'].astype('Int64')
df_main5['averageRating'] = df_main5['averageRating'].astype('Float64')
df_main5['numVotes'] = df_main5['numVotes'].astype('Int64')

In [36]:
df_main5.head()

Unnamed: 0,id,imdb_id,title,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"Drama,Mystery,Thriller",5.9,16028.0,nm1404307,"nm1100123,nm3143608,nm1404307,nm0296491,nm3405121"
1,11042350,tt38466379,"My Father, the BTK Killer","My Father, the BTK Killer","My Father, the BTK Killer",movie,2025,2025,0,93,"Crime,Documentary",6.2,1063.0,nm1341941,
2,1812460,tt32129665,She Walks in Darkness,She Walks in Darkness,Un fantasma en la batalla,movie,2025,2025,0,108,"Drama,History,Thriller",7.4,70.0,nm0246503,nm0246503
3,1996393,tt34682204,Swim to Me,Swim to Me,Limpia,movie,2025,2025,0,102,Drama,5.3,162.0,nm3049622,"nm10202287,nm3049622,nm15953578"
4,1944218,tt35669054,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,movie,2025,2025,0,128,,,,nm1552261,


In [37]:
for i in ['genres', 'directors', 'writers']:
    df_main5[i] = df_main5[i].apply(lambda x: str(x).split(',') if pd.notna(x) else [])

df_main5.head()

Unnamed: 0,id,imdb_id,title,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028.0,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
1,11042350,tt38466379,"My Father, the BTK Killer","My Father, the BTK Killer","My Father, the BTK Killer",movie,2025,2025,0,93,"[Crime, Documentary]",6.2,1063.0,[nm1341941],[]
2,1812460,tt32129665,She Walks in Darkness,She Walks in Darkness,Un fantasma en la batalla,movie,2025,2025,0,108,"[Drama, History, Thriller]",7.4,70.0,[nm0246503],[nm0246503]
3,1996393,tt34682204,Swim to Me,Swim to Me,Limpia,movie,2025,2025,0,102,[Drama],5.3,162.0,[nm3049622],"[nm10202287, nm3049622, nm15953578]"
4,1944218,tt35669054,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,movie,2025,2025,0,128,[],,,[nm1552261],[]


Valores NA

In [40]:
resultados = {}
for i in df_main5.columns:
    resultados[i] = df_main5[i].isna().sum()
resultados

{'id': np.int64(0),
 'imdb_id': np.int64(0),
 'title': np.int64(0),
 'primaryTitle': np.int64(4),
 'originalTitle': np.int64(4),
 'titleType': np.int64(4),
 'year': np.int64(0),
 'startYear': np.int64(4),
 'isAdult': np.int64(4),
 'runtimeMinutes': np.int64(13),
 'genres': np.int64(0),
 'averageRating': np.int64(9),
 'numVotes': np.int64(9),
 'directors': np.int64(0),
 'writers': np.int64(0)}

In [41]:
df_main5[pd.isna(df_main5['runtimeMinutes'])] 

Unnamed: 0,id,imdb_id,title,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
19,11050026,tt36265413,Inside Furioza,Furioza Again,Furioza Again,movie,2025,2025.0,0.0,,[Drama],,,[nm2940110],[]
34,11047294,tt38356685,Turn of the Tide: The Surreal Story of Rabo de...,Turn of the Tide: The Surreal Story of Rabo de...,Maré Branca: A Surreal História de Rabo de Peixe,movie,2025,2025.0,0.0,,[Documentary],,,[nm13956670],[nm2029700]
1430,1628879,tt8066940,Indian 2: Zero Tolerance,Indian 2,Indian 2,movie,2024,2024.0,0.0,,"[Action, Drama, Musical]",3.8,16792.0,[nm0788171],"[nm10171496, nm3328122, nm11805162, nm9731217,..."
1683,1771964,tt29892095,Girl Haunts Boy,,,,2024,,,,[],,,[],[]
2114,1693330,tt15939090,Burning Patience,,,,2022,,,,[],,,[],[]
2285,1890136,tt35287639,The Maximum Penalty 2,La Pena Maxima 2,La Pena Maxima 2,movie,2024,2024.0,0.0,,[Comedy],5.5,74.0,[nm1357797],[]
2683,1834116,tt32643879,Outside,,,,2024,,,,[],,,[],[]
2764,1556312,tt0106206,Aashik Aawara,Aashik Aawara,Aashik Aawara,movie,1993,1993.0,0.0,,[Drama],4.8,414.0,[nm0576494],"[nm0576494, nm1100075]"
2769,1899207,tt13079112,The Lockdown Plan,The Lockdown Plan,The Lockdown Plan,tvEpisode,2022,2020.0,0.0,,"[Comedy, Romance]",6.5,102.0,[],[]
2773,1698365,tt15751968,Smile,Smile,Smile,movie,2022,2022.0,0.0,,[Horror],4.3,155.0,[nm10076697],[nm10076697]


In [42]:
df_main5[pd.isna(df_main5['averageRating'])] 

Unnamed: 0,id,imdb_id,title,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
4,1944218,tt35669054,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,Everybody Loves Me When I'm Dead,movie,2025,2025.0,0.0,128.0,[],,,[nm1552261],[]
19,11050026,tt36265413,Inside Furioza,Furioza Again,Furioza Again,movie,2025,2025.0,0.0,,[Drama],,,[nm2940110],[]
34,11047294,tt38356685,Turn of the Tide: The Surreal Story of Rabo de...,Turn of the Tide: The Surreal Story of Rabo de...,Maré Branca: A Surreal História de Rabo de Peixe,movie,2025,2025.0,0.0,,[Documentary],,,[nm13956670],[nm2029700]
1683,1771964,tt29892095,Girl Haunts Boy,,,,2024,,,,[],,,[],[]
2114,1693330,tt15939090,Burning Patience,,,,2022,,,,[],,,[],[]
2130,11042008,tt38366532,The Time That Remains,The Time That Remains,The Time That Remains,movie,2025,2025.0,0.0,116.0,"[Drama, Fantasy, Romance]",,,[nm0019723],[]
2391,1901595,tt0498396,The Twits,The Twits,The Twits,movie,2025,2025.0,0.0,98.0,"[Adventure, Animation, Comedy]",,,"[nm1601882, nm2154609, nm4670728]","[nm1601882, nm5044246, nm0001094]"
2683,1834116,tt32643879,Outside,,,,2024,,,,[],,,[],[]
2795,1625502,tt13836494,Pinkfong & Baby Shark's Space Adventure,,,,2019,,,,[],,,[],[]


Duplicados

In [43]:
df_main5.duplicated('imdb_id').sum()

np.int64(0)

Analisis columnas

In [44]:
df_main5.describe()

Unnamed: 0,id,year,startYear,isAdult,runtimeMinutes,averageRating,numVotes
count,2908.0,2908.0,2904.0,2904.0,2895.0,2899.0,2899.0
mean,1581275.0,2018.924347,2018.878788,0.0,108.786528,6.180476,40037.418765
std,916844.9,7.290506,7.285463,0.0,22.508434,1.047656,105053.375984
min,166.0,1967.0,1967.0,0.0,10.0,1.6,6.0
25%,1412952.0,2018.0,2018.0,0.0,94.0,5.6,1828.5
50%,1620810.0,2021.0,2021.0,0.0,105.0,6.3,6289.0
75%,1697665.0,2023.0,2023.0,0.0,120.0,6.9,27807.5
max,11050030.0,2025.0,2025.0,0.0,210.0,9.0,1725395.0


In [45]:
dfmain = df_main5

In [46]:
dfmain.to_csv('data/imdb_netflix.csv')

## Creacion Dataframe personas

In [47]:
df_personas1 = dfmain.merge(imdb_principals, how='left', left_on='imdb_id', right_on='tconst')
df_personas1.head()

Unnamed: 0,id,imdb_id,title,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,...,averageRating,numVotes,directors,writers,tconst,ordering,nconst,category,job,characters
0,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,...,5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n...",tt7130300,1.0,nm0461136,actress,\N,"[""Laura Blacklock""]"
1,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,...,5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n...",tt7130300,2.0,nm0001602,actor,\N,"[""Richard Bullmer""]"
2,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,...,5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n...",tt7130300,3.0,nm2916966,actor,\N,"[""Ben Morgan""]"
3,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,...,5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n...",tt7130300,4.0,nm1926861,actress,\N,"[""Carrie""]"
4,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,...,5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n...",tt7130300,5.0,nm0539562,actor,\N,"[""Dr. Robert Mehta""]"


Eliminar Columnas

In [48]:
df_personas2 = df_personas1[['id', 'imdb_id', 'title', 'nconst', 'ordering', 'category', 'job', 'characters', 'primaryTitle',
                             'originalTitle', 'titleType', 'year', 'startYear', 'isAdult', 'runtimeMinutes', 'genres', 'averageRating',
                             'numVotes', 'directors', 'writers']]
df_personas2.head()

Unnamed: 0,id,imdb_id,title,nconst,ordering,category,job,characters,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,1823218,tt7130300,The Woman in Cabin 10,nm0461136,1.0,actress,\N,"[""Laura Blacklock""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
1,1823218,tt7130300,The Woman in Cabin 10,nm0001602,2.0,actor,\N,"[""Richard Bullmer""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
2,1823218,tt7130300,The Woman in Cabin 10,nm2916966,3.0,actor,\N,"[""Ben Morgan""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
3,1823218,tt7130300,The Woman in Cabin 10,nm1926861,4.0,actress,\N,"[""Carrie""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
4,1823218,tt7130300,The Woman in Cabin 10,nm0539562,5.0,actor,\N,"[""Dr. Robert Mehta""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."


### Arreglar valores nulos y columnas

In [49]:
df_personas3 = df_personas2.replace(['\\N', np.nan, None], pd.NA)
df_personas3.head()

Unnamed: 0,id,imdb_id,title,nconst,ordering,category,job,characters,primaryTitle,originalTitle,titleType,year,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,1823218,tt7130300,The Woman in Cabin 10,nm0461136,1.0,actress,,"[""Laura Blacklock""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
1,1823218,tt7130300,The Woman in Cabin 10,nm0001602,2.0,actor,,"[""Richard Bullmer""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
2,1823218,tt7130300,The Woman in Cabin 10,nm2916966,3.0,actor,,"[""Ben Morgan""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
3,1823218,tt7130300,The Woman in Cabin 10,nm1926861,4.0,actress,,"[""Carrie""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."
4,1823218,tt7130300,The Woman in Cabin 10,nm0539562,5.0,actor,,"[""Dr. Robert Mehta""]",The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,2025,0,92,"[Drama, Mystery, Thriller]",5.9,16028,[nm1404307],"[nm1100123, nm3143608, nm1404307, nm0296491, n..."


In [50]:
df_personas3.dtypes

id                   int64
imdb_id             object
title               object
nconst              object
ordering            object
category            object
job                 object
characters          object
primaryTitle        object
originalTitle       object
titleType         category
year                 int64
startYear            Int64
isAdult              Int64
runtimeMinutes       Int64
genres              object
averageRating      Float64
numVotes             Int64
directors           object
writers             object
dtype: object

In [51]:
df_personas3['ordering'] = df_personas3['ordering'].astype('Int64')
df_personas3['category'] = df_personas3['category'].astype('category')

### Valores nulos

In [53]:
resultados = {}
for i in df_personas3.columns:
    resultados[i] = df_personas3[i].isna().sum()
resultados

{'id': np.int64(0),
 'imdb_id': np.int64(0),
 'title': np.int64(0),
 'nconst': np.int64(6),
 'ordering': np.int64(6),
 'category': np.int64(6),
 'job': np.int64(43264),
 'characters': np.int64(31867),
 'primaryTitle': np.int64(4),
 'originalTitle': np.int64(4),
 'titleType': np.int64(4),
 'year': np.int64(0),
 'startYear': np.int64(4),
 'isAdult': np.int64(4),
 'runtimeMinutes': np.int64(136),
 'genres': np.int64(0),
 'averageRating': np.int64(68),
 'numVotes': np.int64(68),
 'directors': np.int64(0),
 'writers': np.int64(0)}

In [54]:
dfpersonas = df_personas3

In [55]:
dfpersonas.to_csv('data/personas_netflix.csv')