# Creacion de DataFrames de Netflix

## Lecturas archivos

In [1]:
import numpy as np
import pandas as pd

### api_netflix
Informacion de las peliculas dada por la API. Sus columnas son:
Las columnas nos entregan los siguientes datos:
- `id`: Identificador de la pelicula en la API
- `title`: Nombre de la pelicula
- `year`: Año de estreno de la pelicula
- `imdb_id`: Identificador de la pelicula en IMDB
- `tmdb_id`: Identificador de la pelicula en TMDB
- `tmdb_type`: Tipo del titulo en TMDB
- `type`: Tipo del titulo en la API

In [2]:
dfnetflix = pd.read_csv('data/api_netflix.csv')
dfnetflix.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id
0,1789382,Steve,2025,tt32985279,1242404
1,1823218,The Woman in Cabin 10,2025,tt7130300,1290879
2,1996393,Swim to Me,2025,tt34682204,1484640
3,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785
4,1780773,28 Years Later,2025,tt10548174,1100988


### imdb_basics
Informacion basica de cada titulo en IMDB. Sus columnas son:
- `tconst`: Id del titulo en IMDB
- `titleType`: Tipo del titulo
- `primaryTitle`: Nombre mas comun del titulo
- `originalTitle`: Nombre original del titulo
- `isAdult`: Bool que indica si es para adultos o no
- `startYear`: Año de salida, en series es el año de comienzo de la serie
- `endYear`: Año de fin de la serie
- `runtimeMinutes`: Duracion del titulo en minutos
- `genres`: Lista de generos del titulo

In [4]:
imdb_basics = pd.read_csv('tsv/title.basics.tsv', sep='\t')
imdb_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


### imdb_ratings
Rating de cada pelicula en IMDB. Sus columnas son:
- `tconst`: Id de IMDB
- `averageRating`: Puntaje promedio dado por los votos
- `numVotes`: Cantidad de votos

In [5]:
imdb_ratings = pd.read_csv('tsv/title.ratings.tsv', sep='\t')
imdb_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2180
1,tt0000002,5.5,302
2,tt0000003,6.4,2254
3,tt0000004,5.2,194
4,tt0000005,6.2,2992


### imdb_principals
Trabajadores involucrados en cada titulo de IMDB (directores, productores, actores, etc.). Sus columnas son:
- `tconst`: Id del titulo en IMDB
- `ordering`: Id para enumerar a los trabajadores por titulo
- `nconst`: Id de persona en IMDB
- `category`: Categoria del rol que cumplio en el titulo
- `job`: Trabajo que tenia en el titulo
- `characters`: En caso de ser actor, muestra los nombres de los personajes que interpreta

In [6]:
imdb_principals = pd.read_csv('tsv/title.principals.tsv', sep='\t')
imdb_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0005690,producer,producer,\N
3,tt0000001,4,nm0374658,cinematographer,director of photography,\N
4,tt0000002,1,nm0721526,director,\N,\N


### imdb_crew
Directores y escritores de cada titulo en IMDB. Sus columnas son:
- `tconst`: Id del titulo en IMDB
- `directors`: Id de persona del director en IMDB
- `writers`: Id de persona de los escritores en IMDB

In [7]:
imdb_crew = pd.read_csv('tsv/title.crew.tsv', sep='\t')
imdb_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,nm0721526
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


### imdb_name
Este Dataframe contiene informacion de cada persona relacionada a titulos dentro de IMDB. Sus columnas son:
- `nconst`: Id de la persona en IMDB
- `primaryName`: Nombre por el que es mas conocida la persona
- `birthYear`: Año de nacimiento de la persona
- `deathYear`: Año de fallecimiento de la persona
- `primaryProfession`: Los tres roles que mas suele cumplir en los titulos
- `knownForTitle`: Titulos por los que es conocido

In [8]:
imdb_name = pd.read_csv('tsv/name.basics.tsv', sep='\t')
imdb_name.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"actor,miscellaneous,producer","tt0050419,tt0072308,tt0027125,tt0025164"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0038355,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,music_department","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0069467,tt0083922,tt0050976"


## Creacion de Dataframe principal

### Join con `imdb_basics`

In [9]:
df_main1 = dfnetflix.merge(imdb_basics, how='left', left_on='imdb_id', right_on='tconst')
df_main1.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,1789382,Steve,2025,tt32985279,1242404,tt32985279,movie,Steve,Steve,0.0,2025,\N,93,"Comedy,Drama"
1,1823218,The Woman in Cabin 10,2025,tt7130300,1290879,tt7130300,movie,The Woman in Cabin 10,The Woman in Cabin 10,0.0,2025,\N,92,"Drama,Mystery,Thriller"
2,1996393,Swim to Me,2025,tt34682204,1484640,tt34682204,movie,Swim to Me,Limpia,0.0,2025,\N,102,Drama
3,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785,tt38466379,movie,"My Father, the BTK Killer","My Father, the BTK Killer",0.0,2025,\N,93,"Crime,Documentary"
4,1780773,28 Years Later,2025,tt10548174,1100988,tt10548174,movie,28 Years Later,28 Years Later,0.0,2025,\N,115,"Horror,Sci-Fi,Thriller"


### Join con `imdb_ratings`

In [10]:
df_main2 = df_main1.merge(imdb_ratings, how='left', left_on='imdb_id', right_on='tconst')
df_main2.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tconst_x,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst_y,averageRating,numVotes
0,1789382,Steve,2025,tt32985279,1242404,tt32985279,movie,Steve,Steve,0.0,2025,\N,93,"Comedy,Drama",tt32985279,6.4,8427.0
1,1823218,The Woman in Cabin 10,2025,tt7130300,1290879,tt7130300,movie,The Woman in Cabin 10,The Woman in Cabin 10,0.0,2025,\N,92,"Drama,Mystery,Thriller",tt7130300,5.9,16028.0
2,1996393,Swim to Me,2025,tt34682204,1484640,tt34682204,movie,Swim to Me,Limpia,0.0,2025,\N,102,Drama,tt34682204,5.3,162.0
3,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785,tt38466379,movie,"My Father, the BTK Killer","My Father, the BTK Killer",0.0,2025,\N,93,"Crime,Documentary",tt38466379,6.2,1063.0
4,1780773,28 Years Later,2025,tt10548174,1100988,tt10548174,movie,28 Years Later,28 Years Later,0.0,2025,\N,115,"Horror,Sci-Fi,Thriller",tt10548174,6.6,147272.0


### Join con `imdb_crew`

In [11]:
df_main3 = df_main2.merge(imdb_crew, how='left', left_on='imdb_id', right_on='tconst')
df_main3.head()

Unnamed: 0,id,title,year,imdb_id,tmdb_id,tconst_x,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,tconst_y,averageRating,numVotes,tconst,directors,writers
0,1789382,Steve,2025,tt32985279,1242404,tt32985279,movie,Steve,Steve,0.0,2025,\N,93,"Comedy,Drama",tt32985279,6.4,8427.0,tt32985279,nm2139803,nm10532463
1,1823218,The Woman in Cabin 10,2025,tt7130300,1290879,tt7130300,movie,The Woman in Cabin 10,The Woman in Cabin 10,0.0,2025,\N,92,"Drama,Mystery,Thriller",tt7130300,5.9,16028.0,tt7130300,nm1404307,"nm1100123,nm3143608,nm1404307,nm0296491,nm3405121"
2,1996393,Swim to Me,2025,tt34682204,1484640,tt34682204,movie,Swim to Me,Limpia,0.0,2025,\N,102,Drama,tt34682204,5.3,162.0,tt34682204,nm3049622,"nm10202287,nm3049622,nm15953578"
3,11042350,"My Father, the BTK Killer",2025,tt38466379,1550785,tt38466379,movie,"My Father, the BTK Killer","My Father, the BTK Killer",0.0,2025,\N,93,"Crime,Documentary",tt38466379,6.2,1063.0,tt38466379,nm1341941,\N
4,1780773,28 Years Later,2025,tt10548174,1100988,tt10548174,movie,28 Years Later,28 Years Later,0.0,2025,\N,115,"Horror,Sci-Fi,Thriller",tt10548174,6.6,147272.0,tt10548174,nm0000965,nm0307497


### Limpieza

#### Eliminar columnas
Eliminamos columnas con informacion repetida o irrelevante

In [14]:
df_main4 = df_main3[['id', 'imdb_id', 'primaryTitle', 'originalTitle', 'titleType', 'startYear', 
                     'isAdult', 'runtimeMinutes', 'genres', 'averageRating', 'numVotes', 'directors', 'writers']]
df_main4

Unnamed: 0,id,imdb_id,primaryTitle,originalTitle,titleType,startYear,isAdult,runtimeMinutes,genres,averageRating,numVotes,directors,writers
0,1789382,tt32985279,Steve,Steve,movie,2025,0.0,93,"Comedy,Drama",6.4,8427.0,nm2139803,nm10532463
1,1823218,tt7130300,The Woman in Cabin 10,The Woman in Cabin 10,movie,2025,0.0,92,"Drama,Mystery,Thriller",5.9,16028.0,nm1404307,"nm1100123,nm3143608,nm1404307,nm0296491,nm3405121"
2,1996393,tt34682204,Swim to Me,Limpia,movie,2025,0.0,102,Drama,5.3,162.0,nm3049622,"nm10202287,nm3049622,nm15953578"
3,11042350,tt38466379,"My Father, the BTK Killer","My Father, the BTK Killer",movie,2025,0.0,93,"Crime,Documentary",6.2,1063.0,nm1341941,\N
4,1780773,tt10548174,28 Years Later,28 Years Later,movie,2025,0.0,115,"Horror,Sci-Fi,Thriller",6.6,147272.0,nm0000965,nm0307497
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2899,1725885,tt21737024,The (Almost) Legends,Los (casi) ídolos de Bahía Colorada,movie,2023,0.0,97,"Comedy,Music",6.4,643.0,nm7473888,"nm7473888,nm6750289,nm2153699,nm0351021"
2900,1795372,tt31495242,Harta Tahta Raisa,Harta Tahta Raisa,movie,2024,0.0,110,"Biography,Documentary,Music",4.0,72.0,nm5780243,nm5780243
2901,1720462,tt28448409,Jurnal Risa by Risa Saraswati,Jurnal Risa by Risa Saraswati,movie,2024,0.0,92,"Documentary,Horror,Thriller",4.4,339.0,nm0559285,nm5193944
2902,1852883,tt33246651,Ahir Shah: Ends,Ahir Shah: Ends,movie,2024,0.0,61,"Comedy,Documentary",6.5,80.0,nm0651369,nm4113476


In [15]:
df_main4['titleType'].unique()

array(['movie', 'tvEpisode', 'video', nan, 'tvShort'], dtype=object)