# Apresentação

## Alunos
- Caio Guedes - ccsg@cesar.school
- Thais Rezende - trb@cesar.school
- Daniel Moares - dmms@cesar.school
- João Paulo Veloso - jpgev@cesar.school
- Bruno Barbosa - bba@cesar.school

## Fontes

Foram utilizados os seguintes datasets para o desenvolvimento do trabalho:

- [IMDb Dataset](https://www.kaggle.com/datasets/ashirwadsangwan/imdb-dataset)
    - title.basics.tsv
    - title.ratings.tsv
- [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)
    - netflix_titles.csv

> 🚧 Atenção
> 
>  Os arquivos foram salvos em na basta `data` na raiz do projeto


## Objetivos da análise
- Quantidade de lançamentos por Diretores por Ano  OK
- Países que mais lançaram filmes OK
- Filmes com maior e menor tempo de duração OK
- Média de rating por diretor OK
- Qual a categoria mais recorrente na base? OK
- Geração de insights através de [exibições gráficas](netflix_notebook_v2.ipynb#análise-exploratória-dos-dados-eda)

## Dicionário do DataFrame

|Coluna|Tipo|Descrição|
|--|--|--|
| id | string | alphanumeric unique identifier of the title. |
| title_type | string | the type/format of the title (e.g. movie, short,tvseries, tvepisode, video, etc). |
| primary_title | string | the more popular title / the title used by the filmmakers on promotional materials at the point of release. |
| original_title | string | original title, in the original language. |
| is_adult | boolean | 0: non-adult title; 1: adult title. |
| start_year | int | represents the release year of a title. In the case of TV Series, it is the series start year. |
| end_year | int | TV Series end year. for all other title types. |
| runtime_minutes | – | primary runtime of the title, in minutes. |
| genres | string array | includes up to three genres associated with the title. |
| title_slug | string | title slug


In [1]:
import pandas as pd
from slugify import slugify
import numpy as np

# Preparação do dataset

## Datasets do IMDB

### Importar Dados de Títulos do IMDB

In [2]:
df_imdb_titles = pd.read_csv('./data/title.basics.tsv', 
                             sep='\t', 
                             na_values='\\N', 
                             encoding='utf8', 
                             dtype=str)
df_imdb_titles.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,,1,"Documentary,Short"


#### Filtrar apenas filmes e séries

In [3]:
#types = df_imdb_titles['titleType'].unique()
#types
df_imdb_titles = df_imdb_titles[df_imdb_titles['titleType'].isin(['movie', 'tvSeries' ])]
df_imdb_titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,,45,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,,100,"Documentary,News,Sport"
498,tt0000502,movie,Bohemios,Bohemios,0,1905,,100,
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,,90,Drama


#### Converter tipo de dados
- `startYear` de float para inteiro
- `endYear` de float para inteiro

In [4]:
df_imdb_titles['startYear'] = pd.to_numeric(df_imdb_titles['startYear'], errors='coerce', downcast='integer')
df_imdb_titles['endYear'] = pd.to_numeric(df_imdb_titles['endYear'], errors='coerce', downcast='integer')
df_imdb_titles['primaryTitle'] = df_imdb_titles['primaryTitle'].astype(str)
df_imdb_titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100,"Documentary,News,Sport"
498,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100,
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90,Drama


#### Criar slug do título

In [5]:
#df_imdb_titles.dtypes
df_imdb_titles['primaryTitleSlug'] = df_imdb_titles['primaryTitle'].fillna('').apply(lambda x: slugify(str(x)))

#### Criar slug dos gêneros

In [6]:
df_imdb_titles['genres'] = df_imdb_titles['genres'].fillna('').str.split(',').apply(lambda x:[slugify(value) for value in x])
df_imdb_titles.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,primaryTitleSlug
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45,[romance],miss-jerry
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100,"[documentary, news, sport]",the-corbett-fitzsimmons-fight
498,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100,[],bohemios
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70,"[action, adventure, biography]",the-story-of-the-kelly-gang
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90,[drama],the-prodigal-son


#### Normalizar tipo

In [7]:
df_imdb_titles['titleType'] = df_imdb_titles['titleType'].fillna('').apply(lambda x: slugify(str(x)))
df_imdb_titles.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,primaryTitleSlug
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45,[romance],miss-jerry
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100,"[documentary, news, sport]",the-corbett-fitzsimmons-fight
498,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100,[],bohemios
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70,"[action, adventure, biography]",the-story-of-the-kelly-gang
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90,[drama],the-prodigal-son


#### Renomear colunas

In [8]:
df_imdb_titles.rename(
    {'tconst': 'imdb_id', 'titleType': 'type', 'primaryTitle': 'primary_title', 'originalTitle': 'original_title', 'isAdult': 'is_adult', 'startYear': 'start_year', 'endYear': 'end_year', 'runtimeMinutes': 'runtime_minutes', 'primaryTitleSlug': 'title_slug'},
    axis=1,
    inplace=True)
df_imdb_titles.head()

Unnamed: 0,imdb_id,type,primary_title,original_title,is_adult,start_year,end_year,runtime_minutes,genres,title_slug
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45,[romance],miss-jerry
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100,"[documentary, news, sport]",the-corbett-fitzsimmons-fight
498,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100,[],bohemios
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70,"[action, adventure, biography]",the-story-of-the-kelly-gang
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90,[drama],the-prodigal-son


### Importar Dados de Notas do IMDB

In [9]:
df_imdb_title_ratings = pd.read_csv('./data/title.ratings.tsv', 
                             sep='\t', 
                             na_values='\\N', 
                             encoding='utf8', 
                             dtype=str)
df_imdb_title_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2045
1,tt0000002,5.7,273
2,tt0000003,6.5,2003
3,tt0000004,5.4,178
4,tt0000005,6.2,2763


#### Converter Dados
- `averageRating` para float
- `numVotes` para int

In [10]:
df_imdb_title_ratings['tconst'] = df_imdb_title_ratings['tconst'].astype(str)
df_imdb_title_ratings['averageRating'] = df_imdb_title_ratings['averageRating'].astype(float)
df_imdb_title_ratings['numVotes'] = df_imdb_title_ratings['numVotes'].astype(int)
df_imdb_title_ratings.dtypes

tconst            object
averageRating    float64
numVotes           int64
dtype: object

#### Renomar colunas

In [11]:
df_imdb_title_ratings.rename(
    {'tconst': 'imdb_id', 'averageRating': 'average_rating', 'numVotes': 'num_votes'},
    axis=1,
    inplace=True)
df_imdb_title_ratings.head()

Unnamed: 0,imdb_id,average_rating,num_votes
0,tt0000001,5.7,2045
1,tt0000002,5.7,273
2,tt0000003,6.5,2003
3,tt0000004,5.4,178
4,tt0000005,6.2,2763


### Combinando Dados de Filmes e Notas do IMDB

In [12]:
df_imdb = pd.merge(left=df_imdb_titles, right=df_imdb_title_ratings, how='inner', on='imdb_id')
df_imdb.head()

Unnamed: 0,imdb_id,type,primary_title,original_title,is_adult,start_year,end_year,runtime_minutes,genres,title_slug,average_rating,num_votes
0,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45,[romance],miss-jerry,5.3,210
1,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100,"[documentary, news, sport]",the-corbett-fitzsimmons-fight,5.2,510
2,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100,[],bohemios,4.4,17
3,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70,"[action, adventure, biography]",the-story-of-the-kelly-gang,6.0,886
4,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90,[drama],the-prodigal-son,5.4,24


## Dataset da Netflix

### Dicionário do Dataframe

|Coluna|Tipo|Descrição|
|--|--|--|
| show_id      | string  | alphanumeric unique identifier of the show. |
| type         | string  | the show type (Movie, TV Show) |
| title        | string  | the show title |
| director     | string  | a comma separeted values, containing all directors name's |
| cast         | boolean | a comma separeted values, containing all cast name's |
| country      | string  | a comma separeted values, containing all countries involved in the show |
| date_added   | string  | a date indicating when the show was added to netflix catalog |
| release_year | string  | a date indicating when the show was released |
| rating       | string  | a category indicating the target audience by age |
| duration     | string  | the show duration |
| listed_in    | string  | the show category |
| description  | string  | a show description |


### Importar Dados de Títulos da Netflix

In [13]:
df_netflix_titles = pd.read_csv('./data/netflix_titles.csv', sep=',', encoding='utf8', dtype=str)
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


### Criar slug do título

In [14]:
df_netflix_titles['title_slug'] = df_netflix_titles['title'].fillna('').apply(lambda x: slugify(str(x)))
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_slug
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",dick-johnson-is-dead
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",blood-water
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,ganglands
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,kota-factory


In [15]:
types = df_netflix_titles['type'].unique()
types

array(['Movie', 'TV Show'], dtype=object)

### Normalizar tipo

In [16]:
def normalize_netflix_type(value):
    if value == 'Movie':
        return 'movie'
    elif value == 'TV Show':
        return 'tvseries'
    else:
        return None

In [17]:
df_netflix_titles['type'] = df_netflix_titles['type'].apply(lambda x: normalize_netflix_type(x))
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_slug
0,s1,movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",dick-johnson-is-dead
1,s2,tvseries,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",blood-water
2,s3,tvseries,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,ganglands
3,s4,tvseries,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans
4,s5,tvseries,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,kota-factory


### Normalizar país

In [18]:
df_netflix_titles['country'] = df_netflix_titles['country'].fillna('').str.split(',')
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_slug
0,s1,movie,Dick Johnson Is Dead,Kirsten Johnson,,[United States],"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",dick-johnson-is-dead
1,s2,tvseries,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",[South Africa],"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",blood-water
2,s3,tvseries,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",[],"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,ganglands
3,s4,tvseries,Jailbirds New Orleans,,,[],"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans
4,s5,tvseries,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",[India],"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,kota-factory


### Normalizar categoria

In [19]:
df_netflix_titles['listed_in'] = df_netflix_titles['listed_in'].fillna('').str.split(',')
df_netflix_titles['listed_in'] = df_netflix_titles['listed_in'].apply(lambda x: [s.lower() for s in x])
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_slug
0,s1,movie,Dick Johnson Is Dead,Kirsten Johnson,,[United States],"September 25, 2021",2020,PG-13,90 min,[documentaries],"As her father nears the end of his life, filmm...",dick-johnson-is-dead
1,s2,tvseries,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",[South Africa],"September 24, 2021",2021,TV-MA,2 Seasons,"[international tv shows, tv dramas, tv myste...","After crossing paths at a party, a Cape Town t...",blood-water
2,s3,tvseries,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",[],"September 24, 2021",2021,TV-MA,1 Season,"[crime tv shows, international tv shows, tv ...",To protect his family from a powerful drug lor...,ganglands
3,s4,tvseries,Jailbirds New Orleans,,,[],"September 24, 2021",2021,TV-MA,1 Season,"[docuseries, reality tv]","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans
4,s5,tvseries,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",[India],"September 24, 2021",2021,TV-MA,2 Seasons,"[international tv shows, romantic tv shows, ...",In a city of coaching centers known to train I...,kota-factory


### Normalizar Diretor

In [20]:
df_netflix_titles['director_list'] = df_netflix_titles['director'].fillna('').str.split(',')
df_netflix_titles['director_list'] = df_netflix_titles['director_list'].apply(lambda x: [s.lower() for s in x])
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_slug,director_list
0,s1,movie,Dick Johnson Is Dead,Kirsten Johnson,,[United States],"September 25, 2021",2020,PG-13,90 min,[documentaries],"As her father nears the end of his life, filmm...",dick-johnson-is-dead,[kirsten johnson]
1,s2,tvseries,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",[South Africa],"September 24, 2021",2021,TV-MA,2 Seasons,"[international tv shows, tv dramas, tv myste...","After crossing paths at a party, a Cape Town t...",blood-water,[]
2,s3,tvseries,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",[],"September 24, 2021",2021,TV-MA,1 Season,"[crime tv shows, international tv shows, tv ...",To protect his family from a powerful drug lor...,ganglands,[julien leclercq]
3,s4,tvseries,Jailbirds New Orleans,,,[],"September 24, 2021",2021,TV-MA,1 Season,"[docuseries, reality tv]","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans,[]
4,s5,tvseries,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",[India],"September 24, 2021",2021,TV-MA,2 Seasons,"[international tv shows, romantic tv shows, ...",In a city of coaching centers known to train I...,kota-factory,[]


## Combinando Dados do IMDB com Dados da Netflix

In [21]:
df_netflix_merge = pd.merge(df_netflix_titles, df_imdb[['title_slug', 'average_rating']], on=['title_slug'], how='right')
df_netflix_titles['rating_imdb'] = df_imdb['average_rating']
df_netflix_titles.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,title_slug,director_list,rating_imdb
0,s1,movie,Dick Johnson Is Dead,Kirsten Johnson,,[United States],"September 25, 2021",2020,PG-13,90 min,[documentaries],"As her father nears the end of his life, filmm...",dick-johnson-is-dead,[kirsten johnson],5.3
1,s2,tvseries,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",[South Africa],"September 24, 2021",2021,TV-MA,2 Seasons,"[international tv shows, tv dramas, tv myste...","After crossing paths at a party, a Cape Town t...",blood-water,[],5.2
2,s3,tvseries,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",[],"September 24, 2021",2021,TV-MA,1 Season,"[crime tv shows, international tv shows, tv ...",To protect his family from a powerful drug lor...,ganglands,[julien leclercq],4.4
3,s4,tvseries,Jailbirds New Orleans,,,[],"September 24, 2021",2021,TV-MA,1 Season,"[docuseries, reality tv]","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans,[],6.0
4,s5,tvseries,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",[India],"September 24, 2021",2021,TV-MA,2 Seasons,"[international tv shows, romantic tv shows, ...",In a city of coaching centers known to train I...,kota-factory,[],5.4


# Análise exploratória

In [22]:
df_netflix_titles.drop(columns=['show_id', 'rating'], inplace=True)
df_netflix_titles.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,duration,listed_in,description,title_slug,director_list,rating_imdb
0,movie,Dick Johnson Is Dead,Kirsten Johnson,,[United States],"September 25, 2021",2020,90 min,[documentaries],"As her father nears the end of his life, filmm...",dick-johnson-is-dead,[kirsten johnson],5.3
1,tvseries,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",[South Africa],"September 24, 2021",2021,2 Seasons,"[international tv shows, tv dramas, tv myste...","After crossing paths at a party, a Cape Town t...",blood-water,[],5.2
2,tvseries,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",[],"September 24, 2021",2021,1 Season,"[crime tv shows, international tv shows, tv ...",To protect his family from a powerful drug lor...,ganglands,[julien leclercq],4.4
3,tvseries,Jailbirds New Orleans,,,[],"September 24, 2021",2021,1 Season,"[docuseries, reality tv]","Feuds, flirtations and toilet talk go down amo...",jailbirds-new-orleans,[],6.0
4,tvseries,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",[India],"September 24, 2021",2021,2 Seasons,"[international tv shows, romantic tv shows, ...",In a city of coaching centers known to train I...,kota-factory,[],5.4


### Países que mais lançaram filmes

In [23]:
df_exploded = df_netflix_titles.explode(column='country')
df_exploded['country'] = df_exploded['country'].apply(lambda x: x.strip())
df_filtered = df_exploded[df_exploded['country'] != '']
df_grouped = df_filtered.groupby(['country']).size().sort_values(ascending=False)
df_grouped

country
United States     3690
India             1046
United Kingdom     806
Canada             445
France             393
                  ... 
Mozambique           1
Nicaragua            1
Palestine            1
Panama               1
Lithuania            1
Length: 122, dtype: int64

### Filmes com maior e menor tempo de duração

In [24]:
df_netflix_movies = df_netflix_titles[df_netflix_titles['type'] == 'movie'].copy()
df_netflix_movies.dropna(subset='duration', inplace=True)
df_netflix_movies['duration'] = df_netflix_movies['duration'].apply(lambda x: str(x))
df_netflix_movies['duration'] = df_netflix_movies['duration'].apply(lambda x: x.replace('min', ''))
df_netflix_movies.rename(columns={'duration': 'duration (min)'}, inplace=True)
df_netflix_movies['duration (min)'] = df_netflix_movies['duration (min)'].astype(int)

In [25]:
df_netflix_movies_min_duration = df_netflix_movies['duration (min)'].min()
df_netflix_movies[df_netflix_movies['duration (min)'] == df_netflix_movies_min_duration]

Unnamed: 0,type,title,director,cast,country,date_added,release_year,duration (min),listed_in,description,title_slug,director_list,rating_imdb
3777,movie,Silent,"Limbert Fabian, Brandon Oldenburg",,[United States],"June 4, 2019",2014,3,"[children & family movies, sci-fi & fantasy]","""Silent"" is an animated short film created by ...",silent,"[limbert fabian, brandon oldenburg]",5.5


In [26]:
df_netflix_movies_max_duration = df_netflix_movies['duration (min)'].max()
df_netflix_movies[df_netflix_movies['duration (min)'] == df_netflix_movies_max_duration]

Unnamed: 0,type,title,director,cast,country,date_added,release_year,duration (min),listed_in,description,title_slug,director_list,rating_imdb
4253,movie,Black Mirror: Bandersnatch,,"Fionn Whitehead, Will Poulter, Craig Parkinson...",[United States],"December 28, 2018",2018,312,"[dramas, international movies, sci-fi & fant...","In 1984, a young programmer begins to question...",black-mirror-bandersnatch,[],5.0


### Média de rating por diretor

In [27]:
df_netflix_directors = df_netflix_titles.dropna(subset='director')
df_netflix_directors_grouped = df_netflix_directors.groupby(['director'])['rating_imdb'].mean()
df_netflix_directors_grouped.sort_values(ascending=False, inplace=True)
df_netflix_directors_grouped

director
Ed Lilly                        9.4
Leslye Davis, Catrin Einhorn    9.0
Lori Kaye                       9.0
Vincent Ward                    9.0
John Smithson                   8.7
                               ... 
Jos Humphrey                    1.6
Alessandra de Rossi             1.6
Narendra Nath                   1.6
Tony Collingwood                1.2
Michelle Johnston               1.0
Name: rating_imdb, Length: 4528, dtype: float64

### Qual a categoria mais recorrente na base?

In [28]:
df_netflix_genre = df_netflix_titles.copy(deep=True)
df_netflix_genre.dropna(subset='listed_in', inplace=True)
df_netflix_genre['listed_in'] = df_netflix_genre['listed_in'].apply(lambda x: str(x))
df_netflix_genre['listed_in'] = df_netflix_genre['listed_in'].apply(lambda x: x.split(','))
df_netflix_genre_exploded = df_netflix_genre.explode(column='listed_in')
df_netflix_genre_exploded['listed_in'] = df_netflix_genre_exploded['listed_in'].apply(lambda x: x.strip())
df_netflix_genre_exploded
 

df_netflix_genre_grouped = df_netflix_genre_exploded.groupby(['listed_in']).size()
print(f'A categoria mais recorrente é: {df_netflix_genre_grouped.idxmax()}, com {df_netflix_genre_grouped.max()} ocorrências.')

A categoria mais recorrente é: ' international movies'], com 1786 ocorrências.


# Análise exploratória dos dados (EDA)

## Análise automática usando a biblioteca Ydata-profiling

In [29]:
#!pip install -U ydata-profiling

In [30]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df_netflix_titles, title="Profiling Report", explorative=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Análise exploratória manual usando a biblioteca Plotly

In [80]:
import plotly.express as px

### Lista dos 20 Países com mais Lançamentos

Objetivos

- Mostrar a participação de cada país na indústria cinematográfica global

In [32]:
df_netflix_country = df_grouped.rename_axis('country').reset_index(name='count').head(20)

fig = px.bar(df_netflix_country, 
             x='count', 
             y='country', 
             title="Lista dos 20 países que mais lançaram filmes",
             labels={'country': '', 'count': ''},
             text_auto=True,
             height=600,
             width=800)
fig.update_layout(yaxis={'categoryorder':'total ascending'})
fig.update_xaxes(visible=False)
fig.update_traces(marker_color='#A7C7E7')
fig.show()

### Distribuição das Notas dos Filmes com Boxplot

Objetivos

- Distribuição das notas no IMDB nos últimos 5 anos agrupadas por tipo

In [33]:
#print(px.colors.qualitative.Pastel1)

In [34]:
# Converte coluna de ano para inteiro
df_netflix_titles["release_year"] = df_netflix_titles["release_year"].astype('int')
df_netflix_last_titles = df_netflix_titles[df_netflix_titles["release_year"] >= df_netflix_titles['release_year'].max()-5]

fig = px.box(df_netflix_last_titles, 
             x='release_year', 
             y="rating_imdb", 
             color='type',
             width=700,
             title="Distribuição das notas no IMDB \nnos últimos 5 anos agrupadas por tipo",
             color_discrete_sequence=px.colors.qualitative.Dark24,
             labels={'release_year':'Ano de lançamento', 'rating_imdb':'Nota no IMDB', 'type':'Tipo'})
fig.show()

### Histograma das notas

Objetivos

- Observar a distribuição das notas entre filmes e séries

In [35]:
fig = px.histogram(df_netflix_titles, 
                   x="rating_imdb", 
                   color='type',
                   labels={'rating_imdb':'Nota no IMDB', 'type':'Tipo'},
                   color_discrete_sequence=px.colors.qualitative.Dark24,
                   width=700,
                   nbins=20, 
                   barmode='stack',
                   title="Histograma da distribuição das notas no IMDB agrupadas por tipo")
fig.data = fig.data[::-1]
fig.layout.legend.traceorder = 'reversed'
fig.update_yaxes(title='')
fig.show()


### Categorias de Filmes

Objetivos

- Exibir a distribuição das produções cinematográficas por categorias de filme

In [79]:
# agrupando categorias de filmes e suas quantidades
listed_in = df_netflix_movies['listed_in']
flattened_listed_in = [item for sublist in listed_in for item in sublist]
flattened_df = pd.DataFrame(flattened_listed_in, columns=['listed_in'])
grouped_df = flattened_df['listed_in'].value_counts().reset_index()
grouped_df.columns = ['listed_in', 'count']
grouped_df = grouped_df.sort_values(by='count', ascending=True)
# plotando resultado em gráfico de barras horizontal
fig = px.bar(grouped_df, 
             x='count', 
             y='listed_in', 
             title="Quantidade de Filmes por Categoria",
             labels={'listed_in': 'Categorias', 'count': 'Total'},
             text_auto=True,
             height=800,
             width=800)
fig.show()

### Evolução das Notas de Filmes por Diretores Consagrados

Objetivos

- Exibir a percepção do público ao longo do tempo, em relação as produções de diretores famosos

In [76]:
# filtrando por diretores consagrados
directors = ['quentin tarantino', 'martin scorsese', 'denis villeneuve', 'steven spielberg', 'alfred hitchcock', 'stanley kubrick', 'francis ford coppola', 'woody allen', 'billy wilder', 'peter jackson', 'james cameron', 'ridley scott', 'christopher nolan']
df_top_directors = df_netflix_titles[df_netflix_titles['director_list'].apply(lambda x: any(term in x for term in directors))]
df_top_directors = df_top_directors.explode('director_list')
df_top_directors = df_top_directors.sort_values(by='release_year', ascending=True)
df_top_directors.head(5)
# gerando gráfico
fig = px.line(df_top_directors, 
            x='release_year', 
            y='rating_imdb', 
            color='director',
            markers=True,
            hover_data='title',
            title='Evolução das Notas de Diretores Consagrados',
            labels={'release_year': 'Ano de Lançamento', 'rating_imdb': 'Nota IMDB', 'director': 'Diretores'},
            range_x=[df_top_directors['release_year'].min()-1, df_top_directors['release_year'].max()+1],
            width=1500,
            height=700,
            render_mode='svg'
            )
fig.show()