# MVP Engenharia de Dados
Aluno: Lucas Siqueira Rodrigues

OBS: Quando as tabelas eram mostradas usando o método `df.display()` ou `display(df)`, a tabela ficava com milhares de linhas no arquivo do jupyter, e ao abrir pelo github o arquivo ficava pesado e ilegível, pois a maior parte dele era composto por registros em tabelas, então eu baixei a tabela `silver_netflix_titles` e coloquei no github como um arquivo no formato CSV com o nome [netflix_titles.csv](netflix_titles.csv), e limitei os comandos e as consultas nesse notebook para que retornem apenas 10 linhas.

# 0. Objetivos
* Fazer o upload da base de dados no databrics
* Abrir a base de dados com o spark, e salvar como um tabela delta na camada bronze
* Montar um catálogo de dados
* Carregar a tabela novamente com o spark, realizar tratamento de alguns dados e salvar novamente como um tabela delta na camada silver
* Realizar uma análise nos dados, por meio de diversas consultas SQL
* Fazer uma autoavaliação
* Como definição do problema, temos que realizar diversas consultas para entender melhor os dados, como tableas com a distribuição de valores categóricos, valores máximos e mínimos e colunas do tipo data e numéricas, entre outros

# 1. Busca pelos dados
Escolhi o dataset [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) obtido no Kaggle. Esse dataset está sob a licença [CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/) e possui 12 colunas e 8807 linhas.


# 2. Coleta
A primeira etapa foi acessa o dataset pelo [link](https://www.kaggle.com/datasets/shivamb/netflix-shows) e fazer o download do arquivo no formato csv.  
![fazendo o download do dataset](assets/1.png)  
Foi realizado o download do arquivo compactado, logo após ele foi descompactado
  
Na página inicial do Databrics, podemos clicar em `Create Table`.  
![criando tabela](assets/2.png)  
  
A próxima etapa foi realizar o upload do arquivo no Databricks.  
![fazendo o upload no databrics](assets/3.png)
  
O databrics permite que criemos uma tabela por meio da interface gráfica e por meio do notebook, que foi a minha opção escolhida e será abordado na seção 4.  
![opções](assets/4.png)  

# 3. Modelagem
Temos o nosso conjunto de dados no databrics em formato CSV, esse é o nosso dado raw (ou dado bruto), sem qualquer tipo de transformação.  
  
## 3.1. Catálogo de Dados
Essa tabela possui 12 colunas
|Coluna|Tipo de Dado|Descrição|
|---|---|---|
| show_id |String| O id do programa, é a chave primária da tabela |
| type |Categórico| O tipo do conteúdo, podendo ser Movie ou TV Show |
| title |String| O título da obra|
| director |String| O diretor da obra|
| cast |String| O elenco, é uma única string com o nome dos atores e atrizes separados por vírgula |
| country |String| O país de origem do conteúdo, podendo ser mais de um país (dados separados por vírgulas) |
| date_added |String| A data em que o conteúdo foi adicionado a plataforma, há datas nos formatos M/d/yyyy e MMMM d, yyyy, e a data mais antiga é 2008-01-01 e a mais recente 2021-09-24|
| release_year |Inteiro| O ano em que o conteúdo foi lançado, sendo 1925 o menor ano e 2021 o maior ano|
| rating |Categórico|Códigos que mostram como o conteúdo está classificado (PG-13, TV-MA, PG, TV-14, TV-PG, TV-Y, TV-Y7, R, TV-G, G, NC-17, NR, TV-Y7-FV e UR), como G que é para todos os públicos e TV-MA que é só para adultos|
| duration |Inteiro e String| A duração da obra, há informações em minutos e em temporadas |
| listed_in	|String| As categorias as quais a obra está listada, há conteúdos em somente uma categoria, mas há conteúdos com várias categorias, fazendo com que haja um número elevado de valores únicos nessa coluna |
| description |String| A descrição da obra |

# 4. Carga
## 4.1. Bronze
Aqui faremos o carregamento do arquivo para uma tabela delta na camada bronze.

In [0]:
# Localização e tipo do arquivo
file_location = "/FileStore/tables/netflix_titles.csv"
file_type = "csv"

# opções do CSV
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df.limit(10).display()

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Action & Adventure","To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war."
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV Comedies","In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life."
s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries","The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe."
s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, Sofia Carson, Liza Koshy, Ken Jeong, Elizabeth Perkins, Jane Krakowski, Michael McKean, Phil LaMarr",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,"Equestria's divided. But a bright-eyed hero believes Earth Ponies, Pegasi and Unicorns should be pals — and, hoof to heart, she’s determined to prove it."
s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri","United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV","A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.


Aqui estarei salvando como uma tabela delta na camada bronze.

In [0]:
df.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .saveAsTable("bronze_netflix_titles")

Verificando se a nossa tabela foi corretamente criada por meio de uma consulta que seleciona todos os dados da tabela e retorna as 10 primeiras linhas

In [0]:
%sql
SELECT * FROM `bronze_netflix_titles` LIMIT 10;

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Action & Adventure","To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war."
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV Comedies","In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life."
s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries","The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe."
s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, Sofia Carson, Liza Koshy, Ken Jeong, Elizabeth Perkins, Jane Krakowski, Michael McKean, Phil LaMarr",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,"Equestria's divided. But a bright-eyed hero believes Earth Ponies, Pegasi and Unicorns should be pals — and, hoof to heart, she’s determined to prove it."
s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri","United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV","A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.


Consultando a quantidade de registros na nossa tabela bronze

In [0]:
%sql
SELECT COUNT(*) FROM bronze_netflix_titles;

count(1)
8809


## 4.2. Silver
Olhando os dados, já consigo verificar algumas transformações e melhorias que são necessárias nos dados, são elas:
* Deixar as datas com o mesmo formato (converter a coluna `date_added` para o tipo `date`), atualmente temos registros em 2 formatos
  * M/d/yyyy
  * MMMM d, yyyy
* Converter a coluna `release_year` para o tipo `integer`
* Há valores de duração em algumas linhas da coluna `ratings`, essas linhas serão removidas
* Remover linhas com dados nulos

Para começar, iremos abrir a nossa tabela bronze como um dataframe do spark

In [0]:
df_bronze = spark.read \
    .format("delta") \
    .table("bronze_netflix_titles")

df_bronze.limit(10).display()

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Action & Adventure","To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war."
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV Comedies","In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life."
s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries","The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe."
s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, Sofia Carson, Liza Koshy, Ken Jeong, Elizabeth Perkins, Jane Krakowski, Michael McKean, Phil LaMarr",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,"Equestria's divided. But a bright-eyed hero believes Earth Ponies, Pegasi and Unicorns should be pals — and, hoof to heart, she’s determined to prove it."
s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri","United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV","A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.


Aqui estaremos pegando a coluna `date_added` e realizando uma transformação, temos datas nos formatos `M/d/yyyy` e `MMMM d, yyyy`, iremos fazer com que todos esses valores estejam no mesmo formato, e que a coluna seja do tipo `date`.

In [0]:
from pyspark.sql.functions import col, to_date, coalesce

# Defina uma coluna de timestamp convertida tentando múltiplos formatos
df_converted = df_bronze.withColumn(
    "date_added",
    coalesce(
        to_date(col("date_added"), "M/d/yyyy"),
        to_date(col("date_added"), "MMMM d, yyyy")
    )
)

df_converted.limit(10).display()

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Action & Adventure","To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war."
s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV Comedies","In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life."
s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, Henry Thomas, Kristin Lehman, Samantha Sloyan, Igby Rigney, Rahul Kohli, Annarah Cymone, Annabeth Gish, Alex Essoe, Rahul Abburi, Matt Biedel, Michael Trucco, Crystal Balint, Louis Oliver",,2021-09-24,2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries","The arrival of a charismatic young priest brings glorious miracles, ominous mysteries and renewed religious fervor to a dying town desperate to believe."
s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, Sofia Carson, Liza Koshy, Ken Jeong, Elizabeth Perkins, Jane Krakowski, Michael McKean, Phil LaMarr",,2021-09-24,2021,PG,91 min,Children & Family Movies,"Equestria's divided. But a bright-eyed hero believes Earth Ponies, Pegasi and Unicorns should be pals — and, hoof to heart, she’s determined to prove it."
s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri","United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood",United Kingdom,2021-09-24,2021,TV-14,9 Seasons,"British TV Shows, Reality TV","A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.


Agora, irei converter a coluna `release_year` para o tipo `integer`

In [0]:
df_year = df_converted.withColumn("release_year", col("release_year").cast("integer"))

Removendo linhas com valores de duração em algumas linhas da coluna `ratings`

In [0]:
df_filtered = df_year.filter(~col("rating").contains("min"))

E por último, irei remover as linhas com o valores nulos e verificar o dataset usando o método display.

In [0]:
df_cleaned = df_filtered.dropna()
df_cleaned.limit(10).display()

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri","United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood",United Kingdom,2021-09-24,2021,TV-14,9 Seasons,"British TV Shows, Reality TV","A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.
s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, Edin Hasanović, Anna Fialová, Marlon Boess, Victor Boccard, Fleur Geffrier, Aziz Dyab, Mélanie Fouché, Elizaveta Maximová","Germany, Czech Republic",2021-09-23,2021,TV-MA,127 min,"Dramas, International Movies","After most of her family is murdered in a terrorist bombing, a young woman is unknowingly lured into joining the very group that killed them."
s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi, Nassar",India,2021-09-21,1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies","When the father of the man she loves insists that his twin sons marry twin sisters, a woman creates an alter ego that might be a bit too convincing."
s28,Movie,Grown Ups,Dennis Dugan,"Adam Sandler, Kevin James, Chris Rock, David Spade, Rob Schneider, Salma Hayek, Maria Bello, Maya Rudolph, Colin Quinn, Tim Meadows, Joyce Van Patten",United States,2021-09-20,2010,PG-13,103 min,Comedies,"Mourning the loss of their beloved junior high basketball coach, five middle-aged pals reunite at a lake house and rediscover the joys of being a kid."
s29,Movie,Dark Skies,Scott Stewart,"Keri Russell, Josh Hamilton, J.K. Simmons, Dakota Goyo, Kadan Rockett, L.J. Benet, Rich Hutchman, Myndy Crist, Annie Thurman, Jake Brennan",United States,2021-09-19,2013,PG-13,97 min,"Horror Movies, Sci-Fi & Fantasy","A family’s idyllic suburban life shatters when an alien force invades their home, and as they struggle to convince others of the deadly threat."
s30,Movie,Paranoia,Robert Luketic,"Liam Hemsworth, Gary Oldman, Amber Heard, Harrison Ford, Lucas Till, Embeth Davidtz, Julian McMahon, Josh Holloway, Richard Dreyfuss, Angela Sarafyan","United States, India, France",2021-09-19,2013,PG-13,106 min,Thrillers,"Blackmailed by his company's CEO, a low-level employee finds himself forced to spy on the boss's rival and former mentor."
s39,Movie,Birth of the Dragon,George Nolfi,"Billy Magnussen, Ron Yuan, Qu Jingjing, Terry Chen, Vanness Wu, Jin Xing, Philip Ng, Xia Yu, Yu Xia","China, Canada, United States",2021-09-16,2017,PG-13,96 min,"Action & Adventure, Dramas","A young Bruce Lee angers kung fu traditionalists by teaching outsiders, leading to a showdown with a Shaolin master in this film based on real events."
s42,Movie,Jaws,Steven Spielberg,"Roy Scheider, Robert Shaw, Richard Dreyfuss, Lorraine Gary, Murray Hamilton, Carl Gottlieb, Jeffrey Kramer, Susan Backlinie, Jonathan Filley, Ted Grossman",United States,2021-09-16,1975,PG,124 min,"Action & Adventure, Classic Movies, Dramas","When an insatiable great white shark terrorizes Amity Island, a police chief, an oceanographer and a grizzled shark hunter seek to destroy the beast."


E finalmente, fazendo a carga dos dados para um tabela delta na camada `silver`

In [0]:
df_cleaned.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .saveAsTable("silver_netflix_titles")

Fazendo uma checagem na tabela silver, e retornando as 10 primeiras linhas

In [0]:
query = "SELECT * FROM silver_netflix_titles LIMIT 10"
spark.sql(query).display()

show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri","United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood",United Kingdom,2021-09-24,2021,TV-14,9 Seasons,"British TV Shows, Reality TV","A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor",United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.
s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, Edin Hasanović, Anna Fialová, Marlon Boess, Victor Boccard, Fleur Geffrier, Aziz Dyab, Mélanie Fouché, Elizaveta Maximová","Germany, Czech Republic",2021-09-23,2021,TV-MA,127 min,"Dramas, International Movies","After most of her family is murdered in a terrorist bombing, a young woman is unknowingly lured into joining the very group that killed them."
s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi, Nassar",India,2021-09-21,1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies","When the father of the man she loves insists that his twin sons marry twin sisters, a woman creates an alter ego that might be a bit too convincing."
s28,Movie,Grown Ups,Dennis Dugan,"Adam Sandler, Kevin James, Chris Rock, David Spade, Rob Schneider, Salma Hayek, Maria Bello, Maya Rudolph, Colin Quinn, Tim Meadows, Joyce Van Patten",United States,2021-09-20,2010,PG-13,103 min,Comedies,"Mourning the loss of their beloved junior high basketball coach, five middle-aged pals reunite at a lake house and rediscover the joys of being a kid."
s29,Movie,Dark Skies,Scott Stewart,"Keri Russell, Josh Hamilton, J.K. Simmons, Dakota Goyo, Kadan Rockett, L.J. Benet, Rich Hutchman, Myndy Crist, Annie Thurman, Jake Brennan",United States,2021-09-19,2013,PG-13,97 min,"Horror Movies, Sci-Fi & Fantasy","A family’s idyllic suburban life shatters when an alien force invades their home, and as they struggle to convince others of the deadly threat."
s30,Movie,Paranoia,Robert Luketic,"Liam Hemsworth, Gary Oldman, Amber Heard, Harrison Ford, Lucas Till, Embeth Davidtz, Julian McMahon, Josh Holloway, Richard Dreyfuss, Angela Sarafyan","United States, India, France",2021-09-19,2013,PG-13,106 min,Thrillers,"Blackmailed by his company's CEO, a low-level employee finds himself forced to spy on the boss's rival and former mentor."
s39,Movie,Birth of the Dragon,George Nolfi,"Billy Magnussen, Ron Yuan, Qu Jingjing, Terry Chen, Vanness Wu, Jin Xing, Philip Ng, Xia Yu, Yu Xia","China, Canada, United States",2021-09-16,2017,PG-13,96 min,"Action & Adventure, Dramas","A young Bruce Lee angers kung fu traditionalists by teaching outsiders, leading to a showdown with a Shaolin master in this film based on real events."
s42,Movie,Jaws,Steven Spielberg,"Roy Scheider, Robert Shaw, Richard Dreyfuss, Lorraine Gary, Murray Hamilton, Carl Gottlieb, Jeffrey Kramer, Susan Backlinie, Jonathan Filley, Ted Grossman",United States,2021-09-16,1975,PG,124 min,"Action & Adventure, Classic Movies, Dramas","When an insatiable great white shark terrorizes Amity Island, a police chief, an oceanographer and a grizzled shark hunter seek to destroy the beast."


# 5. Análise
Por meio de consultas SQL, irei verificar a qualidade dos dados, e responder algumas perguntas que foram definidas no objetivo

Consultando a quantidade de registros no banco de dados da tabela silver

In [0]:
%sql
SELECT COUNT(*) FROM silver_netflix_titles;

count(1)
5312


Consultando a coluna `show_id`

In [0]:
%sql
SELECT show_id FROM silver_netflix_titles LIMIT 10;

show_id
s8
s9
s10
s13
s25
s28
s29
s30
s39
s42


Consultando a coluna `type`

In [0]:
%sql
SELECT type FROM silver_netflix_titles LIMIT 10;

type
Movie
TV Show
Movie
Movie
Movie
Movie
Movie
Movie
Movie
Movie


Os dados da coluna type são categóricos, portanto irei retornar a quantidade de registros de cada categoria

In [0]:
%sql
SELECT type, COUNT(*) AS count
FROM silver_netflix_titles
GROUP BY type
ORDER BY count DESC

type,count
Movie,5170
TV Show,142


Consultando a coluna  `title`

In [0]:
%sql
SELECT title FROM silver_netflix_titles LIMIT 10;

title
Sankofa
The Great British Baking Show
The Starling
Je Suis Karl
Jeans
Grown Ups
Dark Skies
Paranoia
Birth of the Dragon
Jaws


Consultando a coluna `director`

In [0]:
%sql
SELECT director FROM silver_netflix_titles LIMIT 10;

director
Haile Gerima
Andy Devonshire
Theodore Melfi
Christian Schwochow
S. Shankar
Dennis Dugan
Scott Stewart
Robert Luketic
George Nolfi
Steven Spielberg


Consultando a coluna `cast`

In [0]:
%sql
SELECT cast FROM silver_netflix_titles LIMIT 10;

cast
"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra Duah, Nick Medley, Mutabaruka, Afemo Omilami, Reggie Carter, Mzuri"
"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Hollywood"
"Melissa McCarthy, Chris O'Dowd, Kevin Kline, Timothy Olyphant, Daveed Diggs, Skyler Gisondo, Laura Harrier, Rosalind Chao, Kimberly Quinn, Loretta Devine, Ravi Kapoor"
"Luna Wedler, Jannis Niewöhner, Milan Peschel, Edin Hasanović, Anna Fialová, Marlon Boess, Victor Boccard, Fleur Geffrier, Aziz Dyab, Mélanie Fouché, Elizaveta Maximová"
"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi, Nassar"
"Adam Sandler, Kevin James, Chris Rock, David Spade, Rob Schneider, Salma Hayek, Maria Bello, Maya Rudolph, Colin Quinn, Tim Meadows, Joyce Van Patten"
"Keri Russell, Josh Hamilton, J.K. Simmons, Dakota Goyo, Kadan Rockett, L.J. Benet, Rich Hutchman, Myndy Crist, Annie Thurman, Jake Brennan"
"Liam Hemsworth, Gary Oldman, Amber Heard, Harrison Ford, Lucas Till, Embeth Davidtz, Julian McMahon, Josh Holloway, Richard Dreyfuss, Angela Sarafyan"
"Billy Magnussen, Ron Yuan, Qu Jingjing, Terry Chen, Vanness Wu, Jin Xing, Philip Ng, Xia Yu, Yu Xia"
"Roy Scheider, Robert Shaw, Richard Dreyfuss, Lorraine Gary, Murray Hamilton, Carl Gottlieb, Jeffrey Kramer, Susan Backlinie, Jonathan Filley, Ted Grossman"


Consultando a coluna `country`

In [0]:
%sql
SELECT country FROM silver_netflix_titles LIMIT 10;

country
"United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia"
United Kingdom
United States
"Germany, Czech Republic"
India
United States
United States
"United States, India, France"
"China, Canada, United States"
United States


Temos alguns títulos com múltiplos países e separados por vírgula, dessa forma podemos consultar a quantidade de títulos que são de vários países

In [0]:
%sql
SELECT num_countries, COUNT(*) AS count
FROM (
    SELECT 
        SIZE(SPLIT(country, ',')) AS num_countries
    FROM silver_netflix_titles
) subquery
GROUP BY num_countries
ORDER BY num_countries

num_countries,count
1,4318
2,644
3,213
4,93
5,29
6,9
7,5
8,1


Retornando o país com a maior quantidade de títulos na nossa base de dados

In [0]:
%sql
SELECT country, COUNT(*) AS count
FROM silver_netflix_titles
GROUP BY country
ORDER BY count DESC
LIMIT 1;

country,count
United States,1834


Consultando a coluna `date_added`

In [0]:
%sql
SELECT date_added FROM silver_netflix_titles LIMIT 10;

date_added
2021-09-24
2021-09-24
2021-09-24
2021-09-23
2021-09-21
2021-09-20
2021-09-19
2021-09-19
2021-09-16
2021-09-16


Consultando a data mais antiga e mais recente em que um título foi adicionado a plataforma

In [0]:
%sql
SELECT MIN(date_added), MAX(date_added) FROM silver_netflix_titles;

min(date_added),max(date_added)
2008-01-01,2021-09-24


Consultando a coluna `release_year`

In [0]:
%sql
SELECT release_year FROM silver_netflix_titles LIMIT 10;

release_year
1993
2021
2021
2021
1998
2010
2013
2013
2017
1975


Consultando o título com o ano de lançamento mais antigo, e com o ano de lançamento mais recente

In [0]:
%sql
SELECT MIN(release_year), MAX(release_year) FROM silver_netflix_titles;

min(release_year),max(release_year)
1942,2021


The Battle of Midway é o título mais antigo, sendo lançado em 1942

In [0]:
%sql
SELECT title, release_year
FROM silver_netflix_titles
WHERE release_year == 1942;

title,release_year
The Battle of Midway,1942


E temos diversos títulos que foram lançados em 2021

In [0]:
%sql
SELECT title, release_year
FROM silver_netflix_titles
WHERE release_year == 2021
LIMIT 10;

title,release_year
The Great British Baking Show,2021
The Starling,2021
Je Suis Karl,2021
Kate,2021
Thimmarusu,2021
The Water Man,2021
Sweet Girl,2021
Beckett,2021
Gone for Good,2021
Valeria,2021


Consultando a coluna `rating`

In [0]:
%sql
SELECT rating FROM silver_netflix_titles LIMIT 10;

rating
TV-MA
TV-14
PG-13
TV-MA
TV-14
PG-13
PG-13
PG-13
PG-13
PG


A coluna rating é categórica, portanto podemos analisar quantos registros temos para cada categoria

In [0]:
%sql
SELECT rating, COUNT(*) AS count
FROM silver_netflix_titles
GROUP BY rating
ORDER BY count DESC

rating,count
TV-MA,1812
TV-14,1212
R,775
PG-13,469
TV-PG,428
PG,274
TV-G,84
TV-Y,76
TV-Y7,76
NR,58


Consultando a coluna `duration`

In [0]:
%sql
SELECT duration FROM silver_netflix_titles LIMIT 10;

duration
125 min
9 Seasons
104 min
127 min
166 min
103 min
97 min
106 min
96 min
124 min


Consultando a coluna `listed_in`

In [0]:
%sql
SELECT listed_in FROM silver_netflix_titles LIMIT 10;

listed_in
"Dramas, Independent Movies, International Movies"
"British TV Shows, Reality TV"
"Comedies, Dramas"
"Dramas, International Movies"
"Comedies, International Movies, Romantic Movies"
Comedies
"Horror Movies, Sci-Fi & Fantasy"
Thrillers
"Action & Adventure, Dramas"
"Action & Adventure, Classic Movies, Dramas"


Consultando a coluna `description`

In [0]:
%sql
SELECT description FROM silver_netflix_titles LIMIT 10;

description
"On a photo shoot in Ghana, an American model slips back in time, becomes enslaved on a plantation and bears witness to the agony of her ancestral past."
"A talented batch of amateur bakers face off in a 10-week competition, whipping up their best dishes in the hopes of being named the U.K.'s best."
A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden — and a husband who's struggling to find a way forward.
"After most of her family is murdered in a terrorist bombing, a young woman is unknowingly lured into joining the very group that killed them."
"When the father of the man she loves insists that his twin sons marry twin sisters, a woman creates an alter ego that might be a bit too convincing."
"Mourning the loss of their beloved junior high basketball coach, five middle-aged pals reunite at a lake house and rediscover the joys of being a kid."
"A family’s idyllic suburban life shatters when an alien force invades their home, and as they struggle to convince others of the deadly threat."
"Blackmailed by his company's CEO, a low-level employee finds himself forced to spy on the boss's rival and former mentor."
"A young Bruce Lee angers kung fu traditionalists by teaching outsiders, leading to a showdown with a Shaolin master in this film based on real events."
"When an insatiable great white shark terrorizes Amity Island, a police chief, an oceanographer and a grizzled shark hunter seek to destroy the beast."


Todas as colunas parecem estar com os dados limpos.
Conseguimos facilmente com comandos SQL, obter diversos dados.
* Dos registros na tabela bronze, 5170 são Movie e 142 são TV Show
* O primeiro título a ser adicionado que temos registro, foi em 01/01/2008, e o mais recente em 24/09/2021
* O filme com a data de lançamento mais antigo (The Battle of Midway) foi lançado em 1942, e diversos títulos foram lançados em 2021
* A maior parte dos títulos (1812) tem o rating TV-MA, cuja descrição é Mature Audience Only, o que significa que não é recomendado para menores de idade
* E temos títulos que são de até 8 países, sendo a maioria de 1 único país, os Estados Unidos, com 1834 registros

# Autoavaliação
Primeiramente, esse trabalho foi sem dúvida muito enriquecedor e com certeza irei aplicar os conhecimentos obtidos nessa disciplina no mercado de trabalho.
Atualmente trabalho com processamento de imagens, e pude ver que já aplico alguns conceitos de processamento de dados (como separar em camadas raw, bronze, silver e gold), mas não tinha conhecimento dos nomes.
  
Também aprendi muito sobre SQL, apesar de usar bancos de dados eu não tenho tanta prática em realizar consultas, e na disciplina e no MVP pude praticar bastante, e isso vai me ajudar em projetos futuros.
  
Diria que no geral, o aproveitamento na disciplina foi alto e sou grato ao professores que me tiraram várias dúvidas durante os encontros ao vivo.