<a href="https://colab.research.google.com/github/luasampaio/data-engineering/blob/main/22_ntb_layer_bronze.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Configurações Iniciais e Importações**

Aqui está um exemplo de um notebook em PySpark para implementar a arquitetura Medallion com as camadas Bronze, Silver e Gold.

**Explicações:**

- Importar bibliotecas e funções necessárias.
- Definir os caminhos de arquivo para as camadas Bronze, Silver e Gold.
- Configurar as definições do Spark para um desempenho ótimo, como partições de shuffle automático.

In [19]:
!pip install pyspark
from pyspark.sql import SparkSession
import pandas as pd

# Create a SparkSession
spark = SparkSession.builder.appName("LucianaSampaio").getOrCreate()

# URL para o conteúdo bruto do arquivo CSV
url = 'https://raw.githubusercontent.com/luasampaio/datasets/main/customers.csv'





-  Lendo a URL


In [20]:
# Lê o arquivo CSV diretamente do URL using pandas
pandas_df = pd.read_csv(url)

In [21]:
# Create a Spark DataFrame from the pandas DataFrame
df = spark.createDataFrame(pandas_df)

In [22]:
# Registrando um Spark DataFrame em uma tabela temporaria "bronze"
df.registerTempTable("bronze")



In [23]:
# Now you can access the table using spark.table()
df = spark.table("bronze")
display(df)

DataFrame[customer_id: bigint, first_name: string, last_name: string, phone: string, email: string, street: string, city: string, state: string, zip_code: bigint]

In [24]:
df.count()

1445

In [25]:
# Em vez de usar 'bronze' diretamente, acesse-o através do Spark:
spark.table("bronze").show()

+-----------+----------+---------+--------------+--------------------+--------------------+---------------+-----+--------+
|customer_id|first_name|last_name|         phone|               email|              street|           city|state|zip_code|
+-----------+----------+---------+--------------+--------------------+--------------------+---------------+-----+--------+
|          1|     Debra|    Burks|           NaN|debra.burks@yahoo...|   9273 Thorne Ave. |   Orchard Park|   NY|   14127|
|          2|     Kasha|     Todd|           NaN|kasha.todd@yahoo.com|    910 Vine Street |       Campbell|   CA|   95008|
|          3|    Tameka|   Fisher|           NaN|tameka.fisher@aol...|769C Honey Creek ...|  Redondo Beach|   CA|   90278|
|          4|     Daryl|   Spence|           NaN|daryl.spence@aol.com|     988 Pearl Lane |      Uniondale|   NY|   11553|
|          5|Charolette|     Rice|(916) 381-6003|charolette.rice@m...|      107 River Dr. |     Sacramento|   CA|   95820|
|          6|   

Temos também a opção de ler os dados conforme exemplo abaixo


In [26]:
spark.sql("SELECT * FROM bronze").show()

+-----------+----------+---------+--------------+--------------------+--------------------+---------------+-----+--------+
|customer_id|first_name|last_name|         phone|               email|              street|           city|state|zip_code|
+-----------+----------+---------+--------------+--------------------+--------------------+---------------+-----+--------+
|          1|     Debra|    Burks|           NaN|debra.burks@yahoo...|   9273 Thorne Ave. |   Orchard Park|   NY|   14127|
|          2|     Kasha|     Todd|           NaN|kasha.todd@yahoo.com|    910 Vine Street |       Campbell|   CA|   95008|
|          3|    Tameka|   Fisher|           NaN|tameka.fisher@aol...|769C Honey Creek ...|  Redondo Beach|   CA|   90278|
|          4|     Daryl|   Spence|           NaN|daryl.spence@aol.com|     988 Pearl Lane |      Uniondale|   NY|   11553|
|          5|Charolette|     Rice|(916) 381-6003|charolette.rice@m...|      107 River Dr. |     Sacramento|   CA|   95820|
|          6|   

In [27]:
# Quando tiver dúvida do tipo do dataframe.

print(type(df))

<class 'pyspark.sql.dataframe.DataFrame'>


In [28]:
df.printSchema()

root
 |-- customer_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- email: string (nullable = true)
 |-- street: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip_code: long (nullable = true)



- Concatenando em uma nova coluna


In [32]:
from pyspark.sql.functions import concat, col, lit

df = df.withColumn("NomeCompleto", concat(col("first_name"), lit(" "), col("last_name")))


In [33]:
df.show()

+-----------+----------+---------+--------------+--------------------+--------------------+---------------+-----+--------+----------------+----------------+
|customer_id|first_name|last_name|         phone|               email|              street|           city|state|zip_code|       full_name|    NomeCompleto|
+-----------+----------+---------+--------------+--------------------+--------------------+---------------+-----+--------+----------------+----------------+
|          1|     Debra|    Burks|           NaN|debra.burks@yahoo...|   9273 Thorne Ave. |   Orchard Park|   NY|   14127|     Debra Burks|     Debra Burks|
|          2|     Kasha|     Todd|           NaN|kasha.todd@yahoo.com|    910 Vine Street |       Campbell|   CA|   95008|      Kasha Todd|      Kasha Todd|
|          3|    Tameka|   Fisher|           NaN|tameka.fisher@aol...|769C Honey Creek ...|  Redondo Beach|   CA|   90278|   Tameka Fisher|   Tameka Fisher|
|          4|     Daryl|   Spence|           NaN|daryl.spe

 - Buscando Nome de pessoas que inicia com "A"


In [35]:
df_filtrado = df.filter(df.first_name.startswith("A"))
df_filtrado.show()

+-----------+----------+---------+--------------+--------------------+--------------------+----------------+-----+--------+----------------+----------------+
|customer_id|first_name|last_name|         phone|               email|              street|            city|state|zip_code|       full_name|    NomeCompleto|
+-----------+----------+---------+--------------+--------------------+--------------------+----------------+-----+--------+----------------+----------------+
|         20|     Aleta|  Shepard|           NaN|aleta.shepard@aol...|     684 Howard St. |      Sugar Land|   TX|   77478|   Aleta Shepard|   Aleta Shepard|
|         22|    Adelle|   Larsen|           NaN|adelle.larsen@gma...|683 West Kirkland...|  East Northport|   NY|   11731|   Adelle Larsen|   Adelle Larsen|
|         32|   Araceli|   Golden|           NaN|araceli.golden@ms...|  12 Ridgeview Ave. |       Fullerton|   CA|   92831|  Araceli Golden|  Araceli Golden|
|         62|     Alica|   Hunter|           NaN|ali

In [39]:
df_filtrado2 = df.filter(df.first_name.startswith("A")).count()
print(df_filtrado2)
print('Quantidade de Pessoas que iniciam o nome com a Letra "A" é: ', df_filtrado2)

114
Quantidade de Pessoas que iniciam o nome com a Letra "A" é:  114
