<a href="https://colab.research.google.com/github/luasampaio/data-engineering/blob/main/34_Em_Desenvolvimento.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

from pyspark.sql import SparkSession
import requests
import pandas as pd

# Create a SparkSession
spark = SparkSession.builder.appName("LucianaSampaioApp").getOrCreate()

# URL para o conteúdo bruto do arquivo CSV
url = 'https://raw.githubusercontent.com/luasampaio/datasets/main/products.csv'

In [3]:
# Lê o arquivo CSV diretamente do URL using pandas
pandas_df = pd.read_csv(url)

In [5]:
# Create a Spark DataFrame from the pandas DataFrame
df = spark.createDataFrame(pandas_df)

In [6]:
display(df)

DataFrame[product_id: bigint, product_name: string, brand_id: bigint, category_id: bigint, model_year: bigint, list_price: double]

In [7]:
# Definir as colunas que o DataFrame precisa ter
required_columns = {"product_id", "product_name"}

# Comparar com as colunas reais do DataFrame
missing_columns = required_columns - set(df.columns)

# Se houver colunas ausentes, gerar um erro
if missing_columns:
    raise ValueError(f"❌ Erro: Colunas ausentes no DataFrame: {missing_columns}")


In [8]:
df.show(7)

+----------+--------------------+--------+-----------+----------+----------+
|product_id|        product_name|brand_id|category_id|model_year|list_price|
+----------+--------------------+--------+-----------+----------+----------+
|         1|     Trek 820 - 2016|       9|          6|      2016|    379.99|
|         2|Ritchey Timberwol...|       5|          6|      2016|    749.99|
|         3|Surly Wednesday F...|       8|          6|      2016|    999.99|
|         4|Trek Fuel EX 8 29...|       9|          6|      2016|   2899.99|
|         5|Heller Shagamaw F...|       3|          6|      2016|   1320.99|
|         6|Surly Ice Cream T...|       8|          6|      2016|    469.99|
|         7|Trek Slash 8 27.5...|       9|          6|      2016|   3999.99|
+----------+--------------------+--------+-----------+----------+----------+
only showing top 7 rows



In [9]:
df.printSchema()

root
 |-- product_id: long (nullable = true)
 |-- product_name: string (nullable = true)
 |-- brand_id: long (nullable = true)
 |-- category_id: long (nullable = true)
 |-- model_year: long (nullable = true)
 |-- list_price: double (nullable = true)



# Adicionando a data de ingestão


In [10]:
from pyspark.sql.functions import to_timestamp, current_timestamp, lit, concat_ws, col


# Adicionar colunas extras
df = df.withColumn("ingestion_date", current_timestamp())

In [11]:
df.show(7)

+----------+--------------------+--------+-----------+----------+----------+--------------------+
|product_id|        product_name|brand_id|category_id|model_year|list_price|      ingestion_date|
+----------+--------------------+--------+-----------+----------+----------+--------------------+
|         1|     Trek 820 - 2016|       9|          6|      2016|    379.99|2025-02-04 14:59:...|
|         2|Ritchey Timberwol...|       5|          6|      2016|    749.99|2025-02-04 14:59:...|
|         3|Surly Wednesday F...|       8|          6|      2016|    999.99|2025-02-04 14:59:...|
|         4|Trek Fuel EX 8 29...|       9|          6|      2016|   2899.99|2025-02-04 14:59:...|
|         5|Heller Shagamaw F...|       3|          6|      2016|   1320.99|2025-02-04 14:59:...|
|         6|Surly Ice Cream T...|       8|          6|      2016|    469.99|2025-02-04 14:59:...|
|         7|Trek Slash 8 27.5...|       9|          6|      2016|   3999.99|2025-02-04 14:59:...|
+----------+--------

Uma segunda opção usando o elemento col (Functions)

In [12]:
df = df.withColumn("ingestion_date", to_timestamp(col("ingestion_date")))

In [13]:
df.show(7)

+----------+--------------------+--------+-----------+----------+----------+--------------------+
|product_id|        product_name|brand_id|category_id|model_year|list_price|      ingestion_date|
+----------+--------------------+--------+-----------+----------+----------+--------------------+
|         1|     Trek 820 - 2016|       9|          6|      2016|    379.99|2025-02-04 15:10:...|
|         2|Ritchey Timberwol...|       5|          6|      2016|    749.99|2025-02-04 15:10:...|
|         3|Surly Wednesday F...|       8|          6|      2016|    999.99|2025-02-04 15:10:...|
|         4|Trek Fuel EX 8 29...|       9|          6|      2016|   2899.99|2025-02-04 15:10:...|
|         5|Heller Shagamaw F...|       3|          6|      2016|   1320.99|2025-02-04 15:10:...|
|         6|Surly Ice Cream T...|       8|          6|      2016|    469.99|2025-02-04 15:10:...|
|         7|Trek Slash 8 27.5...|       9|          6|      2016|   3999.99|2025-02-04 15:10:...|
+----------+--------