![Verne Academy Summit 2024](https://github.com/javendia/verne-academy-summit-2024/blob/main/header.png?raw=true)

## Paso 1: Ingesta de datos
- Leemos el fichero **Customers.csv**, especificando el formato (CSV) e indicando que el archivo contiene las cabeceras de las columnas
- Creamos la tabla delta destino en caso de no existir

In [49]:
from pyspark.sql.types import *
from delta.tables import *
from pyspark.sql.functions import when, lit, col, regexp_replace

df = spark.read.load('Files/wwi/Customers.csv',
    format='csv',
    header=True
)

DeltaTable.createIfNotExists(spark) \
     .tableName("Customers") \
     .addColumn("CustomerID", IntegerType()) \
     .addColumn("CustomerName", StringType()) \
     .addColumn("BillToCustomerID", IntegerType()) \
     .addColumn("CustomerCategoryID", IntegerType()) \
     .addColumn("BuyingGroupID", IntegerType()) \
     .addColumn("PrimaryContactPersonID", IntegerType()) \
     .addColumn("AlternateContactPersonID", IntegerType()) \
     .addColumn("DeliveryMethodID", IntegerType()) \
     .addColumn("DeliveryCityID", IntegerType()) \
     .addColumn("PostalCityID", IntegerType()) \
     .addColumn("CreditLimit", StringType()) \
     .addColumn("AccountOpenedDate", DateType()) \
     .addColumn("StandardDiscountPercentage", DecimalType(18,3)) \
     .addColumn("IsStatementSent", BooleanType()) \
     .addColumn("IsOnCreditHold", BooleanType()) \
     .addColumn("PaymentDays", IntegerType()) \
     .addColumn("PhoneNumber", StringType()) \
     .addColumn("FaxNumber", StringType()) \
     .addColumn("DeliveryRun", StringType()) \
     .addColumn("RunPosition", StringType()) \
     .addColumn("WebsiteURL", StringType()) \
     .addColumn("DeliveryAddressLine1", StringType()) \
     .addColumn("DeliveryAddressLine2", StringType()) \
     .addColumn("DeliveryPostalCode", IntegerType()) \
     .addColumn("PostalAddressLine1", StringType()) \
     .addColumn("PostalAddressLine2", StringType()) \
     .addColumn("PostalPostalCode", IntegerType()) \
     .addColumn("LastUpdated", TimestampType()) \
     .execute()

deltaTable = DeltaTable.forPath(spark, 'Tables/customers')

StatementMeta(, e5fe86a7-e20b-485d-91f2-ac762df5d620, 51, Finished, Available)

## Paso 2: Transformación
- Devolvemos un valor controlado para la columna **CustomerName** en caso de ser nula o vacía
- Adecuamos el separador de miles, reemplazando **','** por **'.'** y definiendo el tipo de datos para el destino

In [50]:
df = df.withColumn("CustomerName", when((col("CustomerName").isNull() | (col("CustomerName")=="")),lit("N/A")).otherwise(col("CustomerName"))) \
        .withColumn('StandardDiscountPercentage', regexp_replace('StandardDiscountPercentage', ',', '.').cast(DecimalType(18,3)))

StatementMeta(, e5fe86a7-e20b-485d-91f2-ac762df5d620, 52, Finished, Available)

## Paso 3: Creación de vista temporal
- Creamos la vista temporal **vw_colors** donde almacenar los datos tratados anteriormente

In [51]:
df.createOrReplaceTempView("vw_customers")

StatementMeta(, e5fe86a7-e20b-485d-91f2-ac762df5d620, 53, Finished, Available)

## Paso 4: Instrucción MERGE

- Si el valor del registro para la columna **CustomerID** existe **(MATCHED)** en la tabla destino y alguna columna difiere del registro existente, actualizamos la fila
- Si el valor del registro para la columna **CustomerID** no existe **(NOT MATCHED)** en la tabla destino, insertamos una nueva fila
- En caso de que el registro de la tabla destino no exista en el fichero origen, eliminamos esa fila

In [52]:
%%sql

MERGE INTO customers AS target
USING vw_customers AS source
ON target.CustomerID = source.CustomerID
WHEN MATCHED AND 
(      
        target.CustomerName <> source.CustomerName
        OR target.BillToCustomerID <> source.BillToCustomerID
        OR target.CustomerCategoryID <> source.CustomerCategoryID
        OR target.BuyingGroupID <> source.BuyingGroupID
        OR target.PrimaryContactPersonID <> source.PrimaryContactPersonID
        OR target.AlternateContactPersonID <> source.AlternateContactPersonID
        OR target.DeliveryMethodID <> source.DeliveryMethodID
        OR target.DeliveryCityID <> source.DeliveryCityID
        OR target.PostalCityID <> source.PostalCityID
        OR target.CreditLimit <> source.CreditLimit
        OR target.AccountOpenedDate <> source.AccountOpenedDate
        OR target.StandardDiscountPercentage <> source.StandardDiscountPercentage
        OR target.IsStatementSent <> source.IsStatementSent
        OR target.IsOnCreditHold <> source.IsOnCreditHold
        OR target.PaymentDays <> source.PaymentDays
        OR target.PhoneNumber <> source.PhoneNumber
        OR target.FaxNumber <> source.FaxNumber
        OR target.DeliveryRun <> source.DeliveryRun
        OR target.RunPosition <> source.RunPosition
        OR target.WebsiteURL <> source.WebsiteURL
        OR target.DeliveryAddressLine1 <> source.DeliveryAddressLine1
        OR target.DeliveryAddressLine2 <> source.DeliveryAddressLine2
        OR target.DeliveryPostalCode <> source.DeliveryPostalCode
        OR target.PostalAddressLine1 <> source.PostalAddressLine1
        OR target.PostalAddressLine2 <> source.PostalAddressLine2
        OR target.PostalPostalCode <> source.PostalPostalCode
)
THEN
UPDATE SET
        target.CustomerName = source.CustomerName
        ,target.BillToCustomerID = source.BillToCustomerID
        ,target.CustomerCategoryID = source.CustomerCategoryID
        ,target.BuyingGroupID = source.BuyingGroupID
        ,target.PrimaryContactPersonID = source.PrimaryContactPersonID
        ,target.AlternateContactPersonID = source.AlternateContactPersonID
        ,target.DeliveryMethodID = source.DeliveryMethodID
        ,target.DeliveryCityID = source.DeliveryCityID
        ,target.PostalCityID = source.PostalCityID
        ,target.CreditLimit = source.CreditLimit
        ,target.AccountOpenedDate = source.AccountOpenedDate
        ,target.StandardDiscountPercentage = source.StandardDiscountPercentage
        ,target.IsStatementSent = source.IsStatementSent
        ,target.IsOnCreditHold = source.IsOnCreditHold
        ,target.PaymentDays = source.PaymentDays
        ,target.PhoneNumber = source.PhoneNumber
        ,target.FaxNumber = source.FaxNumber
        ,target.DeliveryRun = source.DeliveryRun
        ,target.RunPosition = source.RunPosition
        ,target.WebsiteURL = source.WebsiteURL
        ,target.DeliveryAddressLine1 = source.DeliveryAddressLine1
        ,target.DeliveryAddressLine2 = source.DeliveryAddressLine2
        ,target.DeliveryPostalCode = source.DeliveryPostalCode
        ,target.PostalAddressLine1 = source.PostalAddressLine1
        ,target.PostalAddressLine2 = source.PostalAddressLine2
        ,target.PostalPostalCode = source.PostalPostalCode
        ,target.LastUpdated = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (CustomerID, CustomerName, BillToCustomerID, CustomerCategoryID, BuyingGroupID, PrimaryContactPersonID, AlternateContactPersonID, DeliveryMethodID, 
        DeliveryCityID, PostalCityID, CreditLimit, AccountOpenedDate, StandardDiscountPercentage, IsStatementSent, IsOnCreditHold, 
        PaymentDays, PhoneNumber, FaxNumber, DeliveryRun, RunPosition, WebsiteURL, DeliveryAddressLine1, DeliveryAddressLine2, DeliveryPostalCode, 
        PostalAddressLine1, PostalAddressLine2, PostalPostalCode, LastUpdated)
VALUES (source.CustomerID, source.CustomerName, source.BillToCustomerID, source.CustomerCategoryID, source.BuyingGroupID, source.PrimaryContactPersonID, source.AlternateContactPersonID, source.DeliveryMethodID, 
        source.DeliveryCityID, source.PostalCityID, source.CreditLimit, source.AccountOpenedDate, source.StandardDiscountPercentage, source.IsStatementSent, source.IsOnCreditHold, 
        source.PaymentDays, source.PhoneNumber, source.FaxNumber, source.DeliveryRun, source.RunPosition, source.WebsiteURL, source.DeliveryAddressLine1, source.DeliveryAddressLine2, source.DeliveryPostalCode,
        source.PostalAddressLine1, source.PostalAddressLine2, source.PostalPostalCode, CURRENT_TIMESTAMP())
WHEN NOT MATCHED BY SOURCE THEN
DELETE;

StatementMeta(, e5fe86a7-e20b-485d-91f2-ac762df5d620, 54, Finished, Available)

<Spark SQL result set with 1 rows and 4 fields>

## Paso 5: Eliminación vista temporal

In [53]:
spark.catalog.dropTempView("vw_customers")

StatementMeta(, e5fe86a7-e20b-485d-91f2-ac762df5d620, 55, Finished, Available)

True