![Verne Academy Summit 2024](https://github.com/javendia/verne-academy-summit-2024/blob/main/header.png?raw=true)

## Paso 1: Ingesta de datos
- Leemos el fichero **People.csv**, especificando el formato (CSV) e indicando que el archivo contiene las cabeceras de las columnas
- Creamos la tabla delta destino en caso de no existir

In [1]:
from pyspark.sql.types import *
from delta.tables import *
from pyspark.sql.functions import when, lit, col

df = spark.read.load('Files/wwi/People.csv',
    format='csv',
    header=True
)

DeltaTable.createIfNotExists(spark) \
     .tableName("People") \
     .addColumn("PersonID", IntegerType()) \
     .addColumn("FullName", StringType()) \
     .addColumn("PreferredName", StringType()) \
     .addColumn("SearchName", StringType()) \
     .addColumn("IsPermittedToLogon", BooleanType()) \
     .addColumn("LogonName", StringType()) \
     .addColumn("IsExternalLogonProvider", BooleanType()) \
     .addColumn("IsSystemUser", BooleanType()) \
     .addColumn("IsEmployee", BooleanType()) \
     .addColumn("IsSalesperson", BooleanType()) \
     .addColumn("PhoneNumber", StringType()) \
     .addColumn("FaxNumber", StringType()) \
     .addColumn("EmailAddress", StringType()) \
     .addColumn("LastUpdated", TimestampType()) \
     .execute()

deltaTable = DeltaTable.forPath(spark, 'Tables/people')

StatementMeta(, 4d87c738-213c-49f1-a620-f59d3e9ec1f9, 3, Finished, Available)

## Paso 2: Transformación
- Devolvemos un valor controlado para las columnas **FullName** y **PreferredName** en caso de ser nulas o vacías

In [2]:
df = df.withColumn("FullName", when((col("FullName").isNull() | (col("FullName")=="")),lit("N/A")).otherwise(col("FullName"))) \
        .withColumn("PreferredName", when((col("PreferredName").isNull() | (col("PreferredName")=="")),lit("N/A")).otherwise(col("PreferredName")))

StatementMeta(, 4d87c738-213c-49f1-a620-f59d3e9ec1f9, 4, Finished, Available)

## Paso 3: Creación de vista temporal
- Creamos la vista temporal **vw_people** donde almacenar los datos tratados anteriormente

In [3]:
df.createOrReplaceTempView("vw_people")

StatementMeta(, 4d87c738-213c-49f1-a620-f59d3e9ec1f9, 5, Finished, Available)

## Paso 4: Instrucción MERGE

- Si el valor del registro para la columna **PersonID** existe **(MATCHED)** en la tabla destino y alguna columna difiere del registro existente, actualizamos la fila
- Si el valor del registro para la columna **PersonID** no existe **(NOT MATCHED)** en la tabla destino, insertamos una nueva fila
- En caso de que el registro de la tabla destino no exista en el fichero origen, eliminamos esa fila

In [4]:
%%sql

MERGE INTO people AS target
USING vw_people AS source
ON target.PersonID = source.PersonID
WHEN MATCHED AND 
(
        target.FullName <> source.FullName
        OR target.PreferredName <> source.PreferredName
        OR target.SearchName <> source.SearchName
        OR target.IsPermittedToLogon <> source.IsPermittedToLogon
        OR target.LogonName <> source.LogonName
        OR target.IsExternalLogonProvider <> source.IsExternalLogonProvider
        OR target.IsSystemUser <> source.IsSystemUser
        OR target.IsEmployee <> source.IsEmployee
        OR target.IsSalesperson <> source.IsSalesperson
        OR target.PhoneNumber <> source.PhoneNumber
        OR target.FaxNumber <> source.FaxNumber
        OR target.EmailAddress <> source.EmailAddress
)
THEN
UPDATE SET
        target.FullName = source.FullName
        ,target.PreferredName = source.PreferredName
        ,target.SearchName = source.SearchName
        ,target.IsPermittedToLogon = source.IsPermittedToLogon
        ,target.LogonName = source.LogonName
        ,target.IsExternalLogonProvider = source.IsExternalLogonProvider
        ,target.IsSystemUser = source.IsSystemUser
        ,target.IsEmployee = source.IsEmployee
        ,target.IsSalesperson = source.IsSalesperson
        ,target.PhoneNumber = source.PhoneNumber
        ,target.FaxNumber = source.FaxNumber
        ,target.EmailAddress = source.EmailAddress
        ,target.LastUpdated = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (PersonID, FullName, PreferredName, SearchName, IsPermittedToLogon, LogonName, IsExternalLogonProvider, 
        IsSystemUser, IsEmployee, IsSalesperson, PhoneNumber, FaxNumber, EmailAddress, LastUpdated)
VALUES (source.PersonID, source.FullName, source.PreferredName, source.SearchName, source.IsPermittedToLogon, source.LogonName, source.IsExternalLogonProvider, 
        source.IsSystemUser, source.IsEmployee, source.IsSalesperson, source.PhoneNumber, source.FaxNumber, source.EmailAddress, CURRENT_TIMESTAMP())
WHEN NOT MATCHED BY SOURCE THEN
DELETE;

StatementMeta(, 4d87c738-213c-49f1-a620-f59d3e9ec1f9, 6, Finished, Available)

<Spark SQL result set with 1 rows and 4 fields>

## Paso 5: Eliminación vista temporal

In [5]:
spark.catalog.dropTempView("vw_people")

StatementMeta(, 4d87c738-213c-49f1-a620-f59d3e9ec1f9, 7, Finished, Available)

True