### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) **Crear una tabla e insertar datos utilizando Python**

Recordar que al utilizar Python para crear tablas estas se crearan como **MANAGED TABLES** por defecto

In [None]:
employee_data = [(10,"Raj","Kumar","1999","100","M",2000),
                 (20,"Rahul","Rajan","2002","200","f",2000),
                 (30,"Raghav","Manish","2010","100",None,2000),
                 (40,"Raja","Singh","2004","100","F",2000),
                 (50,"Rama","Krish","2008","400","M",2000),
                 (60,"Rasul","Kutty","2014","500","M",2000),
                 (70,"Kumar","Chand","2004","600","M",2000)
                ]
employee_schema = ["employee_id","first_name","last_name","doj",
                   "employee_dept_id","gender","salary"]

df = spark.createDataFrame(data=employee_data, schema=employee_schema)

df.printSchema()

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- doj: string (nullable = true)
 |-- employee_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



In [None]:
employee_data = [(80,"Pedro","Rojas","1999","100","M",2000),
                 (90,"Jose","Perez","2002","200","f",2000),
                 (100,"Belen","Oyarce","2010","100",None,2000)
                ]
employee_schema = ["employee_id","first_name","last_name","doj",
                   "employee_dept_id","gender","salary"]

df_new = spark.createDataFrame(data=employee_data, schema=employee_schema)

df_new.printSchema()

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- doj: string (nullable = true)
 |-- employee_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



#### Ejemplo 1

In [None]:
%sql
CREATE DATABASE IF NOT EXISTS demo;

In [None]:
df.write.format("parquet").saveAsTable("demo.empleados")

In [None]:
%sql
SELECT * FROM demo.empleados;

employee_id,first_name,last_name,doj,employee_dept_id,gender,salary
80,Pedro,Rojas,1999,100,M,2000
20,Rahul,Rajan,2002,200,f,2000
60,Rasul,Kutty,2014,500,M,2000
70,Kumar,Chand,2004,600,M,2000
30,Raghav,Manish,2010,100,,2000
40,Raja,Singh,2004,100,F,2000
90,Jose,Perez,2002,200,f,2000
50,Rama,Krish,2008,400,M,2000
100,Belen,Oyarce,2010,100,,2000
10,Raj,Kumar,1999,100,M,2000


Mencionar que **insertInto** por defecto ejecuta el modo **append**. No hace falta escribirlo

In [None]:
df_new.write.format("parquet").insertInto("demo.empleados")

In [None]:
%sql
SELECT * FROM demo.empleados
ORDER BY employee_id;

employee_id,first_name,last_name,doj,employee_dept_id,gender,salary
10,Raj,Kumar,1999,100,M,2000
20,Rahul,Rajan,2002,200,f,2000
30,Raghav,Manish,2010,100,,2000
40,Raja,Singh,2004,100,F,2000
50,Rama,Krish,2008,400,M,2000
60,Rasul,Kutty,2014,500,M,2000
70,Kumar,Chand,2004,600,M,2000
80,Pedro,Rojas,1999,100,M,2000
90,Jose,Perez,2002,200,f,2000
100,Belen,Oyarce,2010,100,,2000


Podemos utilizar el modo **overwrite** y sobreescribirá los nuevos registros en la tabla

In [None]:
df_new.write.format("parquet").mode('overwrite').insertInto("demo.empleados")

In [None]:
%sql
SELECT * FROM demo.empleados
ORDER BY employee_id;

employee_id,first_name,last_name,doj,employee_dept_id,gender,salary
80,Pedro,Rojas,1999,100,M,2000
90,Jose,Perez,2002,200,f,2000
100,Belen,Oyarce,2010,100,,2000


#### Ejemplo 2

Para este ejemplo vamos a utilizar un particionado sobre la columna **employee_id**

In [None]:
df.write.format("parquet").partitionBy('employee_id').saveAsTable("demo.empleados_nuevo")

Podemos ver que al crear una partición sobre una columna, en este caso sobre **employee_id**, al utilizar **saveAsTable** la columna particionada la mueve al final de la tabla

In [None]:
%sql
SELECT * FROM demo.empleados_nuevo
ORDER BY employee_id;

first_name,last_name,doj,employee_dept_id,gender,salary,employee_id
Raj,Kumar,1999,100,M,2000,10
Rahul,Rajan,2002,200,f,2000,20
Raghav,Manish,2010,100,,2000,30
Raja,Singh,2004,100,F,2000,40
Rama,Krish,2008,400,M,2000,50
Rasul,Kutty,2014,500,M,2000,60
Kumar,Chand,2004,600,M,2000,70


Y el problema que se genera es que al utilizar **insertInto**, esta función trata de insertar nuevos registros según el schema original de la tabla.

In [None]:
df_new.write.format("parquet").insertInto("demo.empleados_nuevo")

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-4037067367114995>:1[0m
[0;32m----> 1[0m [43mdf_new[49m[38;5;241;43m.[39;49m[43mwrite[49m[38;5;241;43m.[39;49m[43mformat[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mparquet[39;49m[38;5;124;43m"[39;49m[43m)[49m[38;5;241;43m.[39;49m[43mmode[49m[43m([49m[38;5;124;43m'[39;49m[38;5;124;43mappend[39;49m[38;5;124;43m'[39;49m[43m)[49m[38;5;241;43m.[39;49m[43minsertInto[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mdemo.empleados_nuevo[39;49m[38;5;124;43m"[39;49m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 

Por tanto, para solucionar este problema debemos ordenar las columnas del dataframe que tiene los registros que queremos añadir y mover la columna particionada al final

In [None]:
df_new = df_new.select("first_name","last_name","doj","employee_dept_id","gender","salary","employee_id")

In [None]:
df_new.write.format("parquet").insertInto("demo.empleados_nuevo")

In [None]:
%sql
SELECT * FROM demo.empleados_nuevo
ORDER BY employee_id;

first_name,last_name,doj,employee_dept_id,gender,salary,employee_id
Raj,Kumar,1999,100,M,2000,10
Rahul,Rajan,2002,200,f,2000,20
Raghav,Manish,2010,100,,2000,30
Raja,Singh,2004,100,F,2000,40
Rama,Krish,2008,400,M,2000,50
Rasul,Kutty,2014,500,M,2000,60
Kumar,Chand,2004,600,M,2000,70
Pedro,Rojas,1999,100,M,2000,80
Jose,Perez,2002,200,f,2000,90
Belen,Oyarce,2010,100,,2000,100


#### Ejemplo 3

Vamos a ver como lograr de manera dinámica solo insertar aquellos registros para las particiones correctas

In [None]:
df.write.format("parquet").partitionBy('employee_id').saveAsTable("demo.empleados_ejemplo")

In [None]:
%sql
SELECT * FROM demo.empleados_ejemplo
ORDER BY employee_id;

first_name,last_name,doj,employee_dept_id,gender,salary,employee_id
Raj,Kumar,1999,100,M,2000,10
Rahul,Rajan,2002,200,f,2000,20
Raghav,Manish,2010,100,,2000,30
Raja,Singh,2004,100,F,2000,40
Rama,Krish,2008,400,M,2000,50
Rasul,Kutty,2014,500,M,2000,60
Kumar,Chand,2004,600,M,2000,70


Para ello vamos a utilizar el siguiente comando:

In [None]:
# Que por defecto se encuentra establecida en "static"
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

In [None]:
employee_data = [(10,"Alejandra","Soto","2004","300","F",3000),
                 (20,"Tomas","Lino","2010","100","M",4000),
                ]
employee_schema = ["employee_id","first_name","last_name","doj",
                   "employee_dept_id","gender","salary"]

df_new = spark.createDataFrame(data=employee_data, schema=employee_schema)

df_new.printSchema()

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- doj: string (nullable = true)
 |-- employee_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



In [None]:
df_new = df_new.select("first_name","last_name","doj","employee_dept_id","gender","salary","employee_id")

In [None]:
df_new.write.format("parquet").mode('overwrite').insertInto("demo.empleados_ejemplo")

In [None]:
# Vemos que se sobrescribieron las particiones 10 y 20
%sql
SELECT * FROM demo.empleados_ejemplo
ORDER BY employee_id;

first_name,last_name,doj,employee_dept_id,gender,salary,employee_id
Alejandra,Soto,2004,300,F,3000,10
Tomas,Lino,2010,100,M,4000,20
Raghav,Manish,2010,100,,2000,30
Raja,Singh,2004,100,F,2000,40
Rama,Krish,2008,400,M,2000,50
Rasul,Kutty,2014,500,M,2000,60
Kumar,Chand,2004,600,M,2000,70
