## PySpark Dynamic Partition (Part 1)

Four modes for writing data:

**error**: Throws an error if we are trying to write into an existing table.  
**ignore**: Does not write any data if the table exists.  
**overwrite**: Overwrites the complete table with the new data.  
**append**: Appends the data to the table.  

The data often comes in incremental deliveries containing one or more complete partitions of data.

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count, udf, explode, concat_ws

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark Dynamic Partitions').getOrCreate()

#### Create Dataframe

In [0]:
from datetime import datetime

data = [
  ('item1', 5, datetime.strptime('2021-06-15','%Y-%m-%d')),
  ('item2', 1, datetime.strptime('2021-06-20','%Y-%m-%d')),
  ('item8', 9, datetime.strptime('2021-06-20','%Y-%m-%d')),
  ('item3', 2, datetime.strptime('2021-06-20','%Y-%m-%d')),
  ('item1', 3, datetime.strptime('2021-07-05','%Y-%m-%d')),
  ('item3', 4, datetime.strptime('2021-07-25','%Y-%m-%d')),
  ('item2', 1, datetime.strptime('2021-07-30','%Y-%m-%d')),
  ('item4', 6, datetime.strptime('2021-08-01','%Y-%m-%d')),
  ('item2', 8, datetime.strptime('2021-08-01','%Y-%m-%d')),
  ('item5', 8, datetime.strptime('2021-08-03','%Y-%m-%d'))
]

schema = StructType([
  StructField('item', StringType(), True),
  StructField('quantity', IntegerType(), True),
  StructField('date', DateType(), True)
])

df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show()

In [0]:
table_name = 'temp.partitions_testing'
table_path = f'/mnt/{table_name}'

#### Create a delta table

In [0]:
(df
.write
.format('delta')
.mode('overwrite')
.partitionBy('date')
.save(table_path))

In [0]:
%sh
ls -lah /dbfs/mnt/temp.partitions_testing

#### Create a new dataset with data that should be added to table

In [0]:
from datetime import datetime

aug_data = [
  #('item4', 6, datetime.strptime('2021-08-01','%Y-%m-%d')), # remove one row
  ('item2', 8, datetime.strptime('2021-08-01','%Y-%m-%d')),
  ('item5', 8, datetime.strptime('2021-08-03','%Y-%m-%d')),
  ('item1', 8, datetime.strptime('2021-08-04','%Y-%m-%d')) # that's a new item for August
]

schema = StructType([
  StructField('item', StringType(), True),
  StructField('quantity', IntegerType(), True),
  StructField('date', DateType(), True)
])

aug_df = spark.createDataFrame(data=aug_data, schema=schema)
aug_df.printSchema()
aug_df.show() 

In [0]:
spark.read.load(table_path).orderBy('date').show()

In [0]:
# When using append, item2 and item5 would get duplicated. 
# We can now use dynamic partition overwrite to the data for the 1st and 3rd, but keep the purchases from the previous days:
#spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
st_dt = '2021-08-01'
end_dt = '2021-08-04'

query = f'(date >= "{st_dt}") and (date <= "{end_dt}")'

(aug_df
.write
.format('delta')
#.option('partitionOverwriteMode', 'dynamic')
.option('replaceWhere',query)
.partitionBy('date')
.mode('overwrite')
.save(table_path))

In [0]:
spark.read.load(table_path).orderBy('date').show()

In [0]:
%sh
rm -rf /dbfs/mnt/dev/solutions/retail/corporate/temp

#### Same with Parquet

In [0]:
parquet_path = '/mnt/items'

In [0]:
(df
.write 
.mode('overwrite')
.partitionBy('date')
.parquet(parquet_path)
)

In [0]:
%sh
ls -lah /dbfs/mnt/items

In [0]:
spark.read.parquet(parquet_path).orderBy('date').show()

In [0]:
#spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

(aug_df
.write
.mode('overwrite')
.option('partitionOverwriteMode', 'dynamic')
.partitionBy('date')
.parquet(parquet_path))

In [0]:
spark.read.parquet(parquet_path).orderBy('date').show()

In [0]:
%sh
rm -rf /dbfs/mnt/

#### The end of the notebook