#### Setup Instructions
##### Step 1 - Initialise
Set the number of tables required and the relative base location where the sample data and checkpoints are stored

Run the cell below to set up the tables and sample relative base location <p>

Remember to add the target lakehouse and set as default


In [19]:
# Set the number of tables required
numtables=10
# Set the relative base path for sample data
relbaselocation = "Files/AutoMerger/"

def create_tables(tablenum):
  table_name = 'table'+str(tablenum+1)
  print("Dropping table "+table_name+" if exists")
  spark.sql("drop table if exists "+table_name+";")
  print("Loading and creating table "+table_name)
  df = spark.read.format("parquet").load(relbaselocation+'/basetable/part-00000-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet')
  df.write.mode("overwrite").format("delta").save("Tables/" + table_name)
  print("Resetting checkpoint and incremental directories")
  try:
    mssparkutils.fs.rm(relbaselocation+"checkpoints/"+table_name+"",True)
  except:
    None
  try:
    mssparkutils.fs.rm(relbaselocation+"incrementalfeed/"+table_name+"",True)
  except:
    None
  # Copy the first sample data to the incremental feed folder   
  mssparkutils.fs.cp(relbaselocation+"/basetable/part-00001-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet",relbaselocation+"/incrementalfeed/"+table_name+"/part-00001-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet")


# Copy sample data
for i in range(0,10):
    if not mssparkutils.fs.exists(relbaselocation+"/basetable/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet"):
      print("Copying sample data file: "+ relbaselocation+"/basetable/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet")
      mssparkutils.fs.cp("https://azuresynapsestorage.blob.core.windows.net/sampledata/WideWorldImportersDW/parquet/incremental/fact_sale_1y_incremental/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet",relbaselocation+"/basetable/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet")
# Create tables. Note these will be appear in the tables section with the name "table" suffixed with a numerical value 
for i in range(numtables):
  create_tables(i)
print("Complete")


StatementMeta(, 3f459cd8-5d1b-4d00-943a-a2ce475eecad, 21, Finished, Available)

Dropping table table1 if exists
Loading and creating table table1
Resetting checkpoint and incremental directories
Dropping table table2 if exists
Loading and creating table table2
Resetting checkpoint and incremental directories
Dropping table table3 if exists
Loading and creating table table3
Resetting checkpoint and incremental directories
Dropping table table4 if exists
Loading and creating table table4
Resetting checkpoint and incremental directories
Dropping table table5 if exists
Loading and creating table table5
Resetting checkpoint and incremental directories
Dropping table table6 if exists
Loading and creating table table6
Resetting checkpoint and incremental directories
Dropping table table7 if exists
Loading and creating table table7
Resetting checkpoint and incremental directories
Dropping table table8 if exists
Loading and creating table table8
Resetting checkpoint and incremental directories
Dropping table table9 if exists
Loading and creating table table9
Resetting chec

##### Step 2

Now navigate to the Orchestrator notebook, read the instructions and ensure the streams are running before returning to this notebook.


##### Step 3 - Prepare and load incremental data into the incremental feed per table
This below cells allow you to simulate incoming incremental files to be merged with the target tables. Each incremental file arrives in an associated sub folder matching the table name. Once the stream is running you can run these cells at various intervals and monitor the changes using the describe history command lower down.

Set the variables below to simulate the number of inserts and updates, the relative base location as above, but leave the filepos variable set to 2

In [None]:
# Set the number of tables, inserts and updates  
numupdates = 500
numinserts=500
numtables=10
relbaselocation = "Files/AutoMerger/"

# Do not change this starting file position 2 which is the incremental file starting position. 
# This value Will be incremented each time the cell below is run to add incremental files
filepos = 2


In [36]:


print("Loading incremental file position " + str(filepos))

# Obtain a set of already existing records so that these can be updated to form part of the next incremental batch 
dfupdates = spark.read.format("parquet") \
.load(relbaselocation+"/basetable/part-00000-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet") \
.orderBy('salekey',ascending=True).limit(numupdates)
dfupdates.write.mode("overwrite").format("delta").save("Tables/temptable")
spark.sql("update temptable set Description = 'Test update"+str(filepos)+"'")
dfupdates = spark.sql("select * from temptable")

dfinserts = spark.read.format("parquet") \
.load(relbaselocation+"/basetable/part-0000" + str(filepos) + "-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet") \
.orderBy('salekey',ascending=False).limit(numinserts)
df03 = dfupdates.union(dfinserts)
for p in range(1,numtables):
  df03.coalesce(1).write.mode("append").format("parquet").save(relbaselocation+"/incrementalfeed/table"+str(p))  

filepos = filepos+1

StatementMeta(, 3f459cd8-5d1b-4d00-943a-a2ce475eecad, 38, Finished, Available)

Loading incremental file position 3


##### Step 4 - Verify data has been merged into the target tables

You can verify the number of inserts and updates as well as the latency/timestamp at which these occured.

In [37]:
# Verify the recent inserts and updates for a specific table
df = spark.sql("describe history table1")
display(df.select("timestamp","operationMetrics.numTargetRowsInserted","operationMetrics.numTargetRowsMatchedUpdated").orderBy("timestamp",ascending=False))

StatementMeta(, 3f459cd8-5d1b-4d00-943a-a2ce475eecad, 39, Finished, Available)

SynapseWidget(Synapse.DataFrame, 3396d350-6430-4342-8088-dc87a55d1580)

In [None]:
'''
Issue to resolve

Cannot perform Merge as multiple source rows matched and attempted to modify the same
target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge,
when multiple source rows match on the same target row, the result may be ambiguous
as it is unclear which source row should be used to update or delete the matching
target row. You can preprocess the source table to eliminate the possibility of
multiple matches. Please refer to
https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge'''