#### Setup Instructions
##### Step 1 - Initialise
Set the number of tables required and the relative base location where the sample data and checkpoints are stored

Run the cell below to set up the tables and sample relative base location <p>

Remember to add the target lakehouse and set as default


In [None]:
# Set the number of tables required
numtables=5
# Set the relative base path for sample data
relbaselocation = "Files/AutoMerger/"

def create_tables(tablenum):
  table_name = 'table'+str(tablenum+1)
  print("Dropping table "+table_name+" if exists")
  spark.sql("drop table if exists table"+table_name+";")
  print("Loading and creating table "+table_name)
  df = spark.read.format("parquet").load(relbaselocation+'/basetable/part-00000-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet')
  df.write.mode("overwrite").format("delta").save("Tables/" + table_name)
  print("Resetting checkpoint and incremental directories")
  try:
    mssparkutils.fs.rm("Files/checkpoints/"+table_name+"",True)
  except:
    None
  try:
    mssparkutils.fs.rm("Files/incremental/"+table_name+"",True)
  except:
    None
  # Copy the first sample data to the incremental feed folder   
  mssparkutils.fs.cp(relbaselocation+"/basetable/part-00001-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet",relbaselocation+"/incrementalfeed/"+table_name+"/part-00001-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet")


# Copy sample data
for i in range(0,10):
    if not mssparkutils.fs.exists(relbaselocation+"/basetable/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet"):
      print("Copying sample data file: "+ relbaselocation+"/basetable/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet")
      mssparkutils.fs.cp("https://azuresynapsestorage.blob.core.windows.net/sampledata/WideWorldImportersDW/parquet/incremental/fact_sale_1y_incremental/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet",relbaselocation+"/basetable/part-0000"+str(i)+"-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet")
# Create tables. Note these will be appear in the tables section with the name "table" suffixed with a numerical value 
for i in range(numtables):
  create_tables(i)
print("Complete")


#### Step 2

Now navigate to the Orchestrator notebook, read the instructions and run.


##### Step 3 - Prepare and load incremental data into the incremental feed per table
Set the variables below

In [None]:
# Set the number of tables, inserts and updates  
numupdates = 500
numinserts=500
numtables=5

# Starting incremental feed at file position 2. Will be incremented each time the cell below is run
filepos = 2

relbaselocation = "Files/AutoMerger/"

print("Loading incremental file position " + str(filepos))

# Obtain a set of already existing records so that these can be updated to form part of the next incremental batch 
dfupdates = spark.read.format("parquet") \
.load(relbaselocation+"/basetable/part-00000-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet") \
.orderBy('salekey',ascending=True).limit(numupdates)
dfupdates.write.mode("overwrite").format("delta").save("Tables/temptable")
spark.sql("update temptable set Description = 'Test update"+str(filepos)+"'")
dfupdates = spark.sql("select * from temptable")

dfinserts = spark.read.format("parquet") \
.load(relbaselocation+"/basetable/part-0000" + str(filepos) + "-5eaa21a0-54da-459d-b098-307a78d5d41e-c000.snappy.parquet") \
.orderBy('salekey',ascending=False).limit(numinserts)
df03 = dfupdates.union(dfinserts)
for p in range(1,numtables):
  df03.coalesce(1).write.mode("append").format("parquet").save(relbaselocation+"/incrementalfeed/table"+str(p))  

filepos = filepos+1

###### Step 4 - Verify data has been merged into the target tables

In [None]:
# Verify the recent inserts and updates for a specific table
df = spark.sql("describe history table1")
display(df.select("timestamp","operationMetrics.numTargetRowsInserted","operationMetrics.numTargetRowsMatchedUpdated").orderBy("timestamp",ascending=False))

In [None]:
'''
Issue to resolve

Cannot perform Merge as multiple source rows matched and attempted to modify the same
target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge,
when multiple source rows match on the same target row, the result may be ambiguous
as it is unclear which source row should be used to update or delete the matching
target row. You can preprocess the source table to eliminate the possibility of
multiple matches. Please refer to
https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge'''