##### Streaming ingestion orchestration notebook 

This notebook generates a [DAG](https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-multiple-notebooks-in-parallel) required by the runMultiple utility which defines the tables to load and associated parameters. 
Notebook 03 - TableLoader is invoked which accepts three main parameters, table name (in this case defined by the for loop on line 15), primary key and dense rank order by key, which is often required to extract the latest change per primary i.e. your change feed contains multiple records with the same primary key. For this demo leave the final parameter empty.

In the cell below, set the number of tables and the desired [trigger mode](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers). Streaming will run the process continuously (micro-batch) and process the changes as soon as possible. Batch mode is to be used when the notebook is run on a schedule. Optionally the relative base location if this was changed in the Setup notebook. 

Run and wait for all the streams to initialise.

Then leave the notebook running and return to the Setup notebook and continue with step 3 to add incremental files and verify the changes have been loaded.

 <font size="2" color="red" face="sans-serif" bold> 

<b> <i> <u>Ensure a default lakehouse has been set for this notebook and 03 - TableLoader before running this notebook.
</font>



In [8]:
import ast

# Set the desired number of table to simulate. This should be the same as the number chosen in the setup notebook.  
numtables = 5

# Specify one of the desired trigger modes by uncommenting the relevant line.
trigger_mode = "batch" # Note: per run, batch mode will take approximately 3 mins to complete for 10 tables
#trigger_mode = "streaming"

# Do not change the relative base - this is where change feed data and checkpoints will be stored
relbaselocation = "Files/AutoMerger"

# The notebook to be run in the runMultiple utility. This notebook should exist in your workspace.
NotebookName='03 - TableLoader'

DAGstrnotebooks=''
DAGstr=''
DAGstrbeg =  '''{
    "activities": ['''

# Loop through the number of required tables to generate a DAG in JSON
for i in range(numtables):
  table_name = 'table'+str(i+1)
  DAGstrnotebook = '''        {
            "name": "'''+table_name+'''", 
            "path": "'''+NotebookName+'''", 
            "timeoutPerCellInSeconds": 90, 
            "args": {"pTableName": "'''+ table_name + '''", "pJoinKey":"salekey", "pOrderKey":"changeTimestamp", "pTriggerType":"''' + trigger_mode + '''"},
            "retry": 3,
            "retryIntervalInSeconds": 10
        }'''
  if i<int(numtables)-1:
    DAGstrdelim = ','
  else:
    DAGstrdelim = ''
  DAGstrnotebooks = DAGstrnotebooks+DAGstrnotebook + DAGstrdelim
DAGstrend = '''        
    ]
}'''
DAGstr=DAGstrbeg + DAGstrnotebooks + DAGstrend
# Convert the DAG string to JSON
DAG = ast.literal_eval(DAGstr)

# Execute the notebooks using runMultiple utility passing the DAG as a parameter
exitval = mssparkutils.notebook.runMultiple(DAG)

# Print the output of the queries (applicable only in batch mode)
print(exitval)


StatementMeta(, 70e95ccb-546c-43e4-a856-033726f36c17, 14, Finished, Cancelled)

VBox(children=(HBox(children=(HTML(value='Status: Pending', description='table1'), FloatProgress(value=0.0, de…

StatementMeta(, 70e95ccb-546c-43e4-a856-033726f36c17, 15, Finished, Available)

#### Terminate streaming queries

If running in streaming mode (continuous) and you cancel the cell above, it is a good idea to check whether any active queries remain and terminate them. Note this may need to be run multiple times until no further queries.

In [18]:
import time
# Helper method to stop a streaming query
def stop_stream_query(query, wait_time):
    """Stop a running streaming query"""
    while query.isActive:
        msg = query.status['message']
        data_avail = query.status['isDataAvailable']
        trigger_active = query.status['isTriggerActive']
        if not data_avail and not trigger_active and msg != "Initializing sources":
            print('Stopping query...')
            query.stop()
        time.sleep(0.5)

    # Okay wait for the stop to happen
    print('Awaiting termination...')
    query.awaitTermination(wait_time)

if trigger_mode == "streaming":
    sqm = spark.streams
    if len(sqm.active)>0:
        print(str(len(sqm.active))+" active streaming queries exist.")
        for q in sqm.active:
            print(q.name + " streaming query is still active, terminating...")
            stop_stream_query(q,1000)
        print("Done")
    else:
        print("No active streaming queries exist.")


StatementMeta(, 70e95ccb-546c-43e4-a856-033726f36c17, 25, Finished, Available)

No active streaming queries exist.


###### Next Steps...
You may wish to make the process metadata driven. This can be achieved by storing the list of tables and associated primary keys in a delta (or SQL Database table) and using this to populate the DAG. Additionally you can have the solution scan all sub folders under the incremental feed folder to determine which target tables to load. The first cell has been adapted to demonstrate this. 

In [None]:
import ast
import os
import re

# Fetch the metadata from a table
keys_df = spark.sql("select * from table_primary_key_lookup")
keys_list = keys_df.collect()

# Convert each row to a dictionary
keys_dict_list = [row.asDict() for row in keys_list]


# Iterate through the incrementalfeed folders, extract the folder name for each folder that ends with a suffix such as '__ct' and store that in a list called full_tables
# TODO handle empty folders
folders = mssparkutils.fs.ls("Files/" + relbaselocation+"/incrementalfeed/")
full_tables = []
for folder in folders:
    folder_path = folder.path
    
    # Check if the folder name ends with __ct, if it does add it to the list of full_tables: 
    if folder_path.endswith('__ct'):
        extracted_folder = os.path.basename(folder_path.split("/Files/")[-1])
        #display(extracted_folder_without_suffix)
        full_tables.append(extracted_folder) 


#   Build the steps in the DAG
#       Find the results of full_tables to table_name in keys_dict_list
#           refer to full_tables as tablename, replace the . with _ and remove __ct from the end of the string
#           refer to keys_dict_list as keys, append source_database + '_' to the table_name in keys 
#           loop through all records where the two strings match, output tablename to pTablename parameter, primary_key_column to pJoinKey along with the other fixed values to a comma seperate list called DAGstrnotebook
#          
# Remove the last comma from the end of the DAGstrnotebook and wrap it in the variables DAGstrbeg and DAGstrend, refer to that as DAGstr
# Create something called a DAG from DAGstr   
NotebookName='IngestionApply'
DAGstrnotebook=''
DAGstr=''
DAGstrbeg =  '''{ "activities": ['''
for tablename in full_tables:
    for keys in keys_dict_list:
        #if re.sub(r'__ct$','',tablename.replace('.','_')) == 'camis_'+keys['table_name']:
        if re.sub(r'__ct$','',tablename.replace('.','_')) == keys['source_database']+'_'+keys['table_name']:
            DAGstrnotebook = DAGstrnotebook+'''        {
                    "name": "'''+tablename+'''", 
                    "path": "'''+NotebookName+'''", 
                    "timeoutPerCellInSeconds": 200, 
                    "args": {"pTableName": "'''+tablename+'''", "pJoinKey":"'''+keys['primary_key_column']+'''", "pOrderKey":"changeTimestamp", "pTriggerType":"''' + trigger_mode + '''"},
                    "retry": 3,
                    "retryIntervalInSeconds": 10
                },'''
DAGstrend = '''        
    ]
}'''
DAGstr=DAGstrbeg + DAGstrnotebook.rstrip(',') + DAGstrend

DAG = ast.literal_eval(DAGstr)

# Execute the notebooks using runMultiple utility passing the DAG as a parameter
exitval = mssparkutils.notebook.runMultiple(DAG)

# Print the output of the queries (applicable only in batch mode)
print(exitval)