## Programmatically create multiple tables

You can use Python with Delta Live Tables to programmatically create multiple tables to reduce code redundancy.

You might have pipelines containing multiple flows or dataset definitions that differ only by a small number of parameters. This redundancy results in pipelines that are error-prone and difficult to maintain. For example, the following diagram shows the graph of a pipeline that uses a fire department dataset to find neighborhoods with the fastest response times for different categories of emergency calls. In this example, the parallel flows differ by only a few parameters.
<img src="https://learn.microsoft.com/en-us/azure/databricks/_static/images/workflows/delta-live-tables/fire-dataset-flows.png"> </src>


## Delta Live Tables metaprogramming with Python example
You can use a metaprogramming pattern to reduce the overhead of generating and maintaining redundant flow definitions. Metaprogramming in Delta Live Tables is done using Python inner functions. Because these functions are lazily evaluated, you can use them to create flows that are identical except for input parameters. Each invocation can include a different set of parameters that controls how each table should be generated, as shown in the following example.

In [0]:
import dlt
from pyspark.sql.functions import *

Important

Because Python functions with Delta Live Tables decorators are invoked lazily, when creating datasets in a loop you must call a separate function to create the datasets to ensure correct parameter values are used. Failing to create datasets in a separate function results in multiple tables that use the parameters from the final execution of the loop.

The following example calls the create_table() function inside a loop to create tables t1 and t2:

<ol>
<li class="has-line-data" data-line-start="0" data-line-end="1">def create_table(name): - This defines a function create_table that takes one argument, name, which is expected to be the name of the table you want to create.</li>
<li class="has-line-data" data-line-start="1" data-line-end="2">@dlt.table(name=name) - This is a decorator that applies to the function defined immediately below it. The dlt.table decorator is used to register a function as a Databricks Delta Live Table (DLT). The name=name part sets the name of the DLT to the value of the name parameter passed to the create_table function.</li>
<li class="has-line-data" data-line-start="2" data-line-end="3">def t(): - This defines a nested function within create_table, which will be registered as a DLT due to the decorator above it.</li>
<li class="has-line-data" data-line-start="3" data-line-end="4">return spark.read.table(name) - Inside the nested function t, it reads a table from the Spark session using the name provided and returns it. This is the operation that will be performed when the DLT is executed.</li>
<li class="has-line-data" data-line-start="4" data-line-end="5">tables = [“t1”, “t2”] - This creates a list of table names.</li>
<li class="has-line-data" data-line-start="5" data-line-end="6">for t in tables: - This starts a loop over the list of table names.</li>
<li class="has-line-data" data-line-start="6" data-line-end="9">create_table(t) - Inside the loop, it calls the create_table function with each table name, which will define and register a new DLT for each table name in the list.<br>
In essence, this code is dynamically creating and registering DLTs for each table name specified in the tables list. When executed in a Databricks environment, it would result in the creation of two DLTs named t1 and t2, each associated with a Spark table of the same name.</li>
</ol>

In [0]:
def create_table(name):
  @dlt.table(name=name)
  def t():
    return spark.read.table(name)

tables = ["t1", "t2"]
for t in tables:
  create_table(t)

py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 642, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/databricks/spark/python/dlt/helpers.py", line 29, in call
    res = self.func()
  File "/root/.ipykernel/1060/command-4315130338077372-363941997", line 4, in t
    return spark.read.table(name)
  File "/databricks/spark/python/dlt/overrides.py", line 34, in dlt_read_table_fn
    return spark_read_table(self, name)
  File "/databricks/spark/python/pyspark/instrumentation_utils.py", line 48, in wrapper
    res = func(*args, **kwargs)
  File "/databricks/spark/python/pyspark/sql/readwriter.py", line 484, in table
    return self._df(self._jreader.table(tableName))
  File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1355, in __call__
    return_value = get_return_va

<ol>
<li class="has-line-data" data-line-start="0" data-line-end="3">Import Statements:<br>
o   import dlt: This imports the dlt module, which is likely specific to Databricks Delta Live Tables (DLT).<br>
o   from pyspark.sql.functions import *: This imports all functions from the pyspark.sql.functions module.</li>
<li class="has-line-data" data-line-start="3" data-line-end="10">Creating a Raw Table:<br>
o   @dlt.table(name=“raw_fire_department”, comment=“raw table for fire department response”): This decorator defines a DLT named “raw_fire_department” with a comment describing its purpose.<br>
o   @dlt.expect_or_drop(“valid_received”, “received IS NOT NULL”), @dlt.expect_or_drop(“valid_response”, “responded IS NOT NULL”), and @dlt.expect_or_drop(“valid_neighborhood”, “neighborhood != ‘None’”): These decorators specify expectations or filters for the data in the table. For example, the first one ensures that the “received” column is not null.<br>
o   Inside the decorated function get_raw_fire_department(), the following operations are performed:<br>
   Read a CSV file from the specified path (’/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv’).<br>
   Rename columns using withColumnRenamed.<br>
   Select specific columns: ‘call_type’, ‘received’, ‘responded’, and ‘neighborhood’.</li>
<li class="has-line-data" data-line-start="10" data-line-end="21">Generating Tables:<br>
o   The generate_tables function creates two DLTs based on the provided parameters:<br>
   create_call_table():<br>
   Reads data from the “raw_fire_department” DLT.<br>
   Filters rows where the “call_type” matches the specified filter (e.g., “Alarms”).<br>
   Converts timestamps to Unix timestamps.<br>
   Selects the “neighborhood” column.<br>
   create_response_table():<br>
   Reads data from the previously created DLT (e.g., “alarms_table”).<br>
   Calculates the average response time (ts_received - ts_responded) for each neighborhood.<br>
   Orders the results by response time and limits to the top 10 neighborhoods.</li>
<li class="has-line-data" data-line-start="21" data-line-end="24">Table Names and Appending to all_tables:<br>
o   The generate_tables function is called three times with different parameters (“alarms_table”, “fire_table”, and “medical_table”).<br>
o   The resulting DLT names (“alarms_response”, “fire_response”, and “medical_response”) are appended to the all_tables list.</li>
<li class="has-line-data" data-line-start="24" data-line-end="30">Summary Table:<br>
o   A final DLT named “best_neighborhoods” is created.<br>
o   It combines data from all previously generated DLTs (union operation).<br>
o   Groups by neighborhood and calculates the count (score) for each neighborhood.<br>
o   Orders the results by score.</li>
</ol>

In [0]:
import dlt
from pyspark.sql.functions import *

@dlt.table(
  name="raw_fire_department",
  comment="raw table for fire department response"
)
@dlt.expect_or_drop("valid_received", "received IS NOT NULL")
@dlt.expect_or_drop("valid_response", "responded IS NOT NULL")
@dlt.expect_or_drop("valid_neighborhood", "neighborhood != 'None'")
def get_raw_fire_department():
  return (
    spark.read.format('csv')
      .option('header', 'true')
      .option('multiline', 'true')
      .load('/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv')
      .withColumnRenamed('Call Type', 'call_type')
      .withColumnRenamed('Received DtTm', 'received')
      .withColumnRenamed('Response DtTm', 'responded')
      .withColumnRenamed('Neighborhooods - Analysis Boundaries', 'neighborhood')
    .select('call_type', 'received', 'responded', 'neighborhood')
  )

all_tables = []

def generate_tables(call_table, response_table, filter):
  @dlt.table(
    name=call_table,
    comment="top level tables by call type"
  )
  def create_call_table():
    return (
      spark.sql("""
        SELECT
          unix_timestamp(received,'M/d/yyyy h:m:s a') as ts_received,
          unix_timestamp(responded,'M/d/yyyy h:m:s a') as ts_responded,
          neighborhood
        FROM LIVE.raw_fire_department
        WHERE call_type = '{filter}'
      """.format(filter=filter))
    )

  @dlt.table(
    name=response_table,
    comment="top 10 neighborhoods with fastest response time "
  )
  def create_response_table():
    return (
      spark.sql("""
        SELECT
          neighborhood,
          AVG((ts_received - ts_responded)) as response_time
        FROM LIVE.{call_table}
        GROUP BY 1
        ORDER BY response_time
        LIMIT 10
      """.format(call_table=call_table))
    )

  all_tables.append(response_table)

generate_tables("alarms_table", "alarms_response", "Alarms")
generate_tables("fire_table", "fire_response", "Structure Fire")
generate_tables("medical_table", "medical_response", "Medical Incident")

@dlt.table(
  name="best_neighborhoods",
  comment="which neighbor appears in the best response time list the most"
)
def summary():
  target_tables = [dlt.read(t) for t in all_tables]
  unioned = functools.reduce(lambda x,y: x.union(y), target_tables)
  return (
    unioned.groupBy(col("neighborhood"))
      .agg(count("*").alias("score"))
      .orderBy(desc("score"))
  )

Name,Type
neighborhood,string
score,bigint
