## BigQuery Spark Connector

Python does not currently support jar files loaded in an Environment. We load the required library into the session.

<mark>PLEASE VERIFY THE LAKEHOUSE (ABFSS) PATH TO THE JAR IS CORRECT FOR YOUR WORKSPACE</mark>

In [None]:
%%configure -f: 
{
    "conf": {
        "spark.jars": "<<<PATH_SPARK_BQ_JAR>>>"
    }
}  

If you are not going to leverage an environment, the BQ Sync package needs to be installed at runtime. 

<strong>Please note that if you are scheduling this notebook to run from a pipeline, you must provide the <code>_inlineInstallationEnabled</code> parameter to the pipeline for pip install support.</strong>

In [None]:
%pip install --upgrade --force-reinstall <<<PATH_TO_BQ_SYNC_PACKAGE>>>

The set-up process creates a minimal config file based on the parameters provided. You can update the config file at anytime and manually upload to the either the notebook resources or environment resources path below.

In [None]:
config_json_path = "<<<PATH_TO_USER_CONFIG>>>"

In [None]:
from FabricSync.BQ.Loader import *
from FabricSync.DeltaTableUtility import *

# Metadata Sync

This step loads the user-supplied configuration and retrieves the relevant BQ metadata. Once the metadata has been synced to the Fabric lakehouse, the auto-detect process estimates the most optimal way to load the BQ data and persist to the the metadata lakehouse. You can modified any of the configuration data as needed by either:
-  Updating the <code>bq_sync_configuration</code> table directly
-  Specifying overrides in the user configuration file

Once the schedule and loader has run for the first time for any given table the configuration is locked. To make a change or fix an error or if any configuration modification is needed after a table has been loaded. Take the following steps:
1. Delete the configuration record from the <code>bq_sync_configuration</code> table
2. Delete any related sync metadata from the <code>bq_sync_schedule</code> and <code>bq_sync_schedule_telemetry</code> tables.
3. Remove the target BQ table data from the target lakehouse manually or with <code>mssparkutils.fs.rm("\<PATH TO BQ TABLE>\", recurse=True)</code> 

# Scheduler

Schedule builder for the tables configured in the above step. Currently only AUTO is supported for loading which evaluates all enabled tables for every load. If a table is static and needs to be skipped, add an entry to the user configuration file to disabled the table.

<code>
	"tables": [ 
	{
		"table_name": "<MY BQ TABLE>",
		"enabled": false
	}
	]
</code>

# Loader

The async scheduler uses python threading to saturate the configured spark cluster resource to optimize load efficiencies for your BQ tables/partitions. Python thread parallelism is controlled by the user configuration file. 

<code>
"async": {
	"enabled": true,
	"parallelism": 5,
	"cell_timeout": 36000,
	"notebook_timeout": 72000
	}
</code>

Choose a sensible number of threads based on your configured spark environment (the default is 5). Setting the degree of parallelism too high will slow the whole process down.

# Metadata Table Maintenance

Hygiene for the BQ sync process metadata tables:
- Optimize
- Vacuum with Retention 0 to minimize data size

In [None]:
bq_sync = BQSync(spark, config_json_path)
group_schedule_id = bq_sync.build_schedule()
bq_sync.run_schedule(group_schedule_id)