## <mark>Notebook must be attached to metadata Lakehouse created during set-up</mark>

## BigQuery Spark Connector

In [None]:
%%configure -f: 
{
    "defaultLakehouse": {
        "name": "<<<METADATA_LAKEHOUSE_NAME>>>",
        "id": "<<<METADATA_LAKEHOUSE_ID>>>",
        "workspaceId": "<<<FABRIC_WORKSPACE_ID>>>"
    },
    "conf": {
        "spark.jars": "<<<PATH_SPARK_BQ_JAR>>>"
    }
}  

## BQ Sync Python Package
If you are not going to leverage an environment, the BQ Sync package needs to be installed at runtime. 

<strong>Options for Loading/Using the BQ Sync Package</strong>
1. Runtime from OneLake (stable): <br />
    <code>%pip install /lakehouse/default/Files/BQ_Sync_Process/libs/FabricSync-0.1.0-py3-none-any.whl</code>
2. Runtime from GitHub (latest version): <br/>
    <code>%pip install https://github.com/microsoft/FabricBQSync/raw/main/Packages/FabricSync/dist/FabricSync-0.1.0-py3-none-any.whl</code>
3. From Spark Environment 

<strong>Please note that if you are scheduling this notebook to run from a pipeline, you must provide the <code>_inlineInstallationEnabled</code> parameter to the pipeline for pip install support.</strong>

In [None]:
%pip install --upgrade --force-reinstall <<<PATH_TO_BQ_SYNC_PACKAGE>>>

# Sync Loader Documentation

## Config File Path
The set-up process creates a minimal config file based on the parameters provided. 

You can update the config file at anytime and manually upload to an alternate path. 

Note: If you upload to a OneLake destination, it must be in the default Lakehouse and the <code>config_json_path</code> should point to the File API path (example: <code>/lakehouse/default/Files/myconfigfile.json</code>).

In [None]:
config_json_path = "<<<PATH_TO_USER_CONFIG>>>"
schedule_type = "AUTO"
optimize_metadata = False

In [None]:
from FabricSync.BQ.Loader import *
from FabricSync.DeltaTableUtility import *

# Metadata Sync
This step loads the user-supplied configuration and retrieves the relevant BQ metadata. Once the metadata has been synced to the Fabric lakehouse, the auto-detect process estimates the most optimal way to load the BQ data and persist to the the metadata lakehouse. You can modified any of the configuration data as needed by either:
-  Updating the <code>bq_sync_configuration</code> table directly
-  Specifying overrides in the user configuration file

Once the schedule and loader has run for the first time for any given table the configuration is locked. To make a change or fix an error or if any configuration modification is needed after a table has been loaded. Take the following steps:
1. Delete the configuration record from the <code>bq_sync_configuration</code> table
2. Delete any related sync metadata from the <code>bq_sync_schedule</code> and <code>bq_sync_schedule_telemetry</code> tables.
3. Remove the target BQ table data from the target lakehouse manually or with <code>mssparkutils.fs.rm("\<PATH TO BQ TABLE>\", recurse=True)</code> 

# Scheduler
Schedule builder for the tables configured in the above step. Currently only AUTO is supported for loading which evaluates all enabled tables for every load. If a table is static and needs to be skipped, add an entry to the user configuration file to disabled the table.

<code>
	"tables": [ 
	{
		"table_name": "<MY BQ TABLE>",
		"enabled": false
	}
	]
</code>

# Loader
The async scheduler uses python threading to saturate the configured spark cluster resource to optimize load efficiencies for your BQ tables/partitions. Python thread parallelism is controlled by the user configuration file. 

<code>
"async": {
	"enabled": true,
	"parallelism": 5,
	"cell_timeout": 36000,
	"notebook_timeout": 72000
	}
</code>

Choose a sensible number of threads based on your configured spark environment (the default is 5). Setting the degree of parallelism too high will slow the whole process down.

# Metadata Table Maintenance
Hygiene for the BQ sync process metadata tables:
- Optimize
- Vacuum with Retention 0 to minimize data size

# Running BQ Sync

In [None]:
bq_sync = BQSync(spark, config_json_path)
bq_sync.sync_metadata()

If you would like to export the auto-discovered BQ sync configuration to a new user configuration file, un-comment and run the following lines of code. 

This step will generate a potentially verbose configuration file with all tables discovered from your BQ Dataset. You can tweak and override any of the autodiscovered settings. Please note, there is currently no validation on the configuration set-up. An invalid configuration may potentially break the sync process.

If configuration changes are made and your re-target the sync process within an existing session please run the following line of code to clean-up any cached configuration before re-running the sync process.

<code>bq_sync.cleanup_session(spark)</code>

After the session is invalidated, it is necessary to re-run the <code>BQSync()</code> constructor to force a reload of the user configuration

Before you continue, carefully evaluate your config for correctness.

Once you run the next step, your load configuration is locked and cannot be changed without manually resetting the sync metadata and sync'd data.

In [None]:
schedule_id = bq_sync.build_schedule(sync_metadata=False, schedule_type=schedule_type)
bq_sync.run_schedule(group_schedule_id=schedule_id, optimize_metadata=optimize_metadata)

In [None]:
display(spark.sql(f"""
SELECT * FROM (
    SELECT status
    FROM bq_sync_schedule
    WHERE group_schedule_id='{group_schedule_id}'
)
PIVOT (
  COUNT(*)
  FOR status in (
    'COMPLETE', 'SKIPPED', 'FAILED', 'SCHEDULED'
  )
)
"""))