Skip to content

iizotov/CustomADFActivityDocDBUpsert

Repository files navigation

Sample - Azure Data Factory Upsert to Document DB

A sample project to demonstrate how one can implement the upsert logic (strictly speaking, since Document DB does not support partial updates, the insert or replace logic) using Azure Data Factory with a custom C# activity running on an auto-scaling pool of VMs inside Azure Batch. Inspired by Microsoft's howto and the solution for local debugging of custom C# activities.

Solution Components

  1. Azure Data Factory
  2. Azure Batch
  3. C# custom activity implementing the Insert or Replace logic
  4. Input: Blob Storage Container
  5. Blob Storage Container for compiled C# code
  6. Output: DocumentDB Collection

Provisioning

Manually, had no time to create an ARM template:

  1. Download the latest Azure Data Factory plugin for Visual Studio
  2. Provision a DocDB database and create a collection
  3. Create a blob storage container for source data, you may upload a sample json
  4. Create a blob storage container for the custom activity
  5. Create an Azure Batch pool, you may choose to use the scaling formula
  6. Fire up the solution in VS2015/2017
  7. Adjust the ADF json config files in the DataFactory project
    • edit the linked service json files replacing all occurences of *** with your values
    • adjust the input/output datasets if your input data is time/date partitioned
    • adjust the pipeline definition json as required
  8. you can debug the solution locally (launch the console app) or alternatively compile the custom C# script, zip up all the binaries in the \bin\Debug folder as described here and upload the zip file to the blob storage container specified in the pipeline definition json
  9. deploy the ADF, fire up the pipeline, observe jobs being created on the Azure Batch Pool, new VM being spun up automatically and magic happening :-)

Azure Batch Scaling

The Azure Batch pool used the following auto-scaling function:

startingNumberOfVMs = 0;
maxNumberofVMs = 5;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs, pendingTaskSamples);

Caveats

  1. The C# code assumes all input files are in the arrayOfObjects JSON format, the filePattern setting is ignored - feel free to implement it yourself
  2. The ADF setup code does not implement slices - each invokation of the pipeline will re-read the same input file over and over again. Please adjust dataset definitions accordingly
  3. Code can be further optimised

Releases

No releases published

Packages

No packages published

Languages