Sample - Azure Data Factory Upsert to Document DB

A sample project to demonstrate how one can implement the upsert logic (strictly speaking, since Document DB does not support partial updates, the insert or replace logic) using Azure Data Factory with a custom C# activity running on an auto-scaling pool of VMs inside Azure Batch. Inspired by Microsoft's howto and the solution for local debugging of custom C# activities.

Solution Components

Azure Data Factory
Azure Batch
C# custom activity implementing the Insert or Replace logic
Input: Blob Storage Container
Blob Storage Container for compiled C# code
Output: DocumentDB Collection

Provisioning

Manually, had no time to create an ARM template:

Download the latest Azure Data Factory plugin for Visual Studio
Provision a DocDB database and create a collection
Create a blob storage container for source data, you may upload a sample json
Create a blob storage container for the custom activity
Create an Azure Batch pool, you may choose to use the scaling formula
Fire up the solution in VS2015/2017
Adjust the ADF json config files in the DataFactory project
- edit the linked service json files replacing all occurences of *** with your values
- adjust the input/output datasets if your input data is time/date partitioned
- adjust the pipeline definition json as required
you can debug the solution locally (launch the console app) or alternatively compile the custom C# script, zip up all the binaries in the \bin\Debug folder as described here and upload the zip file to the blob storage container specified in the pipeline definition json
deploy the ADF, fire up the pipeline, observe jobs being created on the Azure Batch Pool, new VM being spun up automatically and magic happening :-)

Azure Batch Scaling

The Azure Batch pool used the following auto-scaling function:

startingNumberOfVMs = 0;
maxNumberofVMs = 5;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs, pendingTaskSamples);

Caveats

The C# code assumes all input files are in the arrayOfObjects JSON format, the filePattern setting is ignored - feel free to implement it yourself
The ADF setup code does not implement slices - each invokation of the pipeline will re-read the same input file over and over again. Please adjust dataset definitions accordingly
Code can be further optimised

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CustomADFActivityDocDBUpsert		CustomADFActivityDocDBUpsert
DataFactory		DataFactory
LocalTest		LocalTest
sample-input		sample-input
.gitattributes		.gitattributes
.gitignore		.gitignore
CustomADFActivityDocDBUpsert.sln		CustomADFActivityDocDBUpsert.sln
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sample - Azure Data Factory Upsert to Document DB

Solution Components

Provisioning

Azure Batch Scaling

Caveats

About

Releases

Packages

Languages

iizotov/CustomADFActivityDocDBUpsert

Folders and files

Latest commit

History

Repository files navigation

Sample - Azure Data Factory Upsert to Document DB

Solution Components

Provisioning

Azure Batch Scaling

Caveats

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages