Skip to content

nickwiecien/AML_ADF_PipelineSample

Repository files navigation

Azure ML Pipeline/Azure Data Factory Sample

This repo contains sample code for creating an Azure Machine Learning pipeline that saves outputs to an AML-linked blob store can be readily adapted for batch scoring. Below we include instructions for deploying the pipeline in an Azure ML workspace, and configuring an Azure Data Factory pipeline to regularly execute this pipeline and copy results from the AML-linked blob store into an Azure SQL Database

Setting up your Virtual Environment

Azure Machine Learning

A requirements.txt file is included with the necessary packages that may be need to be installed in your working environment. To install these packages you can execute a command line argument from your notebook cell by entering %pip install -r requirements.txt.

VS Code

A vscode_environment.yml file is included in this repository for creating a virtual environment for this projects. You can create a conda environment using the terminal in vs code using the following commands: conda create -n name_of_env -y and then you need to activate the environment by entering the next set of commands conda activate name_of_env and then follow up with the last set by entering conda env update --name name_of_env --file vscode_environment.yml. This will install the proper version of python for the environment and all of the conda and pip packages.

Cloning the code into your project

The instructions below assume you have provisioned an Azure Machine Learning workspace and a Compute Instance within this workspace. You can clone the code within this repo to your workspace by executing the following command in a terminal. The terminal icon in above the folders to the upper right hand side.

Clone a git repository

A terminal will now open where you can navigate the directory tree. Make sure to change directory to the appropriate folder. If the terminal does not have a command prompt then you need to select a compute to work from. If you don't have a compute then you can easily create one from the compute tab in the left navigation bar.

Image of terminal

git clone https://github.com/nickwiecien/AML_ADF_PipelineSample

The notebook will now pull down the code repository into the cloned folder. You will now see the project in the left folder navigation window.

Image of terminal

After cloning this repo execute the notebooks 01_Demo_Env_Setup.ipynb and 02_AML_Pipeline_Setup.ipynb in sequence. <i>Note: In order to run all steps of this demo you will need access to an Azure SQL database which you can use SQL Authentication to log into. There is a section in the 01 notebook which contains a pyodbc snippet for creating a new table. </i>

Once you have run both notebooks you should see an Azure Machine Learning pipeline named 'Sample Scoring Pipeline' available under your Pipeline Endpoints tab.

Published Pipeline Endpoint

Azure Data Factory Pipeline Setup

Update Linked Services

Inside your Azure Data Factory workspace, create linked services for Azure Machine Learning, the associated Default Datastore (Azure Storage Account), and the target Azure SQL Database. Instructions for creating linked services can be found here.

  • AML_Workspace (Azure ML Workspace)
  • AML_BlobStore (Azure ML-Linked Storage Account)
  • AzSQLDB (Azure SQL Database)

Azure Data Factory Linked Services

Create Datasets

Create a new dataset inside ADF from your linked AML_BlobStore account - select 'Delimited text' under file options. Configure your dataset to read from the default blobstore location (should have syntax like azureml-blobstore-xxxxxxxxxxxx...) and the scored_data subdirectory.

CSV Dataset

Create a new dataset from your linked AzSQLDB database pointing at the dbo.mydata table created in your execution of the attached notebooks.

SQL Dataset

Create ADF Pipeline

Create a new ADF pipeline and before adding any steps configure a pipeline variable called filename. Set the type as 'string' and default value as 'test_file.csv'.

Pipeline Variable

Add a 'Set variable' step to your pipeline. Under the settings listed under the 'Variables' tab select the filename variable and add the following as dynamic content:

@concat(formatDateTime(convertTimeZone(utcNow(),'UTC','Eastern Standard Time'),'yyyyMMddHHmmss'), '.csv')

Set Variable

As a second step, add a 'Machine Learning Execute Pipeline' step. Under the settings tab, select your linked AML workspace resource, select the 'Pipeline endpoint ID' and nagivate to 'Sample Scoring Pipeline.' Under the 'Machine Learning pipeline parameters' section, add a single key-value pair with filename as the key, and `@variables('filename') as the value. This will effectively pass the pipeline variable set in the previous step into your AML pipeline.

Machine Learning Execute Pipeline

As a final step, add a 'Copy data' step. Here we will configure the file written by the AML pipeline to the linked datastore as a source, and the Azure SQL DB table as a sink.

Under 'Source' select your configured CSV dataset as a source. Under File path type, select 'Wilecard file path' and enter the default blob storage container + scored_data + @variables('filename') as your effective path.

Configure Source

Under 'Sink' select your Azure SQL Database table dataset as a sink dataset. Under Write behavior select 'Insert.'

Configure Sink

Finally, under 'Mapping' you can either manually update mapping or upload samples and complete a link between columns as shown below.

Configure Mapping

Once complete, click 'Publish all' to save changes to your pipeline. Once completed configure your pipeline to run on a regular schedule.

Configure Schedule

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published