## Azure Machine Learning

Create a workspace before running this notebook.

In this notebook we will be building and deploying a simple model to predict domain of vacant positions in [NAV dataset](https://www.kaggle.com/kongharald/navledigestillinger) on Kaggle

In [None]:
! pip install kaggle &> /dev/null

In [6]:
## Getting the datasets\
# https://www.kaggle.com/kongharald/navledigestillinger
import kaggle

kaggle.api.authenticate()

kaggle.api.dataset_download_files('kongharald/navledigestillinger', path='./', unzip=True)

In [1]:
## Read the data
import pandas as pd
df = pd.read_csv('nav_ledigestillinger.tsv', delimiter='\t', nrows=5000)
df.head()
df = df.dropna(subset=['beskrivelse', 'hovedenhet_sektor'])
df.to_csv('nav_ledigestillinger_minimal2.tsv', sep='\t')

## Azure ML

First we will get reference to the needed resources and upload our dataset

In [16]:
! pip install --upgrade azureml-sdk mlflow azureml-mlflow &> /dev/null

In [None]:
## Just a workaround on arch
from dotnetcore2 import runtime
runtime.version = ("18", "10", "0")
runtime.dist = 'ubuntu'

## Get workspace
from azureml.core import Workspace, Datastore
ws = Workspace.from_config()

## Get a new datastore reference
default_ds = ws.get_default_datastore()

In [None]:
## Push to azure ml
default_ds.upload_files(files=['nav_ledigestillinger_minimal2.tsv'], # Upload the diabetes csv files in /data
                       target_path='nav_dataset/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

In [19]:
from azureml.core import Dataset
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'nav_dataset/nav_ledigestillinger_minimal2.tsv'), separator='\t')

# Display the first 5 rows as a Pandas dataframe
tab_data_set.take(5).to_pandas_dataframe()

Unnamed: 0,Column1,stilling_id,tittel,antall_stillinger,yrkeskode_styrk08,yrkeskode_styrk08_navn,yrkeskode_hovedkategori_navn,regdato,siste_publisert_dato,kommunenr,...,aktiv_flagg,org_nace,hovedenhet,hovedenhet_navn,hovedenhet_sektor,tilleggskriterium,beskrivelse,sprak,kilde,nav_enhet_kode
0,0,1894669,Senior Ingeniør Konstruksjonsteknikk,1,2142,Sivilingeniører (bygg og anlegg),Ingeniør- og ikt-fag,2009-02-05,2009-12-31,301.0,...,1.0,74909.0,891697872.0,SEATOWER AS,Privat og offentlig næringsvirksomhet,"Dagtid, Fast stilling, Heltid",Har du tung erfaring innen design av konstruks...,no,Reg av arb.giver på nav.no,334
1,2,2633246,Deltid og ekstrahjelp,2,5223,Butikkmedarbeidere,Butikk- og salgsarbeid,2011-07-13,2012-05-15,1106.0,...,0.0,47721.0,996406431.0,UKT HAUGESUND AS,Privat og offentlig næringsvirksomhet,"Deltid, Fast stilling, Vakt",<p>Vi søker en deltidsansatt og en ekstrahjelp...,no,Reg av arb.giver på nav.no,1106
2,3,2690643,Fysioterapeut,1,2264,Fysioterapeuter,"Helse, pleie og omsorg",2011-11-23,2011-12-23,301.0,...,1.0,86902.0,988131482.0,MARIA JOHANSSON,Privat og offentlig næringsvirksomhet,"Dagtid, Deltid, Heltid, Vikar",<html />,,Reg av arb.giver på nav.no,316
3,4,2691938,gårdsarbeider/ dyrepasser,1,6121,Melke- og husdyrprodusenter,"Jordbruk, skogbruk og fiske",2011-11-25,2011-12-01,822.0,...,0.0,1190.0,987202130.0,ANJA IREN MOLAND MIDTBØ,Privat og offentlig næringsvirksomhet,"Dagtid, Deltid, Fast stilling, Heltid",<html />,,Reg av arb.giver på nav.no,822
4,5,2698137,Vikar - helg/deltid,1,8322,"Bil-, drosje- og varebilførere",Reiseliv og transport,2011-12-09,2011-12-31,1504.0,...,1.0,49410.0,984635648.0,FRØYSAGARDEN V/HANS FREDRIK KLOKK,Privat og offentlig næringsvirksomhet,"Deltid, Helg, Vikar",<p>Stillingen innebærer kjøring av varebil nat...,no,Reg av arb.giver på nav.no,1528


In [20]:
tab_data_set = tab_data_set.register(workspace=ws, 
                                    name='NAV Dataset',
                                    description='Vacant positions from NAV on Kaggle https://www.kaggle.com/kongharald/navledigestillinger/metadata',
                                    tags = {'format':'TSV'},
                                    create_new_version=True)

### Programs for training, model handling and deployment

The following cells will write programs to filesystem which will be run by our pipeline

In [2]:
import os
# Create a folder for the pipeline step files
experiment_folder = 'nav_pipeline'
os.makedirs(experiment_folder, exist_ok=True)

In [3]:
%%writefile $experiment_folder/train_nav.py
# Import libraries
from azureml.core import Run
import argparse
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--output_folder', type=str, dest='output_folder', default="nav_model", help='output folder')
args = parser.parse_args()
output_folder = args.output_folder

# Get the experiment run context
run = Run.get_context()

# load the diabetes data (passed as an input dataset)
print("Loading Data...")
nav = run.input_datasets['nav_train'].to_pandas_dataframe()

nav['beskrivelse_length'] = nav['beskrivelse'].str.len() 
# Separate features and labels
X, y = nav[['beskrivelse_length']].values, nav['hovedenhet_sektor'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train adecision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# Save the trained model
os.makedirs(output_folder, exist_ok=True)
output_path = output_folder + "/model.pkl"
joblib.dump(value=model, filename=output_path)

run.complete()

Overwriting nav_pipeline/train_nav.py


In [4]:
%%writefile $experiment_folder/register_nav.py
# Import libraries
import argparse
import joblib
from azureml.core import Workspace, Model, Run

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--model_folder', type=str, dest='model_folder', default="nav_model", help='model location')
args = parser.parse_args()
model_folder = args.model_folder

# Get the experiment run context
run = Run.get_context()

# load the model
print("Loading model from " + model_folder)
model_file = model_folder + "/model.pkl"
model = joblib.load(model_file)

Model.register(workspace=run.experiment.workspace,
               model_path = model_file,
               model_name = 'nav_model',
               tags={'Training context':'Pipeline'})

run.complete()

Overwriting nav_pipeline/register_nav.py


In [None]:
from azureml.core.conda_dependencies import CondaDependencies
folder_name = 'nav_service'

# Create a folder for the web service files
deploy_folder = experiment_folder+'/'+folder_name
os.makedirs(deploy_folder, exist_ok=True)

print(folder_name, 'folder created.')

# Add the dependencies for our model (AzureML defaults is already included)
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")
myenv.add_conda_package("pandas")

# Save the environment config as a .yml file
env_file = deploy_folder + "/nav_env.yml"
with open(env_file,"w") as f:
    f.write(myenv.serialize_to_string())
print("Saved dependency info in", env_file)

In [6]:
%%writefile $deploy_folder/score_nav.py
import json
import joblib
import numpy as np
from azureml.core.model import Model

# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load it
    model_path = Model.get_model_path('nav_model')
    model = joblib.load(model_path)

# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Return the predictions as JSON
    return str(predictions)

Overwriting nav_pipeline/nav_service/score_nav.py


In [15]:
%%writefile $deploy_folder/deploy_nav.py
import os

from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.conda_dependencies import CondaDependencies 
from azureml.core import Workspace, Model, Run

# Get the correct references to needed resources
run = Run.get_context()
ws = run.experiment.workspace
model = ws.models['nav_model']

# Configure the scoring environment
inference_config = InferenceConfig(runtime= "python",
                                   source_directory = os.path.dirname(os.path.realpath(__file__)),
                                   entry_script="score_nav.py",
                                   conda_file="nav_env.yml")

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)

service_name = "nav-service"

service = Model.deploy(ws, service_name, [model], inference_config, deployment_config, overwrite=True)

service.wait_for_deployment(True)
print(service.state)

Overwriting nav_pipeline/nav_service/deploy_nav.py


In [16]:
! tree ./nav_pipeline

[01;34m./nav_pipeline[00m
|-- [01;34mnav_service[00m
|   |-- deploy_nav.py
|   |-- nav_env.yml
|   `-- score_nav.py
|-- register_nav.py
`-- train_nav.py

1 directory, 5 files


## Computation

Select a computation cluster to run the experiment

In [17]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "ml-basic-clst1"

try:
    # Check for existing compute target
    pipeline_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        pipeline_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        pipeline_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

Found existing cluster, use it.


## Run configuration

Dependencies etc

In [18]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import RunConfiguration

# Create a Python environment for the experiment
nav_env = Environment("nav-pipeline-env")
nav_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
nav_env.docker.enabled = True # Use a docker container

# Create a set of package dependencies
nav_packages = CondaDependencies.create(conda_packages=['scikit-learn','pandas'],
                                             pip_packages=['azureml-defaults','azureml-dataprep[pandas]'])

# Add the dependencies to the environment
nav_env.python.conda_dependencies = nav_packages

# Register the environment (just in case you want to use it again)
nav_env.register(workspace=ws)
registered_env = Environment.get(ws, 'nav-pipeline-env')

# Create a new runconfig object for the pipeline
pipeline_run_config = RunConfiguration()

# Use the compute you created above. 
pipeline_run_config.target = pipeline_cluster
#pipeline_run_config.target = 'local'

# Assign the environment to the run configuration
pipeline_run_config.environment = registered_env

print ("Run configuration created.")

Run configuration created.


## Define the pipeline

In [19]:
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.train.estimator import Estimator

# Get the training dataset
nav_ds = ws.datasets.get("nav dataset")

# Create a PipelineData (temporary Data Reference) for the model folder
model_folder = PipelineData("model_folder", datastore=ws.get_default_datastore())

estimator = Estimator(source_directory=experiment_folder,
                        compute_target = pipeline_cluster,
                        environment_definition=pipeline_run_config.environment,
                        entry_script='train_nav.py')

# Step 1, run the estimator to train the model
train_step = EstimatorStep(name = "Train Model",
                           estimator=estimator, 
                           estimator_entry_script_arguments=['--output_folder', model_folder],
                           inputs=[nav_ds.as_named_input('nav_train')],
                           outputs=[model_folder],
                           compute_target = pipeline_cluster,
                           allow_reuse = True)

# Step 2, run the model registration script
register_step = PythonScriptStep(name = "Register Model",
                                source_directory = experiment_folder,
                                script_name = "register_nav.py",
                                arguments = ['--model_folder', model_folder],
                                inputs=[model_folder],
                                compute_target = pipeline_cluster,
                                runconfig = pipeline_run_config,
                                allow_reuse = True)

# Step 3, run the model deployment script
deploy_step = PythonScriptStep(name = "Deploy Model",
                                source_directory = deploy_folder,
                                script_name = "deploy_nav.py",
                                arguments = [],
                                compute_target = pipeline_cluster,
                                runconfig = pipeline_run_config,
                                allow_reuse = True)

print("Pipeline steps defined")

Pipeline steps defined


## Run the pipeline

In [None]:
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline, StepSequence

# Construct the pipeline
pipeline_steps = StepSequence(steps=[train_step, register_step, deploy_step])
pipeline = Pipeline(workspace = ws, steps=pipeline_steps)
print("Pipeline is built.")

# Create an experiment and run the pipeline
experiment = Experiment(workspace = ws, name = 'nav-training-pipeline')
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)
print("Pipeline submitted for execution.")
pipeline_run.wait_for_completion()

## Run a prediction

Use the REST POST interface

In [70]:
service_name = "nav-service"
endpoint = ws.webservices[service_name].scoring_uri

In [71]:
import requests
import json

x_new = [[222],[555]]

# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})

# Set the content type
headers = { 'Content-Type':'application/json' }

predictions = requests.post(endpoint, input_json, headers = headers)
predicted_classes = predictions.json()
predicted_classes

"['Kommunal forvaltning' 'Privat og offentlig næringsvirksomhet']"