diff --git a/README.md b/README.md index fad4163873..86058697a5 100644 --- a/README.md +++ b/README.md @@ -91,7 +91,7 @@ notebooks. See the [documentation](docs/docker.md) for more on Docker use. > To read the EULA for using the docker image, run \ > `docker run -it -p 8888:8888 microsoft/mmlspark eula` -#### GPU VM Setup +### GPU VM Setup MMLSpark can be used to train deep learning models on a GPU node from a Spark application. See the instructions for [setting up an Azure GPU diff --git a/docs/gpu-setup.md b/docs/gpu-setup.md index 4493c7589f..a0aa612f34 100644 --- a/docs/gpu-setup.md +++ b/docs/gpu-setup.md @@ -2,20 +2,23 @@ ## Requirements -CNTK training using MMLSpark in Azure requires an HDInsight Spark cluster and a -GPU virtual machine (VM). The GPU VM should be reachable via SSH from the -cluster, but no public SSH access (or even a public IP address) is required. -As an example, it can be on a private Azure virtual network (VNet), and within -this VNet, it can be addressed directly by its name and access the Spark -clsuter nodes (e.g., use the active NameNode RPC endpoint). - -See the original [copyright and license notices](third-party-notices.txt) of -third party software used by MMLSpark. +CNTK training using MMLSpark in Azure requires an HDInsight Spark +cluster and a GPU virtual machine (VM). The GPU VM should be reachable +via SSH from the cluster, but no public SSH access (or even a public IP +address) is required, and the cluster's NameNode should be accessible +from the GPU machine via the HDFS RPC. As an example, it can be on a +private Azure virtual network (VNet), and within this VNet, it can be +addressed directly by its name and access the Spark clsuter nodes (e.g., +use the active NameNode RPC endpoint). + +(See the original [copyright and license +notices](third-party-notices.txt) of third party software used by +MMLSpark.) ### Data Center Compatibility -Currently, not all data centers have GPU VMs available. See [the Linux -VMs page](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) +Currently, not all data centers have GPU VMs available. See [the Linux VMs +page](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) to check availability in your data center. ## Connect an HDI cluster and a GPU VM via the ARM template @@ -44,21 +47,7 @@ the associated GPU VM: - `gpuVirtualMachineName`: The name of the GPU virtual machine to create - `gpuVirtualMachineSize`: The size of the GPU virtual machine to create -If you need to further configure the environment (for example, to change [the -class of VM -sizes](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) -for HDI cluster nodes), modify the template directly before deployment. See -also [the guide for best ARM template -practices](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-template-best-practices). -For the naming rules and restrictions for Azure resources please refer to the -[Naming conventions -article](https://docs.microsoft.com/en-us/azure/architecture/best-practices/naming-conventions). - -There are actually three templates that are used for deployment: -- [`deploy-main-template.json`](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-main-template.json): - This is the main template. It referencs the following two child - templates — these are relative references so they are expected to be - found in the same base URL. +There are actually two additional templates that are used from this main template: - [`spark-cluster-template.json`](https://mmlspark.azureedge.net/buildartifacts/0.9/spark-cluster-template.json): A template for creating an HDI Spark cluster within a VNet, including MMLSpark and its dependencies. (This template installs MMLSpark using @@ -69,46 +58,40 @@ There are actually three templates that are used for deployment: CNTK and other dependencies that MMLSpark needs for GPU training. (This is done via a script action that runs [`gpu-setup.sh`](https://mmlspark.azureedge.net/buildartifacts/0.9/gpu-setup.sh).) - -Note that the last two child templates can also be deployed independently, if +Note that these child templates can also be deployed independently, if you don't need both parts of the installation. ## Deploying an ARM template ### 1. Deploy an ARM template within the [Azure Portal](https://ms.portal.azure.com/) -An ARM template can be opened within the Azure Portal via the following REST -API: +[Click here to open the above +template](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fmmlspark.azureedge.net%2Fbuildartifacts%2F0.9%2Fdeploy-main-template.json) +in the Azure portal. - https://portal.azure.com/#create/Microsoft.Template/uri/ +(If needed, you click the **Edit template** button to view and edit the +template.) -The URI can be one for either an *Azure Blob* or a *GitHub file*. For example, +This link is using the Azure Portal API: - https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fmystorage.blob.core.windows.net%2Fdeploy-main-template.json + https://portal.azure.com/#create/Microsoft.Template/uri/〈ARM-template-URI〉 -(Note that the URL is percent-encoded.) Clicking on the above link will -open the template in the Portal. If needed, click the **Edit template** button -(see screenshot below) to view and edit the template. +where the template URI is percent-encoded. -![ARM template in Portal](http://image.ibb.co/gZ6iiF/arm_Template_In_Portal.png) +### 2. Deploy an ARM template with MMLSpark Azure CLI 2.0 -### 2. Deploy an ARM template with [MMLSpark Azure CLI 2.0](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.sh) +We also provide a convenient shell script to create a deployment on the +command line: -MMLSpark provides an Azure CLI 2.0 script -([`deploy-arm.sh`](../tools/deployment/deploy-arm.sh)) to deploy an ARM -template (such as -[`deploy-main-template.json`](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-main-template.json)) -along with a parameter file (see -[deploy-parameters.template](../tools/deployment/deploy-parameters.template) -for a template of such a file). +* Download the [shell + script](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.sh) + and make a local copy of it -> Note that you cannot use the -> [template file](../tools/deployment/deploy-main-template.json) from -> the source tree, since it requires additional resources that are -> created by the build (specifically, a working version of -> [`install-mmlspark.sh`](../tools/hdi/install-mmlspark.sh)). +* Create a JSON parameter file by downloading [this template + file](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-parameters.template) + and modify it according to your specification. -The script take the following arguments: +You can now run the script — it takes the following arguments: - `subscriptionId`: The GUID that identifies your subscription (e.g., `01234567-89ab-cdef-0123-456789abcdef`), defaults to setting in your `az` environment. @@ -118,13 +101,10 @@ The script take the following arguments: `East US`), note that this is required if creating a new resource group. - `deploymentName`: The name for this deployment. -- `templateLocation`: The URL of an ARM template file, or the path to - one. By default, it is set to `deploy-main-template.json` in the same - directory, but note that this will normally not work without the rest - of the required resources. -- `parametersFilePath`: The path to the parameter file, which you need - to create. Use `deploy-parameters.template` as a template for - creating a parameters file. +- `templateLocation`: The URL of an ARM template file. By default, it + is set to the above main template. +- `parametersFilePath`: The path to the parameter file, which you have + created. Run the script with a `-h` or `--help` to see the flags that are used to set these arguments: @@ -132,15 +112,17 @@ set these arguments: ./deploy-arm.sh -h If no flags are specified on the command line, the script will prompt -you for all values. If needed, install the Azure CLI 2.0 using the -instruction found in the [Azure CLI Installation -Guide](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli). +you for all needed values. + +> Note that the script uses the Azure CLI 2.0, see the +> [Azure CLI Installation Guide](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) +> if you need to install it. -### 3. Deploy an ARM template with the [MMLSpark Azure PowerShell](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.ps1) +### 3. Deploy an ARM template with the MMLSpark Azure PowerShell MMLSpark also provides a [PowerShell script](https://mmlspark.azureedge.net/buildartifacts/0.9/deploy-arm.ps1) -to deploy ARM templates, similar to the above bash script, run it with +to deploy ARM templates, similar to the above bash script. Run it with `-?` to see the usage instructions (or use `get-help`). If needed, install the Azure PowerShell cmdlets using the instructions in the [Azure PowerShell @@ -164,7 +146,7 @@ Azure will stop billing if a VM is in a "Stopped (**Deallocated**)" state, which is different from the "Stopped" state. So make sure it is *Deallocated* to avoid billing. In the Azure Portal, clicking the "Stop" button will put the VM into a "Stopped (Deallocated)" state and clicking the "Start" button brings -it VM. See "[Properly Shutdown Azure VM to Save +it back up. See "[Properly Shutdown Azure VM to Save Money](https://buildazure.com/2017/03/16/properly-shutdown-azure-vm-to-save-money/)" for futher details. diff --git a/notebooks/gpu/401 - CNTK train on HDFS.ipynb b/notebooks/gpu/401 - CNTK train on HDFS.ipynb index af52efe354..8b735c9844 100644 --- a/notebooks/gpu/401 - CNTK train on HDFS.ipynb +++ b/notebooks/gpu/401 - CNTK train on HDFS.ipynb @@ -96,12 +96,7 @@ "brainscriptText = \"\"\"\n", " # ConvNet applied on CIFAR-10 dataset, with no data augmentation.\n", "\n", - " command = TrainNetwork\n", - "\n", - " precision = \"double\"; traceLevel = 1 ; deviceId = \"auto\"\n", - "\n", - " rootDir = \"../../..\" ; dataDir = \"$$rootDir$$/DataSets/CIFAR-10\" ;\n", - " outputDir = \"./Output\" ;\n", + " parallelTrain = true\n", "\n", " TrainNetwork = {\n", " action = \"train\"\n", @@ -148,7 +143,7 @@ "\n", " SGD = {\n", " epochSize = 0\n", - " minibatchSize = 256\n", + " minibatchSize = 32\n", "\n", " learningRatesPerSample = 0.0015625*10:0.00046875*10:0.00015625\n", " momentumAsTimeConstant = 0*20:607.44\n", @@ -164,18 +159,7 @@ " dataParallelSGD = { gradientBits = 1 }\n", " }\n", " }\n", - "\n", - " reader = {\n", - " readerType = \"CNTKTextFormatReader\"\n", - " file = \"$$DataDir$$/Train_cntk_text.txt\"\n", - " randomize = true\n", - " keepDataInMemory = true # cache all data in memory\n", - " input = {\n", - " features = { dim = 3072 ; format = \"dense\" }\n", - " labels = { dim = 10 ; format = \"dense\" }\n", - " }\n", - " }\n", - "}\n", + " }\n", "\"\"\"" ] }, diff --git a/tools/deployment/deploy-arm.ps1 b/tools/deployment/deploy-arm.ps1 index 5b251a1604..d9c23f840f 100755 --- a/tools/deployment/deploy-arm.ps1 +++ b/tools/deployment/deploy-arm.ps1 @@ -25,9 +25,9 @@ .PARAMETER deploymentName The deployment name. - .PARAMETER templateFilePath - Path of the template file to deploy. - Optional, defaults to deploy-main-template.json in this directory. + .PARAMETER templateLocation + URL of the template to deploy. + Optional, defaults to the one corresponding to this script. .PARAMETER parametersFilePath Path of the parameters file to use for the template, use @@ -57,6 +57,7 @@ param( [string] $resourceGroupName, + [Parameter(Mandatory=$False)] [string] $resourceGroupLocation, @@ -64,30 +65,40 @@ param( [string] $deploymentName, + [Parameter(Mandatory=$False)] [string] - $templateFilePath = "deploy-main-template.json", + $templateLocation, [Parameter(Mandatory=$True)] [string] $parametersFilePath ) +# <=<= this line is replaced with variables defined with `defvar -X` =>=> +$DOWNLOAD_URL = "$STORAGE_URL/$MML_VERSION" +# TODO: throw an error if $MML_VERSION is not defined + <# .SYNOPSIS Registers RPs #> Function RegisterRP { - Param( - [string]$ResourceProviderNamespace - ) - Write-Host "Registering resource provider '$ResourceProviderNamespace'"; - Register-AzureRmResourceProvider -ProviderNamespace $ResourceProviderNamespace; + Param( + [string]$ResourceProviderNamespace + ) + Write-Host "Registering resource provider '$ResourceProviderNamespace'"; + Register-AzureRmResourceProvider -ProviderNamespace $ResourceProviderNamespace; } #****************************************************************************** # Script body # Execution begins here #****************************************************************************** + +if (!$templateLocation) { + $templateLocation = $DOWNLOAD_URL + "/deploy-main-template.json"; +} + $ErrorActionPreference = "Stop" # sign in @@ -101,29 +112,29 @@ Select-AzureRmSubscription -SubscriptionID $subscriptionId; # Register RPs $resourceProviders = @("microsoft.hdinsight"); if ($resourceProviders.length) { - Write-Host "Registering resource providers" - foreach ($resourceProvider in $resourceProviders) { - RegisterRP($resourceProvider); - } + Write-Host "Registering resource providers" + foreach ($resourceProvider in $resourceProviders) { + RegisterRP($resourceProvider); + } } #Create or check for existing resource group $resourceGroup = Get-AzureRmResourceGroup -Name $resourceGroupName -ErrorAction SilentlyContinue if (!$resourceGroup) { - Write-Host "Resource group '$resourceGroupName' does not exist. To create a new resource group, please enter a location."; - if (!$resourceGroupLocation) { - $resourceGroupLocation = Read-Host "resourceGroupLocation"; - } - Write-Host "Creating resource group '$resourceGroupName' in location '$resourceGroupLocation'"; - New-AzureRmResourceGroup -Name $resourceGroupName -Location $resourceGroupLocation + Write-Host "Resource group '$resourceGroupName' does not exist. To create a new resource group, please enter a location."; + if (!$resourceGroupLocation) { + $resourceGroupLocation = Read-Host "resourceGroupLocation"; + } + Write-Host "Creating resource group '$resourceGroupName' in location '$resourceGroupLocation'"; + New-AzureRmResourceGroup -Name $resourceGroupName -Location $resourceGroupLocation } else { - Write-Host "Using existing resource group '$resourceGroupName'"; + Write-Host "Using existing resource group '$resourceGroupName'"; } # Start the deployment Write-Host "Starting deployment..."; if (Test-Path $parametersFilePath) { - New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateFile $templateFilePath -TemplateParameterFile $parametersFilePath; + New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateUri $templateLocation -TemplateParameterFile $parametersFilePath; } else { - New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateFile $templateFilePath; + New-AzureRmResourceGroupDeployment -ResourceGroupName $resourceGroupName -TemplateUri $templateLocation; } diff --git a/tools/deployment/deploy-arm.sh b/tools/deployment/deploy-arm.sh index 305f2cb421..c3ea3bfbfd 100755 --- a/tools/deployment/deploy-arm.sh +++ b/tools/deployment/deploy-arm.sh @@ -2,6 +2,15 @@ # Copyright (C) Microsoft Corporation. All rights reserved. # Licensed under the MIT License. See LICENSE in project root for information. +# This script deploys a Spark Cluster and a GPU, see docs/gpu-setup.md +# for details. + +# <=<= this line is replaced with variables defined with `defvar -X` =>=> +DOWNLOAD_URL="$STORAGE_URL/$MML_VERSION" +if [[ -z "$MML_VERSION" ]]; then + echo "Error: this script cannot be executed as-is" 1>&2; exit 1 +fi + set -euo pipefail # -e: exit if any command has a non-zero exit status # -u: unset variables are an error @@ -71,7 +80,8 @@ readarg subscriptionId "Subscription ID" "$cursub" readarg -r resourceGroupName "Resource Group Name" readarg deploymentName "Deployment Name" readarg resourceGroupLocation "Resource Group Location" -readarg templateLocation "Template Location (Path/URL)" "$here/deploy-main-template.json" +readarg templateLocation "Template Location URL" \ + "$DOWNLOAD_URL/deploy-main-template.json" readarg -rf parametersFilePath "Parameters File" if [[ "$subscriptionId" != "$cursub" ]]; then @@ -99,12 +109,7 @@ echo "Starting deployment..." args=() if [[ -n "$deploymentName" ]]; then args+=(--name "$deploymentName"); fi args+=(--resource-group "$resourceGroupName") -if [[ "$templateLocation" = "http://"* ]]; then args+=(--template-uri) -elif [[ "$templateLocation" = "https://"* ]]; then args+=(--template-uri) -elif [[ -r "$templateLocation" ]]; then args+=(--template-file) -else failwith "templateLocation is neither a URL, nor does it point at a file" -fi -args+=("$templateLocation") +args+=(--template-uri "$templateLocation") args+=(--parameters "@$parametersFilePath") az group deployment create "${args[@]}" || failwith "Deployment failed" diff --git a/tools/deployment/deploy-main-template.json b/tools/deployment/deploy-main-template.json index 1ead9147c7..9cb1927570 100644 --- a/tools/deployment/deploy-main-template.json +++ b/tools/deployment/deploy-main-template.json @@ -93,8 +93,7 @@ "variables": { "vnetName": "[concat(parameters('clusterName'), '-vnet')]", "subnetName": "subnet1", - "sparkClusterTemplateUrl": "[uri(deployment().properties.templateLink.uri, 'spark-cluster-template.json')]", - "gpuVmTemplateUrl": "[uri(deployment().properties.templateLink.uri, 'gpu-vm-template.json')]", + "thisTemplateUri": "[deployment().properties.templateLink.uri]", "sparkClusterDeploymentName": "sparkClusterTemplate", "vmDeploymentName": "gpuVmTemplate" }, @@ -106,7 +105,7 @@ "properties": { "mode": "incremental", "templateLink": { - "uri": "[variables('sparkClusterTemplateUrl')]", + "uri": "[uri(variables('thisTemplateUri'), 'spark-cluster-template.json')]", "contentVersion": "1.0.0.0" }, "parameters": { @@ -153,7 +152,7 @@ "properties": { "mode": "incremental", "templateLink": { - "uri": "[variables('gpuVmTemplateUrl')]", + "uri": "[uri(variables('thisTemplateUri'), 'gpu-vm-template.json')]", "contentVersion": "1.0.0.0" }, "parameters": { diff --git a/tools/hdi/setup-ssh-access.sh b/tools/hdi/setup-ssh-access.sh index b920b951bf..33295187cf 100755 --- a/tools/hdi/setup-ssh-access.sh +++ b/tools/hdi/setup-ssh-access.sh @@ -17,7 +17,7 @@ if [[ "$#" = 0 || "$1" = "-h" || "$1" == "--help" ]]; then usage; fi vmname="$1"; shift user="$1"; shift -if [[ -z "$vmname" "x$1" == "x" ]]; then usage; fi +if [[ -z "$vmname" ]]; then usage; fi wasb_dir="wasb:///MML-GPU" wasb_private="$wasb_dir/identity" diff --git a/tools/runme/build.sh b/tools/runme/build.sh index 4bbbf24d1d..df8bd5a3ec 100644 --- a/tools/runme/build.sh +++ b/tools/runme/build.sh @@ -269,7 +269,8 @@ _upload_artifacts_to_storage() { fi txt="$(< "$f")" if [[ "$txt" =~ $varlinerx ]]; then - txt="${BASH_REMATCH[1]}$(_show_gen_vars)${BASH_REMATCH[2]}" + local sfx="${target##*.}" + txt="${BASH_REMATCH[1]}$(_show_gen_vars "$sfx")${BASH_REMATCH[2]}" fi # might be useful to allow <{...}> substitutions: _replace_var_substs txt echo "$txt" > "$target" @@ -279,6 +280,30 @@ _upload_artifacts_to_storage() { _add_to_description \ '* **HDInsight**: Copy the link to %s to setup this build on a cluster.\n' \ "[this Script Action]($STORAGE_URL/$MML_VERSION/install-mmlspark.sh)" + portal_link() { + local url="$STORAGE_URL/$MML_VERSION/$1" + url="${url//\//%2F}"; url="${url//:/%3A}"; url="${url//+/%2B}" + url="https://portal.azure.com/#create/Microsoft.Template/uri/$url" + printf "([in the portal](%s))" "$url" + } + _add_to_description \ + '* **ARM Template**: Use %s to setup a cluster with a gpu %s.\n' \ + "[this ARM Template]($STORAGE_URL/$MML_VERSION/deploy-main-template.json)" \ + "$(portal_link "deploy-main-template.json")" + _add_to_description \ + ' - **HDI Sub-Template**: Use the %s for just the hdi deployment %s.\n' \ + "[HDI sub-template]($STORAGE_URL/$MML_VERSION/spark-cluster-template.json)" \ + "$(portal_link "spark-cluster-template.json")" + _add_to_description \ + ' - **GPU VM Sub-Template**: Use the %s for just the gpu vm deployment %s.\n' \ + "[GPU sub-template]($STORAGE_URL/$MML_VERSION/gpu-vm-template.json)" \ + "$(portal_link "gpu-vm-template.json")" + _add_to_description \ + ' - **Convenient Deployment Script**: Download %s or %s, create a %s, and run as\n\n%s\n' \ + "[this bash script]($STORAGE_URL/$MML_VERSION/deploy-arm.sh)" \ + "[this powershell script]($STORAGE_URL/$MML_VERSION/deploy-arm.ps1)" \ + "paramers file [based on this template]($STORAGE_URL/$MML_VERSION/deploy-parameters.template)" \ + " ./deploy-arm.sh ... -p " } _full_build() { diff --git a/tools/runme/utils.sh b/tools/runme/utils.sh index 2d23bda804..9c09759351 100644 --- a/tools/runme/utils.sh +++ b/tools/runme/utils.sh @@ -36,9 +36,14 @@ defvar() { if [[ "$opts" == *[eE]* ]]; then envinit_commands+=("export $var=$(qstr "${!var}")"); fi } -_show_gen_vars() { - local var - for var in "${_gen_vars[@]}"; do printf '%s=%s\n' "$var" "$(qstr "${!var}")"; done +_show_gen_vars() { # mode, which is either "sh" or "ps1" + local var mode="$1"; shift + case "$mode" in + ( "sh" ) mode='%s=%s\n' ;; + ( "ps1" ) mode='$%s = %s;\n' ;; + ( * ) failwith "internal error, unknown mode for _show_gen_vars: $mode" + esac + for var in "${_gen_vars[@]}"; do printf "$mode" "$var" "$(qstr "${!var}")"; done } _replace_var_substs() { # var... local var val pfx sfx change=1