# Notes from my Azure experience
I have only experienced a small part of the Azure ecosystem, but I spent a lot of time on Azure Machine Learning Services (AML), which is in rapid development, and this is an account of some of the challenges I ran into.  

## Using an AML cluster vs a VM

The AML workspace concept is excellent.  It provides good integration between scalable Compute units ("clusters"), a container registry for environments and code, an excellent design for logging and tracking runs that are nested inside 'experiments,' good integration with blob and file storage, and the ability to easily add your code blocks to more sophisticated workflow pipelines and/or Kubernetes-managed clusters for deployment.  However, you must learn a whole new codebase to operate an AML workspace.  For that, the AML Python SDK (`azureml`) is reliable, flexible, and powerful, but like other software it has its fair share of glitches, counterintuitive behavior, heinous access control problems, and outright bugs, so it is not a trivial undertaking. 

**Here's the problem**: for development, you want a highly interactive environment like Jupyter notebooks which gives you immediate feedback.  But for model training and deployment, you want the ability to rapidly and easily scale the power and number of machines up and back down.  In my Azure experience, a standalone VM is far better for flexible, interactive, rapid development, but it is difficult to scale up and down (aside: I haven't explored "scale sets," but they seem less flexible than AML).  In contrast, the ability to scale up or down an AML cluster during training and deployment is life-changing.  Unfortunately, AML clusters are not easy to use interactively.  In other words, neither VMs nor AML have _all_ of the qualities that one would like.  Ongoing issues include:

- **AML is not good for development**. AML is quite clunky and slow if driven by the Python SDK -- it lacks the flexibility of Jupyter notebooks and you have to hunt through logs to figure out what went wrong.  AML Notebooks are supposed to be a Microsoft version of Jupyter notebooks, but they cannot be run on a cluster or attached VM, which makes them almost useless.  
- **An AML workspace can drive your costs up.**  AML workspaces fill up a lot of memory quickly because all input and all output is kept for every run, in addition to copious logs.  That can cost raise your overall costs very quickly.  Microsoft Support [responded to a question I asked](https://github.com/MicrosoftDocs/azure-docs/issues/60501#issuecomment-671608408) by saying that it's not possible to delete experiments at present (although they suggested a workaround).  You can delete your entire workspace, but typically you don't really want to.  That's a real problem.
- **It was not possible to attach a VM as a Compute resource to an AML workspace until recently.** After about 2 months of phone and email conversations, Microsoft Support finally issued a hotfix that now works (at least for US West2).  
- **CUDA versions can be a problem**:  as of Aug 2020, a new Linux DSVM currently comes with CUDA 10.2, whereas the AML clusters run CUDA 10.1.  That means you may run into CUDA compatibility problems if you develop on a DSVM and then port the code to a cluster, as I do. I think the solution is to downgrade CUDA on the VM, but that's likely to be difficult (root canal, anyone?)
- **Jupyter notebooks can cause problems.** When running Detectron2 from Jupyter notebooks, I ran into odd problems with objects that couldn't be serialized ('pickled', in Python), but only when doing distributed training with multiple GPUs.  Those problems disappeared when I ran the program from the command line.  So, for multi-GPU training, in some cases you may not be able to benefit much from using Jupyter notebooks.
- **It can be difficult to containerize code for an AML cluster**.  If you're new to the process, you will likely struggle with aspects of it, including arcane access control for Azure Container Registries (parallel but non-equivalent command sets from Docker and Azure's`az-cli` double the confusion).
- **The performance difference between version 1 and 2 machines is larger than advertised.**  I experienced a 4.5X increase in speed when training (and also when doing inference) on an NC24_v2 machine, vs. an NC24_v1 machine.  The advertised difference is 2X, and I was using identical code.  I suspect that v1 machines are not performing as they should for multi-GPU training.
- **Compute quotas can be a roadblock**.  Microsoft sets quota limits on the number of vCPUs that any user can take advantage of.  Requests, at least on a complimentary account, are not always approved, and the approval takes a couple of days.  In addition, Microsoft may try to steer you (as they did to me) to a more expensive machine when they are upgrading a region's hardware.  I requested two, 4-GPU version-1 machines in US West2, and was told that they were no longer available.  I whined and Microsoft relented, but the difference in price is significant and it would have been frustrating to be forced to either develop on an \$8/hour machine, or to set up workspaces in two regions just so I could use a cheaper machine in one. 

### Summary (absolutely free advice):
The flexibility of being able to develop interactively using Jupyter notebooks on a VM is invaluable.  But when it comes to scaling up to better or more machines, the AML clusters are truly life-changing.  Also, for managing costs, it's super useful to develop on a machine that only costs \$2/hour, and then be able to run a cluster of \$9/hour machines that will shut itself down as soon as the job is finished, while training.   So despite the built-in CUDA hassles, and some incompatibilities between Jupyter Notebook code and containerized, command-line code, I think it's worth learning the AML code and using both approaches simultaneously.  The AML documentation is quite good.  

## AML Studio quirks
Studio is basically a GUI front end for an AML workspace.  I have so far only used the basic edition of Studio (not the Enterprise version) and I haven't experimented with graphical model-building.  My impressions:
- **Studio is still pretty glitchy**. For example, if I leave it alone for a while and then refresh it, it will typically crash and I'll have to close the webpage and re-launch it.  
- **Logs are better in the old version.**  If your program crashes while the container is being built (i.e., before any code runs), the new version of Studio doesn't give any clues to what went wrong with the container build.  The old version did.
- **File storage locations for runs are (exceedingly!) hard to find.**  Most of the time you won't need to, but every once in a while you need to copy model weights or something from a previous run.  The best way to find your files is to use the [Microsoft Azure Storage Explorer](https://azure.microsoft.com/en-us/features/storage-explorer/).  Even then it's a real hunt. Basics:
    - Open Microsoft Azure Storage Explorer
    - Go to the storage account associated with your workspace (probably has a long name)
    - Go to `Blob Containers`, then `Azureml-blobstore-<some long hexadecimal number>`
    - Go to `azureml`.  You're getting there!
    - Now there are _two folders and two files_ generated for every run.  They all have 19-digit hexadecimal names and the modification date is not shown for folders, so good luck figuring out which is which.  Look for the latest _folders_ (not files), noting that there may be multiple pages and the folders probably end somewhere in the middle of the pages; click on the folder of the latest pair whose name does not end in `- setup`.
    - Finally, click on the `outputs` folder. Hey, piece of cake!
- **Passing output to Studio isn't always easy.**  The Detectron2 model I used comes with its own logging system, so unless I intercept the messages on their way to being stored and pass them to the AML run logger, I won't be able to see output until the run is over, and Studio can't plot it for me.  It's a relatively mild annoyance, but it means it will require several extra steps to take advantage of Studio's potentially nice display options.
- **AML Notebooks can't be run on a cluster or attached VM.**  Microsoft Support just told me that AML Notebooks can't be run on a cluster or an attached VM -- only a Compute Instance.  That seriously limits their usefulness, and essentially guarantees that a standalone VM is still essential for development. AML Notebook is a Microsoft version of a Jupyter Notebook, which is a thriving (and rapidly moving) ecosystem.  Compatibility problems seem nearly guaranteed.
- **Studio is threatening to increase prices.**  Microsoft pushes you towards the "Enterprise" version every time you use Studio, and if you read the fine print they say that they may increase prices on it in the future, but there are no numbers attached.  That's a bit ominous, frankly.

### Summary
Studio is still very glitchy. It's usable, but hard to love. I'm personally not interested in graphical model building so I have no comment on that. 

## Swapping disks
Azure "thinks of a computer as a collection of connected disks," as one Microsoft staff member said to me, and the result can be some counterintuitive behavior when you want to back up, upgrade, or swap disks.  
- Losing owner and permissions when swapping a disk.  Documented [here](https://askubuntu.com/questions/1259744/recovering-a-lost-user-group).
- Properly re-sizing a disk.  Documented [here](https://askubuntu.com/questions/1253516/properly-resize-an-azure-vm-disk).

## Azure Command-line Interface (`az-cli` )
Certain tasks in AML are almost unavoidably done with `az-cli`.  I have had plenty of problems with `az` and its children (`azcopy`, etc.), and I avoid it whenever possible.  It's powerful and flexible, but it seems to be plagued with extra-awful access control problems that are coupled with sub-standard error messaging.  That's not a likeable combination.  For example, to log into your AML Container Registry takes two steps:
1. `az login`, which directs you to a webpage where you enter a code
2. `az acr login`, and **no, you can't jump straight to this step**.  
Furthermore, even if you succeed in logging in, there is no apparent change in your environment, so it's unclear "where" you are or whether commands that you issue will be directed to the registry, or not.  Ugh.