Skip to content

Why use Docker containers for machine learning development?

Jimena Lozano edited this page Oct 16, 2020 · 4 revisions

Source: https://aws.amazon.com/es/blogs/opensource/why-use-docker-containers-for-machine-learning-development/

Machine learning development environment: Bare necessities

Let’s start with what the four basic ingredients you need for a machine learning development environment:

  1. Compute: High-performance CPUs and GPUs to train models.
  2. Storage: For large training datasets and metadata you generated during training.
  3. Frameworks and libraries: To provide APIs and execution environment for training.
  4. Source control: For collaboration, backup, and automation.

At some point in your machine learning development process, you’ll hit one of these two walls:

  1. You’re experimenting and you have too many variations of your training scripts to run, and you're bottlenecked by your single machine.
  2. You’re running training on a large model with a large dataset, and it’s not feasible to run on your single machine and get results in a reasonable amount of time.

To address the first wall, you can run every model independently and asynchronously on a cluster of computers. To address the second wall, you can distribute a single model on a cluster and train it faster. Both these solutions require that you be able to successfully and consistently reproduce your development training setup on a cluster. And that’s challenging because the cluster could be running different operating systems and kernel versions; different GPUs, drivers and runtimes; and different software dependencies than your development machine. Another reason why you need portable machine learning environments is for collaborative development. Sharing your training scripts with your collaborator through version control is easy.

Machine learning, open source, and specialized hardware

A challenge with machine learning development environments is that they rely on complex and continuously evolving open source machine learning frameworks and toolkits, and complex and continuously evolving hardware ecosystems. Both are positive qualities that we desire, but they pose short-term challenges.

How many times have you run machine learning training and asked yourselves these questions:

  • Is my code taking advantage of all available resources on CPUs and GPUs?
  • Do I have the right hardware libraries? Are they the right versions?
  • Why does my training code work fine on my machine, but crashes on my colleague’s, when the environments are more or less identical?
  • I updated my drivers today and training is now slower/errors out. Why?

prasanna_f2_2020_03_11_150

If you examine your machine learning software stack, you will notice that you spend most of your time in the magenta box called My code in the accompanying figure. This includes your training scripts, your utility and helper routines, your collaborators’ code, community contributions, and so on. As if that were not complex enough, you also would notice that your dependencies include:

  • the machine learning framework API that is evolving rapidly;
  • the machine learning framework dependencies, many of which are independent projects;
  • CPU-specific libraries for accelerated math routines;
  • GPU-specific libraries for accelerated math and inter-GPU communication routines; and
  • GPU driver that needs to be aligned with the GPU compiler used to compile above GPU libraries.

In the figure below, notice that even if you control for changes to your training code and the machine learning framework, there are lower-level changes that you may not account for, resulting in failed experiments. Ultimately, this costs you the most precious commodity of all— your time.

prasanna_f3_2020_03_11

Enter containers for machine learning development

What you should and shouldn’t include in your machine learning development container:

  1. Only the machine learning frameworks and dependencies: This is the cleanest approach. Every collaborator gets the same copy of the same execution environment. They can clone their training scripts into the container at runtime or mount a volume that contains the training code.
  2. Machine learning frameworks, dependencies, and training code: This approach is preferred when scaling workloads on a cluster. You get a single executable unit of machine learning software that can be scaled on a cluster. Depending on how you structure your training code, you could allow your scripts to execute variations of training to run hyperparameter search experiments.

Sharing your development container is also easy. You can share it as a:

  1. Container image: This is the easiest option. This allows every collaborator or a cluster management service, such as Kubernetes, to pull a container image, instantiate it, and execute training immediately.
  2. Dockerfile: This is a lightweight option. Dockerfiles contain instructions on what dependencies to download, build, and compile to create a container image. Dockerfiles can be versioned along with your training code. You can automate the process of creating container images from Dockerfiles by using continuous integration services, such as AWS CodeBuild.

Clone this wiki locally