Skip to content
Branch: master
Find file History
dsdinter and k8s-ci-robot Lint fixes mnist (#581)
* Remove modules from .pylintrc

* Add lint inline exceptions

* Add lint inline exceptions as all as the specific exception is not available for Pylint 1.8

* Fix string formatting logging message and remove unnecessary Pylint exception

* Update app.yaml with correct environment details
Latest commit a9c6e69 Jul 25, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
ks_app Lint fixes mnist (#581) Jul 25, 2019
serving/seldon-wrapper
testing [pytorch_mnist] Automate image build (#490) Jun 14, 2019
training/ddp/mnist Lint fixes mnist (#581) Jul 25, 2019
web-ui
01_setup_a_kubeflow_cluster.md [mnist_pytorch] Update documentation (#463) Jan 8, 2019
02_distributed_training.md [pytorch_mnist] Automate image build (#490) Jun 14, 2019
03_serving_the_model.md [pytorch_mnist] Automate image build (#490) Jun 14, 2019
04_querying_the_model.md E2E Pytorch mnist example (#274) Nov 18, 2018
05_teardown.md E2E Pytorch mnist example (#274) Nov 18, 2018
Dockerfile.ksonnet [pytorch_mnist] Automate image build (#490) Jun 14, 2019
Makefile [pytorch_mnist] Automate image build (#490) Jun 14, 2019
OWNERS E2E Pytorch mnist example (#274) Nov 18, 2018
README.md Fixed typo in README and one bad link Feb 15, 2019
image_build.jsonnet [pytorch_mnist] Automate image build (#490) Jun 14, 2019
ksonnet-entrypoint.sh [pytorch_mnist] Automate image build (#490) Jun 14, 2019

README.md

End-to-End kubeflow tutorial using a Pytorch model in Google Cloud

This example demonstrates how you can use kubeflow end-to-end to train and serve a distributed Pytorch model on a kubernetes cluster in GCP. This tutorial is based upon the below projects:

Goals

There are two primary goals for this tutorial:

  • Demonstrate an End-to-End kubeflow example
  • Present an End-to-End Pytorch model

By the end of this tutorial, you should learn how to:

  • Setup a Kubeflow cluster on a new Kubernetes deployment
  • Spawn up a shared-persistent storage across the cluster to store models
  • Train a distributed model using Pytorch and GPUs on the cluster
  • Serve the model using Seldon Core
  • Query the model from a simple front-end application

The model and the data

This tutorial trains a TensorFlow model on the MNIST dataset, which is the hello world for machine learning.

The MNIST dataset contains a large number of images of hand-written digits in the range 0 to 9, as well as the labels identifying the digit in each image.

After training, the model classifies incoming images into 10 categories (0 to 9) based on what it’s learned about handwritten images. In other words, you send an image to the model, and the model does its best to identify the digit shown in the image.

In the above screenshot, the image shows a hand-written 7. The table below the image shows a bar graph for each classification label from 0 to 9. Each bar represents the probability that the image matches the respective label. Looks like it’s pretty confident this one is an 7!

Steps:

  1. Setup a Kubeflow cluster
  2. Distributed Training using DDP and PyTorchJob
  3. Serving the model
  4. Querying the model
  5. Teardown
You can’t perform that action at this time.