ml-kubernetes MNIST
is simple project that trains and predicts MNIST
dataset in Kubernetes cluster. This projects comducts PoC (Proof of Concept) of distributed machine learning on container environment.
All steps are done on GKE of Google Cloud Platform. You have to install gcloud
comman line to use GKE.
We assume that gcloud
is installed and available. If not, please refer the below link, GCP official document.
gcloud install guide : https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu
-
Create Kubernetes cluster using gcloud container clusters create command.
$ gcloud container clusters create my-kube-cluster \ --zone=us-central1-a \ --cluster-version=1.11.7-gke.12 \ --disk-size=20GB \ --disk-type=pd-standard \ --num-nodes=3
You can adjust specified options, such as
--zone
,--disk-size
,--num-nodes
, etc. -
Get access credential for Kubernetes.
$ gcloud container clusters get-credentials my-kube-cluster --zone us-central1-a
Test the kubectl command.
$ kubectl get nodes NAME STATUS ROLES AGE VERSION gke-my-kube-cluster-default-pool-d0ed872d-33mt Ready <none> 49s v1.11.7-gke.12 gke-my-kube-cluster-default-pool-d0ed872d-gd42 Ready <none> 49s v1.11.7-gke.12 gke-my-kube-cluster-default-pool-d0ed872d-wpv4 Ready <none> 49s v1.11.7-gke.12
-
Create a external disk to provide dataset to each Kubernetes node.
Also, you can adjust the options such as
--size
,--zone
, etc.$ gcloud compute disks create --type=pd-standard \ --size=1GB --zone=us-central1-a ml-kube-disk
Congraturation! You has just created a Kubernetes cluster with 3 worker nodes.
-
Splitter
split MNIST dataset into multiple dataset. -
Trainer
conducts distributed learning process in Kubernetes container. -
Aggregator
aggregates output models extracted from step 2. -
Server
container provides web page to demonstrate MNIST prediction.
-
First, define environment values used in distributed MNIST learning.
-
$ export WORKER_NUMBER=3 $ export EPOCH=2 $ export BATCH=100
-
Create NFS container to store datasets and models. It will be used as a PV, PVC in Kubernetes.
1-nfs-deployment.yaml
creates NFS server container to be mounted to other components, such as splitter, trainer.$ kubectl apply -f 1-nfs-deployment.yaml $ kubectl apply -f 2-nfs-service.yaml
Create PV and PVC using NFS container.
$ export NFS_CLUSTER_IP=$(kubectl get svc/nfs-server -o jsonpath='{.spec.clusterIP}') $ cat 3-nfs-pv-pvc.yaml | sed "s/{{NFS_CLUSTER_IP}}/$NFS_CLUSTER_IP/g" | kubectl apply -f -
[Optional (but recommended) ]
If you want to view directory of NFS server, create
busybox
deployment and enter into container. By default,index.html
andlost+found
files exist.$ kubectl apply -f 9999-busybox.yaml $ kubectl exec -it $(kubectl get pods | grep busybox | awk '{print $1}') sh / # ls /mnt index.html lost+found / # exit
-
Split MNIST dataset using splitter. Splitter will create datasets, the number of $(WORKER_NUMBER)
$ cat 4-splitter.yaml | sed "s/{{WORKER_NUMBER}}/$WORKER_NUMBER/g" | kubectl apply -f -
To check datasets are created, check in
busybox
deployment. Splitted datasets exist as *.npz$ kubectl exec $(kubectl get pods | grep busybox | awk '{print $1}') ls /mnt/data 0.npz 1.npz 2.npz
-
Train each dataset in Kubernetes workers. Below bash commands create trainers as deployment to train and extract neural network model.
$ for (( c=0; c<=($WORKER_NUMBER)-1; c++ )) do echo $(date) [INFO] "$c"th Creating th trainer in kubernetes.. cat 5-trainer.yaml | sed "s/{{EPOCH}}/$EPOCH/g; s/{{BATCH}}/$BATCH/g; s/{{INCREMENTAL_NUMBER}}/$c/g;" | kubectl apply -f - & done
After about 3 minitues, you can view the status of trainer job. Status should be
completed
.$ kubectl get po NAME READY STATUS RESTARTS AGE mnist-splitter-qgkxf 0/1 Completed 0 14m mnist-trainer-0-g896k 0/1 Completed 0 3m mnist-trainer-1-6xfkg 0/1 Completed 0 3m mnist-trainer-2-ppnsc 0/1 Completed 0 3m
Also you can check generated models using
busybox
deployment.$ kubectl exec $(kubectl get pods | grep busybox | awk '{print $1}') ls /mnt/model 0-model.h5 1-model.h5 2-model.h5
-
Aggregate generated models into one model. Below command creates aggregator, which aggregate models into single model.
$ kubectl apply -f 6-aggregator.yaml
Check a aggregated model.
$ kubectl exec $(kubectl get pods | grep busybox | awk '{print $1}') ls /mnt aggregated-model.h5 ...
If you want to test accuracy of aggregated model, use
9999-accuracy-test
deployment.$ kubectl apply -f 9999-accuracy-test.yaml $ kubectl logs --tail 1 $(kubectl get pods | grep accuracy-test | awk '{print $1}') 10000/10000 [==============================] - 10s 997us/sample - loss: 1.2667 - acc: 0.8728
-
Create
server
deployment for demo. You can test MNIST prediction.$ kubectl apply -f 7-server.yaml
After a few seconds, you can see the external IP to access the demo web page. Below example shows external IP is a.b.c.d, so you can access a.b.c.d:80 in web browser
$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ... mnist-server-svc LoadBalancer 10.19.253.70 a.b.c.d 80:30284/TCP 12m ...
Upload a sample MNIST dataset located in
samples/
directory, such as 6.jpg.After uploading MNIST sample, web page shows prediction result.
-
Splitter : splitter distributes MNIST data equally by the number of contents.
-
--n_container : number of training container.
-
--savedir : saved directory path of splitted data. each data files saved
.npz
file format will be saved as number of--n_container
. -
Usage Example
$python splitter.py --n_container 10 --savedir
-
-
Trainer : trainer will independently learn the data divided into each container and create a model for each container as
*.h5
file format.-
--data : Data you want to learn from the container.
-
--epoch : number of epoch.
-
--batch : size of batch.
-
--savemodel : saved model in each container. model format should be h5 file format.
-
Usage Example
$python train.py --data 0.npz --epoch 3 --batch 100 --savemodel model.h5
-
-
Aggregater : aggregater averages the stored models in each container. aggregater will find
*.h5
file format in--dir
directory and average it.-
--dir : In
trainer.py
,*.h5
files saved in specific directory. This--dir
parameter refers to its saved folder. -
--savefile : savefile is final averaged model. model format should be h5 file format.
-
Usage Example
$python aggregater.py --dir ./models --savefile final_model.h5
-
-
Server : server serves as the final average model and serves through flasks.
-
Usage Example
model = tf.keras.models.load_model('your_model.h5', compile=False)
-