code repo for "Gymnasie Arbete"
- Use MNIST dataset for metrics
- Create TfJobs (https://www.kubeflow.org/docs/components/training/tftraining/) with MNIST and deploy on k8s
- Create and deploy MNIST model with single node computation
- Deploy MNIST model with multi worker computation (https://www.tensorflow.org/guide/distributed_training#multiworkermirroredstrategy) Use mirror strat
- Run metrics on both systems (TfJobs on k8s and multi worker computation)
- Analyze metrics
- Done
-
the model_save path is only passed into the app as a env variable, and so the saving needs to be handled internally in the app
-
Mount a pvc and let multiple nodes write with: https://stackoverflow.com/questions/67345577/can-we-connect-multiple-pods-to-the-same-pvc
- I created the AI model
- I downloaded kubeflow manifests to the cluster using k8s kustomize (make sure k8s cluster is v1.21.1, kustomize is v3.2.0)
- "k create -f mnist.yaml"
- vad som behövs kodas, med containers, hur mkt som kan hämtas från kubeflow
- en mer konkret tidsplan
- se till att alla delar finns där
-
förklara hypotesen
-
använd diagram, loss divergens i resultat
https://github.com/nottombrown/distributed-tensorflow-example/blob/master/example.py
implement Parimiter server to speed up syncronisation between workers.
Should only need to specify PS job to join stategy, and worker job to do the default work.
8ps 12 work: 71 sek
2ps 4work: 90 sek
time to beat: 27.4 sek