Skip to content

Latest commit

 

History

History
54 lines (28 loc) · 1.99 KB

README.md

File metadata and controls

54 lines (28 loc) · 1.99 KB

pistachio-mlops

Came across a pistachio dataset for classification. Intent here is to use this to develop an ml-pipeline (kubeflow/vertex-ai) and deploy on gcp

will develop a simple model in a jupyterlab notebook, and use that as a starting point for pipeline development.

Data

Pistachio Image Dataset downloaded from kaggle here

will use the 16 feature version which contains 1718 records across two pistachio types.

pandera for schema/data validation

installing packages into image: just use pip

Pipeline

  • base image for all python functionality
  • python scripts to handle arguments for each component definition

things: - some sort of config, project, storage, arifact locations, etc. - build images locally and push to AR, vs cloudbuild - component definitions, including image location

https://www.kubeflow.org/docs/components/pipelines/v1/sdk/component-development/#creating-a-component-specification

images

test load_data docker run -v ./pipeline/data:/data pistachio_base:0.0.1 load_data.py /data/Pistachio_16_Features_Dataset.arff /data/pistachio_imagetest_train.pqt /data/pistachio_imagetest_test.pqt

test validate_data docker run -v ./pipeline/data:/data pistachio_base:0.0.1 validate_data.py /data/pistachio_imagetest_train.pqt /data/pistachio_schema.json

TODO

  • kfp has a local runner/docker setup for testing components. look at this instead of test_images.sh
  • XGboost warnings - can disable them in the container code - verbosity 0 or some other flag