Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GCS support and tests #1105

Merged
merged 4 commits into from Jan 15, 2021
Merged

Add GCS support and tests #1105

merged 4 commits into from Jan 15, 2021

Conversation

mszacillo
Copy link
Contributor

@mszacillo mszacillo commented Sep 27, 2020

What this PR does / why we need it:
This PR adds GCS Protocol support for downloading models.

Fixes #1048

Overview:
I've added GCS as an accepted protocol to provider.go, along with the necessary model download logic using gcs which resides in gcs.go. For unit testing, I needed to find a way to mock the gcs client. The best resource I could find was googleapis/google-cloud-go-testing, but when I tried adding this library as a dependency, there was a conflict that prevented the project from pulling it. Because of this, I added in some interface classes, which I use to adapt the gcs client to a general type, which can be used for mocking. Test coverage for the gcs class is currently at 75%, but this can be increased with a couple more tests.

Note: There were conflicts when importing google.cloud.com/go/storage alongside the storage package within agent. Due to this, I renamed the storage class to kfstorage (but I'm happy to alter this name to something else more fitting if needed). Also noticed that there are a bunch of conflicts with #1055 - so I can resolve those after that PR merges. :)

Testing

For testing, I deployed the puller pod manually and checked the logs to make sure that the models were being successfully downloaded from my GCS bucket. I did this using the following yaml:

apiVersion: v1
kind: Pod
metadata:
  name: puller
spec:
  serviceAccountName: default
  restartPolicy: Never
  containers:
  - name: job
    image: mszacillo/korepo:gcssupporttest
    imagePullPolicy: Always
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
    command:
    - /agent
    args:
    - -config-dir
    - /mnt/configs
    volumeMounts:
    - name: config-volume
      mountPath: /mnt/configs
    - name: models-volume
      mountPath: /mnt/models
    - name: gcs-serviceaccount-volume
      mountPath: /gcssecret
  volumes:
    - name: config-volume
      configMap:
        name: models-config
    - name: models-volume
      emptyDir: {}
    - name: gcs-serviceaccount-volume
      configMap:
        name: gcs-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: models-config
data:
  models.json: |
    [
      {
        "modelName": "model1",
        "modelSpec": {
          "storageUri": "gs://mskfservingmodels/iris.py",
          "framework": "sklearn",
          "memory": "1G"
        }
      }
    ]
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: gcs-config
data:
  gcloud-application-credentials.json: |
     { *** credential info removed *** }

After the puller pod comes up, I can see the following logs print which show that the download was successful, however, I will also run a full e2e test with the kfserving setup. Below you can see a failure when trying to connect to localhost:8080, this is because I wasn't proxyporting the ingress gateway.

{"level":"info","ts":1606770594.7795548,"logger":"modelAgent","msg":"Initializing model agent with","config-dir":"/mnt/configs","model-dir":"/mnt/models"}
{"level":"info","ts":1606770594.7796307,"logger":"modelAgent","msg":"Setting credential file for GCS."}
{"level":"info","ts":1606770594.7796636,"logger":"modelAgent","msg":"Initializing gcs client, using existing GOOGLE_APPLICATION_CREDENTIALS variable."}
{"level":"info","ts":1606770595.8202243,"logger":"Syncer","msg":"Syncing model directory..","modelDir":"/mnt/models"}
&{mskfservingmodels [{project-owners-810000861461  OWNER   0xc00052d9c0} {project-editors-810000861461  OWNER   0xc00052da00} {project-viewers-810000861461  READER   0xc00052da20}] {false 0001-01-01 00:00:00 +0000 UTC} {false 0001-01-01 00:00:00 +0000 UTC} [{project-owners-810000861461  OWNER   0xc00052da40} {project-editors-810000861461  OWNER   0xc00052da80} {project-viewers-810000861461  READER   0xc00052daa0}] false   US 1 STANDARD 2020-11-30 18:28:08.213 +0000 UTC false map[] false {[]} <nil> [] <nil> <nil> <nil> CAE= multi-region}
{"level":"info","ts":1606770595.8207524,"logger":"modelWatcher","msg":"adding model","modelName":"model1"}
{"level":"info","ts":1606770595.8211873,"logger":"Watcher","msg":"Start to watch model config event"}
{"level":"info","ts":1606770595.8213172,"logger":"Watcher","msg":"Watching","modelConfig":"/mnt/data/models.json"}
{"level":"info","ts":1606770595.8212771,"logger":"modelProcessor","msg":"worker is started for","model":"model1"}
{"level":"info","ts":1606770595.8214014,"logger":"modelProcessor","msg":"Downloading model","storageUri":"gs://mskfservingmodels/iris.py"}
{"level":"info","ts":1606770595.8214495,"logger":"Downloader","msg":"Downloading to model dir","modelUri":"gs://mskfservingmodels/iris.py","modelDir":"/mnt/models"}
{"level":"info","ts":1606770595.8214915,"logger":"Downloader","msg":"Success file does not exist"}
{"level":"info","ts":1606770595.82158,"logger":"modelAgent","msg":"Downloading model ","modelName":"model1","storageUri":"gs://mskfservingmodels/iris.py","modelDir":"/mnt/models"}
{"level":"info","ts":1606770596.0174067,"logger":"modelAgent","msg":"Getting file /mnt/models/model1/iris.py"}
{"level":"info","ts":1606770596.2042253,"logger":"modelWatcher","msg":"Downloaded model."}
{"level":"error","ts":1606770596.213174,"logger":"modelProcessor","msg":"Failed to Load model","modelName":"model1","error":"Post http://localhost:8080/v2/repository/models/model1/load: dial tcp [::1]:8080: connect: connection refused","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/kubeflow/kfserving/pkg/agent.(*Puller).modelProcessor\n\t/go/src/github.com/kubeflow/kfserving/pkg/agent/puller.go:143"}
{"level":"info","ts":1606770596.2133465,"logger":"modelOnComplete","msg":"completion event for model","modelName":"model1","inFlight":0}

Release note:

Adding in GCS as an accepted storage provider

@kubeflow-bot
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Hi @mszacillo. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-cla
Copy link

google-cla bot commented Sep 27, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

2 similar comments
@google-cla
Copy link

google-cla bot commented Sep 27, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Sep 27, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@mszacillo
Copy link
Contributor Author

@googlebot I fixed it.

@google-cla
Copy link

google-cla bot commented Sep 27, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@ifilonenko
Copy link
Contributor

/ok-to-test

@yuzisun
Copy link
Member

yuzisun commented Sep 29, 2020

@mszacillo This is great! Can you help rebase the master and sign the Google CLA?

@mszacillo
Copy link
Contributor Author

@yuzisun Yes! I can take care of rebasing the branch - for the CLA I'm just waiting to hear back for approval from legal. Should be able to get that done in the next couple days.

@ifilonenko
Copy link
Contributor

has this been tested outside of unit tests? i.e. an e2e example of a pod with the appropriate labels pulling down from gcs? Can the results of that test be put in the PR description please

@mszacillo
Copy link
Contributor Author

Good point, I can do some manual verifications and post the results on the PR description.

@@ -65,3 +65,31 @@ type ModelSpec struct {
// +optional
Memory resource.Quantity `json:"memory,omitempty"`
}

type TrainedModelYaml struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we do need the yaml version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Downloader: s3manager.NewDownloaderWithClient(sessionClient, func(d *s3manager.Downloader) {
}),
downloader.Providers[kfstorage.GCS] = &kfstorage.GCSProvider{
Client: mockapi.AdaptClient(client),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here should use the real client?

Copy link
Contributor Author

@mszacillo mszacillo Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real client is defined by storage.NewClient(ctx) on line 54. This line instead adapts the client to fit the mock client interface which is used to allow unit testing. At first I had set the client type for the GCS provider as storage.Client, but this was causing type errors when I went and mocked the client for unit tests - so I used a common interface between the mocked version and the real client.

I might be missing a better way to do this, so I'll look into it further.

@@ -77,14 +80,41 @@ func (d *Downloader) download(modelName string, storageUri string) error {
return nil
}

func (d* Downloader) GetProvider() (string, error) {
storageUri := ""
matches, _ := filepath.Glob(d.ModelDir + "/*.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand this, the storage uri should be passed on ModelSpec and does not need to be parsed from the yaml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that we'd configure the storage provider before any of the models are pulled, however I think I was over complicating things. We can just add each supported provider to the Provider map and then upon an add event, the correct provider will be parsed from the ModelSpec like you said. I'll go ahead and remove this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing this logic

Copy link
Contributor

@ifilonenko ifilonenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits but thanks for writing up a much need storage protocol!

panic(err)
}

if strings.Contains(storageUri, string(kfstorage.S3)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the appropriate place to do this. This main method is setting up watcher and initializing all the available providers (by checking os.LookupEnv() to see if the agent is configured properly. in essence, this just means you need to do:

if endpoint, ok := os.LookupEnv(bcscredential.GCSEndpointUrl); ok {

or whatever to setup the gcs client, if needed. The logic of whether or not the download should happen from gcs or s3 is configured elsewhere.

@@ -77,14 +80,41 @@ func (d *Downloader) download(modelName string, storageUri string) error {
return nil
}

func (d* Downloader) GetProvider() (string, error) {
storageUri := ""
matches, _ := filepath.Glob(d.ModelDir + "/*.yaml")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to removing this logic

)

type GCSProvider struct {
Client mockapi.Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mockapi? this should be the physical client

)
}
defer rc.Close()
data, err := ioutil.ReadAll(rc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to send this directly to disk? i.e. zero copy transfer or some more efficient way to write to disk directly? or is there BatchDownloader that takes a FileWriter? and what is rc here?

@@ -65,3 +65,31 @@ type ModelSpec struct {
// +optional
Memory resource.Quantity `json:"memory,omitempty"`
}

type TrainedModelYaml struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cmd/agent/Dockerfile Outdated Show resolved Hide resolved
cmd/agent/main.go Outdated Show resolved Hide resolved
@@ -332,4 +297,39 @@ var _ = Describe("Watcher", func() {
})
})
})

Describe("Use GCS Downloader", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! could you add a test for failing to download ?

@yuzisun
Copy link
Member

yuzisun commented Dec 4, 2020

@mszacillo This looks great! Can you help resolve the conflicts and we should be good to merge!

@yuzliu
Copy link
Contributor

yuzliu commented Dec 8, 2020

@mszacillo I think there must be a bug in the gcs downloader. I tested using gs://kfserving-samples/models/sklearn/iris The logs shows that it downloaded the model but when I exec into the pod the model was not downloaded.

{"level":"info","ts":1607388819.7210934,"logger":"modelProcessor","msg":"Downloading model","storageUri":"gs://kfserving-samples/models/sklearn/iris"}
{"level":"info","ts":1607388819.7213714,"logger":"Downloader","msg":"Downloading to model dir","modelUri":"gs://kfserving-samples/models/sklearn/iris","modelDir":"/mnt/models"}
{"level":"info","ts":1607388819.7258606,"logger":"modelOnComplete","msg":"completion event for model","modelName":"model3","inFlight":0}

Could you please try

kubectl exec -it <pod_name> -- /bin/bash
cd /mnt/models

And verify that models are downloaded from gcs?

@mszacillo
Copy link
Contributor Author

Could you please try

kubectl exec -it <pod_name> -- /bin/bash
cd /mnt/models

And verify that models are downloaded from gcs?

@yuzliu I added some debug logs to double check that the file is getting downloaded successfully - seems like the download works, the file gets written to successfully, and that the file exists after the download:

{"level":"info","ts":1607454256.6356146,"logger":"modelAgent","msg":"Downloading model ","modelName":"model1","storageUri":"gs://mskfservingmodels/iris.py","modelDir":"/mnt/models"}
{"level":"info","ts":1607454257.6121614,"logger":"modelAgent","msg":"Object iris.py exists."}
{"level":"info","ts":1607454257.7909303,"logger":"modelAgent","msg":"Created filename /mnt/models/model1/iris.py"}
{"level":"info","ts":1607454258.070437,"logger":"modelAgent","msg":"Wrote to file /mnt/models/model1/iris.py"}
{"level":"info","ts":1607454258.0704966,"logger":"modelAgent","msg":"File /mnt/models/model1/iris.py exists."}
{"level":"info","ts":1607454258.070524,"logger":"modelWatcher","msg":"Downloaded model."}

This is however with only the puller pod deployed by itself. Next I'll try creating a trainedmodel.

func (g *GCSObjectDownloader) Download(client stiface.Client, it stiface.ObjectIterator) error {
var errs []error
// flag to help determine if query prefix returned an empty iterator
var foundObject = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in a flag to determine if the returned iterator is empty (meaning that the storage uri prefix used to search for objects returned nothing). If the flag is still false after iteration, that means no objects were found and we should log a warning for the user. This was done since the object iterator does not have a way to check size. That said I'm not entirely happy with this solution - any suggestions?

@mszacillo mszacillo force-pushed the gcs-support branch 2 times, most recently from d42b53e to 4183643 Compare January 14, 2021 05:08
@yuzisun
Copy link
Member

yuzisun commented Jan 14, 2021

/retest

2 similar comments
@ifilonenko
Copy link
Contributor

/retest

@ifilonenko
Copy link
Contributor

ifilonenko commented Jan 14, 2021

/retest

@aws-kf-ci-bot
Copy link
Contributor

@ifilonenko: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test kubeflow-kfserving-presubmit

Use /test all to run all jobs.

In response to this:

/retest this PR also might need to be rebased

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ifilonenko
Copy link
Contributor

/retest

michael.szacillo added 4 commits January 15, 2021 09:13
Adding in testing

Removing mockapi and using google-cloud-go-testing instead, removing unnecessary methods, cleaning up code

Changing back dockerfile name

Rebasing on master

Fixing import statement

Reverting kfstorage rename to storage, changing gcs import

Fixing import statement

Changing import in test

Combining tests into watcher_test, putting mocks into a testutils package

Removing unnecessary suite run, renaming testutils to mocks

Adding more test cases, accounting for lack of model name in passed in storageURI

Changing iterator retrieval logic

Rebasing and cleaning code
@yuzisun
Copy link
Member

yuzisun commented Jan 15, 2021

/retest

@yuzisun
Copy link
Member

yuzisun commented Jan 15, 2021

@mszacillo Thanks for your awesome contribution!

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mszacillo, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit b8b3584 into kserve:master Jan 15, 2021
abchoo pushed a commit to abchoo/kfserving that referenced this pull request Jan 19, 2021
* Adding in GCS support

Adding in testing

Removing mockapi and using google-cloud-go-testing instead, removing unnecessary methods, cleaning up code

Changing back dockerfile name

Rebasing on master

Fixing import statement

Reverting kfstorage rename to storage, changing gcs import

Fixing import statement

Changing import in test

Combining tests into watcher_test, putting mocks into a testutils package

Removing unnecessary suite run, renaming testutils to mocks

Adding more test cases, accounting for lack of model name in passed in storageURI

Changing iterator retrieval logic

Rebasing and cleaning code

* Returning a warning if queried object doesn't exist in bucket, resolving test

* Rebasing, removing unused import, refactoring

* Adding missing parameter to NewWatcher call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Harden Puller logic in the side-car Agent
7 participants