Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support webhdfs in storageURI and storage spec #2077

Merged
merged 9 commits into from
Jun 4, 2022

Conversation

markwinter
Copy link
Member

@markwinter markwinter commented Mar 7, 2022

What this PR does / why we need it:

Adds support for hdfs:// in the storageUri, and also in the new storage spec
It requires a webhdfs enabled cluster. Optionally supports kerberized clusters too.
See the included documentation file for full usage details

Fixes #2135

Type of changes

  • New feature (non-breaking change which adds functionality)

Feature/Issue validation/testing:

Test on local cluster using storageUri
  1. Created a secret like this and attached it to a service account
    kubectl create secret generic hdfscreds --from-file=HDFS_KEYTAB=./markwinter.keytab.b64 --from-literal=HDFS_NAMENODE="https://redacted:port" --from-literal=HDFS_PRINCIPAL="markwinter@redacted" -n markwinter

  2. Create an isvc

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: tfmodel
spec:
  predictor:
    serviceAccountName: sa  # has hdfscreds secret
    model:
      modelFormat:
        name: tensorflow
      storageUri: "hdfs://user/markwinter/tfmodel-20201003/1"
  1. Check storage-init logs
❯ kubectl logs tfmodel-predictor-default-00001-deployment-7994c657f9-px5qv storage-initializer -n markwinter
[I 220308 05:17:29 initializer-entrypoint:13] Initializing, args: src_uri [hdfs://user/markwinter/tfmodel-20201003/1] dest_path[ [/mnt/models]
[I 220308 05:17:29 storage:56] Copying contents of hdfs://user/markwinter/tfmodel-20201003 to local
[I 220308 05:17:29 client:192] Instantiated <KerberosClient(url='https://redacted:port')>.
[I 220308 05:17:29 client:745] Downloading '/user/markwinter/tfmodel-20201003/1' to '/mnt/models'.
[I 220308 05:17:29 client:1114] Walking '/user/markwinter/tfmodel-20201003/1' (depth 0).
[I 220308 05:17:29 client:317] Fetching status for '/user/markwinter/tfmodel-20201003'/1.
[I 220308 05:17:29 client:1078] Listing '/user/markwinter/tfmodel-20201003'/1.
[I 220308 05:17:29 client:317] Fetching status for '/user/markwinter/tfmodel-20201003'/1.
[I 220308 05:17:29 client:1078] Listing '/user/markwinter/tfmodel-20201003/1'.
[I 220308 05:17:29 client:1078] Listing '/user/markwinter/tfmodel-20201003/1/variables'.
[I 220308 05:17:29 client:686] Reading file '/user/markwinter/tfmodel-20201003/1/saved_model.pb'.
[I 220308 05:17:29 client:686] Reading file '/user/markwinter/tfmodel-20201003/1/variables/variables.data-00000-of-00001'.
[I 220308 05:17:29 client:686] Reading file '/user/markwinter/tfmodel-20201003/1/variables/variables.index'.
[I 220308 05:17:42 storage:94] Successfully copied hdfs://user/markwinter/tfmodel-20201003/1 to /mnt/models
  1. Check model server logs
❯ kubectl logs tfmodel-predictor-default-00002-deployment-cb7bc8df5-cfr5b kserve-container -n markwinter
2022-03-08 05:31:44.366405: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config:  model_name: tfmodel model_base_path: /mnt/models
2022-03-08 05:31:44.366677: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2022-03-08 05:31:44.366693: I tensorflow_serving/model_servers/server_core.cc:591]  (Re-)adding model: tfmodel
2022-03-08 05:31:44.467044: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: tfmodel version: 1}
2022-03-08 05:31:44.467071: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: tfmodel version: 1}
2022-03-08 05:31:44.467081: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: tfmodel version: 1}
2022-03-08 05:31:44.467120: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /mnt/models/1
2022-03-08 05:31:44.492801: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve }
2022-03-08 05:31:44.492829: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /mnt/models/1
2022-03-08 05:31:44.492936: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-08 05:31:44.633731: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:211] Restoring SavedModel bundle.
2022-03-08 05:31:44.744580: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 139378688 exceeds 10% of free system memory.
2022-03-08 05:31:44.744611: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 139378688 exceeds 10% of free system memory.
2022-03-08 05:31:44.744651: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 139378688 exceeds 10% of free system memory.
2022-03-08 05:31:45.320376: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:283] SavedModel load for tags { serve }; Status: success: OK. Took 853253 microseconds.
2022-03-08 05:31:45.334315: I tensorflow_serving/servables/tensorflow/saved_model_bundle_factory.cc:160] Wrapping session to perform batch processing
2022-03-08 05:31:45.334364: I tensorflow_serving/servables/tensorflow/bundle_factory_util.cc:65] Wrapping session to perform batch processing
2022-03-08 05:31:45.334446: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /mnt/models/1/assets.extra/tf_serving_warmup_requests
2022-03-08 05:31:45.335220: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: tfmodel version: 1}
2022-03-08 05:31:45.407852: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
2022-03-08 05:31:45.407927: I tensorflow_serving/model_servers/server.cc:133] Using InsecureServerCredentials
2022-03-08 05:31:45.407941: I tensorflow_serving/model_servers/server.cc:383] Profiler service is enabled
2022-03-08 05:31:45.409435: I tensorflow_serving/model_servers/server.cc:409] Running gRPC ModelServer at 0.0.0.0:9000 ...
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...
2022-03-08 05:31:45.499813: I tensorflow_serving/model_servers/server.cc:430] Exporting HTTP/REST API at:localhost:8080 ...
Test on local cluster using storage spec
  1. Create secret storage-config
apiVersion: v1
kind: Secret
metadata:
  name: storage-config
  namespace: markwinter
type: Opaque
stringData:
  internalhdfs: |
    {
      "type": "hdfs",
      "keytab_b64": "redacted",
      "namenode": "https://redacted:port",
      "principal": "markwinter@redacted"
    }
  1. Create isvc
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: tfmodel
  namespace: markwinter
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storage:
        key: internalhdfs
        path: /user/markwinter/tfmodel-20201003/1
  1. storage-init logs
[I 220308 07:58:29 initializer-entrypoint:13] Initializing, args: src_uri [hdfs://user/markwinter/tfmodel-20201003/1] dest_path[ [/mnt/models]
[I 220308 07:58:29 storage:56] Copying contents of hdfs://user/markwinter/tfmodel-20201003/1 to local
[I 220308 07:58:29 client:192] Instantiated <KerberosClient(url='https://redacted:port')>.
[I 220308 07:58:29 client:745] Downloading '/user/markwinter/tfmodel-20201003/1' to '/mnt/models'.
[I 220308 07:58:29 client:1114] Walking '/user/markwinter/tfmodel-20201003/1' (depth 0).
[I 220308 07:58:29 client:317] Fetching status for '/user/markwinter/tfmodel-20201003/1'.
[I 220308 07:58:29 client:1078] Listing '/user/markwinter/tfmodel-20201003/1'.
[I 220308 07:58:29 client:1078] Listing '/user/markwinter/tfmodel-20201003/1/variables'.
[I 220308 07:58:29 client:686] Reading file '/user/markwinter/tfmodel-20201003/1/saved_model.pb'.
[I 220308 07:58:29 client:686] Reading file '/user/markwinter/tfmodel-20201003/1/variables/variables.data-00000-of-00001'.
[I 220308 07:58:29 client:686] Reading file '/user/markwinter/tfmodel-20201003/1/variables/variables.index'.
[I 220308 07:58:40 storage:94] Successfully copied hdfs://user/markwinter/tfmodel-20201003/1 to /mnt/models

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:

- Add support for kerberized hdfs in storageUri and storage spec

@markwinter markwinter marked this pull request as draft March 7, 2022 07:06
@markwinter markwinter changed the title WIP: Support (web) hdfs in storage-initializer WIP: Support (web) kerberized hdfs in storage-initializer Mar 7, 2022
@markwinter markwinter changed the title WIP: Support (web) kerberized hdfs in storage-initializer WIP: Support kerberized (web)hdfs in storage-initializer Mar 7, 2022
@markwinter
Copy link
Member Author

markwinter commented Mar 7, 2022

@yuzisun

Currently I have kerberos and hdfs python packages in the kserve sdk requirements.txt. This means all docker images need the krb5-config package installed on the OS

Do you think it would be better to only install those packages and the krb5-config package in the storage-initializer.Dockerfile dockerfile?

I wonder if the storage module should be removed from the kserve sdk as it only seems to be used by the storage-initializer. That way storage-initializer dependencies are separate to the kserve sdk dependencies.

Edit: I have now changed it only install in the storage-initializer.Dockerfile for now

@markwinter
Copy link
Member Author

markwinter commented Mar 7, 2022

@yuzisun

Edit: Made an issue for the tests #2078

@markwinter
Copy link
Member Author

markwinter commented Mar 8, 2022

@Tomcli

I updated some of the code relating to the storage spec - namely removing the bucket-placeholder from the component annotation so I can support the new hdfs type nicely. I add the bucket into the uri in the python storage code still. Could you review it please?

@markwinter markwinter changed the title WIP: Support kerberized (web)hdfs in storage-initializer Support kerberized (web)hdfs in storageURI and storage spec Mar 8, 2022
@markwinter markwinter marked this pull request as ready for review March 8, 2022 07:58
@markwinter
Copy link
Member Author

markwinter commented Mar 11, 2022

@Tomcli

I've updated it now to hopefully handle all cases.

Instead of splitting the env var handling and bucket injection between the python code and the controller, I'm now doing it all in the controller like storageUri does.

Could you take another look please?

@Tomcli
Copy link
Member

Tomcli commented Mar 11, 2022

Thanks @markwinter. One of the reasons I had to split the logic between controller and the python server is that we don't want to expose credentials such as AWSSecretAccessKey with raw strings. Maybe for the S3 case, we can read the secret to generate the s3 uri in the controller and let the python server handles the other environment variables via secret mount?

@markwinter
Copy link
Member Author

@Tomcli
Oh right I forgot about that, that's important.
I've updated it to how you suggested.
Thanks for reviewing in detail several times! Sorry for so much back and forth

@Tomcli
Copy link
Member

Tomcli commented Mar 14, 2022

thanks @markwinter
/lgtm

@yuzisun
Copy link
Member

yuzisun commented Apr 1, 2022

@markwinter Can you rebase from your other PR?

@markwinter
Copy link
Member Author

@yuzisun
Done 👍

@lizzzcai
Copy link
Member

lizzzcai commented Apr 6, 2022

hi @markwinter, Just want to check if this PR supports general webHDFS with SSL/TLS.

We have the exact requirement on webhdfs but we are not using the KerberosClient. we will need cert and key as well as some header to download our file through webhdfs from our remote Hadoop based storage.

Is it possible to make the webhdfs support more general if it is not supported in this PR. I can help on it if needed. Thanks.

@markwinter
Copy link
Member Author

Hey @lizzzcai

This PR is only for Kerberized clusters but it would be good to support other methods of connecting to HDFS as well.

Could you create an issue with the requirements and maybe an example of how you download a file? In this PR I've used hdfscli to download files.

We can take a look at supporting it then

@lizzzcai
Copy link
Member

lizzzcai commented Apr 7, 2022

Hey @lizzzcai

This PR is only for Kerberized clusters but it would be good to support other methods of connecting to HDFS as well.

Could you create an issue with the requirements and maybe an example of how you download a file? In this PR I've used hdfscli to download files.

We can take a look at supporting it then

Hi @markwinter , thanks, I am using the same hdfscli with TLS and extra headers. I have created an issue with example: #2135, please take a look and see if it can be supported.

@markwinter
Copy link
Member Author

This PR now handles general WebHDFS support for storageURI and storage spec

@markwinter markwinter marked this pull request as ready for review April 21, 2022 05:35
@lizzzcai
Copy link
Member

@lizzzcai

My PR should now support your use-case (when using storageUri). It would be good if you could give it a try. You will need to build and deploy kserve-controller and update storage-initializer image (specified in inferenceservice-config configmap) - let me know if you need help with this. https://kserve.github.io/website/0.8/developer/developer/

Documentation: https://github.com/markwinter/kserve/tree/storage-init-hdfs-support/docs/samples/storage/hdfs

Hi @markwinter , I have tried it out and it is working fine for my use cases.

One small feedback as I saw you encode the kerberos.keytab with base64. For the TLS_CERT and TLS_KEY as well as TLS_CA, is it possible to encode them as well? (user have to provide base64 encoded data for these) or have another TLS_CERT_BASE64, TLS_KEY_BASE64, TLS_CA_BASE64 as optional. (Not sure which one is better)

Another thing is for the protocol, should we name it as webhdfs rather than hdfs as we are actually using webhdfs REST API.

These are just some small suggestions but overall the code is working on my side! Thanks a lot!.

@markwinter
Copy link
Member Author

markwinter commented Apr 22, 2022

@lizzzcai
Thanks for reviewing and checking it works !

When using storageUri e.g. storageUri: "hdfs://...", none of the files are base64 encoded anymore. The kerberos keytab, TLS files etc. are expected to be given as is using --from-file=KEY=./file. Before, I did encode the kerberos keytab to base64 because it was being loaded as an env variable. Now, I use a secret volume mount so all secrets are loaded as files - so no need for base64 anymore.

When using the storage spec (not yet released), the user has to write a json configuration in a secret called storage-config. In this case, all files (keytab, tls etc.) must be encoded in base64 so they can be used in the json config. The storage spec isn't intended to replace the above storageUri method, both will be usable.

Another thing is for the protocol, should we name it as webhdfs rather than hdfs as we are actually using webhdfs REST API.

Yes this might make more sense

@lizzzcai
Copy link
Member

@lizzzcai Thanks for reviewing and checking it works !

When using storageUri e.g. storageUri: "hdfs://...", none of the files are base64 encoded anymore. The kerberos keytab, TLS files etc. are expected to be given as is using --from-file=KEY=./file. Before, I did encode the kerberos keytab to base64 because it was being loaded as an env variable. Now, I use a secret volume mount so all secrets are loaded as files - so no need for base64 anymore.

When using the storage spec (not yet released), the user has to write a json configuration in a secret called storage-config. In this case, all files (keytab, tls etc.) must be encoded in base64 so they can be used in the json config. The storage spec isn't intended to replace the above storageUri method, both will be usable.

Another thing is for the protocol, should we name it as webhdfs rather than hdfs as we are actually using webhdfs REST API.

Yes this might make more sense

Thanks for your clarification. I have tested the storage-config with base64 encoded TLS, and it is working fine as well. Cool.

BTW, I saw that this PR is under KSserve 0.9, just want to check is there any estimated timeline for releasing 0.9? Thanks.

@markwinter
Copy link
Member Author

markwinter commented Apr 25, 2022

@lizzzcai

Thanks for testing that method as well 👍

I think 0.9 is planned for mid May

@yuzisun yuzisun mentioned this pull request May 1, 2022
15 tasks
Signed-off-by: Mark Winter <mark.winter@navercorp.com>
Signed-off-by: Mark Winter <mark.winter@navercorp.com>
Signed-off-by: Mark Winter <mark.winter@navercorp.com>
@aws-kf-ci-bot
Copy link
Contributor

@markwinter: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-kserve-presubmit 9314a38 link /test kubeflow-kserve-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@lizzzcai
Copy link
Member

@lizzzcai Thanks for reviewing and checking it works !

When using storageUri e.g. storageUri: "hdfs://...", none of the files are base64 encoded anymore. The kerberos keytab, TLS files etc. are expected to be given as is using --from-file=KEY=./file. Before, I did encode the kerberos keytab to base64 because it was being loaded as an env variable. Now, I use a secret volume mount so all secrets are loaded as files - so no need for base64 anymore.

When using the storage spec (not yet released), the user has to write a json configuration in a secret called storage-config. In this case, all files (keytab, tls etc.) must be encoded in base64 so they can be used in the json config. The storage spec isn't intended to replace the above storageUri method, both will be usable.

Another thing is for the protocol, should we name it as webhdfs rather than hdfs as we are actually using webhdfs REST API.

Yes this might make more sense

Hi @markwinter , for the protocol, can we update it to webhdfs? We have maintained our own version of storage initializer to support it so far and thinking of switching back to the official image once the feature is released. webhdfs will be easier for the transition. Thanks.

@markwinter
Copy link
Member Author

@lizzzcai
Will make sure to do it 👍
Sorry it's taking a while

Signed-off-by: Mark Winter <mark.winter@navercorp.com>
@markwinter
Copy link
Member Author

markwinter commented Jun 3, 2022

@lizzzcai @yuzisun

For now I have added support for both hdfs:// and webhdfs:// schemes. They will both use the same python storage code. 7e9960c

Reasoning: @lizzzcai would prefer the webhdfs:// uri scheme, whilst internally we are using regular hdfs:// paths. I don't think this will cause much of an issue. What do you think?

@yuzisun
Copy link
Member

yuzisun commented Jun 4, 2022

Thanks @markwinter ! Awesome work!

/lgtm
/approve

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markwinter, yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kserve-oss-bot kserve-oss-bot merged commit d61fc1f into kserve:master Jun 4, 2022
@markwinter markwinter deleted the storage-init-hdfs-support branch June 4, 2022 04:14
alexagriffith pushed a commit to alexagriffith/kserve that referenced this pull request Sep 19, 2022
* hdfs storage support

Signed-off-by: Mark Winter <mark.winter@navercorp.com>

* fix merge

Signed-off-by: Mark Winter <mark.winter@navercorp.com>

* check status to see if its a file or directory

Signed-off-by: Mark Winter <mark.winter@navercorp.com>

* remove unused import

Signed-off-by: Mark Winter <mark.winter@navercorp.com>

* Support webhdfs:// scheme too

Signed-off-by: Mark Winter <mark.winter@navercorp.com>
Signed-off-by: alexagriffith <agriffith96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Support download model through general WebHDFS REST API
6 participants