Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-6852 - Kubernetes support #4370

Merged
merged 24 commits into from
Mar 10, 2020
Merged

PUBDEV-6852 - Kubernetes support #4370

merged 24 commits into from
Mar 10, 2020

Conversation

Pscheidl
Copy link
Contributor

@Pscheidl Pscheidl commented Feb 28, 2020

https://0xdata.atlassian.net/browse/PUBDEV-6852

It is recommended to read the README.md introduced in this PR.

Screenshot from 2020-02-29 14-12-26

Works even locally on minikube:

Screenshot from 2020-02-29 21-12-22

Example output of a Pod

02-29 13:09:06.927 10.129.20.65:54321    10     main      INFO: ----- H2O started  -----
02-29 13:09:06.945 10.129.20.65:54321    10     main      INFO: Build git branch: pavel/pubdev-6852
02-29 13:09:06.945 10.129.20.65:54321    10     main      INFO: Build git hash: e6623b22c55e5f43359348efeaac94bfe2f4433d
02-29 13:09:06.945 10.129.20.65:54321    10     main      INFO: Build git describe: jenkins-3.28.0.4-11-ge6623b22c5-dirty
02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Build project version: 3.28.0.99999
02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Build age: -55 minutes
02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Built by: 'pavel'
02-29 13:09:06.946 10.129.20.65:54321    10     main      INFO: Built on: '2020-02-29 14:04:48'
02-29 13:09:06.947 10.129.20.65:54321    10     main      INFO: Found H2O Core extensions: [XGBoost, KrbStandalone]
02-29 13:09:06.948 10.129.20.65:54321    10     main      INFO: Processed H2O arguments: []
02-29 13:09:06.948 10.129.20.65:54321    10     main      INFO: Java availableProcessors: 1
02-29 13:09:06.948 10.129.20.65:54321    10     main      INFO: Java heap totalMemory: 12.9 MB
02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: Java heap maxMemory: 123.8 MB
02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: Java version: Java 11.0.6 (from Ubuntu)
02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: JVM launch parameters: []
02-29 13:09:06.949 10.129.20.65:54321    10     main      INFO: JVM process id: 10@example-0
02-29 13:09:06.950 10.129.20.65:54321    10     main      INFO: OS version: Linux 4.18.0-147.5.1.el8_1.x86_64 (amd64)
02-29 13:09:06.950 10.129.20.65:54321    10     main      INFO: Machine physical memory: 31.41 GB
02-29 13:09:06.950 10.129.20.65:54321    10     main      INFO: Machine locale: en_US
02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: X-h2o-cluster-id: 1582981745644
02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: User name: '1002280000'
02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: IPv6 stack selected: false
02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: Possible IP Address: eth0 (eth0), fe80:0:0:0:80f0:dcff:fef8:1dcd%eth0
02-29 13:09:06.951 10.129.20.65:54321    10     main      INFO: Possible IP Address: eth0 (eth0), 10.129.20.65
02-29 13:09:06.952 10.129.20.65:54321    10     main      INFO: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%lo
02-29 13:09:06.954 10.129.20.65:54321    10     main      INFO: Possible IP Address: lo (lo), 127.0.0.1
02-29 13:09:06.954 10.129.20.65:54321    10     main      INFO: H2O node running in unencrypted mode.
02-29 13:09:06.956 10.129.20.65:54321    10     main      INFO: Internal communication uses port: 54322
02-29 13:09:06.956 10.129.20.65:54321    10     main      INFO: Listening for HTTP and REST traffic on http://10.129.20.65:54321/
02-29 13:09:06.960 10.129.20.65:54321    10     main      INFO: Initializing H2O Kubernetes cluster
02-29 13:09:07.032 10.129.20.65:54321    10     main      INFO: Timeout for node discovery is set to 120 seconds.
02-29 13:09:07.033 10.129.20.65:54321    10     main      INFO: Desired cluster size is set to 3 nodes.
02-29 13:09:07.120 10.129.20.65:54321    10     main      INFO: New H2O pod with DNS record 'example-0.h2o-service.h2o-statefulset.svc.cluster.local./10.129.20.65' discovered.
02-29 13:09:21.199 10.129.20.65:54321    10     main      INFO: New H2O pod with DNS record 'example-1.h2o-service.h2o-statefulset.svc.cluster.local./10.131.10.165' discovered.
02-29 13:09:52.344 10.129.20.65:54321    10     main      INFO: New H2O pod with DNS record 'example-2.h2o-service.h2o-statefulset.svc.cluster.local./10.130.8.110' discovered.
02-29 13:09:53.345 10.129.20.65:54321    10     main      INFO: Using the following pods to form H2O cluster: [10.131.10.165,10.130.8.110,10.129.20.65]
02-29 13:09:53.346 10.129.20.65:54321    10     main      INFO: Dynamically loaded 'H2OKubernetesEmbeddedConfigProvider' as AbstractEmbeddedH2OConfigProvider.
02-29 13:09:53.352 10.129.20.65:54321    10     main      INFO: H2O cloud name: '1002280000' on /10.129.20.65:54321, discovery address /234.145.114.231:60049
02-29 13:09:53.353 10.129.20.65:54321    10     main      INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
02-29 13:09:53.353 10.129.20.65:54321    10     main      INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 1002280000@10.129.20.65'
02-29 13:09:53.353 10.129.20.65:54321    10     main      INFO:   2. Point your browser to http://localhost:55555
02-29 13:09:54.128 10.129.20.65:54321    10     main      INFO: Log dir: '/tmp/h2o-1002280000/h2ologs'
02-29 13:09:54.128 10.129.20.65:54321    10     main      INFO: Cur dir: '/'
02-29 13:09:54.136 10.129.20.65:54321    10     main      INFO: Subsystem for distributed import from HTTP/HTTPS successfully initialized
02-29 13:09:54.136 10.129.20.65:54321    10     main      INFO: HDFS subsystem successfully initialized
02-29 13:09:54.139 10.129.20.65:54321    10     main      INFO: S3 subsystem successfully initialized
02-29 13:09:54.209 10.129.20.65:54321    10     main      INFO: GCS subsystem successfully initialized
02-29 13:09:54.211 10.129.20.65:54321    10     main      INFO: Flow dir: '//h2oflows'
02-29 13:09:54.237 10.129.20.65:54321    10     main      INFO: Cloud of size 1 formed [example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65:54321]
02-29 13:09:54.237 10.129.20.65:54321    10     main      INFO: Created cluster of size 1, leader node IP is 'example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65'
02-29 13:09:54.251 10.129.20.65:54321    10     main      INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
02-29 13:09:54.255 10.129.20.65:54321    10     main      INFO: XGBoost extension initialized
02-29 13:09:54.256 10.129.20.65:54321    10     main      INFO: KrbStandalone extension initialized
02-29 13:09:54.312 10.129.20.65:54321    10     main      INFO: Registered 2 core extensions in: 475ms
02-29 13:09:54.313 10.129.20.65:54321    10     main      INFO: Registered H2O core extensions: [XGBoost, KrbStandalone]
02-29 13:09:54.932 10.129.20.65:54321    10     main      INFO: Found XGBoost backend with library: xgboost4j_minimal
02-29 13:09:54.932 10.129.20.65:54321    10     main      WARN: Your system supports only minimal version of XGBoost (no GPUs, no multithreading)!
02-29 13:09:55.233 10.129.20.65:54321    10     main      INFO: Registered: 187 REST APIs in: 920ms
02-29 13:09:55.233 10.129.20.65:54321    10     main      INFO: Registered REST API extensions: [Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4]
02-29 13:09:55.707 10.129.20.65:54321    10     main      INFO: Registered: 279 schemas in 474ms
02-29 13:09:55.707 10.129.20.65:54321    10     main      INFO: H2O started in 50056ms
02-29 13:09:55.707 10.129.20.65:54321    10     main      INFO: 
02-29 13:09:55.708 10.129.20.65:54321    10     main      INFO: Open H2O Flow in your web browser: http://10.129.20.65:54321
02-29 13:09:55.708 10.129.20.65:54321    10     main      INFO: 
02-29 13:09:57.764 10.129.20.65:54321    10     FJ-126-1  INFO: Cloud of size 3 formed [example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65:54321, example-2.h2o-service.h2o-statefulset.svc.cluster.local/10.130.8.110:54321, example-1.h2o-service.h2o-statefulset.svc.cluster.local/10.131.10.165:54321]
02-29 13:09:57.765 10.129.20.65:54321    10     FJ-126-1  INFO: Created cluster of size 3, leader node IP is 'example-0.h2o-service.h2o-statefulset.svc.cluster.local/10.129.20.65'

@Pscheidl Pscheidl added the WIP label Feb 28, 2020
@Pscheidl Pscheidl force-pushed the pavel/pubdev-6852 branch 2 times, most recently from 17556df to d07eb3d Compare February 28, 2020 12:59
settings.gradle Outdated Show resolved Hide resolved
@Pscheidl
Copy link
Contributor Author

Pscheidl commented Mar 2, 2020

Memory limits should be handled on JVM level and are the container's responsibility ... java -jar h2o.jar -Xmx...

h2o-core/src/main/java/water/H2O.java Outdated Show resolved Hide resolved
h2o-core/src/main/java/water/H2O.java Show resolved Hide resolved
h2o-core/src/main/java/water/H2O.java Outdated Show resolved Hide resolved
h2o-k8s/README.md Outdated Show resolved Hide resolved
h2o-k8s/README.md Outdated Show resolved Hide resolved
h2o-k8s/README.md Outdated Show resolved Hide resolved
h2o-k8s/README.md Show resolved Hide resolved

Exposing the H2O cluster is a responsibility of the Kubernetes administrator. By default, an
[Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) can be created. Different platforms offer
different capabilities, e.g. OpenShift offers [Routess](https://docs.openshift.com/container-platform/4.3/networking/routes/route-configuration.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find in the documentation a bit where actually h2o is started, am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add more explanation. H2O is started as soon as all pods cluster. Which means the headless service is created and the H2O pods are started. The ingress only makes H2O accessible from the outside. It may even change without affecting the underlying layers.

You may spawn H2O in one go by doing kubectl apply -f h2o.yaml. The YAML looks like this and it will work on your local minikube as well:

apiVersion: v1
kind: Pod
metadata:
  name: h2o-k8s
  labels:
    app: h2o-k8s
  namespace: default
spec:
  containers:
    - name: h2o-k8s
      image: pscheidl/h2o-k8s
      ports:
        - containerPort: 54321
      env:
      - name: H2O_KUBERNETES_SERVICE_DNS
        value: h2o-service.default.svc.cluster.local
---
apiVersion: v1
kind: Service
metadata:
  name: h2o-service
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: h2o-k8s
  ports:
  - protocol: TCP
    port: 54321

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pscheidl Routess looks like a typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@jakubhava
Copy link
Contributor

Nice work!, mentioned few nits and proposals

@jakubhava
Copy link
Contributor

One more question: are all nodes accessible on the default port from the outside word? In Sparkling Water we need all nodes to be exposed (just the API port).

When running GET request /3/Cloud the requests returns the IP:Port of all the H2O nodes in the cluster, are these IP:Ports accessible from outside? ( This is how SW discovers the nodes in external backend).

Question 2: Could you please try with HTTPS? I guess it should work out of box, but would be good to also mention this in the documentation

@Pscheidl
Copy link
Contributor Author

Pscheidl commented Mar 7, 2020

Next step is to do health checks in order to allow automatic cluster restarts when required (those are turned on and off by K8S cluster administrator or the orchestration software.).

Created a separate JIRA, as this functionality is definitely not required, it's a convenience feature. https://0xdata.atlassian.net/browse/PUBDEV-7359

@Pscheidl
Copy link
Contributor Author

Pscheidl commented Mar 7, 2020

Also, the REST API clustering could also be a convenient option for some. It has no huge advantages and it requires role bindings set up. I've created a JIRA for that as well, to keep track of the task. We can always postpone/decline the JIRA.

https://0xdata.atlassian.net/browse/PUBDEV-7360

Copy link
Contributor

@jakubhava jakubhava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, for now good, but would be good to be able to add ability to export all h2o nodes on their api ports to the outside word ( or if that would be business of Kubernetes, document how to achieve that)

@Pscheidl
Copy link
Contributor Author

Pscheidl commented Mar 9, 2020

Note to self: Detect K8S presence in any other way - sometimes we do NOT want the lookup to be triggered, even if running on K8S.

Copy link
Member

@bilcus bilcus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pavel!

@Pscheidl
Copy link
Contributor Author

The commit 9517354 makes it easy to disable this behavior. As the clustering will only work with a headless service set-up and the environment variable H2O_KUBERNETES_SERVICE_DNS present, the KubernetesEmbeddedConfigProvider simply detects presence of this environment variable. If it's there, it's a clear signal to do the clustering. If not, it ignores even the very fact that it's running on K8S.

This means freedom for the users. For example, it'd not serve us well in our test pipeline once it starts running inside K8S.

@Pscheidl
Copy link
Contributor Author

Pscheidl commented Mar 10, 2020

@michalkurka @mn-mikke This is it. Here, on my side, this is the final version, as I can't think of any improvements related to DNS-based clustering. Did a small change today and tested it manually.

Please review.

Copy link
Contributor

@pkozelka pkozelka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not go deep, but I really like how the code and documentation looks.
Consistent naming, enough and meaningful javadocs, no new underscores fields - simply, well done!

@Pscheidl
Copy link
Contributor Author

Thank you @pkozelka .

@Pscheidl Pscheidl merged commit e435d42 into rel-yule Mar 10, 2020
@Pscheidl Pscheidl deleted the pavel/pubdev-6852 branch March 10, 2020 17:52
michalkurka pushed a commit that referenced this pull request Mar 11, 2020
Should only go to master

This reverts commit e435d42.
michalkurka pushed a commit that referenced this pull request Mar 11, 2020
Brings back support for Kubernetes from Pavel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants