Kube2Hadoop

Use of Kubernetes has flourished for offline AI workloads. Offline training jobs on Kubernetes, such as TensorFlow or Spark jobs need secure access to datalake like HDFS. However, there exists a gap between the security model of Kubernetes and Hadoop. Kube2Hadoop bridges this gap by providing a scalable and secure integration of Kubernetes and HDFS Kerberos. Kube2Hadoop consists of three main components:

Hadoop Token Fetcher for fetching delegation tokens, deployed as Kubernetes Deployment.
IDDecorator for writing authenticated user id, deployed as Kubernetes Admission Controller.
Kube2Hadoop Init Container in each worker pod as client for sending request to fetch delegation token from Hadoop Token Service.

For more details on how Kube2Hadoop works internally, and its authentication mechanism, please read Kube2Hadoop blogpost

Build and deploy

The Hadoop Token Fetcher is built using Gradle. To build it, run:

./gradlew build.

The resulting jar will be located in ./token-fetcher/build/libs/.

You can find sample Kubernetes Deployment yaml files under ./token-fetcher/resources/; and Kubernetes related services under ./core/src/resources.

Visit this page on instructions to deploy IDDecorator

Usage

Once the iddecorator and token-fetcher services are deployed on your Kubernetes cluster, you should be able to use Kube2Hadoop by adding an init container that launches the following command: ./misc/fetch_delegation_token. The fetched token should be placed under $HADOOP_TOKEN_FILE_LOCATION. Sample init-container config:

initContainers:
          - name: tokenfetcher
            image: <init-container-image-path>
            imagePullPolicy: IfNotPresent
            env:
            - name: K8SNAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            volumeMounts:
            - name: shared-data
              mountPath: "/var/tmp"

Be sure to add an empty VolumeMount in the pod for storing the delegation token:

Volumes:
    - name: shared-data
      emptyDir:
        sizeLimit: "10Mi"

Finally, set up an environment variable in your container to reference the delegation token and reference the Volume in the container to access the token written by the init container:

containers:
    - name: <my-main-container>
      image: <my-image-url>
      ...
      env:
      - name: HADOOP_TOKEN_FILE_LOCATION
        value: "/var/tmp/hdfs-delegation-token"
      ...
      volumeMounts:
      - name: shared-data
        mountPath: "/var/tmp"
      ...

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
gradle		gradle
iddecorator		iddecorator
ligradle		ligradle
misc		misc
token-fetcher		token-fetcher
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
license_header		license_header
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kube2Hadoop

Build and deploy

Usage

About

Releases

Packages

Contributors 3

Languages

License

linkedin/kube2hadoop

Folders and files

Latest commit

History

Repository files navigation

Kube2Hadoop

Build and deploy

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages