Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not mount volume in Container op #477

Closed
StefanoFioravanzo opened this issue Dec 5, 2018 · 14 comments
Closed

Can not mount volume in Container op #477

StefanoFioravanzo opened this issue Dec 5, 2018 · 14 comments
Assignees

Comments

@StefanoFioravanzo
Copy link
Member

[Running Pipelines on Minikube]
I am trying to mount a folder from the Minikube VM into the containers of my pipeline.

I have a data_processing folder in the Minikube VM that I want to be accessible by the Pipelines containers:

$> minikube ssh ls /data_processing
data.csv

I created a PersistentVolume using:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-processing
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 5Gi
  hostPath:
    path: /data_processing

I tried to test the mount in a Pipelines Lightweight component:

def data_loader():
    import pandas as pd
    data = pd.read_csv('/data_processing/data.csv', sep=',',header=None)
    print(data)

data_loading_op = comp.func_to_container_op(data_loader, base_image='tensorflow/tensorflow:1.11.0-py3')

import kfp.dsl as dsl
@dsl.pipeline(
   name='DataLoading Pipeline',
   description='Test.'
)
def phenology_pipeline():
    data_loading_task = data_loading_op(data_path).add_volume(k8s_client.V1VolumeMount(
          mount_path='/data_processing',
          name='data-processing'))

# ...

But the folder can not be found. Any ideas on how to solve this?

@gaoning777
Copy link
Contributor

  • @IronPan
    To mount a volume, one needs to call both add_volume and add_volume_mount. Refer to here for an example.
    add_volume specifies the volume(for example type as kubernetes secrets) and add_volume_mount specifies the mounting path, etc.

@StefanoFioravanzo
Copy link
Member Author

@gaoning777 Using the following code:

import kfp.dsl as dsl
@dsl.pipeline(
   name='DataLoading Pipeline',
   description='Test.'
)
def phenology_pipeline():
    data_loading_task = data_loading_op(data_path)
                            .add_volume(k8s_client.V1Volume(name='data-processing')) \
                            .add_volume_mount(k8s_client.V1VolumeMount(
                                          mount_path='/data_processing',
                                          name='data-processing'))

Worked by mounting a folder /data_processing in the containers. But the folder is empty (I should expect it to contain the data.csv file). I guess it is mounting an empty volume regardless of my PersistentVolume? Or am I missing something to map the PersistentVolume to the real /data_processing folder in the minikube VM?

My PersistentVolume details:

$> kubectl describe pv data-processing
Name:            data-processing
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller=yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    manual
Status:          Bound
Claim:           kubeflow/data-processing-claim
Reclaim Policy:  Retain
Access Modes:    RWO
Capacity:        5Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data_processing
    HostPathType:
Events:            <none>

@IronPan
Copy link
Member

IronPan commented Dec 5, 2018

Have you try add the hostPath to the volume definition in DSL?

.add_volume(k8s_client.V1Volume(name='data-processing', host_path=k8s_client.V1HostPathVolumeSource(path='/data_processing')))

@StefanoFioravanzo
Copy link
Member Author

@IronPan Thanks, that did the trick!

Anyway, are there any plans to better support volume mounts in Container Ops, maybe in some declarative way?

@vkoukis
Copy link
Member

vkoukis commented Dec 5, 2018

Anyway, are there any plans to better support volume mounts in Container Ops, maybe in some declarative way?

+1!
I have not worked extensively with Kubeflow Pipelines specifically, but it definitely makes sense for the Pipelines to be able to use K8s-native storage concepts, e.g., PVCs.

For context, here is the work we have contributed to in Kubeflow, initially for notebooks:
kubeflow/kubeflow#34
and the PR:
kubeflow/kubeflow#1918

@IronPan We -- Arrikto -- would be more than willing to contribute effort in getting Kubeflow Pipelines to use the native K8s storage resources seamlessly.

And in general, we would definitely like to contribute more in getting Pipelines to be able to expose the characteristics of the underlying native K8s resources [e.g., pod spec] easily.

@yebrahim
Copy link
Contributor

yebrahim commented Dec 7, 2018

/assign ark-kun

@IronPan
Copy link
Member

IronPan commented Dec 7, 2018

Anyway, are there any plans to better support volume mounts in Container Ops, maybe in some declarative way?

This make sense, especially for one who are familiar with K8s yaml paradigm. @Ark-kun I think this aligns with your idea. Any thoughts?

Pipelines to be able to use K8s-native storage concepts, e.g., PVCs.

The container op supports mounting volumes so it should support PVC already.

be able to expose the characteristics of the underlying native K8s resources [e.g., pod spec] easily.

Pipeline uses Argo as underlying orchestrator, and not every pod spec make sense for the orchestrator so I think we should be a bit cautious what pod API fields to expose. I would like to treat this case by case. Do you have specific features want to be supported?

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 7, 2018

@StefanoFioravanzo

better support volume mounts in Container Ops

@vkoukis

use the native K8s storage resources seamlessly

Can you elaborate more? We're already using the full native K8s types to specify the volumes and volumeMounts. K8s has pretty extensive volume API. Do you want us to make some subset of the K8s APIs to be easier to specify instead of writing k8s-style spec?

In any case, we're working on both expanding our support for k8s APIs and adding ways to simplify the pipeline author's job. See gcp.set_tpu_resource or gcp.use_gcp_secret for instance.

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 7, 2018

expose the characteristics of the underlying native K8s resources [e.g., pod spec] easily

We're working on some improvements here, but we're currently limited by the parts of K8s API that Argo supports. While Argo supports the full k8s Container spec, it does not fully support the Pod spec.
Argo currently has some support for:

  • ActiveDeadlineSeconds
  • Affinity
  • Metadata (Already implemented)
  • NodeSelector
  • Parallelism (Argo only, not Pod spec)
  • RetryStrategy
  • Tolerations
  • Volumes (global: on Workflow level; Already implemented)

Which of those do you need to your pipelines?

What would you like API to look?
Do you find it confusing to have both Container properties and Pod properties in the same place?
Would you like Pod properties have a pod_ prefix to easily identify them (e.g. train_op(...).set_pod_retry_strategy(...) )?

@StefanoFioravanzo
Copy link
Member Author

@Ark-kun You are right, the support for k8s native types to specify Volumes and VolumesMounts is there. My comment pointed to some easier way to specify the volume mounts and the gcp.use_gcp_secret example strikes the point.

@IronPan Now that I managed to run a custom pipeline using local volume mounts, I would like to read also form a gcp bucket (I am still running on Minikube). As far as I understood, Kubeflow creates its own service account when deployed on a gcp cluster. Missing that, I would need to create my own service account and then create a k8s secret.

But then how should I mount the secret in the container?

@jlewi
Copy link
Contributor

jlewi commented Dec 13, 2018

@Ark-kun Why not let pipelines just orchestrate K8s objects so users have full access to the K8s APIs rather than trying to figure out which fields to expose?

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 18, 2018

@Ark-kun Why not let pipelines just orchestrate K8s objects so users have full access to the K8s APIs rather than trying to figure out which fields to expose?

Argo does not allow full access to K8s APIs. For instance, Argo has support for Container spec, but not the Pod or Job spec.

Should we expose Argo-specific functionality which differs from K8s model?
Should we expose K8s functionality that cannot be executed in Argo?

@Ark-kun
Copy link
Contributor

Ark-kun commented Dec 18, 2018

Also some features that make sense for raw k8s network services are not used when running finite tasks while preventing the pipeline system from adding value to the user. Example of features that can be added by the pipeline system that K8s lacks: caching, reproducibility, data provenance, security.

E.g. If we allow any Pod to have privilege access, we cannot guarantee security or data consistency. Same if we allow Pods to freely modify the Pipelines system databases.

Nevertheless, we strive to have maximum parity with the K8s API as allowed by Argo.

@zoux86
Copy link

zoux86 commented Jan 20, 2019

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants