Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand user group to XArray mailing list #130

Closed
mrocklin opened this issue Feb 23, 2018 · 27 comments
Closed

Expand user group to XArray mailing list #130

mrocklin opened this issue Feb 23, 2018 · 27 comments

Comments

@mrocklin
Copy link
Member

So far knowledge of the pangeo.pydata.org website has mostly been communicated in a social fashion. We might also force this spread a bit once we feel we're ready. I suspect that the next logical group would be the XArray mailing list. What do we want to accomplish before we are comfortable with this?

Some options:

  • Controls to limit a single user taking over the cluster
  • Particular examples for particular use cases
  • Instructions on how to upload data
  • A set of issues to encourage people to start contributing

How much do we care about these? What are other possible blockers that we might care about before releasing to the XArray mailing list?

@jhamman
Copy link
Member

jhamman commented Feb 23, 2018

@mrocklin - how hard would it be to add some documentation on the JupyterHub landing page. I have showed our deployment to a number of people and almost universally, I get questions like: "what is this?" or "how does this differ from my JupyterHub?". It would be great if up front, we could tell people that this is running on GCP, it comes from the pangeo project (give a link to our website), and explain how this deployment is different (autoscales, integration with distributed).

@mrocklin
Copy link
Member Author

mrocklin commented Feb 23, 2018 via email

@jhamman
Copy link
Member

jhamman commented Feb 23, 2018

Perhaps @yuvipanda can comment on the potential to do this.

@rabernat
Copy link
Member

Great points @jhamman. I think it would be great to link to our other project sites and acknowledge our funders on the landing page.

@rabernat
Copy link
Member

To @mrocklin's list I would add

  • Metrics to track usage statistics over time

Having fine-grained and long-term timeseries could be very valuable in the long run (e.g. when it's time to get more funding.

@mrocklin
Copy link
Member Author

It has been about a month here. Checking in on this.

It looks like @jhamman has made progress on adding documentation to the landing page.

Most of the things that I've mentioned have not happened:

Controls to limit a single user taking over the cluster
Instructions on how to upload data
A set of issues to encourage people to start contributing

I've added a couple examples here for this item.

Particular examples for particular use cases

I would encourage others to add more. This can be done by submitting a PR that targets the gce/notebook/examples directory. I'll try to dump in a few more generic dask examples.

@rabernat 's desire to have fine-grained metrocs and long term timeseries also has not happened.

Do we go ahead an expand to the xarray mailing list anyway?

@jhamman
Copy link
Member

jhamman commented Mar 21, 2018

I'm fine publicizing this as is. I have two talks in the next few weeks where I'll be using or mentioning this deployment so it be nice to keep these things moving to whatever extent possible.

@rabernat
Copy link
Member

@rabernat 's desire to have fine-grained metrocs and long term timeseries also has not happened.

It looks like there is already a lot of kubernetes log info dumped into stackdriver. (GCE docs on kubernetes logging.) Browsing through the logs I can see lots of details on how individual users create and delete different resources. This should be sufficient to retroactively create the metrics we want at a later date. So I am satisfied in the sense that the data is fundamentally there; we just haven't analyzed it yet.

Are we concerned at all about security? What about RBAC (disusssion in #167)? I am somewhat paranoid that a user could easily and even accidentally delete data that I have worked very hard to upload (see e.g. #150, #166). If @mrocklin and @jacobtomlinson feel that these are not serious concerns, I will defer to their expertise.

I also have a few more example notebooks that I would really like add to the default docker image. Is the current notebook dockerfile in this repo up to date? If so I will add my examples today and rebuild the docker image.

@mrocklin
Copy link
Member Author

Are we concerned at all about security?

I'm concerned about security, but more from a misuse of resources perspective than a delete-data perspective. Data is stored on GCS, which is separate from our kubernetes deployment here. You can (and should) view permissions on GCS by going to the cloud console and navigating to the storage tab. I don't think that people we don't know can delete data there, but you shouldn't trust me on this,

What about RBAC

Yeah, this would be great. Unfortunately it looks like no one is doing it.

If @mrocklin and @jacobtomlinson feel that these are not serious concerns, I will defer to their expertise.

I am not an expert on cloud deployments and you should not trust me. I do not accept this responsibility and instead push it back up to you as PI.

I also have a few more example notebooks that I would really like add to the default docker image. Is the current notebook dockerfile in this repo up to date? If so I will add my examples today and rebuild the docker image.

I'll be rebuilding the image sometime in the next day or two. If you just push things to the gce/notebook/examples directory they'll end up getting included on the next build

@rabernat
Copy link
Member

@tjcrone: would it be feasible for you to apply what you have learned about RBAC to pangeo.pydata.org? It sounds like this is an important security feature we should have enabled, but few people around here have the necessary expertise to make it work.

@jacobtomlinson
Copy link
Member

I have some large concerns about the security of the current platform. The two main issues are down to RBAC not being enabled and the notebooks being run in privileged containers.

If data on GCS is writable by any single user on the cluster then it technically is writable by anyone via privilege escalation.

It would also be reasonable trivial to begin crypto mining or other things on this platform, this is unavoidable as the whole platform is intended to allow people to execute arbitrary code. Perhaps some monitoring of resources would be useful so maintainers can be notified of large scale usage.

@rabernat
Copy link
Member

@jacobtomlinson I hear you! The problem is that we are having a hard time finding someone with the necessary expertise to actually fix these issues.

Is there any chance you would have some time to take on the RBAC issue? It sounds like you have already implemented this in your own deployments, so perhaps it would not be too heavy a burden to enable it for pangeo.pydata.org. We would be sincerely grateful.

As for crypto mining and other misuse, I'm slightly less concerned about that right now. How does mybinder deal with that question? Hopefully we can eventually find a way to use the stackdriver logs for both retroactive analysis and realtime monitoring of resource usage.

@rabernat
Copy link
Member

We need to resolve #176 before we can share with the XArray mailing list. It looks like this will require a new version of gcsfs.

@mrocklin
Copy link
Member Author

mrocklin commented Mar 22, 2018 via email

@yuvipanda
Copy link
Member

Currently, you should assume that anyone with any access to this hub can impersonate anyone else (covertly or not) and do anything they want with it. As others have already mentioned, this is primarily because of RBAC and using Privileged containers.

I'm unfortunately swamped with a course launch happening in the first week of April here in Berkeley, and will not be able to help at least until sometime after mid-April :( Am happy to answer specific questions people have in the meantime though!

@mrocklin
Copy link
Member Author

mrocklin commented Mar 22, 2018 via email

@jhamman
Copy link
Member

jhamman commented Mar 22, 2018

People don't seem to be using fuse at the moment.

I have been using fuse recently to provide read access to non-zarr filetypes. I may not be following but what is the current motivation for tearing down fuse?

@mrocklin
Copy link
Member Author

mrocklin commented Mar 23, 2018 via email

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Mar 23, 2018

I've been in and out of the office the last few days for personal reasons. But it looks like @tjcrone has done is good job of RBAC stuff in #172.

We don't use privileged containers and we give FUSE access using our S3 FUSE flex volume drivers. These are based on @yuvipanda's NFS flex volume driver.

It does make some assumptions about the cluster (it's running ubuntu debian jesse for example) and it installs packages on the hosts, which I'm unsure whether that's possible on your GCE clusters. If it is possible then we could definitely adapt this package to run on your cluster, if the use of golang puts you off that is also easy to change to something else (but that thing needs to be available on the host, hence golang static binaries being preferable).

@mrocklin
Copy link
Member Author

mrocklin commented Mar 23, 2018 via email

@rabernat
Copy link
Member

Where do we stand on this. Does FUSE still work with #172 merged?

Once we have the cluster re-deployed with RBAC, I think we are ready to share with the xarray mailing list.

@mrocklin
Copy link
Member Author

mrocklin commented Mar 26, 2018 via email

@rabernat
Copy link
Member

Someone implements FUSE more cleanly using FlexVolumes. UK Met has a
fairly clean example doing this that they use in production.

This is obviously the best choice. Who else might be qualified to do that? We are already leaning on @jacobtomlinson pretty hard...

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Mar 26, 2018

I don't know if I will have time to look at implementing this for GCE, but I would be keen for my flex volume driver to be expanded and made more generic. There is no reason why it has to be S3 specific, we could make it an "object store FUSE flex driver" instead.

We could add more FUSE applications and create drivers for them, for example we could add gcsfuse.

This requires changes to two sections:

The first installs system requirements on the kubernetes nodes themselves. This is currently done here using a privileged container and assumes the node is running Debian. This could be abstracted out into a custom container with logic for if <distro> install with <x> method, but if the default GCE container clusters use Debian then we could always address this at a later stage. To add gcsfuse we would need to add it, following the install instructions, to this step.

The second step builds and drops the, currently two, flex volume drivers onto the node. The drivers are golang cli applications which wrap the FUSE cli applications to conform to the Kubernetes flex volume API. They are packaged as a docker image, which builds the binaries when you build the docker image and then copies them to a volume when you run the container. It is reasonably straight forward to copy one of the existing go applications and adapt it to the gcsfuse command line application.

The reason the drivers are written in golang is for portability, however you could probably write a short shell script which achieves the same task. The only blocker is it requires some json munging which probably would require jq.

Once you have installed the helm chart, and therefore the FUSE applications and drivers, you can use them (or not) in any pod on your cluster. So an AWS cluster would have the gcsfuse driver available, you just wouldn't use it on your Pangeo as you will get charged for data transit. And vice versa for GCE clusters.

@yuvipanda
Copy link
Member

#190 has way forward on the FUSE issue. I no longer use the NFS Flex Volume I built. I instead use the approach mentioned in that issue.

@stale
Copy link

stale bot commented Jun 25, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 25, 2018
@stale
Copy link

stale bot commented Jul 2, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants