Expand user group to XArray mailing list #130

mrocklin · 2018-02-23T00:17:38Z

So far knowledge of the pangeo.pydata.org website has mostly been communicated in a social fashion. We might also force this spread a bit once we feel we're ready. I suspect that the next logical group would be the XArray mailing list. What do we want to accomplish before we are comfortable with this?

Some options:

Controls to limit a single user taking over the cluster
Particular examples for particular use cases
Instructions on how to upload data
A set of issues to encourage people to start contributing

How much do we care about these? What are other possible blockers that we might care about before releasing to the XArray mailing list?

jhamman · 2018-02-23T19:08:44Z

@mrocklin - how hard would it be to add some documentation on the JupyterHub landing page. I have showed our deployment to a number of people and almost universally, I get questions like: "what is this?" or "how does this differ from my JupyterHub?". It would be great if up front, we could tell people that this is running on GCP, it comes from the pangeo project (give a link to our website), and explain how this deployment is different (autoscales, integration with distributed).

mrocklin · 2018-02-23T19:15:27Z

I don't know, but I agree that it would be a good idea.

…

On Fri, Feb 23, 2018 at 2:08 PM, Joe Hamman ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> - how hard would it be to add some documentation on the JupyterHub landing page. I have showed our deployment to a number of people and almost universally, I get questions like: "what is this?" or "how does this differ from my JupyterHub?". It would be great if up front, we could tell people that this is running on GCP, it comes from the pangeo project (give a link to our website), and explain how this deployment is different (autoscales, integration with distributed). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFq4gBRdp-I6UuT9bw8Aj7KicNLqks5tXwy_gaJpZM4SQMqe> .

jhamman · 2018-02-23T19:17:34Z

Perhaps @yuvipanda can comment on the potential to do this.

rabernat · 2018-02-23T19:31:13Z

Great points @jhamman. I think it would be great to link to our other project sites and acknowledge our funders on the landing page.

rabernat · 2018-02-23T20:11:36Z

To @mrocklin's list I would add

Metrics to track usage statistics over time

Having fine-grained and long-term timeseries could be very valuable in the long run (e.g. when it's time to get more funding.

mrocklin · 2018-03-20T19:56:31Z

It has been about a month here. Checking in on this.

It looks like @jhamman has made progress on adding documentation to the landing page.

Most of the things that I've mentioned have not happened:

Controls to limit a single user taking over the cluster
Instructions on how to upload data
A set of issues to encourage people to start contributing

I've added a couple examples here for this item.

Particular examples for particular use cases

I would encourage others to add more. This can be done by submitting a PR that targets the gce/notebook/examples directory. I'll try to dump in a few more generic dask examples.

@rabernat 's desire to have fine-grained metrocs and long term timeseries also has not happened.

Do we go ahead an expand to the xarray mailing list anyway?

jhamman · 2018-03-21T04:40:07Z

I'm fine publicizing this as is. I have two talks in the next few weeks where I'll be using or mentioning this deployment so it be nice to keep these things moving to whatever extent possible.

rabernat · 2018-03-21T12:50:31Z

@rabernat 's desire to have fine-grained metrocs and long term timeseries also has not happened.

It looks like there is already a lot of kubernetes log info dumped into stackdriver. (GCE docs on kubernetes logging.) Browsing through the logs I can see lots of details on how individual users create and delete different resources. This should be sufficient to retroactively create the metrics we want at a later date. So I am satisfied in the sense that the data is fundamentally there; we just haven't analyzed it yet.

Are we concerned at all about security? What about RBAC (disusssion in #167)? I am somewhat paranoid that a user could easily and even accidentally delete data that I have worked very hard to upload (see e.g. #150, #166). If @mrocklin and @jacobtomlinson feel that these are not serious concerns, I will defer to their expertise.

I also have a few more example notebooks that I would really like add to the default docker image. Is the current notebook dockerfile in this repo up to date? If so I will add my examples today and rebuild the docker image.

mrocklin · 2018-03-21T12:59:48Z

Are we concerned at all about security?

I'm concerned about security, but more from a misuse of resources perspective than a delete-data perspective. Data is stored on GCS, which is separate from our kubernetes deployment here. You can (and should) view permissions on GCS by going to the cloud console and navigating to the storage tab. I don't think that people we don't know can delete data there, but you shouldn't trust me on this,

What about RBAC

Yeah, this would be great. Unfortunately it looks like no one is doing it.

If @mrocklin and @jacobtomlinson feel that these are not serious concerns, I will defer to their expertise.

I am not an expert on cloud deployments and you should not trust me. I do not accept this responsibility and instead push it back up to you as PI.

I also have a few more example notebooks that I would really like add to the default docker image. Is the current notebook dockerfile in this repo up to date? If so I will add my examples today and rebuild the docker image.

I'll be rebuilding the image sometime in the next day or two. If you just push things to the gce/notebook/examples directory they'll end up getting included on the next build

rabernat · 2018-03-21T13:08:16Z

@tjcrone: would it be feasible for you to apply what you have learned about RBAC to pangeo.pydata.org? It sounds like this is an important security feature we should have enabled, but few people around here have the necessary expertise to make it work.

jacobtomlinson · 2018-03-21T13:49:21Z

I have some large concerns about the security of the current platform. The two main issues are down to RBAC not being enabled and the notebooks being run in privileged containers.

If data on GCS is writable by any single user on the cluster then it technically is writable by anyone via privilege escalation.

It would also be reasonable trivial to begin crypto mining or other things on this platform, this is unavoidable as the whole platform is intended to allow people to execute arbitrary code. Perhaps some monitoring of resources would be useful so maintainers can be notified of large scale usage.

rabernat · 2018-03-21T13:52:56Z

@jacobtomlinson I hear you! The problem is that we are having a hard time finding someone with the necessary expertise to actually fix these issues.

Is there any chance you would have some time to take on the RBAC issue? It sounds like you have already implemented this in your own deployments, so perhaps it would not be too heavy a burden to enable it for pangeo.pydata.org. We would be sincerely grateful.

As for crypto mining and other misuse, I'm slightly less concerned about that right now. How does mybinder deal with that question? Hopefully we can eventually find a way to use the stackdriver logs for both retroactive analysis and realtime monitoring of resource usage.

rabernat · 2018-03-22T17:21:00Z

We need to resolve #176 before we can share with the XArray mailing list. It looks like this will require a new version of gcsfs.

mrocklin · 2018-03-22T17:30:59Z

We're already running on a branch. This just requires someone to make a PR with an appropriate fix. It would be good to get more hands in that project if anyone is around. This sounds like a perfect first PR opportunity.

…

On Thu, Mar 22, 2018 at 1:21 PM, Ryan Abernathey ***@***.***> wrote: We need to resolve #176 <#176> before we can share with the XArray mailing list. It looks like this will require a new version of gcsfs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszO_92mpfPoTrCmzPq86yXVMJi_izks5tg919gaJpZM4SQMqe> .

yuvipanda · 2018-03-22T19:23:10Z

Currently, you should assume that anyone with any access to this hub can impersonate anyone else (covertly or not) and do anything they want with it. As others have already mentioned, this is primarily because of RBAC and using Privileged containers.

I'm unfortunately swamped with a course launch happening in the first week of April here in Berkeley, and will not be able to help at least until sometime after mid-April :( Am happy to answer specific questions people have in the meantime though!

mrocklin · 2018-03-22T19:26:14Z

People don't seem to be using fuse at the moment. So we can tear that down. Hopefull with @tjcrone's RBAC implementatation that might resolve all of the critical issues?

…

On Thu, Mar 22, 2018 at 3:23 PM, Yuvi Panda ***@***.***> wrote: Currently, you should assume that anyone with any access to this hub can impersonate anyone else (covertly or not) and do anything they want with it. As others have already mentioned, this is primarily because of RBAC and using Privileged containers. I'm unfortunately swamped with a course launch happening in the first week of April here in Berkeley, and will not be able to help at least until sometime after mid-April :( Am happy to answer specific questions people have in the meantime though! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszFKen6koFenb5Nap4bLLYByTb4UVks5tg_ofgaJpZM4SQMqe> .

jhamman · 2018-03-22T23:47:37Z

People don't seem to be using fuse at the moment.

I have been using fuse recently to provide read access to non-zarr filetypes. I may not be following but what is the current motivation for tearing down fuse?

mrocklin · 2018-03-23T01:26:28Z

As we currently have it configured FUSE is implemented with elevated permissions on the docker containers. Tearing it down would increase security. We could also mount FUSE using Kubernetes FlexVolumes. I think that the UK Met office solution does this currently with S3

…

On Thu, Mar 22, 2018 at 7:47 PM, Joe Hamman ***@***.***> wrote: People don't seem to be using fuse at the moment. I have been using fuse recently to provide read access to non-zarr filetypes. I may not be following but what is the current motivation for tearing down fuse? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszDuJJTJH9E5L3YkIByJjbx1qp-ALks5thDgdgaJpZM4SQMqe> .

jacobtomlinson · 2018-03-23T09:52:15Z

I've been in and out of the office the last few days for personal reasons. But it looks like @tjcrone has done is good job of RBAC stuff in #172.

We don't use privileged containers and we give FUSE access using our S3 FUSE flex volume drivers. These are based on @yuvipanda's NFS flex volume driver.

It does make some assumptions about the cluster (it's running ~~ubuntu~~ debian jesse for example) and it installs packages on the hosts, which I'm unsure whether that's possible on your GCE clusters. If it is possible then we could definitely adapt this package to run on your cluster, if the use of golang puts you off that is also easy to change to something else (but that thing needs to be available on the host, hence golang static binaries being preferable).

mrocklin · 2018-03-23T11:54:50Z

My guess is that people will not be picky :) We're sufficiently constrained by expertise and hours that language preferences seem extravagant :)

…

On Fri, Mar 23, 2018 at 5:52 AM, Jacob Tomlinson ***@***.***> wrote: I've been in and out of the office the last few days for personal reasons. But it looks like @tjcrone <https://github.com/tjcrone> has done is good job of RBAC stuff in #172 <#172> . We don't use privileged containers and we give FUSE access using our S3 FUSE flex volume drivers <https://github.com/met-office-lab/s3-fuse-flex-volume>. These are based on @yuvipanda <https://github.com/yuvipanda>'s NFS flex volume driver <https://github.com/yuvipanda/nfs-flex-volume>. It does make some assumptions about the cluster (it's running ubuntu for example) and it installs packages on the hosts, which I'm unsure whether that's possible on your GCE clusters. If it is possible then we could definitely adapt this package to run on your cluster, if the use of golang puts you off that is also easy to change to something else (but that thing needs to be available on the host). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszE9xkzwvwm_H03rfMzNCGj7bTrsnks5thMXQgaJpZM4SQMqe> .

rabernat · 2018-03-26T14:05:50Z

Where do we stand on this. Does FUSE still work with #172 merged?

Once we have the cluster re-deployed with RBAC, I think we are ready to share with the xarray mailing list.

mrocklin · 2018-03-26T14:46:28Z

According to @jacobtomlinson FUSE still represents a serious security issue. @jhamman still finds fuse useful. I see three options: 1. Someone implements FUSE more cleanly using FlexVolumes. UK Met has a fairly clean example doing this that they use in production. 2. We remove FUSE 3. We publish anyway

…

On Mon, Mar 26, 2018 at 10:05 AM, Ryan Abernathey ***@***.***> wrote: Where do we stand on this. Does FUSE still work with #172 <#172> merged? Once we have the cluster re-deployed with RBAC, I think we are ready to share with the xarray mailing list. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#130 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszKAg61QnHmx0qGqOT4yahmGUWuuUks5tiPW_gaJpZM4SQMqe> .

rabernat · 2018-03-26T15:22:48Z

Someone implements FUSE more cleanly using FlexVolumes. UK Met has a
fairly clean example doing this that they use in production.

This is obviously the best choice. Who else might be qualified to do that? We are already leaning on @jacobtomlinson pretty hard...

jacobtomlinson · 2018-03-26T15:35:09Z

I don't know if I will have time to look at implementing this for GCE, but I would be keen for my flex volume driver to be expanded and made more generic. There is no reason why it has to be S3 specific, we could make it an "object store FUSE flex driver" instead.

We could add more FUSE applications and create drivers for them, for example we could add gcsfuse.

This requires changes to two sections:

The first installs system requirements on the kubernetes nodes themselves. This is currently done here using a privileged container and assumes the node is running Debian. This could be abstracted out into a custom container with logic for if <distro> install with <x> method, but if the default GCE container clusters use Debian then we could always address this at a later stage. To add gcsfuse we would need to add it, following the install instructions, to this step.

The second step builds and drops the, currently two, flex volume drivers onto the node. The drivers are golang cli applications which wrap the FUSE cli applications to conform to the Kubernetes flex volume API. They are packaged as a docker image, which builds the binaries when you build the docker image and then copies them to a volume when you run the container. It is reasonably straight forward to copy one of the existing go applications and adapt it to the gcsfuse command line application.

The reason the drivers are written in golang is for portability, however you could probably write a short shell script which achieves the same task. The only blocker is it requires some json munging which probably would require jq.

Once you have installed the helm chart, and therefore the FUSE applications and drivers, you can use them (or not) in any pod on your cluster. So an AWS cluster would have the gcsfuse driver available, you just wouldn't use it on your Pangeo as you will get charged for data transit. And vice versa for GCE clusters.

yuvipanda · 2018-04-02T17:42:44Z

#190 has way forward on the FUSE issue. I no longer use the NFS Flex Volume I built. I instead use the approach mentioned in that issue.

stale · 2018-06-25T16:15:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-07-02T16:38:45Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

jhamman mentioned this issue Feb 28, 2018

Monthly check-in with UK Met Office #76

Closed

rabernat mentioned this issue Mar 22, 2018

added new notebook #174

Merged

jacobtomlinson added the community label Apr 26, 2018

stale bot added the stale label Jun 25, 2018

stale bot closed this as completed Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand user group to XArray mailing list #130

Expand user group to XArray mailing list #130

mrocklin commented Feb 23, 2018

jhamman commented Feb 23, 2018

mrocklin commented Feb 23, 2018 via email

jhamman commented Feb 23, 2018

rabernat commented Feb 23, 2018

rabernat commented Feb 23, 2018

mrocklin commented Mar 20, 2018

jhamman commented Mar 21, 2018

rabernat commented Mar 21, 2018

mrocklin commented Mar 21, 2018

rabernat commented Mar 21, 2018

jacobtomlinson commented Mar 21, 2018

rabernat commented Mar 21, 2018

rabernat commented Mar 22, 2018

mrocklin commented Mar 22, 2018 via email

yuvipanda commented Mar 22, 2018

mrocklin commented Mar 22, 2018 via email

jhamman commented Mar 22, 2018

mrocklin commented Mar 23, 2018 via email

jacobtomlinson commented Mar 23, 2018 •

edited

Loading

mrocklin commented Mar 23, 2018 via email

rabernat commented Mar 26, 2018

mrocklin commented Mar 26, 2018 via email

rabernat commented Mar 26, 2018

jacobtomlinson commented Mar 26, 2018 •

edited

Loading

yuvipanda commented Apr 2, 2018

stale bot commented Jun 25, 2018

stale bot commented Jul 2, 2018

Expand user group to XArray mailing list #130

Expand user group to XArray mailing list #130

Comments

mrocklin commented Feb 23, 2018

jhamman commented Feb 23, 2018

mrocklin commented Feb 23, 2018 via email

jhamman commented Feb 23, 2018

rabernat commented Feb 23, 2018

rabernat commented Feb 23, 2018

mrocklin commented Mar 20, 2018

jhamman commented Mar 21, 2018

rabernat commented Mar 21, 2018

mrocklin commented Mar 21, 2018

rabernat commented Mar 21, 2018

jacobtomlinson commented Mar 21, 2018

rabernat commented Mar 21, 2018

rabernat commented Mar 22, 2018

mrocklin commented Mar 22, 2018 via email

yuvipanda commented Mar 22, 2018

mrocklin commented Mar 22, 2018 via email

jhamman commented Mar 22, 2018

mrocklin commented Mar 23, 2018 via email

jacobtomlinson commented Mar 23, 2018 • edited Loading

mrocklin commented Mar 23, 2018 via email

rabernat commented Mar 26, 2018

mrocklin commented Mar 26, 2018 via email

rabernat commented Mar 26, 2018

jacobtomlinson commented Mar 26, 2018 • edited Loading

yuvipanda commented Apr 2, 2018

stale bot commented Jun 25, 2018

stale bot commented Jul 2, 2018

jacobtomlinson commented Mar 23, 2018 •

edited

Loading

jacobtomlinson commented Mar 26, 2018 •

edited

Loading