Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Include very basic tracking of usage by default #55

Closed
aronchick opened this issue Dec 20, 2017 · 33 comments
Closed

Proposal: Include very basic tracking of usage by default #55

aronchick opened this issue Dec 20, 2017 · 33 comments
Assignees

Comments

@aronchick
Copy link
Contributor

Using something like Spartakus (https://github.com/kubernetes-incubator/spartakus), ping back to a central server information about the Kubeflow deployment once per day. It should be absolutely anonymous, with zero PII. Just how many components are deployed, and how many pods are running - with a unique identifier to track deployments that last for more than one day.

We should also enable opting out with a single flag, something like --report-metrics=false during ksonnet deployment.

@jlewi
Copy link
Contributor

jlewi commented Dec 20, 2017

This looks pretty easy to setup.

  1. Setup a GKE cluster running the collector
  2. Add the volunteer component to our ksonnet core package with an option to disable it.

@jlewi jlewi added this to the Kubecon Europe milestone Jan 29, 2018
@jlewi jlewi self-assigned this Jan 29, 2018
@jlewi
Copy link
Contributor

jlewi commented Feb 15, 2018

Created the project kubeflow.org/kubeflow-usage

Create the cluster

gcloud container clusters create --project=kubeflow-usage reporting --zone=us-central1-c

Reserve a static IP

gcloud compute --project=kubeflow-usage addresses create stats-collector --global

@jlewi
Copy link
Contributor

jlewi commented Feb 15, 2018

Created a DNS record to associate stats-collector.kubeflow.org with the static IP.

@erikerlandson
Copy link

I feel obligated to mention that modern ML technology (irony!) has demonstrated the ability to infer PII from patterns in data that have no literal PII in them. To be clear, when I look at the information currently broadcast by spartakus, I can't off the top of my head imagine a scenario for how that would happen here. OTOH that's what ML is good at, exploiting patterns humans can't directly perceive.

And yes, users can opt out :)

@elmiko
Copy link

elmiko commented Feb 22, 2018

And yes, users can opt out :)

+1

@erikerlandson
Copy link

Is there a writeup anywhere that gives examples of the various stats that spartakus will collect, and how we plan to use those to improve Kubeflow roadmapping?

@jlewi
Copy link
Contributor

jlewi commented Feb 23, 2018

@erikerlandson https://github.com/kubernetes-incubator/spartakus describes the basic metrics collected; these are all generic K8s metrics that aren't Kubeflow specific.

So I think the immediate use for these metrics is so that contributors to Kubeflow can demonstrate impact and justify further investment.

I think the next step would be to collect more specific Kubeflow metrics to see which components are being used.

@erikerlandson
Copy link

@jlewi so iiuc, the idea is to demonstrate that Kubeflow is being used in the wild? As in "our metrics show that xxx Kubeflow clusters are reporting in, and here is a plot of Kubeflow cluster reports over time"

If I'm reading the report definitions right, it's reporting total resources available on nodes in a cluster. Like "here is a node that has 1TB of RAM" as opposed to "here is a pod using 200MB of RAM"

@aronchick
Copy link
Contributor Author

aronchick commented Feb 23, 2018 via email

@jlewi
Copy link
Contributor

jlewi commented Feb 24, 2018

@erikerlandson An obvious metric to track would be deployments of different versions of Kubeflow. This will help us making informed decisions about breaking changes and how much effort to spend supporting older versions.

@mattf
Copy link
Contributor

mattf commented Feb 24, 2018

three things should be present for something like this to work.

  1. data donated to the community, for the benefit of the community, needs to be available to the community. for instance, data readily available to anyone going to kubeflow.org.
  2. there should be a clear value proposition for the community. for instance, being able to connect with others who are using similar projects or are in similar locations, or clear use of the data for improvement of the project, which may take some time to demonstrate.
  3. it should be opt-in.

the first two go to the social contract established.

the last is my personal position, and i'm usually mollified by a strong social contract, clear indication that the data is collected, a trivial opt-out option.

@jlewi
Copy link
Contributor

jlewi commented Feb 24, 2018

100% on board with the first 2. One of the main reasons we want to collect this data is to build trust in Kubeflow by showing that companies/individuals investing in Kubeflow are extending their reach.

I'm strongly in favor of starting with opt out and seeing what users think. We're still in alpha/experimental so I think that's very reasonable.

If we're opt out we'll get much higher participation just because its the default option.

@aronchick
Copy link
Contributor Author

+1 with 100% about the first two - this should absolutely be available and build trust.

I think we're saying the same thing on #3 - specifically, Matthew has said (which I support), that we have a strong social contract and trivial opt out.

Trivial opt out is done (just one command, and it's gone). What does a clear social contract look like?

@mattf
Copy link
Contributor

mattf commented Feb 26, 2018

the social contract is embodied in doing (0) and (1).

@gsunner
Copy link
Member

gsunner commented Feb 27, 2018

I agree with trust and transparency should be the main goal.

We are also looking to get some basic usage tracking on our project seldon-core using spartakus.
We also have the same issue of whether to have the usage tracking on by default with an easy opt-out.

As we are in the process of integrating Seldon and Kubeflow, we would also want to take advantage of any global flag for an 'opt-out' of all tracking.

Also as you are proposing to share collected data with the community - we may not need to collect the same data as long as usage of Kubeflow related components such as Seldon is also available.

@jlewi
Copy link
Contributor

jlewi commented Feb 27, 2018

It seems like the consensus is that collecting metrics is a good thing.

Let start with opt in opt out and see what users say. If people would strongly prefer opt out we can change.

@gsunner My hope is that in follow on PRs we can include additional metadata to break down usage by component.

Does someone want to approve the actual PR?

@aronchick
Copy link
Contributor Author

aronchick commented Feb 27, 2018 via email

@jlewi
Copy link
Contributor

jlewi commented Feb 27, 2018

@aronchick That was a typo on my part. I agree with you about making it opt out by default.

@mattf
Copy link
Contributor

mattf commented Feb 28, 2018

opt-in is my personal view.

i agree that opt-out is a reasonable starting point for the community, especially if we make it clear we're collecting, make it clear how to opt out, share the data with the community, and demonstrate ways we use the data to benefit the community.

i don't think all those things must, or even can, be done before proceeding.

let's proceed in good faith.

the kubeflow-discuss post has given this heightened attention for a week now. i propose this be on the agenda for the next community meeting and give until the following day for comments before proceeding w/ opt-out.

@aronchick
Copy link
Contributor Author

aronchick commented Feb 28, 2018 via email

@jlewi
Copy link
Contributor

jlewi commented Mar 1, 2018

@mattf Sounds good. I've updated the PR to make it opt in for now and updated the instructions to include the commands to opt in (and make it clear you can skip them).

@elmiko
Copy link

elmiko commented Mar 1, 2018

opt-in is my personal view.

same for me, thanks for updating the PR @jlewi

jlewi added a commit that referenced this issue Mar 2, 2018
Use Kubernetes reporting tool (spartakus) to report anonymous statistics about Kubeflow usage such as basic cluster stats.

This is optional and we're making it opt in for now.

One current limitation is that there's no easy way to give each kubeflow deployment a unique, random id, so it will be hard to distinguish different deployments.

users can manually assign a unique id.

We could potentially modify spartakus (or the Docker image) to generate a random id. The one downside of this is that the id would be regenerated if the pod restarts.
Related to #55
@jlewi
Copy link
Contributor

jlewi commented Mar 5, 2018

PR has been submitted with opt in.

I have created a group
data-analysts@kubeflow.org
to give access to the data in BigQuery to folks preparing reports. I've given access to @chrisheecho who's been doing some of our data analysis and who I'm going to ask to prepare some initial reports.

I can share access with other folks who will be working on preparing reports for the community.

I'll also open up an issue on whether we should make the raw data open to all.

@inc0
Copy link

inc0 commented Mar 6, 2018

As I said on meeting, even opt-in is iffy for me. This can be security risk and well, damages from these can be hard to recover from. Another thing would be usefulness of this data. We can see scale of cluster people use, but how much of it is kubeflow? We can add footnote that if you're willing to run spartacus, that's our endpoint and thank you:)

I'd rather create google doc (?) questionnaire that we can modify and ask open questions tailored to actually improve our project. If we put scale brackets rather than number of nodes, that's easier to convince operators to share this info etc.

@mhausenblas
Copy link
Member

mhausenblas commented Mar 6, 2018

I'm for opt-in (with very clear strong red-blink notice at install time) and while a questionnaire like suggested by @inc0 sounds nice I believe the point is automation so I don't think the folks who want the data for planning or whatever reasons would prefer that option (understandably so).

@mhausenblas
Copy link
Member

After having reviewed the kubernetes-incubator/spartakus source code now I do have a question: given that it has a hard dependency on BigQuery, how are folks supposed to use this behind a firewall, in an on-premises setup?

Don't get me wrong I love and admire BigQuery—heck, a long time ago I even contributed to the open source version of the underlying engine called Dremel, that is, Apache Drill—but I really wouldn't know how I'd explain someone who wanted to set up Kubeflow in a stand-alone fashion that in order to do so she needs a BigQuery account and can't really use Kubeflow "off-line" with telemetry enabled. Please tell me I'm missing something obvious here?

@mhausenblas
Copy link
Member

I asked around a bit and Tim confirmed Spartakus is a PoC and so I think, since we've apparently decided to adopt it, it would make sense to do it properly ;)

I've reached out to Tim to see how I can get involved so that if we have needs (for example, my interest for on-prem deployments is to allow for alternative back-ends) we can meet them in a timely fashion. WDYT @jlewi @aronchick?

@aronchick
Copy link
Contributor Author

aronchick commented Mar 7, 2018 via email

@mhausenblas
Copy link
Member

Thanks @aronchick.

In re: on-prem, part of the idea is that we're able to track how this is
being used even on-prem. The fact that it uses a centralized logging system
(BQ) is a feature, not a bug, because otherwise how would we aggregate?
Because opting out is SO trivial, we're hoping that it doesn't cause any
issues.

Yes, I get that and I hope you remember that we actually decided on an opt-in policy ;)

I think I might be missing the point in re: using KF offline - did you mean
you think that users would like to aggregate all the KF deployed across
their enterprise in an offline way? What an interesting (and exciting)
proposition! I love the idea of exploring that.

That is exactly what I mean, apologies for not being able to communicate that better. We're all guilty of having a bit of a tunnel vision as we're living in a bubble where we take the tools in our org for granted, but you can trust me, I've been in enough situations with users/customers that went like: "what do you mean, technology X is hard-wired and can't be replaced?" not gonna use/buy it …

FWIW, I'm in touch with @thockin concerning Spartakus, will raise issues there and see how I can help in refactoring and extending the plug-able backend stuff with the goal to have a reliable component we can ship with Kubeflow. Hope that makes sense?

@aronchick
Copy link
Contributor Author

aronchick commented Mar 9, 2018 via email

@mhausenblas
Copy link
Member

@aronchick for now I think we should be good, thanks. I'm trying to get involved in Spartakus to ensure that it's a stable and reliable component for our needs, for starters I'm focusing on improving the docs, see kubernetes-retired/spartakus#31 and then we'll see how merciful Mr @thockin is with my refactoring PRs ;)

@jlewi
Copy link
Contributor

jlewi commented Mar 19, 2018

@mhausenblas The spartakus collector defines an interface that abstracts away the database. So if someone wanted to support a DB other than BigQuery it should be pretty straightforward.

@jlewi
Copy link
Contributor

jlewi commented Mar 19, 2018

Per the discussion in this thread, we are now collecting metrics opt-in. This is described in our instructions
https://github.com/kubeflow/kubeflow#steps

So I'm closing this issue.

@mhausenblas thanks for chipping in on spartakus that will be very useful.

@jlewi jlewi closed this as completed Mar 19, 2018
kimwnasptd pushed a commit to arrikto/kubeflow that referenced this issue Mar 5, 2019
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
Signed-off-by: Ce Gao <gaoce@caicloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants