Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create opt-in statistics gathering #1651

Closed
arantius opened this issue Oct 7, 2012 · 12 comments

Comments

Projects
None yet
3 participants
@arantius
Copy link
Collaborator

commented Oct 7, 2012

See #1573 for an example.

It would help to know how many scripts people have installed, how big they are, all sorts of things. But we don't know this. It would be nice to have some sort of statistic gathering, opt-in only, to help us make decisions like that.

  1. What should we gather?
  2. How should we gather it?
  3. What are the privacy implications?
@johan

This comment has been minimized.

Copy link
Collaborator

commented Oct 15, 2012

3 won't fall out until we've decided on 1, 2 and if/how we'll identify each user. We might want to tie each user's data to a persisted randomly-generated GUID (likely synced per #1573), so we can tell apart per-user and per-used-browser-profile counts.

Data that could be highly useful to understand how people use Greasemonkey:

  • count of scripts installed
  • count of scripts enabled
  • count of scripts with tweaked include/exclude patterns
  • for thus tweaked scripts, perhaps min/max/avg/median number of added includes or excludes?
  • some kind of metrics on GM features used, both UI and (on behalf of scripts) what header keywords are used
  • size and count of user GM_setValue values
  • metrics on how often / seldom installed and enabled scripts de facto run in visited web pages, perhaps?
  • install urls
@arantius

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 16, 2012

This is probably the right general level-of-detail. We want to know things but not too much, not to violate privacy. I might also be interested in:

  • Count of scripts installed where downloadURL contains "userscripts.org".
  • Count of locally modified (lastmod time > install time) scripts.

But probably not the whole install URL. There could be semi-private info (hostnames that only resolve in the right environment, etc.) in there.

When you say "metrics on GM features used" do you mean e.g. counts of clicks on each UI element, from which we could build a heatmap?
And "de facto run in visited web pages" just means.. a counter of how many times any script ever executes?

@arantius

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 17, 2012

If we're ever going to do Sync we must, and if we do this we definitely should provide a privacy policy. Not a fun task but an important one.

@arantius

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 17, 2012

Started working on a draft, based on a template/generator, and tweaking for our specific case:
https://docs.google.com/document/d/1n1jQoNvwSxEmvhR-OXT8I1OZQ9I2A5eNK5YH7DTUeko/edit

Once finalized I'd expect this doc to be removed, and put this in the AMO privacy policy slot.

@TriMoon

This comment has been minimized.

Copy link

commented Oct 18, 2012

Maybe also take into consideration of the time between updates/re-installs of a script.
This will provide info on how much a developer is struggling to make things work, hence info on quality of provided documentation of GM.

@johan

This comment has been minimized.

Copy link
Collaborator

commented Oct 22, 2012

When you say "metrics on GM features used" do you mean e.g. counts of clicks on each UI element, from which we could build a heatmap?

That, and the kind of grep-based survey you did on script header imperatives and GM_* calls used in scripts (main code body, in your case) that you did on the userscripts.org collection, which is one step removed from what people actually have installed and enabled in their browser profiles.

The two are significantly different enough to merit separate bullet points, I suppose.

Count of scripts installed where downloadURL contains "userscripts.org".

Good point also about censoring the details of the URL. I think "protocol and hostname of the downloadURL" would be an excellent granularity, telling us to what extent people now host on https:// (which is ideal), and also tallying a modern cut of where to go looking for good user scripts over time. I am personally more and more dropping off userscripts.org, in favour of github (lately to https://github.com/johan/user.js/), and expect many others do similarly.

@TriMoon Excellent suggestion; that data is hard to measure any other way, especially filtered on the reality typical users see.

@arantius

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 24, 2012

Good point also about censoring the details of the URL. I think "protocol and hostname of the downloadURL" would be an excellent granularity,

I'm gonna be extra paranoid and say "domain" not "host name". Not "http://secret.admin.site.company.com" but just "http" and "company.com".


So first we should probably implement some sort of infobar like Firefox has for the same purpose (i.e. https://bug652657.bugzilla.mozilla.org/attachment.cgi?id=532566 ). Iff the user accepts, generate a persistent (to the firefox profile) pseudorandom ID and enable submission. When submitting store that ID, the (server's) time, and whatever data is submitted. Add a checkbox in the Greasemonkey options window to toggle this setting later.

Probably the "value" will just be a string of JSON to be parsed and handled later. This makes the server side implementation stupid simple and easy. We could do something simple like filesystem (ID is a directory, time is a file name, file contents are the values) storage, or any SQL table with 3 columns (ID, time, data). Leave the complexity to the much less often run script that reads and makes sense of the data. The JSON object is easy to parse in any decent language, and we can add/remove whatever data we want in the future without breaking the server, only the parsing scripts might need to be updated.

I'm thinking we define a default period (weekly?) but allow (not require) the server to return a hint about how frequently submissions should be made in the future, to either alleviate load, or gather more data, without changing code.


As for the data itself, a first draft of an example:

{
  firefoxVersion: '16.0.1',
  platform: 'Linux',
  greasemonkeyVersion: '1.5',
  greasemonkeyEnabled: true,
  scripts: [{
      enabled: true,
      explicitGrants: [],
      id: 'script_id_1',
      imperatives: ['name', 'description', 'include', 'include', 'include'],
      implicitGrants: ['GM_xmlhttpRequest'],
      installScheme: 'http',
      installDomain: 'userscripts.org',
      installTime: 'YYYY-MM-DDTHH:MM:SSZ',
      modifiedTime: 'YYYY-MM-DDTHH:MM:SSZ',
      updateCount: A,
      userExcludeCount: X,
      userIncludeCount: Y,
      valueCount: N,
      valueSize: M,
    }, {
      enabled: false,
      explicitGrants: ['GM_addStyle', 'GM_log'],
      id: 'script_id_2',
      imperatives: ['name', 'namespace', 'match', 'grant', 'require'],
      implicitGrants: [],
      installScheme: 'https',
      installDomain: 'github.com',
      installTime: 'YYYY-MM-DDTHH:MM:SSZ',
      modifiedTime: 'YYYY-MM-DDTHH:MM:SSZ',
      updateCount: A,
      userExcludeCount: X,
      userIncludeCount: Y,
      valueCount: N,
      valueSize: M,
    }],
  ui: {
    monkeyButtonClicks: M,
    monkeyButtonMenuOpens: N,
    monkeyToolsMenuOpens: O,
    menuCommanderClicks: {
      script_id_1: {
        "title 1": A,
        "title 2": B,
        },
    }
  }

Most of this is very easy to gather. The "ui" bit and "updateCount" is harder (new routines to count, new storage for the values), and is probably worth doing separately.

@arantius

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 24, 2012

I'm also signing up for a StartCom SSL certificate. I've heard of people (i.e. restricted corporate environments) not being able to connect to that. But AFAIK it's in the default Firefox root CA list, and I think encryption on the wire is worth it.

@johan

This comment has been minimized.

Copy link
Collaborator

commented Oct 25, 2012

All of the above sounds good. (Yes, the encryption is worth it.)

I'm gonna be extra paranoid and say "domain" not "host name". Not "http://secret.admin.site.company.com" but just "http" and "company.com".

Good point. Does Firefox expose something that lets us grab the domain name, for whatever that means in each top level domain, these days? (I guess we have the .tld regexp to go on, if not; however out of date it is, it will be better than having a ton of data points for "co.uk", and the like.)

Ideal data would probably be to have it drill down to organizational owner (e g: finding lysator.liu.se from www.lysator.liu.se, which would be the Lysator Academic Computer Society, not liu.se for Linköping University), but I assume that wouldn't be feasibly easy to do, implementation wise.

imperatives: ['name', 'description', 'include', 'include', 'include']

Thoughts on this vs {"name": 1, "description": 1, "include": 3}? Yours arguably provides richer data, and might be the better choice. Should ignored (misspelled and otherwise non-GM) meta imperatives log as well? My gut feel says yes; it would be good to know how many have adopted @match and other similar good ideas we haven't implemented yet. (Bad example these days, but you get the idea.)

Do installTime and modifyTime capture first install and latest filesystem modification, but not most recent updateTime – and intentionally so? I guess with aggregate over-time data submissions we get enough data tallying up the update counts to get a feel for script update frequencies (after some data massage), but without last update we will not know if a modifyTime means the user has modified the script or if the update system did, which I think would be a property of interest.

@arantius

This comment has been minimized.

Copy link
Collaborator Author

commented Oct 25, 2012

Does Firefox expose something that lets us grab the domain name

https://developer.mozilla.org/en-US/docs/XPCOM_Interface_Reference/nsIEffectiveTLDService#getBaseDomain%28%29

drill down to organizational owner

Yeah, how?

Thoughts on this vs {"name": 1, "description": 1, "include": 3}?

I'm defaulting to making the client side as simple as possible. Scripts can be written and re-written to parse the data later. My example above was the imagined result of a very simple re-parse (if we keep all entries) or scan of known data (if we only use "supported" data already stored in the Script objects).

@match and other similar good ideas we haven't implemented yet.

Ah, but we do support match! For over a year: 23241da

Do installTime and modifyTime capture first install and latest filesystem modification

At first, it would be whatever is already stored in config.xml/the file system. An update is an install and thus would change that value. These values being different means the user edited the script some time after install.

@johan

This comment has been minimized.

Copy link
Collaborator

commented Oct 26, 2012

Cool! I sometimes wish getBaseDomain was in the DOM api too.

@match and other similar good ideas we haven't implemented yet.

Ah, but we do support match! For over a year: 23241da

Exactly, hence "bad example these days". :-) Point being that external innovation will keep happening, and it might be good to get an indicator of what stuff people actually use, whether we support it yet or not.

Do installTime and modifyTime capture first install and latest filesystem modification

At first, it would be whatever is already stored in config.xml/the file system. An update is an install and thus would change that value. These values being different means the user edited the script some time after install.

Ah, right – I mis-recalled that we hung on to initial install times. Pardon the noise. :)

@johan

This comment has been minimized.

Copy link
Collaborator

commented Oct 27, 2012

I just realized why I thought script install times would retain initial install time – in our discussion about how to pick mozilla style salted unique script id cryptographic hashes, we leaned towards making it based on initial install time. I guess we'd resolve and assign that id once though, and then store it in the meta area until it's uninstalled, rather than re-resolving it on each access.

arantius added a commit to arantius/greasemonkey that referenced this issue Nov 7, 2012

arantius added a commit to arantius/greasemonkey that referenced this issue Nov 7, 2012

@arantius arantius closed this in 5b42344 Nov 7, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.