Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications on METRICS.md #2579

Closed
fmarier opened this Issue Apr 5, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@fmarier
Copy link

fmarier commented Apr 5, 2017

I'm reading through METRICS.md and I'd like to suggest two small clarifications:

Page views

  • Views of shot pages show up as /a-shot/{hash}

What exactly goes into this hash? Is it a random number that's then SHA256'd?

Metrics schema

Each item in these events requires:

Is it possible to provide an actual example of a submitted event?

@ianb

This comment has been minimized.

Copy link
Contributor

ianb commented Apr 6, 2017

What exactly goes into this hash? Is it a random number that's then SHA256'd?

It's just a SHA256 hash of the path, there's no added randomness. See code

The goal of using a hash is that we can get some sense of how views are clustered; e.g., are there some pages that are viewed (and thus shared) heavily and many that are not? Since we're doing the hashing on the client side we don't have access to any private random number.

Is it possible to provide an actual example of a submitted event?

Here's an example running locally of an event from onboarding:

{
  "an":"firefox",
  "av":"55.0",
  "ec":"addon",
  "ea":"visited-slide",
  "el":"slide-1",
  "ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0",
  "v":"1",
  "tid":"UA-49796218-30",
  "cid":"e0d5c32b-1115-5870-9603-f8640eb18673",
  "t":"event"
}

Because the client hasn't logged in at this point there are no A/B tests assigned, otherwise there would be some cd* variables as well. cid is the client ID. tid is our Google Account ID.

I thought we also included the add-on version, but apparently not; filed as #2583

@fmarier

This comment has been minimized.

Copy link
Author

fmarier commented Apr 7, 2017

It's just a SHA256 hash of the path, there's no added randomness.

Thanks. It would be good to add this to the documentation.

The goal of using a hash is that we can get some sense of how views are clustered; e.g., are there some pages that are viewed (and thus shared) heavily and many that are not?

So this is a way to obfuscate the real path from GA but still get a unique identifier for every page that will be stable across clients and visits?

@fmarier

This comment has been minimized.

Copy link
Author

fmarier commented Apr 7, 2017

Firefox Screenshots assigns each user a random ID (associated with their profile) when the add-on is installed. This ID is associated with all shots the user makes. For the purpose of Google Analytics (GA) the ID is hashed. The same hashed ID is used for website visits and events, and for add-on events.

I've got a few questions about this random user ID:

  1. This is the same thing as the "client ID" (cid) you mentioned in #2579 (comment), right?
  2. What exactly goes into that ID? You mentioned there's a device ID (where does that come from?) and a private random number that then gets hashed (i.e. SHA256(deviceID + randomSalt)?
  3. Is the "hashed" GA ID simply SHA256(userId)?
@fmarier

This comment has been minimized.

Copy link
Author

fmarier commented Apr 7, 2017

General Google Analytics information

This is stuff we get from including ga.js on Screenshots pages.
[...]
Location

Is that the location with the path hashed as you explained in #2579 (comment)?

Referrals

I assume you mean the HTTP Referrer here? If so, is it the full Referer header or are we trimming it to just include the origin?

Social referral

What exactly does that look like?

@ianb

This comment has been minimized.

Copy link
Contributor

ianb commented Apr 10, 2017

It's just a SHA256 hash of the path, there's no added randomness.
Thanks. It would be good to add this to the documentation.

Sure, added in d5891ee

So this is a way to obfuscate the real path from GA but still get a unique identifier for every page that will be stable across clients and visits?

Yes. Adding a secret that rotates would be fine in terms of stability, but is just harder when we're doing the hashing on the client side.

I've got a few questions about this random user ID:

This is the same thing as the "client ID" (cid) you mentioned in #2579 (comment), right?
What exactly goes into that ID? You mentioned there's a device ID (where does that come from?) and a private random number that then gets hashed (i.e. SHA256(deviceID + randomSalt)?
Is the "hashed" GA ID simply SHA256(userId)?

The random user ID, which is generated in the add-on, is referred to as the deviceId. cid is a Google Analytics term, and a hash of SHA1(private_rotating_secret + deviceId). That is done in server.js

The deviceId is generated as a UUID using makeUuid, which uses window.crypto.getRandomValues()

[From GA: Location] Is that the location with the path hashed as you explained in #2579 (comment)?

Yes, the path, hashed for individual shots, or kept in full for other pages (My Shots, homepage, etc).

I assume you mean the HTTP Referrer here? If so, is it the full Referer header or are we trimming it to just include the origin?

The complete referrer is kept. Since it's google-analytics.js doing the collecting, presumably it's document.referrer

[Re: Social Referrer] What exactly does that look like?

I don't know much about this specifically. Looking at the report in GA, I believe it's just referral information with a filter mapping domains to social networks.

@fmarier

This comment has been minimized.

Copy link
Author

fmarier commented Apr 11, 2017

The random user ID, which is generated in the add-on, is referred to as the deviceId. cid is a Google Analytics term, and a hash of SHA1(private_rotating_secret + deviceId). That is done in server.js

It's a little odd that we're using a SHA1 hash here and a SHA256 for the paths, but I imagine it's a limitation of the nodify-uuid module.

Since this is about obfuscating the real deviceId, I don't think it's a concern.

The deviceId is generated as a UUID using makeUuid, which uses window.crypto.getRandomValues()

So this deviceId is a unique ID that's generated when the add-on is installed and never changes? (since you can't uninstall a system addon)

It would be good to document what's going into the user ID, basically what you wrote in #2579 (comment).

The complete referrer is kept. Since it's google-analytics.js doing the collecting, presumably it's document.referrer

Do you need the full referrer or could we provide to GA a referrer that has been truncated to just the origin (i.e. no path)?

A truncated referrer would still tell us for example that traffic is coming from Twitter, but it would avoid revealing the userid of the Twitter user who shared that screenshot (likely the owner of the screenshot).

ianb added a commit that referenced this issue Apr 13, 2017

@ianb

This comment has been minimized.

Copy link
Contributor

ianb commented Apr 13, 2017

So this deviceId is a unique ID that's generated when the add-on is installed and never changes? (since you can't uninstall a system addon)

Yes. We used to regenerate the deviceId when a user chose to leave the service, but this caused some issues if you then tried to restart using the service, so we simplified that and now we just remove all the shots on the server but leave the deviceId in place.

I added a note in d29ed92

Do you need the full referrer or could we provide to GA a referrer that has been truncated to just the origin (i.e. no path)?

We do use the full referrer sometimes when we are curious how people are using Screenshots. Generally it is only available if a shot was posted somewhere publicly, so it's an opportunity to see what public use looks like.

Google Alerts would offer some ability to see public use. We plan to turn on robots.txt hiding (#2256) as a matter of caution while we get used to the service, but I suppose that won't hide links to Screenshots, only the individual shots.

A truncated referrer would still tell us for example that traffic is coming from Twitter, but it would avoid revealing the userid of the Twitter user who shared that screenshot (likely the owner of the screenshot).

Services that want to make a user or link private would typically hide the referrer. I believe Twitter does with its t.co links.

@wresuolc

This comment has been minimized.

Copy link
Contributor

wresuolc commented Apr 20, 2017

Thank you for the discussion here. I've broken the todo item here into #2717 and will close this.

@wresuolc wresuolc closed this Apr 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.