Skip to content
This repository has been archived by the owner on Feb 29, 2020. It is now read-only.

Determine constraints/what needs to be collected in our data pipeline #27

Closed
k88hudson opened this issue Feb 2, 2016 · 29 comments
Closed
Assignees
Milestone

Comments

@k88hudson
Copy link
Contributor

If we are collecting data in splice:

  • what data do we need
  • what are the constraints?

Notes:

  • legal constraints are determined by TestPilot
  • Product to fill in exact data requirements/KPIs

@tspurway do you want to add some clarification here?

@k88hudson k88hudson added this to the 1. UI Demo milestone Feb 2, 2016
@tspurway tspurway modified the milestones: 2. Alpha, 1. UI Demo Feb 12, 2016
@k88hudson
Copy link
Contributor Author

@nchapman can you help wrap up the requirements here?

@ncloudioj
Copy link
Member

Target metrics

  • GUID
  • metric type [click | close | ...]
  • duration (in millisecond?)
  • position (click position: [spotlight | top sites | top activity | timeline])
  • scroll depth
  • device
  • OS
  • browser
  • version
  • country
  • locale
  • time stamp (auto-generated by server)

Other metrics (TBD)

  • number of newtabs opened per user
  • number of opt-out users
  • disable or remove content
  • retention rate ??

Thoughts? @oyiptong @tspurway @nchapman @k88hudson @emtwo

@tspurway
Copy link
Contributor

GUID needs to persist over time. Initially could be FxA ID? Test Pilot ID?

Scroll position could always be tracked with 'click' metrics.

quick and dirty duration could be (start timer on newtab) - (stop timer on 'defocus') (send ping)

Basic navigation and interaction events should be tracked.

On click we will track what position the clicked item was and scroll depth.

how old are the items that users are clicking on (ages should be client time differences (in seconds))?

check out mixpanel and GA data formats (use standard ping/metrics formats)

add-on version, experiment versions, cohort id, etc.

performance metrics - load times, latencies,

size of history, number of bookmarks, time spent in the browser (active usage hours)

install addon, uninstall, updated (lifecycle events of the addon itself)

@tspurway
Copy link
Contributor

We are defining an Activity Stream session as the time between when the newtab page gains and looses focus. We will grab a timestamp on each of these events and report the difference as session_duration for every ping.

A ping will be sent whenever the session ends (which will be when the tab looses focus). The ping will contain the following data:

  • client_id (unique id for user)
  • tab_id (unique id for that user's newtab)
  • load_reason (newtab, refocus, restore)
  • source (timeline or activity_stream)
  • search (indicates a search was performed)
  • top_site_click (0 or 1)
  • max_scroll_depth (int)
  • click_position (index of object clicked on a click event or -1)
  • spotlight_click (0 or 1)
  • recent_item_click (0 or 1)
  • total_bookmarks (total number of bookmarks for this user)
  • total_history_size (number of entries in user's history)
  • session_duration (int in milliseconds)
  • unload_reason (click, search, close, unfocus, navigation, refresh, crash)

Are we missing anything here @emtwo ?

@tspurway
Copy link
Contributor

tspurway commented Mar 1, 2016

  • crash data over a shortish time period might be useful
  • let's make sure metrics team has access to the redshift instance

@ncloudioj
Copy link
Member

Based on the feedback of @oyiptong

  • client_id (unique id for user)
  • tab_id (unique id for that user's newtab) str
  • addon_version (e.g. "1.0") str
  • load_reason (newtab, focus, restore)
  • source (recent_links, recent_bookmarks, frecent_links, top_sites, spotlight) other
  • search (indicates a search was performed)
  • max_scroll_depth (int) 0
  • click_position (index of object clicked on a click event or -1?) -1 (global index or local index?)
  • total_bookmarks (total number of bookmarks for this user) -1
  • total_history_size (number of entries in user's history) -1
  • session_duration (int in milliseconds)
  • unload_reason (click, search, close, unfocus? navigation? refresh? crash?) other str

Edit1: scratch search as it's already been captured in unload_reason.
Edit2:

  • all the fields above are required in the ping. Each field will be either populated with the actual value, or the default value if not otherwise specified.
  • change tab_id type from int to str
  • use integer type for click_position for the click event, use -1 if it's non-click event
  • add 'other' to the unload_reason

FYI, @tspurway @emtwo

@emtwo
Copy link
Contributor

emtwo commented Mar 3, 2016

Do we think we'll want to do any analytics on the time of day of activations? I was keeping track of a start timestamp, wondering if it's worth keeping?

@tspurway @ncloudioj @oyiptong

@oyiptong
Copy link
Contributor

oyiptong commented Mar 3, 2016

The assumption is that the server-side will take note of the receipt time of a ping.
The receipt time is then taken as as the datetime of note.

The client doesn't need to keep track of the start time.

emtwo pushed a commit that referenced this issue Mar 3, 2016
@ncloudioj
Copy link
Member

@emtwo FYI, the Onyx endpoint for AS is live in stage:

"https://onyx_tiles.stage.mozaws.net/v3/links/activity-stream"

In production, it will be "https://tiles.services.mozilla.com/v3/links/activity-stream"

The add-on can send pings to Onyx via HTTP POST, it returns status code 200 upon success with an empty response, otherwise, it returns 400 if any error occurs.

@emtwo
Copy link
Contributor

emtwo commented Mar 3, 2016

@oyiptong is the server aware of the timezone of the client? If not, then the server doesn't actually know what time of day a given user is browsing right?

Edit: sounds like we can infer timezone on the server using geoip, so no need to send client timestamp

emtwo pushed a commit that referenced this issue Mar 3, 2016
@oyiptong
Copy link
Contributor

oyiptong commented Mar 3, 2016

I see what you mean. The local datetime. And yes, we obtain the country data on the server-side.
To obtain timezone awareness, we should store the state level information as well.

@ncloudioj
Copy link
Member

Usually it doesn't make sense to save time stamp from the clients, as we can't control how they set their local time. I will try to get the timezone from the geo, which in turn generated by IP address.

emtwo pushed a commit that referenced this issue Mar 3, 2016
fix(addon): #27 Update data collection format.
emtwo pushed a commit that referenced this issue Mar 5, 2016
@tspurway
Copy link
Contributor

tspurway commented Mar 7, 2016

The MaxMind GeoIP database we use has Timezone as a field (https://www.maxmind.com/en/geoip2-city#features)

@oyiptong
Copy link
Contributor

oyiptong commented Mar 7, 2016

AFAIK, our servers are provisioned with the country db. We'll need to ensure they are provisioned with the city DB

@tspurway
Copy link
Contributor

tspurway commented Mar 7, 2016

That's a good point - I will doublecheck with travis we have the proper license somewhere

@ncloudioj
Copy link
Member

Looking to add a filter for session_duration, if it goes beyond a certain threshold, say, 20 minutes, we treat it as an outlier and discard this ping.

What do you think? @tspurway @oyiptong

@oyiptong
Copy link
Contributor

oyiptong commented Mar 7, 2016

We can capture the ping and just not count it when we do our aggregate queries. The information could still be useful.

@ncloudioj
Copy link
Member

Moving the duration filtering to the query level, that also sounds viable. Plus keeping the long-lived session pings could facilitate the further investigation. OK, will take out this filter from Infernyx. Thanks!

@ncloudioj
Copy link
Member

Just confirm that the timezone is not available in the country database in GeoIP2. It only comes with the city one.

@tspurway
Copy link
Contributor

I just noticed there was a question about the click_position above that was never answered @ncloudioj @emtwo . It should definitely be an index, and not an x-y coordinate.

@ncloudioj
Copy link
Member

click_top_site, click_spotlight, and click_top_activity have been moved to the source field.

As for the click_position, @emtwo and I will finalize the exactly type and value today.

@tspurway
Copy link
Contributor

Also, is it possible to grab telemetry data from #19 and include it in this ping? @ncloudioj @emtwo @mzhilyaev

@emtwo
Copy link
Contributor

emtwo commented Mar 10, 2016

I think it should be fine to add metrics from #19 to the ping. @mzhilyaev should be able to confirm.

@ncloudioj
Copy link
Member

Yes, I've asked @oyiptong about this in the last meeting. Sounds like those are not that urgent comparing to the other metrics. But we can definitely include them in the ping now.

@oyiptong can you confirm this please?

@mzhilyaev
Copy link
Contributor

Yes, we can attach performance events or any metric computable from them to the ping

@ncloudioj
Copy link
Member

@emtwo @tspurway FYI, based on the discussion we had today,

  • client_id (str: unique id for user)
  • tab_id (str: unique id for that user's newtab)
  • addon_version (str: e.g. "1.0")
  • load_reason (str: newtab, focus, restore, other)
  • source (str: recent_links, recent_bookmarks, frecent_links, top_sites, spotlight, other)
  • max_scroll_depth (int: use -1 for "N/A")
  • click_position (int: index of object clicked on a click event, use -1 for non-click ping)
  • total_bookmarks (int: total number of bookmarks for this user, use -1 for "N/A")
  • total_history_size (int: number of entries in user's history, use -1 for "N/A")
  • session_duration (int: in milliseconds)
  • unload_reason (str: click, search, close, unfocus, navigation, refresh, crash, other)

Note:

Question:

  • Regarding click_position, do we want to use the global index for each item on the page, or the local index for each section? @k88hudson Thoughts? I like the global one, easier to query in the database :)

Edit 1: We've decided to include the perf metrics in the ping.
Edit 2: The type of client_id should be string.

@ncloudioj
Copy link
Member

Performance metrics to be included in the ping:

  • load_latency (int: in millisecond)

@oyiptong @mzhilyaev could you add other interested metrics to the list above please?

k88hudson pushed a commit to k88hudson/activity-stream that referenced this issue Mar 11, 2016
@mzhilyaev
Copy link
Contributor

load time metric is Milliseconds , client default is null.

@ncloudioj
Copy link
Member

Update: Add-on telemetry has been enabled since 1.0.4, data processing and persistence look good so far. Will keep watching it as more metrics will be brought to the pipeline.

ScottDowne added a commit to ScottDowne/activity-stream that referenced this issue Oct 5, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants