Add the validation for the new-profile ping on Nightly #65
Add the validation for the new-profile ping on Nightly #65
Conversation
Duplicate pings percentage (by document id): 0.21% | ||
|
||
|
||
The 0.21% of ping duplicates is nice, compared to ~1% we usually get from the `main` and `crash` pings. However, nowdays we're running de-duplication by document id at the pipeline ingestion, so this might be a bit higher. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we have stats on duplicate documents that you can check - there is a parquet table called telemetry_duplicates_parquet
available in re:dash. I did a quick check and it doesn't look like the duplicate detection flagged any duplicate new-profile pings from June 1 until today:
SELECT count(*)
FROM telemetry_duplicates_parquet
WHERE submission_date_s3 >= '20170601'
AND fields.doc_type = 'new-profile'
returns 0 rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Mark! I've created a re:dash query and linked it in the report.
total_duplicates_clientid = deduped_docid_count - deduped_clientid.count() | ||
print("Duplicate pings percentage (by client id): {:.2f}%".format(pct(total_duplicates_clientid, deduped_docid_count))) | ||
``` | ||
Duplicate pings percentage (by client id): 0.89% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, 22. Not a whole lot to make conclusions on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Given the nature of the 'new-profile' ping, this is sort of expected. This could just be an handful of misbehaving clients or some bug. Either way, I think we'd have to wait until Beta+1 week to have a better understanding of this.
```python | ||
sorted_clientid_dupes = sorted(collected_clientid_reasons, key=lambda k: len(k[1]), reverse=True) | ||
misbehaving_clients = [client_id for client_id, _ in sorted_clientid_dupes] | ||
to_redact = subset.filter(lambda p: p.get("clientId") == misbehaving_clients[1]).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're only looking at pings that are from the 2nd misbehaving client? Shouldn't it be p.get("clientId") in misbehaving_clients
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to a more compact form now.
|
||
|
||
|
||
Interesting! Looks like many of these pings are missing the pingsender header from the request, meaning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but they're all from the same client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, rephrased that part to make it clearer.
|
||
|
||
|
||
Sending `new-profile` pings with reason `shutdown` seems to happen half of the times with misbehaving clients and about 10% of the times with well behaving clients. The pingsender doesn't seem to be the issue here: if we generate a ping at shutdown and try to send it with the pingsender, and fail, then it's normal for Firefox to pick it back up and send it. As long as we don't generate a new, different, `new-profile` ping for the same client. And this is something that well behaving clients don't do, so that 10% should be safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this follow from the numbers above? Looks like a third of pings from misbehaving clients are coming from pingsender and less than a quarter of pings from behaving clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Looks like we're receiving about ~75% of the pings with the pingsender (~81% from well behaved clients). Changed this too with the correct numbers: the conclusion stays the same, looks like the pingsender in itself isn't the problem there, as most of the pings are being received from the pingsender with both good and bad behaving clients.
|
||
Sending `new-profile` pings with reason `shutdown` seems to happen half of the times with misbehaving clients and about 10% of the times with well behaving clients. The pingsender doesn't seem to be the issue here: if we generate a ping at shutdown and try to send it with the pingsender, and fail, then it's normal for Firefox to pick it back up and send it. As long as we don't generate a new, different, `new-profile` ping for the same client. And this is something that well behaving clients don't do, so that 10% should be safe. | ||
|
||
### Does the profileCreationDate match the date we received the pings? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting analysis, but given that we know there are dupes, would it be better to compare only the delays of the first new-profile pings we receive?
(probably won't change things much)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! As expected though, this resulted in a modest improvement.
# If there's a new-profile ping with reason 'shutdown', look for the closest next | ||
# 'main' ping with reason shutdown. | ||
for i, p in enumerate(ordered_pings): | ||
# Skip until we find the 'new-profile' ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'd be an interesting case, no? Where we send a main/shutdown ping before sending a new-profile/shutdown ping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I've extended the case to account for that too. Looks like the 'new-profile' ping is always the first generated ping, if generated.
# Is the next ping a 'main' ping with reason 'shutdown'? | ||
next_ping = ordered_pings[next_index] | ||
if is_newprofile(next_ping): | ||
return ("Duplicate 'new-profile' ping.", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case of duplicates with no main/shutdown ping in between them. It is possible that there might be a new-profile/shutdown, then a main/shutdown, then a new-profile/shutdown.... but dupes are so infrequent maybe we can just gloss over that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored that part to simply count the number of times we have new-profile pings
if is_newprofile(next_ping): | ||
return ("Duplicate 'new-profile' ping.", 1) | ||
|
||
if next_ping.get("payload/info/reason") == "shutdown": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always true, as all non-newprofile pings are main/shutdown thanks to earlier filtering.
```python | ||
newprofile_plus_main = first_main.union(newping_shutdown) | ||
sorted_per_client = newprofile_plus_main.map(lambda p: (p["clientId"], [(p, p["meta/creationTimestamp"])]))\ | ||
.reduceByKey(lambda a, b: sorted(a + b, key=lambda k: k[1]))\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed that, at shutdown, the main ping is created after the new-profile ping? The 98% below suggests that this is almost always the case, but is it guaranteed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'new-profile' should always be generated before the 'main', at shutdown. See here where we wait for the 'new-profile' to be generated before moving on to shut down TelemetrySession (which generates the 'main' ping). Although this is how it happens, we're not making any strict promise on that.
The 'new-profile' ping was scheduled at the wrong time: 0.881953867028 | ||
Duplicate 'new-profile' ping.: 0.542740841248 | ||
No 'shutdown' new-profile ping found: 0.203527815468 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation case: we didn't check that new-profile/startup is sent only from sessions that last longer than 30min
I don't know that we absolutely must validate that case. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, just added this checks. Looks good so far!
c5567e2
to
ecf43c8
Compare
Thanks Chris for the feedback so far. This is RFAL. I've also attempted to pin duplicate new-profile pings on crashes, but looks like we only have crash pings for few of the cases :-\ I left my attempt there so that we can run this again on Beta FWIW. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I look forward to seeing the beta week 1 results.
return ("Duplicate 'new-profile' ping.", 1) | ||
|
||
if not newprofile_mask[0]: | ||
return ("The 'new-profile' ping it's not the first ping", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar nit: "The 'new-profile' ping is not the first ping"
ecf43c8
to
510adb2
Compare
Thanks Chris! |
No description provided.