Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

Wrong padding multiplier/wrong number of users for 2020-08-18? #2

Closed
PalminX opened this issue Aug 19, 2020 · 5 comments
Closed

Wrong padding multiplier/wrong number of users for 2020-08-18? #2

PalminX opened this issue Aug 19, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@PalminX
Copy link

PalminX commented Aug 19, 2020

It seems that the data for 2020-08-18 is not quite right:

  • https://ctt.pfstr.de/users/2020-08-18.txt shows a detected padding number of 1, resulting in 379 users
  • The graphs for number of users and number of keys show 98 users and 952 keys, which doesn't match the numbers from approx. users file
  • The dashboard from @micb25 https://micb25.github.io/dka/ shows 75 users, 749 keys and a padding of 5 for 2020-08-18, which seems to be about right.

Is there a problem in the downloaded source data? If so, why does https://micb25.github.io/dka/ show more reasonable values?

@PalminX
Copy link
Author

PalminX commented Aug 19, 2020

Hm, there are 3749 keys in the 2020-08-18 file, which obviously is not a multiple of 5. So it is maybe more a question to @micb25 how he handles these discrepancies

@PalminX
Copy link
Author

PalminX commented Aug 19, 2020

OK, I saw that @micb25 sometimes manually corrected the multiplier in the past.
So I think here you should also have some way of handling or flagging these inconsistent values, because currently the number of users from 2020-08-18 is probably too high

@micb25
Copy link

micb25 commented Aug 19, 2020

OK, I saw that @micb25 sometimes manually corrected the multiplier in the past.

Yes, I had to correct this manually for one of yesterday's hourly packages as well as for one package in the past (2020-08-04). I wonder what situation causes these issues. Fortunately, it seems to happen very rarely. However, the impact on the statistics can be quite significant as you spotted out.

Edit: As a consequence, I do manually check the statistics every day before uploading the new data. And I think this is still necessary for the future, at least as long as fake diagnosis keys are being generated.

@mh-
Copy link

mh- commented Aug 19, 2020

For one specific case, there was an explanation here: corona-warn-app/cwa-server#693
A user submitted twice, the original keys were accepted only once (to avoid duplicate keys), but 2x4 random padding keys were added.
If someone uses my diagnosis-keys tools, I’d suggest to not always use the auto detect feature, but fix the factor to 5 at the moment, and change it when required.

@janpf
Copy link
Owner

janpf commented Aug 19, 2020

  • https://ctt.pfstr.de/users/2020-08-18.txt shows a detected padding number of 1, resulting in 379 users
  • The graphs for number of users and number of keys show 98 users and 952 keys, which doesn't match the numbers from approx. users file

the https://ctt.pfstr.de/X/Y.txt files are generated based on the published daily package, while the graphs are based on the hourly packages. So there will be a discrepancy.
This is done since there is no use for the enduser to click through 24 hourly files per day, but the analysis for the hourly files is of course more precise.

I've now changed it so that the https://ctt.pfstr.de/X/Y.txt files are always analysed with a fixed multiplier of 5, so if the multiplier is wrongly detected, or actually changes it will now be visible by comparing those files to the graphs (1 or 2 users difference will nearly always be present).

Is there a problem in the downloaded source data? If so, why does https://micb25.github.io/dka/ show more reasonable values?

Kind of, yes. If the padding is detected strictly automatically the value is jumping all over the place for the hourly packages, as you correctly noted:

There are 3749 keys in the 2020-08-18 file, which obviously is not a multiple of 5.

I wanted to keep the process of updating the page and analyzing new data as automated and "hands-off" as possible, so these cases were handled incorrectly on my end.
I did this to generate the data as transparently as possible, without any manual interventions.
Everybody can replicate my numbers by running the commands defined in the workflow file in that order.

So it is maybe more a question to @micb25 how he handles these discrepancies.

It seems the only way to handle this is to set some reasonable hard-coded values like @micb25 did.

So I think here you should also have some way of handling or flagging these inconsistent values, because currently the number of users from 2020-08-18 is probably too high

If someone uses my diagnosis-keys tools, I’d suggest to not always use the auto detect feature, but fix the factor to 5 at the moment, and change it when required.

I've placed some safeguards, which should fix it for the moment.
I will use -n -a -m 5 (so with the automatic detection activated, but capped at 5) on new packages every day and when an issue appears I will manually flag the file to be reanalyzed with a fixed multiplier of 5 by adding them to this list.

Thank you for notifying me about the issue!

@janpf janpf added the bug Something isn't working label Aug 19, 2020
@janpf janpf closed this as completed Aug 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants