Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of new TRL vectors #12

Open
micb25 opened this issue Oct 7, 2020 · 15 comments
Open

Implementation of new TRL vectors #12

micb25 opened this issue Oct 7, 2020 · 15 comments
Labels
enhancement New feature or request

Comments

@micb25
Copy link
Contributor

micb25 commented Oct 7, 2020

The next version of the CWA will contain new TRL profiles depending on the symptom status of the patient and the onset. This will significantly increase the difficulty to accurately estimate the number of people that shared their diagnosis keys. However, I think it should be still possible to estimate the number of people with a certain accuracy.

	var transmissionRiskVector: [Int32] {
		switch self {
		case .noInformation:
			return [5, 6, 7, 7, 7, 6, 4, 3, 2, 1, 1, 1, 1, 1, 1]
		case .nonSymptomatic:
			return [4, 4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
		case .symptomaticWithUnknownOnset:
			return [5, 6, 8, 8, 8, 7, 5, 3, 2, 1, 1, 1, 1, 1, 1]
		case .lastSevenDays:
			return [4, 5, 6, 7, 7, 7, 6, 5, 4, 3, 2, 1, 1, 1, 1]
		case .oneToTwoWeeksAgo:
			return [1, 1, 1, 1, 2, 3, 4, 5, 6, 6, 7, 7, 6, 6, 4]
		case .moreThanTwoWeeksAgo:
			return [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 4, 5]
		case .daysSinceOnset(let daysSinceOnset):
			assert(daysSinceOnset < 22)
			return SymptomsOnset.daysSinceOnsetRiskVectors[min(daysSinceOnset, 21)]
		}

// [...]

	private static let daysSinceOnsetRiskVectors: [[Int32]] = [
		[8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1],
		[8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1],
		[6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1],
		[5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1],
		[3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1],
		[2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1],
		[2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1],
		[1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1],
		[1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1],
		[1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2],
		[1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4],
		[1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6],
		[1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7],
		[1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8],
		[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
		[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
	]
@mh-
Copy link
Owner

mh- commented Oct 7, 2020

So - user gets a choice to select from the more vague options above, or tell the exact date, in which case a vector of the 2nd table is selected?

@mh-
Copy link
Owner

mh- commented Oct 7, 2020

Anyway - how to estimate? Just count the keys on the most recent day and divide by a padding factor? But then the (non-published) padding factor becomes the critical value.

@micb25
Copy link
Contributor Author

micb25 commented Oct 7, 2020

Yes, this is exactly what I read out of the CWA source code.

private var optionGroupSelection: OptionGroupViewModel.Selection? {
		didSet {
			switch optionGroupSelection {
			case .datePickerOption(index: 0, selectedDate: let date):
				selectedSymptomsOnsetOption = .exactDate(date)
			case .option(index: 1):
				selectedSymptomsOnsetOption = .lastSevenDays
			case .option(index: 2):
				selectedSymptomsOnsetOption = .oneToTwoWeeksAgo
			case .option(index: 3):
				selectedSymptomsOnsetOption = .moreThanTwoWeeksAgo
			case .option(index: 4):
				selectedSymptomsOnsetOption = .preferNotToSay
			case .none:
				selectedSymptomsOnsetOption = nil
			default:
				fatalError("This selection has not yet been handled.")
			}
		}
	}

May be, someone from the CWA developers ( @thomasaugsten @nickguendling ) can briefly confirm/explain this behavior.

@micb25
Copy link
Contributor Author

micb25 commented Oct 7, 2020

Anyway - how to estimate? Just count the keys on the most recent day and divide by a padding factor? But then the (non-published) padding factor becomes the critical value.

According to private communications, the generation of fake diagnosis keys will be terminated in the near future.

@thomasaugsten
Copy link

Here are the pull request for a better understanding
corona-warn-app/cwa-app-ios#1151
It's planned to set the padding factor to 1 because at the moment no padding is needed.

@mh-
Copy link
Owner

mh- commented Oct 11, 2020

@micb25 @thomasaugsten
So I started to work on this...

At the moment, count_keys.py mainly displays the number of keys it finds each day per TRL:

% curl https://svc90.main.px.t-online.de/version/v1/diagnosis-keys/country/DE/date/2020-10-10 --output 2020-10-10.zip 
% ./count_keys.py -d 2020-10-10.zip -m 5                                                                          
Exposure Notification Diagnosis Key Parser / Counter
This script counts published Diagnosis Keys.

File '2020-10-10.zip' read.

Parsing the Diagnosis Key list...
Length: 23657 keys (4731 without padding)
WARNING: The number of keys is not a multiple of 5!
See https://github.com/corona-warn-app/cwa-server/issues/693#issuecomment-672688610 for a possible explanation.
Number of keys per TRL:
Date:             1      2      3      4      5      6      7      8   Sum:
2020-10-09:       0      0      0      0      0   2269      0      0    2269
2020-10-08:       0      0      0      0      0     30      0   2024    2054
2020-10-07:       0      0      0      0      0      0      0   1889    1889
2020-10-06:       0      0      0      0      0     15      0   1814    1829
2020-10-05:       0      0      0      0   1734      5      0     45    1784
2020-10-04:       0      0   1689      0     45      5      0     25    1764
2020-10-03:    1649      0     70      0      5      0      0     30    1754
2020-10-02:    1709      0      5      0     25      0      0     20    1759
2020-10-01:    1704      0     10      0     10      0      0     20    1744
2020-09-30:    1694      0     20      0     15      0      0      5    1734
2020-09-29:    1699      0     10      0      5      0      0      0    1714
2020-09-28:    1684      0      5      0      0      0      0      0    1689
2020-09-27:    1664      0      0      0      0     10      0      0    1674
Total:        11803      0   1809      0   1839   2334      0   5872   23657

Pre-V1.5 user count:  467
Post-V1.5 user count: 454

For the current fixed TRL profile, we could just use the sum of the TRL 6 column, because "6" currently is unique in the TRL vector and indicates the "freshest" uploaded key.
So the number of users would be:
2334/5 = 467 (because padding is still 5) - and this comment probably explains why 2334 is not a multiple of 5.

About the last line:

2020-09-27:    1664      0      0      0      0     10      0      0    1674

10/5 = 2 users apparently had GAEN last active on 2020-09-27, and on 2020-10-09 decided to upload their keys.

So far so good.

This kind of nicely explains what we are up against in the future:
We will not have a unique starting TRL anymore (like the 6 above).

  1. For a simple approach, we could find a lower bound for the number of users by taking the maximum of the row sums (which would typically be in the first row) - here: 2269 --> 454.
    2269/2334 = 97%, in this case we lost the 3% of users whose first uploaded keys were older.

  2. A more complex approach would iteratively (or recursively) try to apply (parts of) the vectors to this table, "crossing out" the TRLs that are covered by the currently attempted section of the vector in the process, and thereby search for solutions where all TRLs in that table can be covered.
    This approach also somewhat depends on whether When TEKs are missing on the device, wrong TRLs are applied during Diagnosis Key upload corona-warn-app/cwa-documentation#343 is fixed or not. This indicates that it will be fixed; so the vectors will all start at the last day (or the previous day), and some positions will be missing.

Approach 2) could find users missed in approach 1) who had GAEN deactivated for at least one day, but only if all other users had their phone active on these days - otherwise the algorithm would attribute those "otherwise missed" keys to the "normal flow" of keys.
I wonder how often this will happen, and if it's worth the additional effort.

What do you think?

P.S.
./parse_keys.py -d 2020-10-10.zip -m 5 -u finds (only) 459 users, and 133 non-parsed keys.

@janpf
Copy link
Contributor

janpf commented Oct 22, 2020

Two quick questions about the package from 2020-10-21, as I'm trying to understand how the trl vectors are applied (I'm posting here, as I'm unsure whether it is tool-related or server-related):

Parsing the Diagnosis Key list...
Length: 14309 keys (14309 without padding)
Number of keys per TRL:
Date:            1     2     3     4     5     6     7     8   Sum:
2020-10-20:    106   101    86    85   203   648    *1*  144   1374
2020-10-19:     68    69   131     0    77   182   247   485   1259
2020-10-18:     44   118    40     0    46    72   326   497   1143
2020-10-17:    104    32    28     4    40    83   358   470   1119
2020-10-16:     97    17    25    36   244   311   180   191   1101
2020-10-15:     96    43   219   277    24   203    75   149   1086
2020-10-14:    333    83   196    97    97    86    60   110   1062
2020-10-13:    412   280     9   147     6    76    45    75   1050
2020-10-12:    683    77    78    58     2    60    36    46   1040
2020-10-11:    747   135     2    44     5    39    33    22   1027
2020-10-10:    864    50     2    38     0    26    20    14   1014
2020-10-09:    909    42     2    22     3    20     6    10   1014
2020-10-08:    941    28     4    11     2    17     3     7   1013
2020-10-07:     *6*    0     0     0     1     0     0     0      7
Total:        5410  1075   822   819   750  1823  1390  2220  14309

Where does the single 7 for 2020-10-20 come from? There is no vector that could have produced it to my understanding.

Why are there only 6 1s/7 keys overall on the earliest day? This second error (?) happens on other dates as well.

@mh-
Copy link
Owner

mh- commented Oct 22, 2020

Where does the single 7 for 2020-10-20 come from? There is no vector that could have produced it to my understanding.

It doesn't match the profiles on the top of this issue, correct.
I see two possible explanations:

  • Wrong device time: there's some text in some issues in the CWA repo about that; I think one theory is that users set their device time to the future, to get advantages in mobile games, reduce waiting time there?!
  • Federation: keys from Ireland or Italy might have different properties.

Why are there only 6 1s/7 keys overall on the earliest day? This second error (?) happens on other dates as well.

  • I would think Federation again. German CWA apps upload only 13 keys, and this 14th day will come via inpit from Federation.

@mh- mh- added the enhancement New feature or request label Oct 22, 2020
@janpf
Copy link
Contributor

janpf commented Oct 22, 2020

Thanks! That explains it.
I'm crossreferencing #14, as this will likely be my best shot at this, despite containing a lot of guesswork, so:

disclaimer: everything following is highly speculative and is just a test to see what is possible without manual optimization

Basically what we are trying to do is pattern-matching.
I'll try to solve it using an Integer Linear Program (ILP) and will be ignoring federated keys as it gets quite complicated with them.

Feel free to skip to Implementation, if you don't care for my clumsy explanations paired with awkward formatting ;)

Theoretical Explanation

What we are trying to find is the minimal (? more on that later) set of trl-profiles uploaded that matches the daily published distribution of uploaded keys:
(sorry, github issues don't support latex in running text (yet?))

We have T unique transmission risk vectors, where every vector v_{i} contains one risk level l of the available risk levels L for every day d: v_{id} = l.
So let k_{dl} be the number of keys we are observing on day d for risk level l.

Let's introduce a variable for every vector v_{i} that counts how often it got uploaded: trv_{i}

What we are trying to minimize is the sum over our variables trv_{i}.
Now we have a minimization problem on our hands:
https://render.githubusercontent.com/render/math?math=minimize%20\sum_{trv_i\in{}T}trv_i

What's missing are constraints:
We need enough keys for every day for every risk level in our selected transmission risk vectors to explain every key we see in our daily package.

So we add a constraint

  • for every day d

  • for every risk level l

  • that the sum off all the selected transmission risk vectors, that are relevant for risk level l for the given day d: T_{relevant}

  • are at least as many as we are seeing in our packages for that day for that risk level (k_{dl}):

so for each day d \in D: for each level l \in L: https://render.githubusercontent.com/render/math?math=\left(\sum_{i\in%20T_{relevant}}%20trv_i\right)%20\geq%20k_{dl}

now we just need some additional constraints, that all trv_{i} are at least zero in order to avoid weird solutions where we are substracting keys.

skip here if you want

Implementation

So let's implement in python using our favourite ILP solver library ;) (read: the first and only one I got running out of the box without booting linux).

I'll demonstrate it here using the latest package from 2020-10-21.
What we need is the daily key distribution, which your script provides @mh-:

Date:            1     2     3     4     5     6     7     8   Sum:
2020-10-20:    106   101    86    85   203   648     1   144   1374
2020-10-19:     68    69   131     0    77   182   247   485   1259
2020-10-18:     44   118    40     0    46    72   326   497   1143
2020-10-17:    104    32    28     4    40    83   358   470   1119
2020-10-16:     97    17    25    36   244   311   180   191   1101
2020-10-15:     96    43   219   277    24   203    75   149   1086
2020-10-14:    333    83   196    97    97    86    60   110   1062
2020-10-13:    412   280     9   147     6    76    45    75   1050
2020-10-12:    683    77    78    58     2    60    36    46   1040
2020-10-11:    747   135     2    44     5    39    33    22   1027
2020-10-10:    864    50     2    38     0    26    20    14   1014
2020-10-09:    909    42     2    22     3    20     6    10   1014
2020-10-08:    941    28     4    11     2    17     3     7   1013
2020-10-07:      6     0     0     0     1     0     0     0      7
Total:        5410  1075   822   819   750  1823  1390  2220  14309

now we parse that matrix into a numpy array
(note: I included an empty column, to make indexing via risk levels easier, since risk levels start at 1)
(note: I removed the erroneous/federated/... key 7 manually, as otherwise no solution would be possible)
(note: i removed the last row, as I can't pattern match foreign keys atm)

keys =
[[  0 106 101  86  85 203 648   0 144]
 [  0  68  69 131   0  77 182 247 485]
 [  0  44 118  40   0  46  72 326 497]
 [  0 104  32  28   4  40  83 358 470]
 [  0  97  17  25  36 244 311 180 191]
 [  0  96  43 219 277  24 203  75 149]
 [  0 333  83 196  97  97  86  60 110]
 [  0 412 280   9 147   6  76  45  75]
 [  0 683  77  78  58   2  60  36  46]
 [  0 747 135   2  44   5  39  33  22]
 [  0 864  50   2  38   0  26  20  14]
 [  0 909  42   2  22   3  20   6  10]
 [  0 941  28   4  11   2  17   3   7]]

now we load all currently possible transmission risk vectors into another matrix

trvs = 
[[6 6 4 6 5 1 1 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [8 7 3 8 6 1 1 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1]
 [8 7 2 8 7 1 1 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1]
 [8 7 1 8 7 2 1 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1]
 [5 6 1 7 7 3 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1]
 [3 4 1 5 6 4 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1]
 [1 3 1 3 5 5 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1]
 [1 2 1 2 4 6 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1]
 [1 1 1 1 3 6 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1]
 [1 1 1 1 2 7 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1]
 [1 1 1 1 1 7 2 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1]
 [1 1 1 1 1 6 3 1 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1]
 [1 1 1 1 1 6 4 1 1 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1]]

(note: we cut the latest key for every vector, as todays key doesn't get published right away.

if we think about this further, this means my solution more or less depends on us being able to filter out the single key, that gets published the next day at 3am; otherwise the pattern matching will of course go haywire, which could possibly even be the case in this example case)

let's create a solver:

from ortools.linear_solver import pywraplp
solver: pywraplp.Solver = pywraplp.Solver.CreateSolver("SCIP")

create our variables for every possible transmission vector
(note: this already includes the constraints that all variables must be >= 0):

vars: List[pywraplp.Variable] = []
for i in range(trvs.shape[1]):  # $trv_i$ number of people submitted vector $i$
    trvi: pywraplp.Variable = solver.IntVar(0, infinity, f"trv{i}")
    vars.append(trvi)

now we need to assign each variable to every field in keys its vector affects:

key_vars = np.zeros((keys.shape[0], keys.shape[1], 1), dtype=int).tolist()
for i, var in enumerate(vars):
    for day in range(keys.shape[0]):
        key_vars[day][trvs[day, i]].append(var)

all that is left now is to add the constraint that we need at least k_{dl} keys in our solution for every key that was uploaded:

for day in range(len(key_vars)):
    for level in range(len(key_vars[day])):
        # adding the constraint:
        solver.Add(sum(key_vars[day][level]) >= (int(keys[day, level])))

short debug print:

print("Number of variables =", solver.NumVariables()) # 29 (because there are 29 possible different vectors including the pre1.5 one, which to my understanding could still get published)
print("Number of constraints =", solver.NumConstraints()) # 117

let's define our minimization goal:

solver.Minimize(sum(vars))

and then solve the problem:

status = solver.Solve() # 1373 vectors uploaded (note: I removed one 7 by hand)
print(f"took: {solver.wall_time()}") # 30ms

let's print the suggested distribution (trv[i] corresponds to column i in the trv matrix above)
trv0 216 means the vector in column 0 got submitted 216 times (in the hypothetical solution we just calculated).

trv0	216
trv1	243
trv2	85
trv3	12
trv4	128
trv5	14
trv7	4
trv8	37
trv9	103
trv10	177
trv11	75
trv12	86
trv13	46
trv14	55
trv15	28
trv16	12
trv17	6
trv18	7
trv19	9
trv20	5
trv21	16
trv22	3
trv23	2
trv24	4

if we now calculate everything backwards and "simulate" a submission of the calculated number of keys, we arrive at the following expected distribution:

predicted_keys =
[[   0  106  101   86   85  203  648    0  144]
 [   0   78   83  131    0   86  203  247  545]
 [   0   66  125   55    0   46   90  408  583]
 [   0  131   32   28    4   55   83  474  566]
 [   0  124   17   26   37  244  401  317  207]
 [   0  119   53  222  360   24  333   75  187]
 [   0  367  117  262  177  148   87   86  129]
 [   0  454  453    9  203    7  106   46   95]
 [   0  883   94  133   86    9   67   55   46]
 [   0  956  219   16   46    5   64   42   25]
 [   0 1166   52    3   55   16   33   26   22]
 [   0 1212   59    2   28    3   42    6   21]
 [   0 1267   28    4   12    2   23    7   30]]

calculating missing keys

predicted_keys - keys = # missing keys
[[  0   0   0   0   0   0   0   0   0]
 [  0  10  14   0   0   9  21   0  60]
 [  0  22   7  15   0   0  18  82  86]
 [  0  27   0   0   0  15   0 116  96]
 [  0  27   0   1   1   0  90 137  16]
 [  0  23  10   3  83   0 130   0  38]
 [  0  34  34  66  80  51   1  26  19]
 [  0  42 173   0  56   1  30   1  20]
 [  0 200  17  55  28   7   7  19   0]
 [  0 209  84  14   2   0  25   9   3]
 [  0 302   2   1  17  16   7   6   8]
 [  0 303  17   0   6   0  22   0  11]
 [  0 326   0   0   1   0   6   4  23]]

missing number of keys: `3548`
missing number of keys per uploaded vector: `2.58` #  which fits very nicely into the average of the graph of "number of keys published per user per day"

I'd argue it works quite ok, except for the 1 transmission risks.
This could potentially be influenced by the 3 o'clock package which contains a few level 6 keys (at least in the past), which my algorithm then of course tries to match with a vector that includes a lot of level 1 exactly in those earlier days.

btw:
alternatively we could replace our minimization goal such that we are not looking for the minimum number of vectors submitted, but the solution which minimizes the difference between the actually submitted keys and the predicted keys:

min_fun = 0
for day in range(len(key_vars)):
    for level in range(len(key_vars[day])):
        min_fun += sum(key_vars[day][level]) - int(keys[day, level])
solver.Minimize(min_fun)

this results in the exact same distribution (in this case).

alternative idea:

we can invert the contraints and convert the problem into a maximization problem:

for day in range(len(key_vars)):
    for level in range(len(key_vars[day])):
        # solver.Add(sum(key_vars[day][level]) >= (int(keys[day, level])))
        solver.Add(sum(key_vars[day][level]) <= (int(keys[day, level])))
solver.Maximize(sum(vars))

this results in the maximum number of keys we can pattern match with only "complete" vectors, where no key was skipped:

status = solver.Solve() # 1009 vectors are present without any skipped keys

distribution:
trv0 213.0
trv1 180.0
trv2 68.0
trv3 13.0
trv4 73.0
trv5 9.0
trv7 4.0
trv8 29.0
trv9 76.0
trv10 65.0
trv11 74.0
trv12 58.0
trv13 44.0
trv14 36.0
trv15 21.0
trv16 11.0
trv17 6.0
trv18 3.0
trv19 2.0
trv20 5.0
trv22 2.0
trv23 2.0
trv24 4.0
trv25 2.0
trv26 1.0
trv27 8.0

predicted_keys =
[[  0  76  80  58  68 147 471   0 109]
 [  0  55  57 112   0  58 147 184 396]
 [  0  44 100  36   0  44  62 282 441]
 [  0  97  26  21   4  36  73 329 423]
 [  0  94  13  20  29 234 292 151 176]
 [  0  96  34 219 265  24 159  74 138]
 [  0 333  83 196  65  88  85  58 101]
 [  0 409 263   2 147   3  73  44  68]
 [  0 665  76  78  58   2  56  36  38]
 [  0 737 135   0  44   5  38  30  20]
 [  0 864  50   2  36   0  26  20  11]
 [  0 906  42   2  21   2  20   6  10]
 [  0 941  24   4  11   2  17   3   7]]

predicted_keys - keys = # keys, which were not yet assigned a vector, as no more vector can be found which would explain any combination of columns of keys
[[   0  -30  -21  -28  -17  -56 -177    0  -35]
 [   0  -13  -12  -19    0  -19  -35  -63  -89]
 [   0    0  -18   -4    0   -2  -10  -44  -56]
 [   0   -7   -6   -7    0   -4  -10  -29  -47]
 [   0   -3   -4   -5   -7  -10  -19  -29  -15]
 [   0    0   -9    0  -12    0  -44   -1  -11]
 [   0    0    0    0  -32   -9   -1   -2   -9]
 [   0   -3  -17   -7    0   -3   -3   -1   -7]
 [   0  -18   -1    0    0    0   -4    0   -8]
 [   0  -10    0   -2    0    0   -1   -3   -2]
 [   0    0    0    0   -2    0    0    0   -3]
 [   0   -3    0    0   -1   -1    0    0    0]
 [   0    0   -4    0    0    0    0    0    0]]

maybe those keys can then be matched with another algorithm.
this suggests btw. that indeed as presumed above the number of level 6 keys for the latest day is in no way matchable to the rest of the data.

TLDR: please publish some raw data and/or implement an official dashboard with the official data ;)

ideas for improvements:

  • This approach could benefit from applying it to hourly packages, as less keys => less ways to mess up the pattern matching.
  • exclude the 3 o'clock packages, which submit the missing key from the previous day, as that package can not be pattern matched by the algorithm
  • somebody formulates a minimization problem with additional variables, which allows for skipping of keys for single instances of vectors, so that we not only minimize the number of submitted vectors, but the number of skipped keys as well

I'm open to suggestions/improvements, as I never previously worked with ILPs and it's been a few years since I was theoretically dealing with them

@thomasaugsten
Copy link

Publishing of data is planned and be also aware of the the new package with federation keys
https://github.com/corona-warn-app/cwa-app-ios/blob/c02a8b20f2e086c3f5d2b5c5f1726daeeb17b651/src/xcode/ENA/ENA/Source/Services/Exposure%20Transaction/ExposureDetection.swift#L34

@felixlen
Copy link

disclaimer: everything following is highly speculative and is just a test to see what is possible without manual optimization

Basically what we are trying to do is pattern-matching.
I'll try to solve it using an Integer Linear Program (ILP) and will be ignoring federated keys as it gets quite complicated with them.

That is quite an interesting approach. Having regression problems in mind, I think a least squares formulation might also be natural for the estimation. I gave it a quick shot (without a lot of validation so far) based on the data that you posted, so that the results can be directly compared.

I attached a LaTeX description: estimation.pdf. I have used cvxpy as the SCIP license is not very permissive (free for academic use). Code to do the least squares estimation would be:

import cvxpy
import numpy as np
TOTALDAYS = 13
TOTALLEVELS = 8

# rows: days 1-13, columns: risk levels 1-8
# example from https://github.com/mh-/diagnosis-keys/issues/12#issuecomment-714272588
counts = np.array([[  106, 101, 86, 85, 203, 648, 0, 144],
    [  68, 69, 131, 0, 77, 182, 247, 485],
    [  44, 118, 40, 0, 46, 72, 326, 497],
    [  104, 32, 28, 4, 40, 83, 358, 470],
    [  97, 17, 25, 36, 244, 311, 180, 191],
    [  96, 43, 219, 277, 24, 203, 75, 149],
    [  333, 83, 196, 97, 97, 86, 60, 110],
    [  412, 280, 9, 147, 6, 76, 45, 75],
    [  683, 77, 78, 58, 2, 60, 36, 46],
    [  747, 135, 2, 44, 5, 39, 33, 22],
    [  864, 50, 2, 38, 0, 26, 20, 14],
    [  909, 42, 2, 22, 3, 20, 6, 10],
    [  941, 28, 4, 11, 2, 17, 3, 7]])

# taken from from https://github.com/mh-/diagnosis-keys/issues/12#issuecomment-714272588
# but transposed in comparison to the table there
trvectors = [
    [6, 8, 8, 8, 5, 3, 1, 1, 1, 1, 1, 1, 1],
    [5, 7, 7, 7, 6, 4, 3, 2, 1, 1, 1, 1, 1],
    [4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    [6, 8, 8, 8, 7, 5, 3, 2, 1, 1, 1, 1, 1],
    [5, 6, 7, 7, 7, 6, 5, 4, 3, 2, 1, 1, 1],
    [1, 1, 1, 2, 3, 4, 5, 6, 6, 7, 7, 6, 6],
    [8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1],
    [8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1],
    [8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1],
    [6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1],
    [5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1],
    [3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1],
    [2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1],
    [2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1],
    [1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2],
    [1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4],
    [1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6],
    [1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7],
    [1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8],
    [1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8],
    [1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8],
    [1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8],
    [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
]


def trvector_to_matrix(trvector):
    T = np.zeros([TOTALDAYS, TOTALLEVELS])
    for i, level in enumerate(trvector):
        T[i, level-1] = 1.0 # level-1 to switch between 1 and 0 based indexing
    return T

# set up matrix A
A = np.array([
    trvector_to_matrix(trvector).flatten() for trvector in trvectors
]).T
# assert that A is non-degenerate
#if np.linalg.matrix_rank(A) < min(A.shape):
#    raise ValueError(F"Transmission Risk Matrix is rank deficient! Rank {np.linalg.matrix_rank(A)}, shape {A.shape}")

# set up vector n
n = counts.flatten()

# formulate as MIQP and solve with cvxpy
m = cvxpy.Variable(len(trvectors), integer=True)
objective = cvxpy.Minimize(cvxpy.sum_squares(A @ m - n))
prob = cvxpy.Problem(objective, [m >= 0])
prob.solve(solver=cvxpy.ECOS_BB)
m_opt = np.rint(m.value) # round to nearest integer, solution violates integrality by about tol = 1e-8
print(F"Solved Mixed-Integer Least Squares with status {prob.status} and residual norm {np.sqrt(prob.value)}\n")
print("Number of users per transmission vectors")
for i, trvector in enumerate(trvectors):
    print(F"{trvector}: {m_opt[i]}")
print(F"Sum of users {np.linalg.norm(m_opt, 1)}")
print("\nResidual counts ")
print((A.dot(m_opt) - n).reshape(counts.shape))

# formulate as MIQP with additional constraint A m - n >= 0
m = cvxpy.Variable(len(trvectors), integer=True)
objective = cvxpy.Minimize(cvxpy.sum_squares(A @ m - n))
prob = cvxpy.Problem(objective, [m >= 0, A @ m - n >= 0])
prob.solve(solver=cvxpy.ECOS_BB)
m_opt = np.rint(m.value) # round to nearest integer, solution violates integrality by about tol = 1e-8
print(F"\n\nVariant 2: With monotonicity constraint.\n Solved Mixed-Integer Least Squares with status {prob.status} and residual norm {np.sqrt(prob.value)}\n")
print("Number of users per transmission vectors")
for i, trvector in enumerate(trvectors):
    print(F"{trvector}: {m_opt[i]}")
print(F"Sum of users {np.linalg.norm(m_opt, 1)}")
print("\nResidual counts ")
print((A.dot(m_opt) - n).reshape(counts.shape))

@corneliusroemer
Copy link
Contributor

Are we now using new TRL vectors or not? Did CWA change their mind after all? The new documentation pdfs seem to suggest so. I'm very confused by the calculation update to v2. Here are the three PDFs that should be relevant to this discussion:

What I find difficult to square is: right now, the only report type is "confirmed test", so that should mean TRLs of 7 or 8 according to overview/client interpretation pdfs. Yet, in reality, we find TRLs of 0-8. Is the explanation, that right now, TRLs are still calculated as always, but the encoding is prepared for a switch over to a different way of calculating TRLs?

I'd be very interested in your view of this @micb25 @janpf @mh- @thomasaugsten

@thomasaugsten
Copy link

For ENF v1 the Server distribute trl 1-8 and for the ENF v2 the TRL is encoded in dsos and Report Type the Mapping to TRL happens on the Client

@corneliusroemer
Copy link
Contributor

@thomasaugsten Thanks! You say TRL is encoded in dsos and Report Type. Does that mean that all that changed so far is the encoding - not the calculation/determination of TRL? What's discussed in this issue is still WIP and not live?

@thomasaugsten
Copy link

TRL profiles based on symptoms is included since v1.5 since 19.10.20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants