Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in handling duplicate report_ids #192

Closed
hellais opened this issue May 10, 2018 · 4 comments
Closed

Issues in handling duplicate report_ids #192

hellais opened this issue May 10, 2018 · 4 comments
Assignees
Labels
bug Something isn't working ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/medium

Comments

@hellais
Copy link
Member

hellais commented May 10, 2018

When a the same report is submitted twice, because OONI Probe attempted to re-submit a report, the centrifugation step fails due to a duplicate key.

I see that there is already some logic to handle this situation in load_global_duplicate_reports (https://github.com/TheTorProject/ooni-pipeline/blob/master/af/shovel/centrifugation.py#L586) (which is great 👍 ), but this does not apply "cleanly" to when you are backfilling data.

My understanding is that this table should somehow be populated by running shovel/canned_repeated.py, however I don't see it ever being called (nor is it part of the DAG).

@darkk what is the process for handling duplicate reports?

Here are the relevant log lines of a situation of this sort:

[2018-05-10 18:11:51,990] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,983] {bash_operator.py:94} INFO - File "/usr/local/bin/centrifugation.py", line 1766, in main
[2018-05-10 18:11:51,991] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,983] {bash_operator.py:94} INFO - meta_pg(opt.autoclaved_root, bucket, opt.postgres)
[2018-05-10 18:11:51,991] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,983] {bash_operator.py:94} INFO - File "/usr/local/bin/centrifugation.py", line 1726, in meta_pg
[2018-05-10 18:11:51,991] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,984] {bash_operator.py:94} INFO - copy_data_from_autoclaved(pgconn, stconn, in_root, bucket, bucket_code_ver)
[2018-05-10 18:11:51,991] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,984] {bash_operator.py:94} INFO - File "/usr/local/bin/centrifugation.py", line 770, in copy_data_from_autoclaved
[2018-05-10 18:11:51,992] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,984] {bash_operator.py:94} INFO - feeder.close()
[2018-05-10 18:11:51,992] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,984] {bash_operator.py:94} INFO - File "/usr/local/bin/centrifugation.py", line 1018, in close
[2018-05-10 18:11:51,992] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,984] {bash_operator.py:94} INFO - ''') # TODO: `LEFT JOIN report_blob` to fail fast in case of errors
[2018-05-10 18:11:51,992] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,984] {bash_operator.py:94} INFO - psycopg2.IntegrityError: duplicate key value violates unique constraint "report_report_id_key"
[2018-05-10 18:11:51,993] {base_task_runner.py:95} INFO - Subtask: [2018-05-10 18:11:51,985] {bash_operator.py:94} INFO - DETAIL:  Key (report_id)=(20180505T000009Z_AS9143_tvL8E3NvGfbiRDuTH1KanhB2XMEs1xO9Fh1ykMdx2toQeKWoQ2) already exists.
@darkk
Copy link

darkk commented May 15, 2018

What I actually did is following:

  • I looked at 2018-05-{06..10} for reports with same name and found 5 distinct reports:
20180505T000008Z-NL-AS9143-web_connectivity-20180505T000008Z_AS9143_YXblHbyqIlBUxqzkwQ344hJM4O19Nx9q2E90RUv4W6yFTi4QyS-0.2.0-probe.json
20180505T000009Z-NL-AS9143-facebook_messenger-20180505T000009Z_AS9143_tvL8E3NvGfbiRDuTH1KanhB2XMEs1xO9Fh1ykMdx2toQeKWoQ2-0.2.0-probe.json
20180505T000013Z-NL-AS9143-meek_fronted_requests_test-20180505T000013Z_AS9143_0yenXJrRnCtCugkdC5FZKdu5bEEfOByk7yQ9Uo69mUakOzFwvh-0.2.0-probe.json
20180505T002253Z-SE-AS3301-tcp_connect-20180505T002253Z_AS3301_IAgBsdVLd4qaCpAZoEVYueJ9r9wm84216dpFbaRzRgwnRuRFNO-0.2.0-probe.json
20180505T003212Z-NL-AS9143-tcp_connect-20180505T003212Z_AS9143_F4POCxnCWlhm5f2gMqYmUz3Fqou6kdcjK00zaBHcCneHVTCTNe-0.2.0-probe.json
  • I checked if those are same reports -- they were files with same sha1 and sha512
  • I added those to repeated_report table with the following query (and that was enough):
insert into repeated_report
select
    dupgrp_no,
    u, 
    '2018-05-' || d || '/20180505T000008Z-NL-AS9143-web_connectivity-20180505T000008Z_AS9143_YXblHbyqIlBUxqzkwQ344hJM4O19Nx9q2E90RUv4W6yFTi4QyS-0.2.0-probe.json',
    '\xf74412b29a700451b8a0e6dedf6a4c49d781750a'::sha1 as orig_sha1,
    '\x873269d0cf236fdfa30da85017c3bf0ee45b5b5e8796076010fe65bee749fa01f8f9e1fc35137b45875af95867a1613784371f00c5dcdddf3263fe38e7b9fe0b'::sha512 as orig_sha512
from (values ('06', false), ('07', true), ('08', false), ('09', false), ('10', false)) as t1 (d, u)
, (values (nextval('dupgrp_no_seq'))) AS t2 (dupgrp_no) ;

I considered that extending canned_repeated is impractical for following reasons:

  • 2018-05-07 bucket was already ingested while 2018-05-06 was not. So the heuristic of canned_repeated script that the first report in the "series" of duplicate reports is the correct one -- that heuristic was wrong in this case.
  • I spent some time thinking about better heuristics and have not come with a better one.
  • I think, that the better way is to drop unique constraint on report_id and handle rare duplicates at submission & API level.

@darkk
Copy link

darkk commented Aug 4, 2019

I was looking at old notes and I've found the following one:

broken OONI Explorer links: we have two different(!) report files having the same report_id, that's an unexpected corner-case

Unfortunately, the affected report_id was not written down.

@hellais
Copy link
Member Author

hellais commented Nov 19, 2019

@FedericoCeratto this may be relevant to the issues related to duplicate report_id and input pairs and the ooid story.

@hellais hellais transferred this issue from ooni/pipeline Jan 13, 2020
@hellais hellais added bug Something isn't working ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/medium labels Jan 13, 2020
FedericoCeratto pushed a commit that referenced this issue Mar 16, 2023
@FedericoCeratto
Copy link
Contributor

Fixed by moving to measurement uid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/medium
Projects
None yet
Development

No branches or pull requests

3 participants