Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API] Discrepancy between the total annotations and the number of returned annotations #7796

Open
kael opened this issue Jan 9, 2023 · 10 comments
Labels

Comments

@kael
Copy link

kael commented Jan 9, 2023

Querying the API for an URI, the API returns a total of 3 but with 2 annotations:

{
  "total": 3,
  "rows": [
    {
      "id": "2ZlzQI_6Ee2_GiO1q_kZlw",
      "created": "2023-01-09T08:51:47.348734+00:00",
      "updated": "2023-01-09T08:51:47.348734+00:00",
      "user": "acct:kael@hypothes.is",
      "uri": "https://docdrop.org/ocr/",
      "text": "",
      "tags": [
        "docdrop",
        "ocr",
        "pdf"
      ],
      "group": "__world__",
      "permissions": {
        "read": [
          "group:__world__"
        ],
        "admin": [
          "acct:kael@hypothes.is"
        ],
        "update": [
          "acct:kael@hypothes.is"
        ],
        "delete": [
          "acct:kael@hypothes.is"
        ]
      },
      "target": [
        {
          "source": "https://docdrop.org/ocr/"
        }
      ],
      "document": {
        "title": [
          "DocDrop | OCR"
        ]
      },
      "links": {
        "html": "https://hypothes.is/a/2ZlzQI_6Ee2_GiO1q_kZlw",
        "incontext": "https://hyp.is/2ZlzQI_6Ee2_GiO1q_kZlw/docdrop.org/ocr/",
        "json": "https://hypothes.is/api/annotations/2ZlzQI_6Ee2_GiO1q_kZlw"
      },
      "user_info": {
        "display_name": "kael"
      },
      "flagged": false,
      "hidden": false
    },
    {
      "id": "YTWzkF3QEeuxR3Nj9ycYWw",
      "created": "2021-01-23T23:11:53.018743+00:00",
      "updated": "2021-02-14T10:37:00.189592+00:00",
      "user": "acct:ankostis@hypothes.is",
      "uri": "https://docdrop.org/ocr/",
      "text": "Freely OCR PDFs",
      "tags": [
        "pdf-service"
      ],
      "group": "__world__",
      "permissions": {
        "read": [
          "group:__world__"
        ],
        "admin": [
          "acct:ankostis@hypothes.is"
        ],
        "update": [
          "acct:ankostis@hypothes.is"
        ],
        "delete": [
          "acct:ankostis@hypothes.is"
        ]
      },
      "target": [
        {
          "source": "https://docdrop.org/ocr/"
        }
      ],
      "document": {
        "title": [
          "DocDrop | OCR"
        ]
      },
      "links": {
        "html": "https://hypothes.is/a/YTWzkF3QEeuxR3Nj9ycYWw",
        "incontext": "https://hyp.is/YTWzkF3QEeuxR3Nj9ycYWw/docdrop.org/ocr/",
        "json": "https://hypothes.is/api/annotations/YTWzkF3QEeuxR3Nj9ycYWw"
      },
      "user_info": {
        "display_name": "Kostis Anagnostopoulos"
      },
      "flagged": false,
      "hidden": false
    }
  ]
}

Archived version of the API payload

@robertknight robertknight transferred this issue from hypothesis/client Jan 9, 2023
@robertknight
Copy link
Member

It looks like there is an entry in our Elasticsearch index which doesn't correspond to any annotation in Postgres. My guess is that at some point an annotation was deleted from Hypothesis but the deletion wasn't executed in Elasticsearch. The total count in the API results reflects what Elasticsearch believes the number of result to be, but the actual response will omit any entries which can't be found in Postgres.

Comparing entries in our production DB for this URL and the Elasticsearch service (using Kibana (internal link)), I see there are 2 shared annotations in the __world__ group for this URL in Postgres, but 3 in Elasticsearch.

select * from annotation where document_id in (select document_id from document_uri where uri_normalized = 'httpx://docdrop.org/ocr')

The id of the annotation in Elasticsearch that doesn't appear in Postgres is RQjvooSeEey5H_tgbkmkbg, created by acct:you.me.hypothesis.thisness.us@hypothes.is on 2022-02-03T03:06:23.934527+00:00.

@kael
Copy link
Author

kael commented Jan 9, 2023

created by acct:you.me.hypothesis.thisness.us@hypothes.is on 2022-02-03T03:06:23.934527+00:00

The account https://hypothes.is/users/you.me.hypothesis.thisness.us has been deleted recently though - they even announced it with a Good Bye.

Seems some trigger events on account deletion might be missing or something?

Thanks for the lookup.

@robertknight
Copy link
Member

Insight from @seanh about our Elasticsearch <-> Postgres sync:

IIRC correctly the system that makes double-double-sure that annotations are in sync in Elasticsearch only handles creates and updates, not deletes. I'm assuming we do normally delete annotations from Es when we delete them from Pg, but maybe a rare blip or incident can cause that not to happen

@kael
Copy link
Author

kael commented Jan 10, 2023

Insight from @seanh about our Elasticsearch <-> Postgres sync:

I'm assuming we do normally delete annotations from Es when we delete them from Pg, but maybe a rare blip or incident can cause that not to happen

Looking at former links bookmarked by the user via a Memento, deletion seems to have worked for all the links I've tested:

Yes, it might look like a db incident, or perhaps and edge case in the code. If you have an entrypoint in the code, I could give a look to spot a potential edge case.

I'll keep looking at that buggy link to check the value of total has changed.

@kael
Copy link
Author

kael commented Jan 13, 2023

Looking at former links bookmarked by the user via a Memento, deletion seems to have worked for all the links I've tested

Testing with a Greasemonkey script for all the links of the page, links deletion of the deleted account appears ok: for each link, total annotations equals the number of rows.

🐒 GM Script for Hypothesis API deletion test
// ==UserScript==
// @name     Hypothesis API deletion test
// @version  1
// @grant    none
// @include  https://web.archive.org/web/20221223125432if_/https://hypothes.is/users/you.me.hypothesis.thisness.us
// ==/UserScript==


(async function() {
  'use strict';
    
  const links = document.querySelectorAll('.search-bucket-stats__val.search-bucket-stats__url > a.link--plain');
  
  const brokenLinks = [];
  
  for await (let { href } of links) {

    const response = await fetch(`https://hypothes.is/api/search?uri=${encodeURIComponent(href)}`);
    const { total, rows: { length } } = await response.json();
    if (total === length) {
      console.log('URI OK', href);
    } else {
      brokenLinks.push(href);
      console.error('URI Error', href, total, length);
    }
  }
    
  console.warn(brokenLinks.length + ' broken links', brokenLinks);

})();

-> 0 broken links Array []

Can't figure what caused that DB glitch.

@kael
Copy link
Author

kael commented Jan 13, 2023

Testing with a Greasemonkey script for all the links of the page, links deletion of the deleted account appears ok: for each link, total annotations equals the number of rows.

Erratum: I've had forgotten to rewrite the resource URL in the previous version of the script, and there are 8 discrepancies, though:

🐒 GM Script for Hypothesis API deletion test
// ==UserScript==
// @name     Hypothesis API deletion test
// @version  2
// @grant    none
// @include  https://web.archive.org/web/20221223125432if_/https://hypothes.is/users/you.me.hypothesis.thisness.us
// ==/UserScript==


(async function() {
  'use strict';
  
  const links = document.querySelectorAll('.search-bucket-stats__val.search-bucket-stats__url > a.link--plain');
  
  const brokenLinks = [];
  
  for await (let { href } of links) {
    href = href.replace("https://web.archive.org/web/20221223125432/", "");
    const response = await fetch(`https://hypothes.is/api/search?uri=${encodeURIComponent(href)}`);
    const { total, rows: { length } } = await response.json();
    if (total === length) {
      console.log('URI OK', href, { total, length });
    } else {
      brokenLinks.push(href);
      console.error('URI Error', href, { total, length });
    }
  }
  
  console.warn(brokenLinks.length + ' broken links', brokenLinks);

})();
10:23:57.251
URI Error http://phrack.org/issues/7/3.html 22 20 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:01.466
URI Error https://arxiv.org/pdf/0903.0340.pdf 29 20 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:06.429
URI Error http://docdrop.org/video/-BcuCmf00_Y/ 1 0 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:06.712
URI Error https://cavotes.org/current-election 1 0 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:06.955
URI Error https://aeon.co/essays/gaia-why-some-scientists-think-its-a-nonsensical-fantasy 1 0 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:07.322
URI Error https://aeon.co/essays/james-lovelock-the-death-of-scientific-independence 3 0 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:10.047
URI Error http://docdrop.org/video/zi5-90TnI3Y/ 60 20 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:10.770
URI Error https://docdrop.org/ocr/download/wynter-the-ceremony-must-be-found-after-humanism-ex68n_ocr.pdf 12 0 [Hypothesis API deletion test:24:12](user-script:null/Hypothesis%20API%20deletion%20test)
10:24:13.098 8 broken links 
Array(8) [ "http://phrack.org/issues/7/3.html", "https://arxiv.org/pdf/0903.0340.pdf", "http://docdrop.org/video/-BcuCmf00_Y/", "https://cavotes.org/current-election", "https://aeon.co/essays/gaia-why-some-scientists-think-its-a-nonsensical-fantasy", "https://aeon.co/essays/james-lovelock-the-death-of-scientific-independence", "http://docdrop.org/video/zi5-90TnI3Y/", "https://docdrop.org/ocr/download/wynter-the-ceremony-must-be-found-after-humanism-ex68n_ocr.pdf" ]
[Hypothesis API deletion test:28:11](user-script:null/Hypothesis%20API%20deletion%20test)

@robertknight
Copy link
Member

I encountered another instance of this in hypothesis/client#5219, which caused a major problem in the client, where the Notebook showed only a small fraction of the annotations in the group.

@kael
Copy link
Author

kael commented Sep 9, 2023

Is there any update with this bug ?

Can we rely on the total value returned by the API or do we need to query the next batch (until rows is empty) to be sure that we don't miss annotations ?

@robertknight
Copy link
Member

Is there any update with this bug ?

No, nothing has happened since my last comments.

Can we rely on the total value returned by the API or do we need to query the next batch (until rows is empty) to be sure that we don't miss annotations ?

Keep iterating through pages with search_after until you get to an empty page.

@kael
Copy link
Author

kael commented Sep 11, 2023

Alright. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants