New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE: Research Annotations not created in DB despite "successful" `POST /api/annotations` request #4704

Closed
robertknight opened this Issue Nov 27, 2017 · 11 comments

Comments

Projects
None yet
7 participants
@robertknight
Contributor

robertknight commented Nov 27, 2017

@nlisgo and myself have both experienced situations where, when creating a large number (eg. 500 or 1000) of annotations with a sequence of requests, a small number of annotations are subsequently missing from the DB despite all of the POST /api/annotations requests returning a 200 response with a valid payload.

So far I have only been able to reproduce the issue in production, not locally.

Steps to reproduce

  1. Fill in the API_KEY constant below with a valid developer token and run the script against prod.
#!/usr/bin/env python

import requests
import json

API_KEY = '{redacted}'

ids = []
n_anns = 500
for i in range(0, n_anns):
    print('Creating annotation {}'.format(i))
    ann = {'uri': 'https://example.org/robert-is-testing-stuff',
           'text': 'A bulk annotation creation test'}
    headers = {'Authorization': 'Bearer {}'.format(API_KEY)}
    create_rsp = requests.post('https://hypothes.is/api/annotations',
                               data=json.dumps(ann), headers=headers)
    if create_rsp.status_code >= 400:
        print('Request failed: ', create_rsp.json())
        continue
    id_ = create_rsp.json()['id']
    ids.append(id_)

fetched_ok = []
for id_ in ids:
    headers = {'Authorization': 'Bearer {}'.format(API_KEY)}
    read_rsp = requests.get('https://hypothes.is/api/annotations/{}'.format(id_),
                            headers=headers)
    if read_rsp.status_code != 200:
        print('Failed to fetch annotation {}'.format(id_), read_rsp.json())
    else:
        fetched_ok.append(id_)

print('Fetched {} of {} annotations'.format(len(fetched_ok), len(ids)))

Expected result

All 500 annotations are created successfully

Actual result

I saw a small number of annotations which were "created" successfully yet do not subsequently appear in the DB or in Elasticsearch.

Creating annotation 0
Creating annotation 1
...
Creating annotation 499
('Failed to fetch annotation WhvyKNOFEeeFPw9UxPBc5A', {u'status': u'failure', u'reason': u"Either the resource you requested doesn't exist, or you are not currently authorized to see it."})
('Failed to fetch annotation bDpi0tOFEeewg0sezoe5qA', {u'status': u'failure', u'reason': u"Either the resource you requested doesn't exist, or you are not currently authorized to see it."})
('Failed to fetch annotation cV2h1NOFEeeJPo9ASdDObw', {u'status': u'failure', u'reason': u"Either the resource you requested doesn't exist, or you are not currently authorized to see it."})
Fetched 497 of 500 annotations

Notes

  • My working hypothesis at the moment is that the transaction which created the annotation was rolled back after the HTTP response had been sent.
  • This issue originally came up when @nlisgo attempted to perform a bulk import of >1000 annotations from Disqus into Hypothesis.
@seanh

This comment has been minimized.

Show comment
Hide comment
@seanh

seanh Nov 27, 2017

Contributor

The fact that you were only able to reproduce it in production, not locally, suggests that it might have something to do with concurrent transactions that are happening at the same time as the transaction that fails. This is what it turned out to be with a previous session rollback issue we had. Really hard to debug unfortunately

Contributor

seanh commented Nov 27, 2017

The fact that you were only able to reproduce it in production, not locally, suggests that it might have something to do with concurrent transactions that are happening at the same time as the transaction that fails. This is what it turned out to be with a previous session rollback issue we had. Really hard to debug unfortunately

@tgbugs

This comment has been minimized.

Show comment
Hide comment
@tgbugs

tgbugs Nov 27, 2017

I can report that I encountered this issue multiple times when doing a bulk POSTs of scibot annotations (40,000) to __world__. I have a strange thought that I did not experience this issue when doing a similarly large set of POSTs to a private group, but I think that that is because I was keeping a cached local copy of the responses and so never noticed the issue until I posted to __world__ where I was not keeping a copy.

tgbugs commented Nov 27, 2017

I can report that I encountered this issue multiple times when doing a bulk POSTs of scibot annotations (40,000) to __world__. I have a strange thought that I did not experience this issue when doing a similarly large set of POSTs to a private group, but I think that that is because I was keeping a cached local copy of the responses and so never noticed the issue until I posted to __world__ where I was not keeping a copy.

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Nov 28, 2017

Contributor

Some relevant chatter from the #eng-sr channel back in July 2017: https://hypothes-is.slack.com/archives/C074BUPEG/p1499245083385744

"closed with a dirty session" errors are still being reported frequently in Sentry: https://sentry.io/hypothesis/h/issues/232281261/

Contributor

robertknight commented Nov 28, 2017

Some relevant chatter from the #eng-sr channel back in July 2017: https://hypothes-is.slack.com/archives/C074BUPEG/p1499245083385744

"closed with a dirty session" errors are still being reported frequently in Sentry: https://sentry.io/hypothesis/h/issues/232281261/

@seanh

This comment has been minimized.

Show comment
Hide comment
@seanh

seanh Nov 28, 2017

Contributor

Thanks for the report @tgbugs . Damn, we (or at least I) had no idea this bug existed. We need to fix this!

@tgbugs I don't suppose you remember roughly how long ago it was, when you encountered this issue?

Contributor

seanh commented Nov 28, 2017

Thanks for the report @tgbugs . Damn, we (or at least I) had no idea this bug existed. We need to fix this!

@tgbugs I don't suppose you remember roughly how long ago it was, when you encountered this issue?

@tgbugs

This comment has been minimized.

Show comment
Hide comment
@tgbugs

tgbugs Nov 28, 2017

It was on November 9th. If you run an api query to fetch all the annotations that show up here: https://hypothes.is/users/scibot?q=group%3A__world__ you should be able to get the timestamps for all the POSTs. They were running consecutively, so you could probably narrow down to +/- 100ms the times when one was missed (the issue occured) by looking for gaps in the submissions (those were done in 3 batches so there will be bigger gaps as well).

tgbugs commented Nov 28, 2017

It was on November 9th. If you run an api query to fetch all the annotations that show up here: https://hypothes.is/users/scibot?q=group%3A__world__ you should be able to get the timestamps for all the POSTs. They were running consecutively, so you could probably narrow down to +/- 100ms the times when one was missed (the issue occured) by looking for gaps in the submissions (those were done in 3 batches so there will be bigger gaps as well).

@seanh

This comment has been minimized.

Show comment
Hide comment
@seanh

seanh Nov 28, 2017

Contributor

Good to know. So I guess it's possible this bug was introduced recently, though I've got a feeling it has been here for a while and we never noticed.

Contributor

seanh commented Nov 28, 2017

Good to know. So I guess it's possible this bug was introduced recently, though I've got a feeling it has been here for a while and we never noticed.

@dwhly

This comment has been minimized.

Show comment
Hide comment
@dwhly

dwhly Dec 1, 2017

Member

This possibly also happened recently to @heatherstaines who tried to reply to an annotation she had just made.

screen shot 2017-12-01 at 7 00 10 am

She also says it happened to her a few weeks back.

If this is the case it may mean that it's more serious than just batch imports.

Member

dwhly commented Dec 1, 2017

This possibly also happened recently to @heatherstaines who tried to reply to an annotation she had just made.

screen shot 2017-12-01 at 7 00 10 am

She also says it happened to her a few weeks back.

If this is the case it may mean that it's more serious than just batch imports.

@ajpeddakotla ajpeddakotla changed the title from Annotations not created in DB despite "successful" `POST /api/annotations` request to SPIKE: Research Annotations not created in DB despite "successful" `POST /api/annotations` request Dec 11, 2017

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Jan 2, 2018

Contributor

A slightly more useful script for reproducing the issue that does not leave hundreds of annotations around requiring cleanup afterwards: https://gist.github.com/robertknight/24e9a408472e62c910b8d6c522edda64

Contributor

robertknight commented Jan 2, 2018

A slightly more useful script for reproducing the issue that does not leave hundreds of annotations around requiring cleanup afterwards: https://gist.github.com/robertknight/24e9a408472e62c910b8d6c522edda64

robertknight added a commit that referenced this issue Jan 3, 2018

Add a utility class for tracking uncommitted changes in SQLA sessions
Following the discussion in https://groups.google.com/forum/#!topic/sqlalchemy/eGxpQBChXQw
implement a utility class which can be used to check whether any changes
have been made in a `Session` but not yet committed to the DB.

This is part of an investigation into #4704

robertknight added a commit that referenced this issue Jan 3, 2018

Log a warning if a Pyramid request ends with uncommitted DB session c…
…hanges

The previous `session.dirty` check only tested for _changes_ to existing
DB objects which have not been flushed. It did not check for added or
deleted model objects which had not yet been flushed to the DB, or for
changes which have been flushed inside a transaction but not yet _committed_.

When debugging #4704 we want to
specifically catch additions (of annotations) to the DB which have been
flushed (by `storage.create_annotation`) but not yet committed.

robertknight added a commit that referenced this issue Jan 3, 2018

Add a utility class for tracking uncommitted changes in SQLA sessions
Following the discussion in https://groups.google.com/forum/#!topic/sqlalchemy/eGxpQBChXQw
implement a utility class which can be used to check whether any changes
have been made in a `Session` but not yet committed to the DB.

This is part of an investigation into #4704

robertknight added a commit that referenced this issue Jan 3, 2018

Log a warning if a Pyramid request ends with uncommitted DB session c…
…hanges

The previous `session.dirty` check only tested for _changes_ to existing
DB objects which have not been flushed. It did not check for added or
deleted model objects which had not yet been flushed to the DB, or for
changes which have been flushed inside a transaction but not yet _committed_.

When debugging #4704 we want to
specifically catch additions (of annotations) to the DB which have been
flushed (by `storage.create_annotation`) but not yet committed.
@judell

This comment has been minimized.

Show comment
Hide comment
@judell

judell Jan 5, 2018

Contributor

Tonight I was running a test that led to me interactively create and then immediately delete annotations. Once out of a few dozen attempts, a delete provoked a 404.

Contributor

judell commented Jan 5, 2018

Tonight I was running a test that led to me interactively create and then immediately delete annotations. Once out of a few dozen attempts, a delete provoked a 404.

@heatherstaines

This comment has been minimized.

Show comment
Hide comment
@heatherstaines

heatherstaines Jan 5, 2018

heatherstaines commented Jan 5, 2018

robertknight added a commit that referenced this issue Jan 8, 2018

Add a utility class for tracking uncommitted changes in SQLA sessions
Following the discussion in https://groups.google.com/forum/#!topic/sqlalchemy/eGxpQBChXQw
implement a utility class which can be used to check whether any changes
have been made in a `Session` but not yet committed to the DB.

This is part of an investigation into #4704

robertknight added a commit that referenced this issue Jan 8, 2018

Log a warning if a Pyramid request ends with uncommitted DB session c…
…hanges

The previous `session.dirty` check only tested for _changes_ to existing
DB objects which have not been flushed. It did not check for added or
deleted model objects which had not yet been flushed to the DB, or for
changes which have been flushed inside a transaction but not yet _committed_.

When debugging #4704 we want to
specifically catch additions (of annotations) to the DB which have been
flushed (by `storage.create_annotation`) but not yet committed.

@robertknight robertknight self-assigned this Jan 8, 2018

@robertknight

This comment has been minimized.

Show comment
Hide comment
@robertknight

robertknight Jan 9, 2018

Contributor

Based on additional logging added in #4728 we've learned a few things:

  1. The issue affects more than just creating annotations. It has been observed happening on DELETE /api/annotations and PATCH /api/annotations requests. My guess is that it potentially affects all requests that modify the DB. The most common issue is failure to update auth tickets when making a GET /api/badge call.
  2. The issue definitely appears to happen less frequently shortly after a deploy, requiring hundreds of attempts to reproduce. Half an hour or so after a deploy, a repro happens much quicker (eg. typically only 50 or so attempts required)
  3. Looking at an example of a failed PATCH /api/annotations call looking at the breadcrumbs we can see that a txn.{id} commit event is usually preceded by a txn.{id} commit <zope.sqlalchemy.datamanager.SessionDataManager> event which is a log message produced by the "transaction" package when it commits a resource manager. In the case when the warning is produced, the "zope" commit event is missing just before the txn.{id} commit event.

Taking (2) and (3) together, that strongly suggests this is related to the earlier issues with zope.sqlalchemy's internal _SESSION_STATE_ map.

Contributor

robertknight commented Jan 9, 2018

Based on additional logging added in #4728 we've learned a few things:

  1. The issue affects more than just creating annotations. It has been observed happening on DELETE /api/annotations and PATCH /api/annotations requests. My guess is that it potentially affects all requests that modify the DB. The most common issue is failure to update auth tickets when making a GET /api/badge call.
  2. The issue definitely appears to happen less frequently shortly after a deploy, requiring hundreds of attempts to reproduce. Half an hour or so after a deploy, a repro happens much quicker (eg. typically only 50 or so attempts required)
  3. Looking at an example of a failed PATCH /api/annotations call looking at the breadcrumbs we can see that a txn.{id} commit event is usually preceded by a txn.{id} commit <zope.sqlalchemy.datamanager.SessionDataManager> event which is a log message produced by the "transaction" package when it commits a resource manager. In the case when the warning is produced, the "zope" commit event is missing just before the txn.{id} commit event.

Taking (2) and (3) together, that strongly suggests this is related to the earlier issues with zope.sqlalchemy's internal _SESSION_STATE_ map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment