Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate cause of outage during feature.first_party migration on 2023-06-23 #8018

Open
robertknight opened this issue Jun 23, 2023 · 4 comments
Assignees

Comments

@robertknight
Copy link
Member

We attempted to deploy a "trivial" migration which added a new boolean column to the tiny feature table in #8014.

The task ran successfully and quickly in the ca-central-1 AWS environment. In the US environment the migration did run and the DB schema was modified, but the GitHub Actions task kept running for a long time and an h outage occurred.

After re-deploying H the problem was resolved, but we need to understand what happened and avoid a repeat on the next migration.

Incident thread in Slack: https://hypothes-is.slack.com/archives/C074BUPEG/p1687513992710189

@robertknight robertknight added the Added to sprint Issue added to sprint board after sprint planning label Jun 23, 2023
@robertknight
Copy link
Member Author

robertknight commented Jun 23, 2023

Sentry issues that are related:

Alembic logs from S3 in the us-west-1 region:

2023-06-23 09:44:40 73 alembic.runtime.migration [INFO] Context impl PostgresqlImpl.
2023-06-23 09:44:40 73 alembic.runtime.migration [INFO] Will assume transactional DDL.
2023-06-23 09:44:41 73 alembic.runtime.migration [INFO] Running upgrade be612e693243 -> 7d39ade34b69, Add feature.first_party column.

The logs from ca-central-1, where the task finished successfully, look the same:

2023-06-23 09:44:43 64 alembic.runtime.migration [INFO] Context impl PostgresqlImpl.
2023-06-23 09:44:43 64 alembic.runtime.migration [INFO] Will assume transactional DDL.
2023-06-23 09:44:43 64 alembic.runtime.migration [INFO] Running upgrade be612e693243 -> 7d39ade34b69, Add feature.first_party column.

@seanh
Copy link
Contributor

seanh commented Jun 26, 2023

I believe these are the relevant logs from the migration run on GitHub Actions, nothing useful unfortunately:

Run ./bin/eb-task-run "$APP" "$ENV" "$TIMEOUT" "$COMMAND" "us-west-1"
  ./bin/eb-task-run "$APP" "$ENV" "$TIMEOUT" "$COMMAND" "us-west-1"
  shell: /usr/bin/bash -e {0}
  env:
    APP: h
    ENV: prod
    TIMEOUT: 900
    COMMAND: hypothesis migrate upgrade head
    REGION: all
    AWS_DEFAULT_REGION: us-west-1
    AWS_REGION: us-west-1
    AWS_ACCESS_KEY_ID: ***
    AWS_SECRET_ACCESS_KEY: ***
    AWS_SESSION_TOKEN: ***
---> selecting instance
---> selected instance i-083b970b03d0061e6
---> protecting instance from a scale-in event
---> initiating command
---> running command f0aecef9-5707-4897-bbd0-1[2](https://github.com/hypothesis/h/actions/runs/5355039114/jobs/9712828774#step:5:2)2b56107b57. Output is available in the "elasticbeanstalk-run-command-output" S[3](https://github.com/hypothesis/h/actions/runs/5355039114/jobs/9712828774#step:5:3) bucket.
Error: The operation was canceled.

@robertknight
Copy link
Member Author

There is a lengthy Slack thread at https://hypothes-is.slack.com/archives/C4K6M7P5E/p1687768113678409 with the investigation done today. The current leading hypothesis is that a long running query had a shared lock on the feature table at the time of the migration, this caused the migration transaction to hang while running ALTER TABLE, and that in turn caused most subsequent activity in h to block because most web requests end up reading from this table.

The Slack thread describes a range of debugging tools and measures we have encountered. The most general measure is that we should set some kind of timeout on migrations unless we expect them to run for a long time. I haven't yet worked out the most convenient way to do this.

I also bookmarked some useful pages in the Hypothesis Reading group - see the postgres tag.

@robertknight robertknight removed the Added to sprint Issue added to sprint board after sprint planning label Aug 30, 2023
@robertknight
Copy link
Member Author

This still needs investigation because it poses a hazard for future schema changes in h. However I am not actively working on it right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants