Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-406 Step 1: Offset migration for existing submissions #339

merged 9 commits into from May 22, 2019
@@ -0,0 +1,89 @@
from __future__ import print_function

This comment has been minimized.

Copy link

alastair May 22, 2019


I renamed this file to make its name make a bit more sense, and reformatted a few things. Take a look at the whitespace changes for a better idea of the formatting that we use. an ide or other tool can help you do this too

This comment has been minimized.

Copy link

aidanlw17 May 22, 2019

Author Contributor

Thanks for the changes. For whitespace, I see the change to two lines between functions, and also the two lines before the first function definition after the cli setup. Are there any other key whitespace things that I should be aware of?

from flask.cli import FlaskGroup
import click
from collections import defaultdict

import db
import webserver

from sqlalchemy import text

cli = FlaskGroup(add_default_commands=False, create_app=webserver.create_app_flaskgroup)

@click.option("--limit", "-l", default=10000)
def add_offsets(limit):
"""Update lowlevel submission offsets with a specified limit."""

def incremental_add_offset(limit):
with db.engine.connect() as connection:

# Find number of items in table
size_query = text("""
SELECT count(*) AS size

This comment has been minimized.

Copy link

alastair May 22, 2019


we're missing some rows in this table, so max(id) is not the same as the number of rows, I changed it

FROM lowlevel
WHERE submission_offset IS NULL
size_result = connection.execute(size_query)
table_size = size_result.fetchone()["size"]

# Find max existing offsets
offset_query = text("""
SELECT gid, MAX(submission_offset)
FROM lowlevel
WHERE submission_offset IS NOT NULL
offset_result = connection.execute(offset_query)

max_offsets = defaultdict(int)
for gid, max_offset in offset_result.fetchall():
max_offsets[gid] = max_offset

# Find the next batch of items to update
batch_query = text("""
SELECT id, gid
FROM lowlevel
WHERE submission_offset IS NULL
LIMIT :limit

batch_count = 0
item_count = 0
print("Starting batch insertions...")
while True:
batch_result = connection.execute(batch_query, {"limit": limit})
if not batch_result.rowcount:
print("Submission offset exists for all items. Exiting...")

batch_count += 1
print("Updating batch {}:".format(batch_count))
with connection.begin() as transaction:
for id, gid in batch_result.fetchall():
if gid in max_offsets:
# Current offset exists
max_offsets[gid] += 1
# No existing offset
max_offsets[gid] = 0
offset = max_offsets[gid]

query = text("""
UPDATE lowlevel
SET submission_offset = :offset
WHERE id = :id
connection.execute(query, {"id": id, "offset": offset})
item_count += 1
print(" Batch done, inserted {}/{} items...".format(item_count, table_size)),

This comment has been minimized.

Copy link

alastair May 22, 2019


This is good, but I moved the print message to the end of each batch. It turns out that printing stuff to the screen can take a lot of time, and it's a thing that can actually slow down processes like this that run really quickly. We're not missing out on any information by not updating every item


print("Batch insertions finished.")
@@ -6,7 +6,8 @@ CREATE TABLE lowlevel (
build_sha1 TEXT NOT NULL,
lossless BOOLEAN DEFAULT 'n',
gid_type gid_type NOT NULL
gid_type gid_type NOT NULL,
submission_offset INTEGER

CREATE TABLE lowlevel_json (
@@ -0,0 +1,5 @@

ALTER TABLE lowlevel ADD COLUMN submission_offset INTEGER;

@@ -19,6 +19,8 @@
import db.user
import webserver

import add_submission_offsets

ADMIN_SQL_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'admin', 'sql')

cli = FlaskGroup(add_default_commands=False, create_app=webserver.create_app_flaskgroup)
@@ -241,7 +243,7 @@ def toggle_site_status():

# Please keep additional sets of commands down there
cli.add_command(db.dump_manage.cli, name="dump")

cli.add_command(add_submission_offsets.cli, name="update-offsets")

if __name__ == '__main__':
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.