Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-406 Step 1: Offset migration for existing submissions #339

Merged
merged 9 commits into from May 22, 2019

Conversation

@aidanlw17
Copy link
Contributor

commented May 8, 2019

Submission Offsets for Existing Submissions

Migration to add nullable submission_offset column

This will allow for us to populate the submission offset but the field
will be null for recordings that are submitted during the process,
allowing us to catch and update them in the next step. Note that the
offset column will only exist in the lowlevel table.

Migration to add offset for existing submissions

Uses a subquery with a window function to calculate the offsets for all
existing entries, then update the submission_offset column in lowlevel
with the results.

aidanlw17 added 2 commits May 8, 2019
Adds and populates submission_offset in lowlevel
This commit  is the first part of AB-406 (1/3). It contains the
following changes:

- Migration for lowlevel table to include a submission offset that
  is nullable

- Migration to update the submission offset column by populating
  it with calculated offsets based on ascending submission dates
SQL formatting change
Very minor formatting change to SQL.

ALTER TABLE lowlevel ADD COLUMN submission_offset INTEGER;

-- Updating the submission offset column based on determination of offset

This comment has been minimized.

Copy link
@alastair

alastair May 14, 2019

Contributor

Doing this in a single query is a good idea, but this will block the entire table and stop other insert actions from taking place. Unfortunately we're going to have to do it in a script which is a bit more complex and will take longer to run, but will leave the site up while it runs.

Take a look at making a script like manage.py that uses a click cli in order to start up the application and connect to the database:

cli = FlaskGroup(add_default_commands=False, create_app=webserver.create_app_flaskgroup)
logging.basicConfig(level=logging.INFO)
@cli.command(name='runserver')

Do a query like:
select id, gid from lowlevel where submission_offset is null limit 10000

the limit can be a parameter in the script that we can change

Use something like our offset counting code in the dumps to find the current max offset for each mbid:

results = connection.execute(sqlalchemy.text("""
SELECT gid, COUNT(id)
FROM lowlevel
WHERE submitted <= :start_time
GROUP BY gid
"""), {
"start_time": start_time,
})
for mbid, count in results.fetchall():
mbid_occurences[mbid] = count

this query might be good:
select gid, max(submission_offset) from lowlevel where submission_offset is not null group by gid

then you can loop through the set of 10000 items, and insert the offset counts. Do this loop in a transaction.

You don't need to write the sql in the db module, just write it inline in this script.

Make sure that you print some debugging information - each time it loops around, each time it starts and finishes doing the inserts, how many items have been done and how many remain.

aidanlw17 added 5 commits May 16, 2019
Batch add submission offsets
This commit contains a script that integrates with manage.py to
allow the submission offset column to be updated for existing
submissions in smaller batches.
Refactor submission offset script
This commit contains slight changes to refactor the query for
updating submission offsets, and removes an unnecessary update
query.
print("Starting batch insertions...")
print("============================")
count = 0
for id, gid in batch_result.fetchall():

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 16, 2019

Author Contributor

@alastair When you said that the loop should be within a transaction, is it sufficient for the entire operation to be within the same transaction (starting with with db.engine.connect() as connection: on line 19)? Or should I embed another transaction here that is just for the loop?

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

this loop should be in an additional transaction - you can do this with with connection.begin() as transaction:

@alastair
Copy link
Contributor

left a comment

Thanks, this is a good start for a script. A few changes are needed before we run it.

@cli.command(name='add-offsets')
@click.option("--limit", "-l", default=10000)
def add_offsets(limit):
"""Update lowlevel submission offsets with a specified limit."""

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

careful about indentation in this file - we should only use spaces

print("Starting batch insertions...")
print("============================")
count = 0
for id, gid in batch_result.fetchall():

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

this loop should be in an additional transaction - you can do this with with connection.begin() as transaction:

print("============================")
count = 0
for id, gid in batch_result.fetchall():
print("Finished {}/{} items".format(count, limit))

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

will these print messages occur for every item in the 10000 batch?

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 20, 2019

Author Contributor

Yes, the printing occurs for each item in the 10000 batch as the offset is being found and inserted. If we are running this continuously with the added loop that you mentioned, would it be best for me to remove some of the printing? I.e. print for each batch starting and ending rather than on each item in a batch?

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

exactly, there should be one print statement for each 10000 items

def incremental_add_offset(limit):
with db.engine.connect() as connection:
# Find the next batch of items to update
batch_query = text("""

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

It looks to me like this script will only do 10000 items and then quit, is that right?
Instead, it should repeatedly do 10000 items until there are no more remaining. In a loop, get x items, then for each of them count the submission offsets and add them in a transaction.
Then, loop around again and get another x items. You can check the number of rows returned by the batch_result query, and when it's 0, quit.

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 20, 2019

Author Contributor

That’s correct, currently it only does 10000 and then quits. I understand what you’re saying. I’ll add the loop so that it completes all the batches one after another.

batch_query = text("""
SELECT id, gid
FROM lowlevel
WHERE submission_offset IS NULL

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

this query will need an order by id, to ensure that we're getting items in submission order

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 20, 2019

Author Contributor

Does this query order by id when no other order is specified? I was trying this query using an order by id, and trying it without and it was returning the same order (by id) in both cases.

This comment has been minimized.

Copy link
@alastair

alastair May 21, 2019

Contributor

the sql standard doesn't specify a behaviour if you don't specify order by. perhaps by a programming decision or luck postgres appears to return items in id order, but this is not guaranteed. We should always be safe and explicitly set it.

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 21, 2019

Author Contributor

Okay thanks, I've specified order by id in the latest commit.

batch_result = connection.execute(batch_query, { "limit": limit })

# Find max existing offsets
offset_query = text("""

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

if we keep the max_offsets dictionary, we only need to do this query once when the script starts up, instead of for each 10,000 items

Allows multiple batch insertions in one loop
The following changes to offset insertion script:

- Adds a loop to complete multiple batch insertions (the whole
  table) in a single execution of the script
- Fixes indentation errors (convert to spaces)
- Improves debugging outputs with item and batch counts
@aidanlw17

This comment has been minimized.

Copy link
Contributor Author

commented May 20, 2019

@alastair The latest commit here should now allow the offsets for the whole table to be updated in one execution.

@alastair alastair marked this pull request as ready for review May 21, 2019

@@ -0,0 +1,89 @@
from __future__ import print_function

This comment has been minimized.

Copy link
@alastair

alastair May 22, 2019

Contributor

I renamed this file to make its name make a bit more sense, and reformatted a few things. Take a look at the whitespace changes for a better idea of the formatting that we use. an ide or other tool can help you do this too

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 22, 2019

Author Contributor

Thanks for the changes. For whitespace, I see the change to two lines between functions, and also the two lines before the first function definition after the cli setup. Are there any other key whitespace things that I should be aware of?


# Find number of items in table
size_query = text("""
SELECT count(*) AS size

This comment has been minimized.

Copy link
@alastair

alastair May 22, 2019

Contributor

we're missing some rows in this table, so max(id) is not the same as the number of rows, I changed it

""")
connection.execute(query, {"id": id, "offset": offset})
item_count += 1
print(" Batch done, inserted {}/{} items...".format(item_count, table_size)),

This comment has been minimized.

Copy link
@alastair

alastair May 22, 2019

Contributor

This is good, but I moved the print message to the end of each batch. It turns out that printing stuff to the screen can take a lot of time, and it's a thing that can actually slow down processes like this that run really quickly. We're not missing out on any information by not updating every item

@alastair
Copy link
Contributor

left a comment

\o/ thanks

@alastair alastair merged commit 5c1f1db into metabrainz:master May 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.