Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-394 - Bulk get items with a single database query #337

Closed
wants to merge 3 commits into from

Conversation

@aidanlw17
Copy link
Contributor

@aidanlw17 aidanlw17 commented Apr 24, 2019

Single database query for low-level data of many recordings

Check for database query method in endpoint query string

API endpoint get_many_lowlevel takes optional parameter to specify
the use of this method: query_method if the value of
query_method is single, the new single query method will be used.
Else, it will continue using multi query method. This is meant to
simplify the testing process.

Single query implemented for low-level data

Single query uses subquery to partition recordings and attach
offset values, outer query is used to filter the results.

To Do:

  • Test single query against previous method on larger datasets
  • Consider changes that may allow the query to be better optimized
  • Implement similar method for high level data
API endpoint get_many_lowlevel takes optional parameter to specify
the use of this method: `query_method` if the value of
query_method is `single`, the new single query method will be used.
Else, it will continue using multi query method. This is meant to
simplify the testing process.

Single query uses subquery to partition recordings and attach
offset values, outer query is used to filter the results.
@aidanlw17
Copy link
Contributor Author

@aidanlw17 aidanlw17 commented Apr 24, 2019

@alastair This pull request contains the single query method for AB-394. Still a work in progress but I think it would be beneficial to discuss the current state of the query, and how I can approach testing it on a larger scale. Thus far, I've tested the query with 40-50 recordings and seen ~11% improvement in the speed at this small scale. I still need to do unit testing for this work as well (as soon as I get some breathing room from my exams this week).

One other thing I noticed: when adding offsets to a query string, if you specify an offset that does not exist (out of range for recordings in the db), the api just returns data for all other recordings but makes no note that one is missing. I feel as though the functionality of this endpoint might be improved with an error message for an offset out of range, or something to that effect?

Loading

Copy link
Contributor

@alastair alastair left a comment

Thanks, good to see that we can get something working. I think we should make a few changes to the query, and some cleanups to the way that we call the db methods from the view

Loading

db/data.py Outdated
(
SELECT id, gid,
ROW_NUMBER ()
OVER (PARTITION BY gid) RowNum
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this submission_offset instead of RowNum, to give it a better idea of what its purpose is

Loading

Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll have to sort by submission date to ensure that the submission order number is always the same when we run this query

Loading

db/data.py Outdated
) AS partitions
LEFT JOIN lowlevel_json USING (id)
WHERE (:recordings_info)::jsonb ? gid::text
AND (:recordings_info)::jsonb->gid::text ? (RowNum-1)::text
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a neat way of doing the query, but let's not introduce complexity by using json here. Instead you can format the input as a list of tuples: [(mbid, offset), (mbid, offset)] and do WHERE (gid, submission_offset) IN :tuples

Loading

db/data.py Outdated

with db.engine.connect() as connection:
query = text(
""" SELECT gid, RowNum, data
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can subtract 1 from rownum at this point and alias it to a new column name for the where

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate further on how this would look? I was under the impression that select will be executed after the where clause, and thus an alias from select cannot be used in the where clause - the full expression must be used instead. When trying to use the alias, I see an error in which the column name is not found. The postgres docs support this: https://www.postgresql.org/docs/current/sql-select.html#SQL-SELECT-LIST

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I see an alternative to both. Below, in the inner query, I can select ROW_NUMBER () OVER (PARTITION BY gid) -1 RowNum, then the column RowNum (soon to be submission_offset) will start from 0 in the first place.

Loading

# Query option allowed for testing purposes:
if request.args.get("query_method") == "single":
# Use single-query method
recording_details = get_data_for_multiple_recordings(db.data.load_many_low_level)
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, I would call check_bad_request_for_multiple_recordings and load_many_low_level directly and return the value, it will make the logic a bit easier to follow

Loading

db/data.py Outdated

# RowNum starts at 1 so it must be decremented to match offset
for row in result.fetchall():
recordings_info[str(row[0])][str(row[1]-1)] = row[2]
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to access this as row['gid'], etc, to make it a bit more explicit about what fields you are reading

Loading

db/data.py Outdated
recordings_info_json = json.dumps(recordings_info, separators=(',', ':'))

with db.engine.connect() as connection:
query = text(
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at our guidelines for aligning SQL statements: https://github.com/metabrainz/guidelines/blob/master/Python.md#sql

Loading

db/data.py Outdated
@@ -359,6 +360,58 @@ def load_low_level(mbid, offset=0):
return row[0]


def load_many_low_level(recordings):
"""Recordings is a list of tuples (mbid, offset).
Copy link
Contributor

@alastair alastair Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some guidelines here regarding how to write docstrings: https://github.com/metabrainz/guidelines/blob/master/Python.md#docstrings

Specifically, I wouldn't have the "Steps" section below, or the text starting "Sample query response..."

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link! I've read and bookmarked those guidelines, and have altered the docstring accordingly.

Loading

This commit improves the query method for low  level data by
making the following changes:

1. Restructures the function calls involved in the single query
method for get_many_lowlevel api endpoint to improve simplicity.

2. Uses recordings tuples as an sql query parameter rather than
a dict that is casted to jsonb.

3. Orders by submission date within the subquery to maintain
consistent use of offset parameter.

4. Improves documentation via docstring, query format, and
descriptive naming.
db/data.py Outdated
SELECT id, gid,
ROW_NUMBER ()
OVER (PARTITION BY gid
ORDER BY submitted DESC) - 1 submission_offset
Copy link
Contributor Author

@aidanlw17 aidanlw17 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair I've reformatted the query to match the styling that you suggested. I've also decremented the row number within the sub query, so we can use submission_offset in both the outer SELECT and WHERE clauses. I've also ordered the recordings by submission date such that the most recent submission is always offset 0.

Loading

Copy link
Contributor

@alastair alastair Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query is much better! Just a small thing, we want to order by ASC - offset 0 is the first submission for a given gid.

Loading

) AS partitions
LEFT JOIN lowlevel_json
USING (id)
WHERE (gid, submission_offset) IN :recordings
Copy link
Contributor Author

@aidanlw17 aidanlw17 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair here, I've simplified the query using tuples rather than jsonb. I agree that the previous complexity was unnecessary!

Loading

recording_details = get_data_for_multiple_recordings(db.data.load_many_low_level)
recordings = check_bad_request_for_multiple_recordings()
recording_details = db.data.load_many_low_level(recordings)
return jsonify(recording_details)
Copy link
Contributor Author

@aidanlw17 aidanlw17 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the calls to check_bad_request_for_multiple_recordings() and load_many_low_level(recordings) to make this process easier to follow. However, I've assumed that I should leave the pre-existing flow for the multiple query method as is for now?

Loading

db/data.py Outdated

result = connection.execute(query, { 'recordings': tuple(recordings) })

recordings_info = {}
Copy link
Contributor

@alastair alastair Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use defaultdict(dict) here, then you dont need to use setdefault below

Loading

@alastair
Copy link
Contributor

@alastair alastair commented Apr 26, 2019

Thanks! let's take this out of draft mode.
Don't worry about the failing tests - we'll fix them in #338

Loading

@aidanlw17
Copy link
Contributor Author

@aidanlw17 aidanlw17 commented Apr 26, 2019

@alastairp great thanks, I’ll make the change to ascending on the query and then take it out of draft!

Loading

- Small fix for submission order, i.e. offset 0 gives the first
recording submitted for an mbid.
- Uses defaultdict to eliminate use of setdefault in processing
the query results.
@aidanlw17 aidanlw17 marked this pull request as ready for review Apr 27, 2019
@alastair
Copy link
Contributor

@alastair alastair commented May 10, 2019

We're going to close this because the changes that we're making in AB-406 and #339/#340 will fix this in a better way

Loading

@alastair alastair closed this May 10, 2019
aidanlw17 added a commit to aidanlw17/acousticbrainz-server that referenced this issue May 10, 2019
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
aidanlw17 added a commit to aidanlw17/acousticbrainz-server that referenced this issue May 15, 2019
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
aidanlw17 added a commit to aidanlw17/acousticbrainz-server that referenced this issue Jun 3, 2019
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants