Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-394 - Bulk get items with a single database query #337

Closed
wants to merge 3 commits into from

Conversation

@aidanlw17
Copy link
Contributor

commented Apr 24, 2019

Single database query for low-level data of many recordings

Check for database query method in endpoint query string

API endpoint get_many_lowlevel takes optional parameter to specify
the use of this method: query_method if the value of
query_method is single, the new single query method will be used.
Else, it will continue using multi query method. This is meant to
simplify the testing process.

Single query implemented for low-level data

Single query uses subquery to partition recordings and attach
offset values, outer query is used to filter the results.

To Do:

  • Test single query against previous method on larger datasets
  • Consider changes that may allow the query to be better optimized
  • Implement similar method for high level data
Create single query low level data retrieval
API endpoint get_many_lowlevel takes optional parameter to specify
the use of this method: `query_method` if the value of
query_method is `single`, the new single query method will be used.
Else, it will continue using multi query method. This is meant to
simplify the testing process.

Single query uses subquery to partition recordings and attach
offset values, outer query is used to filter the results.
@aidanlw17

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2019

@alastair This pull request contains the single query method for AB-394. Still a work in progress but I think it would be beneficial to discuss the current state of the query, and how I can approach testing it on a larger scale. Thus far, I've tested the query with 40-50 recordings and seen ~11% improvement in the speed at this small scale. I still need to do unit testing for this work as well (as soon as I get some breathing room from my exams this week).

One other thing I noticed: when adding offsets to a query string, if you specify an offset that does not exist (out of range for recordings in the db), the api just returns data for all other recordings but makes no note that one is missing. I feel as though the functionality of this endpoint might be improved with an error message for an offset out of range, or something to that effect?

@alastair
Copy link
Contributor

left a comment

Thanks, good to see that we can get something working. I think we should make a few changes to the query, and some cleanups to the way that we call the db methods from the view

db/data.py Outdated
(
SELECT id, gid,
ROW_NUMBER ()
OVER (PARTITION BY gid) RowNum

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

let's call this submission_offset instead of RowNum, to give it a better idea of what its purpose is

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

we'll have to sort by submission date to ensure that the submission order number is always the same when we run this query

db/data.py Outdated
) AS partitions
LEFT JOIN lowlevel_json USING (id)
WHERE (:recordings_info)::jsonb ? gid::text
AND (:recordings_info)::jsonb->gid::text ? (RowNum-1)::text

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

this is a neat way of doing the query, but let's not introduce complexity by using json here. Instead you can format the input as a list of tuples: [(mbid, offset), (mbid, offset)] and do WHERE (gid, submission_offset) IN :tuples

db/data.py Outdated

with db.engine.connect() as connection:
query = text(
""" SELECT gid, RowNum, data

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

you can subtract 1 from rownum at this point and alias it to a new column name for the where

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Apr 25, 2019

Author Contributor

Could you elaborate further on how this would look? I was under the impression that select will be executed after the where clause, and thus an alias from select cannot be used in the where clause - the full expression must be used instead. When trying to use the alias, I see an error in which the column name is not found. The postgres docs support this: https://www.postgresql.org/docs/current/sql-select.html#SQL-SELECT-LIST

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Apr 25, 2019

Author Contributor

That being said, I see an alternative to both. Below, in the inner query, I can select ROW_NUMBER () OVER (PARTITION BY gid) -1 RowNum, then the column RowNum (soon to be submission_offset) will start from 0 in the first place.

# Query option allowed for testing purposes:
if request.args.get("query_method") == "single":
# Use single-query method
recording_details = get_data_for_multiple_recordings(db.data.load_many_low_level)

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

instead of doing this, I would call check_bad_request_for_multiple_recordings and load_many_low_level directly and return the value, it will make the logic a bit easier to follow

db/data.py Outdated

# RowNum starts at 1 so it must be decremented to match offset
for row in result.fetchall():
recordings_info[str(row[0])][str(row[1]-1)] = row[2]

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

you should be able to access this as row['gid'], etc, to make it a bit more explicit about what fields you are reading

db/data.py Outdated
recordings_info_json = json.dumps(recordings_info, separators=(',', ':'))

with db.engine.connect() as connection:
query = text(

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

take a look at our guidelines for aligning SQL statements: https://github.com/metabrainz/guidelines/blob/master/Python.md#sql

db/data.py Outdated
@@ -359,6 +360,58 @@ def load_low_level(mbid, offset=0):
return row[0]


def load_many_low_level(recordings):
"""Recordings is a list of tuples (mbid, offset).

This comment has been minimized.

Copy link
@alastair

alastair Apr 24, 2019

Contributor

We have some guidelines here regarding how to write docstrings: https://github.com/metabrainz/guidelines/blob/master/Python.md#docstrings

Specifically, I wouldn't have the "Steps" section below, or the text starting "Sample query response..."

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Apr 25, 2019

Author Contributor

Thanks for the link! I've read and bookmarked those guidelines, and have altered the docstring accordingly.

Improves Single Query Method and Organization
This commit improves the query method for low  level data by
making the following changes:

1. Restructures the function calls involved in the single query
method for get_many_lowlevel api endpoint to improve simplicity.

2. Uses recordings tuples as an sql query parameter rather than
a dict that is casted to jsonb.

3. Orders by submission date within the subquery to maintain
consistent use of offset parameter.

4. Improves documentation via docstring, query format, and
descriptive naming.
db/data.py Outdated
SELECT id, gid,
ROW_NUMBER ()
OVER (PARTITION BY gid
ORDER BY submitted DESC) - 1 submission_offset

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Apr 25, 2019

Author Contributor

@alastair I've reformatted the query to match the styling that you suggested. I've also decremented the row number within the sub query, so we can use submission_offset in both the outer SELECT and WHERE clauses. I've also ordered the recordings by submission date such that the most recent submission is always offset 0.

This comment has been minimized.

Copy link
@alastair

alastair Apr 26, 2019

Contributor

this query is much better! Just a small thing, we want to order by ASC - offset 0 is the first submission for a given gid.

) AS partitions
LEFT JOIN lowlevel_json
USING (id)
WHERE (gid, submission_offset) IN :recordings

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Apr 25, 2019

Author Contributor

@alastair here, I've simplified the query using tuples rather than jsonb. I agree that the previous complexity was unnecessary!

recording_details = get_data_for_multiple_recordings(db.data.load_many_low_level)
recordings = check_bad_request_for_multiple_recordings()
recording_details = db.data.load_many_low_level(recordings)
return jsonify(recording_details)

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Apr 25, 2019

Author Contributor

I've moved the calls to check_bad_request_for_multiple_recordings() and load_many_low_level(recordings) to make this process easier to follow. However, I've assumed that I should leave the pre-existing flow for the multiple query method as is for now?

db/data.py Outdated

result = connection.execute(query, { 'recordings': tuple(recordings) })

recordings_info = {}

This comment has been minimized.

Copy link
@alastair

alastair Apr 26, 2019

Contributor

you can use defaultdict(dict) here, then you dont need to use setdefault below

@alastair

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

Thanks! let's take this out of draft mode.
Don't worry about the failing tests - we'll fix them in #338

@aidanlw17

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2019

@alastairp great thanks, I’ll make the change to ascending on the query and then take it out of draft!

load_many_low_level query orders by ASC submission
- Small fix for submission order, i.e. offset 0 gives the first
recording submitted for an mbid.
- Uses defaultdict to eliminate use of setdefault in processing
the query results.

@aidanlw17 aidanlw17 marked this pull request as ready for review Apr 27, 2019

@alastair

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

We're going to close this because the changes that we're making in AB-406 and #339/#340 will fix this in a better way

@alastair alastair closed this May 10, 2019

aidanlw17 added a commit to aidanlw17/acousticbrainz-server that referenced this pull request May 10, 2019
Use submission offset in endpoint queries
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
aidanlw17 added a commit to aidanlw17/acousticbrainz-server that referenced this pull request May 15, 2019
Use submission offset in endpoint queries
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
aidanlw17 added a commit to aidanlw17/acousticbrainz-server that referenced this pull request Jun 3, 2019
Use submission offset in endpoint queries
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.