Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-406 Part 3: Remove nullable, use offset for bulk get endpoints #341

Merged
merged 15 commits into from Jun 4, 2019

Conversation

@aidanlw17
Copy link
Contributor

@aidanlw17 aidanlw17 commented May 15, 2019

AB-406 Part 3:

Remove nullable on submission offset and add index

  • Migration to remove nullable on submission_offset.
  • Adds index to improve querying by submission_offset.

Use offset for bulk get endpoints

  • Updates bulk get for both high and low level data to use the submission
    offset, which simplifies queries.
  • Creates new functions for loading high and low level data of multiple
    submissions in a single query with the use of submission offsets.
  • Adds unit tests for new behaviour of the bulk get methods.
('48877286-42d4-4b0a-a1e0-d703a587f64b', 1),
('ffcc4249-28bb-4c91-9195-a60b21d4fb94', 0)]
with self.assertRaises(db.exceptions.NoDataFoundException):
db.data.load_many_low_level(list(recordings))
Copy link
Contributor

@alastair alastair May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what we talked about in IRC? Even if there are no matches to the queried mbids/offsets I would expect to get an empty dictionary response, rather an an error

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, raising the error for no matches was an oversight. I've fixed it to use an empty dictionary and updated the tests in the most recent commit 5c6b601.

Loading

@@ -296,14 +296,27 @@ def add_empty_lowlevel(mbid, lossless, date):
gid_type = gid_types.GID_TYPE_MSID
with db.engine.connect() as connection:
query = text("""
INSERT INTO lowlevel (gid, build_sha1, lossless, submitted, gid_type)
Copy link
Contributor

@alastair alastair May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why we made our own insert method here. As this tests the stats module and not data, could we use the db.data module instead of duplicating code here?

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions that would be used here from the db.data module are nested within write_low_level, so they aren't accessible. Should I write the function to calculate next submission offset outside of write_low_level so that it can be used in other places like here, or is it best to leave this duplicated?

Loading

Copy link
Contributor

@alastair alastair May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, revisiting my comment from #340 (comment) - we have this code twice in tests now. Let's move the nested function to the top level of the data module so that we can use it here and in test_data.

Loading


return recordings_info


def count_lowlevel(mbid):
Copy link
Contributor

@alastair alastair May 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the explain command in postgres (https://www.postgresql.org/docs/current/sql-explain.html) and paste the query plan for this query using both count() and max(submission_offset)? I'm not sure which one will be faster

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query plan when using count():

Screen Shot 2019-05-18 at 2 17 36 PM

The query plan when using max(submission_offset) - we will also need to increment the offset:

Screen Shot 2019-05-18 at 2 18 23 PM

On my own, I'm not sure what to make of the difference between these.

Loading

Copy link
Contributor

@alastair alastair May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please always copy the output of programs as text, this way we can copy the contents if needed.

The cost for both of these queries is the same, so it doesn't look like there's a problem with what version we use.
I notice that your database is doing a sequential scan instead of an index scan, that's a bit strange as there should be an index in place. It's possible that you have so few rows that postgres decided not to use the index.

for example, mine shows:

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Aggregate  (cost=12.65..12.66 rows=1 width=8)
   ->  Bitmap Heap Scan on lowlevel  (cost=4.18..12.64 rows=4 width=0)
         Recheck Cond: (gid = 'bbe7bd1f-61c7-404d-811e-2bf42ca4b9cc'::uuid)
         ->  Bitmap Index Scan on gid_ndx_lowlevel  (cost=0.00..4.18 rows=4 width=0)
               Index Cond: (gid = 'bbe7bd1f-61c7-404d-811e-2bf42ca4b9cc'::uuid)
(5 rows)

Can you do the following things:

  • copy the output of \d lowlevel, from a psql shell, and copy the output from the section that starts Indexes:, to verify that you have the necessary indexes.
  • run ANALYZE lowlevel;, to recompute the query plans for this table.

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe my table didn't have the index because I was switching between branches and the other branches for #339 and #340 don't have the index migration yet. When I switch between working on each branch I typically reinitialize the database using ./develop.sh run --rm webserver python2 manage.py init-db --force and import archived data. When I got these query plans in a psql session I must have been using the database as it was updated for one of the earlier steps.

A question I have about this... Is there a better way to update my database with the migrations that doesn't require me to reinitialize it and import my archived data? On the actual database for acousticbrainz.org, I have been wondering what the process is for making these changes as well.

Anyways, I reinitialized my database once on this branch and now my lowlevel table shows presence of the index, and additionally my query plan matches yours:

                                          Table "public.lowlevel"
      Column       |           Type           | Collation | Nullable |               Default                
-------------------+--------------------------+-----------+----------+--------------------------------------
 id                | integer                  |           | not null | nextval('lowlevel_id_seq'::regclass)
 gid               | uuid                     |           | not null | 
 build_sha1        | text                     |           | not null | 
 lossless          | boolean                  |           |          | false
 submitted         | timestamp with time zone |           |          | now()
 gid_type          | gid_type                 |           | not null | 
 submission_offset | integer                  |           | not null | 
Indexes:
    "lowlevel_pkey" PRIMARY KEY, btree (id)
    "build_sha1_ndx_lowlevel" btree (build_sha1)
    "gid_ndx_lowlevel" btree (gid)
    "gid_type_ndx_lowlevel" btree (gid_type)
    "lossless_ndx_lowlevel" btree (lossless)
    "submission_offset_ndx_lowlevel" btree (submission_offset)
    "submitted_ndx_lowlevel" btree (submitted)
Referenced by:
    TABLE "highlevel" CONSTRAINT "highlevel_fk_lowlevel" FOREIGN KEY (id) REFERENCES lowlevel(id)
    TABLE "lowlevel_json" CONSTRAINT "lowlevel_json_fk_lowlevel" FOREIGN KEY (id) REFERENCES lowlevel(id)
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Aggregate  (cost=12.65..12.66 rows=1 width=8)
   ->  Bitmap Heap Scan on lowlevel  (cost=4.18..12.64 rows=4 width=0)
         Recheck Cond: (gid = 'd1a0c57a-9b5c-4869-b039-629e845583fd'::uuid)
         ->  Bitmap Index Scan on gid_ndx_lowlevel  (cost=0.00..4.18 rows=4 width=0)
               Index Cond: (gid = 'd1a0c57a-9b5c-4869-b039-629e845583fd'::uuid)

Loading

@@ -296,14 +296,27 @@ def add_empty_lowlevel(mbid, lossless, date):
gid_type = gid_types.GID_TYPE_MSID
Copy link
Contributor

@alastair alastair May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to use db.data.write_low_level for the body of this method? e.g. we create the empty data object, and then call write_low_level to write it to the database instead of duplicating this code. I guess we added it here ourselves to skip a few tests. I think it makes sense to unify this code if possible.

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 May 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the get_submission_offset function to the top level of the db.data module and renamed it, more appropriately, to get_next_submission_offset. I'm looking into using db.data.write_low_level entirely here, but one issue I'm finding is that this function requires a unique sha256 field. In db.data.write_low_level the data_sha256 field is found by hashing the encoded data_json, but data_json will be the same for every item added by this function. I know I can construct the data before passing it to db.data.write_low_level like so:

data = {"data": {}, "metadata": {"audio_properties": {"lossless": lossless}, "version": {"essentia_build_sha": "sha1"}}}

But how can I ensure that the sha256 will be unique for each item added by this function? Or how can I differentiate the data field for each one when they need to be empty?

Loading

aidanlw17 and others added 8 commits Jun 3, 2019
This commit contains the migrations for the following:

- Adds constraint submission_offset in lowlevel, not null.

- Creates index on lowlevel with submission_offset to improve
  query performance.
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR metabrainz#337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
This file should have belonged to commit e5d2aad.
This commit makes some small changes to load_many_high_level:

- Embeds multiple defaultdict objects when structuring the
  data returned by queries.
- Adds an exception for no data found when metadata query returns
  None.
- Removes exceptions on rows so that missing data rows are skipped
  and the rest can be returned.
- Fixes incorrect indexing for elements of returned rows.

Change in bulk get endpoints:

- Removes unnecessary function get_data_for_multiple_recordings.
In test_stats:

- Includes small fix for stats unit testing to account for the
  submission offset.

In test_data:

- Adds unit tests for load_many_low_level and load_many_high_level.

In test_core.py:

- Updates bulk get unit tests to use load_many_low_level and
  load_many_high_level.
This commit removes the no data found error that is raised when
there are no existing submissions for any of the recordings in
load_many_low_level and load_many_high_level. Instead, these
functions will now return an empty dictionary under these
circumstances.

This commit also adjusts unit tests accordingly.
This commit includes the following changes:

- Creates a test to check the successful assingment of a submission
  offset when writing lowlevel data.

- Removes cast gid::text when querying for max submission offset,
  this cast is unnecessary.

- Creates an alias for max submission offset to improve clarity.
Remove some unused code
@alastair
Copy link
Contributor

@alastair alastair commented Jun 4, 2019

@paramsingh review here would be appreciated!

Loading

@paramsingh paramsingh merged commit 786824f into metabrainz:master Jun 4, 2019
1 check passed
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants