Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-406 Part 3: Remove nullable, use offset for bulk get endpoints #341

Merged
merged 15 commits into from Jun 4, 2019

Conversation

@aidanlw17
Copy link
Contributor

commented May 15, 2019

AB-406 Part 3:

Remove nullable on submission offset and add index

  • Migration to remove nullable on submission_offset.
  • Adds index to improve querying by submission_offset.

Use offset for bulk get endpoints

  • Updates bulk get for both high and low level data to use the submission
    offset, which simplifies queries.
  • Creates new functions for loading high and low level data of multiple
    submissions in a single query with the use of submission offsets.
  • Adds unit tests for new behaviour of the bulk get methods.
('48877286-42d4-4b0a-a1e0-d703a587f64b', 1),
('ffcc4249-28bb-4c91-9195-a60b21d4fb94', 0)]
with self.assertRaises(db.exceptions.NoDataFoundException):
db.data.load_many_low_level(list(recordings))

This comment has been minimized.

Copy link
@alastair

alastair May 15, 2019

Contributor

Is this what we talked about in IRC? Even if there are no matches to the queried mbids/offsets I would expect to get an empty dictionary response, rather an an error

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 18, 2019

Author Contributor

Sorry, raising the error for no matches was an oversight. I've fixed it to use an empty dictionary and updated the tests in the most recent commit 5c6b601.

@@ -296,14 +296,27 @@ def add_empty_lowlevel(mbid, lossless, date):
gid_type = gid_types.GID_TYPE_MSID
with db.engine.connect() as connection:
query = text("""
INSERT INTO lowlevel (gid, build_sha1, lossless, submitted, gid_type)

This comment has been minimized.

Copy link
@alastair

alastair May 15, 2019

Contributor

I can't remember why we made our own insert method here. As this tests the stats module and not data, could we use the db.data module instead of duplicating code here?

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 18, 2019

Author Contributor

The functions that would be used here from the db.data module are nested within write_low_level, so they aren't accessible. Should I write the function to calculate next submission offset outside of write_low_level so that it can be used in other places like here, or is it best to leave this duplicated?

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

OK, revisiting my comment from #340 (comment) - we have this code twice in tests now. Let's move the nested function to the top level of the data module so that we can use it here and in test_data.


return recordings_info


def count_lowlevel(mbid):

This comment has been minimized.

Copy link
@alastair

alastair May 15, 2019

Contributor

can you use the explain command in postgres (https://www.postgresql.org/docs/current/sql-explain.html) and paste the query plan for this query using both count() and max(submission_offset)? I'm not sure which one will be faster

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 18, 2019

Author Contributor

The query plan when using count():

Screen Shot 2019-05-18 at 2 17 36 PM

The query plan when using max(submission_offset) - we will also need to increment the offset:

Screen Shot 2019-05-18 at 2 18 23 PM

On my own, I'm not sure what to make of the difference between these.

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

please always copy the output of programs as text, this way we can copy the contents if needed.

The cost for both of these queries is the same, so it doesn't look like there's a problem with what version we use.
I notice that your database is doing a sequential scan instead of an index scan, that's a bit strange as there should be an index in place. It's possible that you have so few rows that postgres decided not to use the index.

for example, mine shows:

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Aggregate  (cost=12.65..12.66 rows=1 width=8)
   ->  Bitmap Heap Scan on lowlevel  (cost=4.18..12.64 rows=4 width=0)
         Recheck Cond: (gid = 'bbe7bd1f-61c7-404d-811e-2bf42ca4b9cc'::uuid)
         ->  Bitmap Index Scan on gid_ndx_lowlevel  (cost=0.00..4.18 rows=4 width=0)
               Index Cond: (gid = 'bbe7bd1f-61c7-404d-811e-2bf42ca4b9cc'::uuid)
(5 rows)

Can you do the following things:

  • copy the output of \d lowlevel, from a psql shell, and copy the output from the section that starts Indexes:, to verify that you have the necessary indexes.
  • run ANALYZE lowlevel;, to recompute the query plans for this table.

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 20, 2019

Author Contributor

I believe my table didn't have the index because I was switching between branches and the other branches for #339 and #340 don't have the index migration yet. When I switch between working on each branch I typically reinitialize the database using ./develop.sh run --rm webserver python2 manage.py init-db --force and import archived data. When I got these query plans in a psql session I must have been using the database as it was updated for one of the earlier steps.

A question I have about this... Is there a better way to update my database with the migrations that doesn't require me to reinitialize it and import my archived data? On the actual database for acousticbrainz.org, I have been wondering what the process is for making these changes as well.

Anyways, I reinitialized my database once on this branch and now my lowlevel table shows presence of the index, and additionally my query plan matches yours:

                                          Table "public.lowlevel"
      Column       |           Type           | Collation | Nullable |               Default                
-------------------+--------------------------+-----------+----------+--------------------------------------
 id                | integer                  |           | not null | nextval('lowlevel_id_seq'::regclass)
 gid               | uuid                     |           | not null | 
 build_sha1        | text                     |           | not null | 
 lossless          | boolean                  |           |          | false
 submitted         | timestamp with time zone |           |          | now()
 gid_type          | gid_type                 |           | not null | 
 submission_offset | integer                  |           | not null | 
Indexes:
    "lowlevel_pkey" PRIMARY KEY, btree (id)
    "build_sha1_ndx_lowlevel" btree (build_sha1)
    "gid_ndx_lowlevel" btree (gid)
    "gid_type_ndx_lowlevel" btree (gid_type)
    "lossless_ndx_lowlevel" btree (lossless)
    "submission_offset_ndx_lowlevel" btree (submission_offset)
    "submitted_ndx_lowlevel" btree (submitted)
Referenced by:
    TABLE "highlevel" CONSTRAINT "highlevel_fk_lowlevel" FOREIGN KEY (id) REFERENCES lowlevel(id)
    TABLE "lowlevel_json" CONSTRAINT "lowlevel_json_fk_lowlevel" FOREIGN KEY (id) REFERENCES lowlevel(id)
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Aggregate  (cost=12.65..12.66 rows=1 width=8)
   ->  Bitmap Heap Scan on lowlevel  (cost=4.18..12.64 rows=4 width=0)
         Recheck Cond: (gid = 'd1a0c57a-9b5c-4869-b039-629e845583fd'::uuid)
         ->  Bitmap Index Scan on gid_ndx_lowlevel  (cost=0.00..4.18 rows=4 width=0)
               Index Cond: (gid = 'd1a0c57a-9b5c-4869-b039-629e845583fd'::uuid)
@@ -296,14 +296,27 @@ def add_empty_lowlevel(mbid, lossless, date):
gid_type = gid_types.GID_TYPE_MSID

This comment has been minimized.

Copy link
@alastair

alastair May 20, 2019

Contributor

would it be possible to use db.data.write_low_level for the body of this method? e.g. we create the empty data object, and then call write_low_level to write it to the database instead of duplicating this code. I guess we added it here ourselves to skip a few tests. I think it makes sense to unify this code if possible.

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 May 20, 2019

Author Contributor

I've moved the get_submission_offset function to the top level of the db.data module and renamed it, more appropriately, to get_next_submission_offset. I'm looking into using db.data.write_low_level entirely here, but one issue I'm finding is that this function requires a unique sha256 field. In db.data.write_low_level the data_sha256 field is found by hashing the encoded data_json, but data_json will be the same for every item added by this function. I know I can construct the data before passing it to db.data.write_low_level like so:

data = {"data": {}, "metadata": {"audio_properties": {"lossless": lossless}, "version": {"essentia_build_sha": "sha1"}}}

But how can I ensure that the sha256 will be unique for each item added by this function? Or how can I differentiate the data field for each one when they need to be empty?

aidanlw17 and others added 8 commits May 10, 2019
Remove nullable, add index for submission offset
This commit contains the migrations for the following:

- Adds constraint submission_offset in lowlevel, not null.

- Creates index on lowlevel with submission_offset to improve
  query performance.
Use submission offset in endpoint queries
This commit uses the submission offset column in lowlevel to make
improvements in the following queries:

- Selects entry based on submission offset in high and lowlevel
  queries for a single recording.

- Simplifies the attempt at a single query method used in PR #337
  for returning lowlevel data of multiple recordings.

- Uses two queries to return highlevel data for  multiple
  recordings.
Adds sql update for remove nullable and add index
This file should have belonged to commit e5d2aad.
Bulk get hl row skipping, remove exception
This commit makes some small changes to load_many_high_level:

- Embeds multiple defaultdict objects when structuring the
  data returned by queries.
- Adds an exception for no data found when metadata query returns
  None.
- Removes exceptions on rows so that missing data rows are skipped
  and the rest can be returned.
- Fixes incorrect indexing for elements of returned rows.

Change in bulk get endpoints:

- Removes unnecessary function get_data_for_multiple_recordings.
Adds unit testing for bulk get methods
In test_stats:

- Includes small fix for stats unit testing to account for the
  submission offset.

In test_data:

- Adds unit tests for load_many_low_level and load_many_high_level.

In test_core.py:

- Updates bulk get unit tests to use load_many_low_level and
  load_many_high_level.
Returns empty dict on no data found in load many
This commit removes the no data found error that is raised when
there are no existing submissions for any of the recordings in
load_many_low_level and load_many_high_level. Instead, these
functions will now return an empty dictionary under these
circumstances.

This commit also adjusts unit tests accordingly.
Add test for writing lowlevel offset
This commit includes the following changes:

- Creates a test to check the successful assingment of a submission
  offset when writing lowlevel data.

- Removes cast gid::text when querying for max submission offset,
  this cast is unnecessary.

- Creates an alias for max submission offset to improve clarity.
Reformat data files
Remove some unused code

@aidanlw17 aidanlw17 force-pushed the aidanlw17:ab-406-3 branch from 5c6b601 to 73a9981 Jun 3, 2019

aidanlw17 and others added 7 commits Jun 3, 2019
Make the submission offset index also include gid
This allows us to use a single index to look up when we select both gid
and submission offset.
Use the bulk load methods to implement load single
for lowlevel and highlevel, we can remove duplicate sql by using the
bulk get methods
@alastair

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

@paramsingh review here would be appreciated!

@paramsingh paramsingh merged commit 786824f into metabrainz:master Jun 4, 2019

1 check passed

Jenkins Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.