Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-404: Endpoint for users to select specific lowlevel features #347

Merged
merged 22 commits into from May 14, 2020

Conversation

@aidanlw17
Copy link
Contributor

@aidanlw17 aidanlw17 commented Jun 7, 2019

AB-404: Endpoint for users to select specific lowlevel features

This feature allows users to select only a subset of the lowlevel data available when they do not require the whole file, and should help to reduce loads as a result. It makes this possible with the following changes:

  • Create an endpoint, get_many_select_features which takes the required features as a parameter.
  • Parse the features to get a string of paths that access different features of lowlevel.data.
  • Query for these features in a similar fashion to other bulk get methods.
if not features_param:
raise webserver.views.api.exceptions.APIBadRequest("Missing `features` parameter")

selectable_features = ['lowlevel.average_loudness',
Copy link
Contributor Author

@aidanlw17 aidanlw17 Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair I'm unsure of the best place to locate this hardcoded array of features. Maybe it should belong here, or perhaps it should be moved to the bottom of the file at the top level?

Loading

Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this could go at the top of this file in a constant

Loading


# Remove duplicates, preserving order
seen = set()
parsed_features = [x for x in parsed_features if not (x in seen or seen.add(x))]
Copy link
Contributor Author

@aidanlw17 aidanlw17 Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair I iterate over the features here twice, once to add them to this array to remove duplicates and then again to form the string. It would be better to only iterate once but I wasn't sure of a better approach with the need to remove duplicates?

Loading

features_string = parsed_features[0] + ' AS "' + raw_paths[0] + '", '
for feature, alias in zip(parsed_features[1:len(parsed_features)-1], raw_paths[1:len(raw_paths)-1]):
features_string += feature + ' AS "' + alias + '", '
features_string += parsed_features[len(parsed_features)-1] + ' AS "' + raw_paths[len(raw_paths)-1] + '"'
Copy link
Contributor Author

@aidanlw17 aidanlw17 Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair I had trouble passing the list of features into the sql query without concatenating them all into a string like this, and obviously the above is hard to follow. Is it preferable to use str.join(), or a different method to concatenate?

Loading

if not features_param:
raise webserver.views.api.exceptions.APIBadRequest("Missing `features` parameter")

selectable_features = ['lowlevel.average_loudness',
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this could go at the top of this file in a constant

Loading

seen = set()
parsed_features = [x for x in parsed_features if not (x in seen or seen.add(x))]

features_string = parsed_features[0] + ' AS "' + raw_paths[0] + '", '
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't be generating SQL in the webserver module. This should happen in the db module. Here you can parse the parameter and discard items which are not valid, then pass the list to db.data and generate the sql query there.

Loading


# Remove duplicates, preserving order
seen = set()
parsed_features = [x for x in parsed_features if not (x in seen or seen.add(x))]
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is now used in so many places we should turn it into a utility

Loading

# Build feature path
feature_path = 'llj.data'
for element in feature.split('.'):
feature_path += '->\'' + element + '\''
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember that in python you can use " and ' for quotes. If you need to use a ' in a string, enclose it in ".
In this case, I would use string formatting: "llj.data -> '%s'" % (element)
Also look at "->".join(somearray) - this could be done as "->".join(["'%s'" % e for e in feature.split(".")]) and should be slightly faster.

Loading

Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative here, because the list of features is fixed, why don't you just make selectable_features a dictionary of feature -> query? This way we don't have to continually create these strings.

Loading

db/data.py Outdated
"""
with db.engine.connect() as connection:
result = connection.execute(
"SELECT ll.gid::text, ll.submission_offset::text, %(features)s "
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use sqlalchemy.text in all other queries here. You should use that here too.

Loading

Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and use """ strings so that we only need to quote at the beginning and end

Loading

db/data.py Outdated
"IN %(recordings)s" % {'recordings': tuple(recordings),
'features': features})

feature_names = result.keys()[2:]
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using the keys of the results here, especially skipping some random fields. If we pass in a list of the feature aliases to this function then we can use that to select the columns from each row.

Loading

db/data.py Outdated
# Build dictionary of feature columns
for name in feature_names:
features_info[name] = row[name]
recordings_info[row['gid']][row['submission_offset']] = features_info
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replicate the structure of the lowlevel document here instead of just adding it to a dictionary. This gives us the advantage of allowing clients who already access this data to update their query url and have the selection logic still work

Loading

'tonal.key_key',
'tonal.key_scale',
'tonal.tuning_frequency',
'tonal.tuning_equal_tempered_deviation']
Copy link
Contributor

@alastair alastair Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add metadata.tags as an option here, and also always include metadata.version and metadata.audio_properties in all responses.

Loading

webserver/views/api/v1/core.py Outdated Show resolved Hide resolved
Loading
webserver/views/api/v1/core.py Outdated Show resolved Hide resolved
Loading
@aidanlw17 aidanlw17 changed the title AB-404: Bulk endpoint for select lowlevel features AB-404: Endpoint for users to select only specific lowlevel features Jun 13, 2019
@aidanlw17 aidanlw17 changed the title AB-404: Endpoint for users to select only specific lowlevel features AB-404: Endpoint for users to select specific lowlevel features Jun 13, 2019
webserver/views/api/v1/core.py Show resolved Hide resolved
Loading
features = [("llj.data->'lowlevel'->'average_loudness'", "lowlevel.average_loudness", None),
("llj.data->'metadata'->'version'", "metadata.version", {}),
("llj.data->'metadata'->'audio_properties'", "metadata.audio_properties", {})]
load_many_select_features.assert_called_with(recordings, features)
Copy link
Contributor Author

@aidanlw17 aidanlw17 Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair this assertion will fail until PR #349 adds the conversion to lowercase in the _parse_bulk_params method

Loading

Copy link
Contributor

@alastair alastair Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because you proactively added a test for this specific case, even though the previous PR hasn't been merged yet? This is interesting. Normally I would say that this specific PR should be standalone, and that you shouldn't write a test that intentionally fails, but you're right that it's a good idea to have this test eventually, and we need to keep track somehow that it's necessary.
One thing we could do here is mark this test function with @unittest.skip so that it's not run, and make a note that it has to be un-skipped when the other is merged.

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I added it proactively because once we merge #349 we will need it anyways, so my thought was that this would be easier than adding a new PR just for the test once this one and #349 are already merged? I will use @unittest.skip and add a comment so we can keep track of it for now.

Loading

self.assertEqual(expected_result, resp.json)

# Once PR #349 adds conversion to lowercase in _parse_bulk_params, this skip should be removed
@unittest.skip('Conversion to lowercase in _parse_bulk_params not yet added via PR #349.')
Copy link
Contributor Author

@aidanlw17 aidanlw17 Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair this now uses unitttest.skip until #349 is merged - sorry to forget about adding this until now.

Loading

db/data.py Outdated Show resolved Hide resolved
Loading
db/data.py Outdated
recordings_info = defaultdict(dict)
for row in result.fetchall():
# Build dictionary of feature columns
data = defaultdict(dict)
Copy link
Contributor

@alastair alastair Sep 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break this indented section into a separate function (input row, output data), and then have some tests for it

Loading

webserver/views/api/v1/core.py Outdated Show resolved Hide resolved
Loading
@@ -0,0 +1,3 @@
def remove_duplicates(arr):
Copy link
Contributor

@alastair alastair Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd name this file something else - when I see remove_duplicates I think of duplicate submissions. How about utils/container_utils.py? (since it's currently got utils for sets, perhaps we might add other things for other container types too)

Loading

@alastair
Copy link
Contributor

@alastair alastair commented Oct 28, 2019

@aidanlw17 is this finished now with the spelling changes you made? Can you rebase to fix the merge conflicts?

Loading

@aidanlw17
Copy link
Contributor Author

@aidanlw17 aidanlw17 commented Oct 28, 2019

@alastairp not quite finished, I had another set of commits with a refactor that I didn’t push yet. Sorry! I will get to it either today or tomorrow and then can rebase.

Loading

@aidanlw17
Copy link
Contributor Author

@aidanlw17 aidanlw17 commented Oct 30, 2019

@alastair This is now good to go! I refactored parts of db/data/load_many_individual_features into smaller functions and added unit tests for them. And rebased.

Loading

aidanlw17 added 13 commits Apr 27, 2020
- Adds the endpoint for select lowlevel features.
- Adds method to parse features provided in query string, provided
  that the features exist in the hardcoded set.
The following changes occur the method parse_select_features:

- Construct feature paths for use in postgres query
- Concatenate feature paths as one string with aliases
- Refactors to move construction of sql into the db module.
- Always ensures metadata.version and metadata.audio_properties
  are in the parsed list of features.
- Reconstructs lowlevel data format for the response.
- Uses sqlalchemy.text for building sql query
- Makes selectable features a dict mapping feature alias to query
  path, removes construction of query strings when parsing the
  features parameter
- Updates docstring
- Adds support for a default type in case the feature is not present
  in the lowlevel document

- Adds unit testing for load_many_select_features and bulk get
  select features endpoint
@alastair alastair force-pushed the select-features branch from f216690 to 667e795 Apr 27, 2020
@pep8speaks
Copy link

@pep8speaks pep8speaks commented Apr 27, 2020

Hello @aidanlw17! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 609:86: W291 trailing whitespace
Line 623:131: E501 line too long (135 > 130 characters)
Line 677:131: E501 line too long (184 > 130 characters)

Line 405:131: E501 line too long (176 > 130 characters)
Line 455:131: E501 line too long (137 > 130 characters)

Comment last updated at 2020-05-14 08:27:32 UTC

Loading

@alastair
Copy link
Contributor

@alastair alastair commented Apr 27, 2020

I just rebased on master to take into account some improvements in master. Working on final tests on this now

Loading

@alastair
Copy link
Contributor

@alastair alastair commented Apr 27, 2020

oh cool! I didn't realise that pep8speaks would talk on this too! Let's skip it for now - I still need to configure some things like line length, then I was planning on doing a single commit to fix all code layout in the whole project so we can start fresh

Loading

@alastair
Copy link
Contributor

@alastair alastair commented Apr 29, 2020

After some discussion in IRC, we decided to merge this functionality into the bulk lowlevel api, that is, /api/v1/low-level. I'll make this change and update the tests and documentation.

Loading

@alastair alastair merged commit 6377179 into metabrainz:master May 14, 2020
1 check passed
Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants