Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-404: Endpoint for users to select specific lowlevel features #347

Open
wants to merge 12 commits into
base: master
from

Conversation

@aidanlw17
Copy link
Contributor

commented Jun 7, 2019

AB-404: Endpoint for users to select specific lowlevel features

This feature allows users to select only a subset of the lowlevel data available when they do not require the whole file, and should help to reduce loads as a result. It makes this possible with the following changes:

  • Create an endpoint, get_many_select_features which takes the required features as a parameter.
  • Parse the features to get a string of paths that access different features of lowlevel.data.
  • Query for these features in a similar fashion to other bulk get methods.
aidanlw17 added 4 commits Jun 6, 2019
Adds endpoint for select lowlevel features
- Adds the endpoint for select lowlevel features.
- Adds method to parse features provided in query string, provided
  that the features exist in the hardcoded set.
Parses to string of feature paths instead of array
The following changes occur the method parse_select_features:

- Construct feature paths for use in postgres query
- Concatenate feature paths as one string with aliases
if not features_param:
raise webserver.views.api.exceptions.APIBadRequest("Missing `features` parameter")

selectable_features = ['lowlevel.average_loudness',

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 7, 2019

Author Contributor

@alastair I'm unsure of the best place to locate this hardcoded array of features. Maybe it should belong here, or perhaps it should be moved to the bottom of the file at the top level?

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

yes, this could go at the top of this file in a constant


# Remove duplicates, preserving order
seen = set()
parsed_features = [x for x in parsed_features if not (x in seen or seen.add(x))]

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 7, 2019

Author Contributor

@alastair I iterate over the features here twice, once to add them to this array to remove duplicates and then again to form the string. It would be better to only iterate once but I wasn't sure of a better approach with the need to remove duplicates?

features_string = parsed_features[0] + ' AS "' + raw_paths[0] + '", '
for feature, alias in zip(parsed_features[1:len(parsed_features)-1], raw_paths[1:len(raw_paths)-1]):
features_string += feature + ' AS "' + alias + '", '
features_string += parsed_features[len(parsed_features)-1] + ' AS "' + raw_paths[len(raw_paths)-1] + '"'

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 7, 2019

Author Contributor

@alastair I had trouble passing the list of features into the sql query without concatenating them all into a string like this, and obviously the above is hard to follow. Is it preferable to use str.join(), or a different method to concatenate?

if not features_param:
raise webserver.views.api.exceptions.APIBadRequest("Missing `features` parameter")

selectable_features = ['lowlevel.average_loudness',

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

yes, this could go at the top of this file in a constant

seen = set()
parsed_features = [x for x in parsed_features if not (x in seen or seen.add(x))]

features_string = parsed_features[0] + ' AS "' + raw_paths[0] + '", '

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

you shouldn't be generating SQL in the webserver module. This should happen in the db module. Here you can parse the parameter and discard items which are not valid, then pass the list to db.data and generate the sql query there.


# Remove duplicates, preserving order
seen = set()
parsed_features = [x for x in parsed_features if not (x in seen or seen.add(x))]

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

since this is now used in so many places we should turn it into a utility

# Build feature path
feature_path = 'llj.data'
for element in feature.split('.'):
feature_path += '->\'' + element + '\''

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

remember that in python you can use " and ' for quotes. If you need to use a ' in a string, enclose it in ".
In this case, I would use string formatting: "llj.data -> '%s'" % (element)
Also look at "->".join(somearray) - this could be done as "->".join(["'%s'" % e for e in feature.split(".")]) and should be slightly faster.

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

an alternative here, because the list of features is fixed, why don't you just make selectable_features a dictionary of feature -> query? This way we don't have to continually create these strings.

db/data.py Outdated
"""
with db.engine.connect() as connection:
result = connection.execute(
"SELECT ll.gid::text, ll.submission_offset::text, %(features)s "

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

we use sqlalchemy.text in all other queries here. You should use that here too.

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

and use """ strings so that we only need to quote at the beginning and end

db/data.py Outdated
"IN %(recordings)s" % {'recordings': tuple(recordings),
'features': features})

feature_names = result.keys()[2:]

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

I don't like using the keys of the results here, especially skipping some random fields. If we pass in a list of the feature aliases to this function then we can use that to select the columns from each row.

db/data.py Outdated
# Build dictionary of feature columns
for name in feature_names:
features_info[name] = row[name]
recordings_info[row['gid']][row['submission_offset']] = features_info

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

Let's replicate the structure of the lowlevel document here instead of just adding it to a dictionary. This gives us the advantage of allowing clients who already access this data to update their query url and have the selection logic still work

'tonal.key_key',
'tonal.key_scale',
'tonal.tuning_frequency',
'tonal.tuning_equal_tempered_deviation']

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

let's also add metadata.tags as an option here, and also always include metadata.version and metadata.audio_properties in all responses.

:query features: *Required.* A list of features to be returned for each mbid.
Takes the form `feature1;feature2`.

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

we should list the features here. One way you could do this is like I did for the max number of bulk queries:

You can specify up to :py:const:`~webserver.views.api.v1.core.MAX_ITEMS_PER_BULK_REQUEST` MBIDs in a request.

@bp_core.route("/low-level/select", methods=["GET"])
@crossdomain()
def get_many_select_features():
"""Get a specified subset of low-level data for many recordings at once.

This comment has been minimized.

Copy link
@alastair

alastair Jun 7, 2019

Contributor

make sure we also say here what happens if no data exists for a mbid/offset (we skip it and it's not returned in the data). Do we have that for the other bulk queries? If not we should add it there too.

aidanlw17 added 5 commits Jun 8, 2019
Build sql in db module, reconstruct ll.data structure
- Refactors to move construction of sql into the db module.
- Always ensures metadata.version and metadata.audio_properties
  are in the parsed list of features.
- Reconstructs lowlevel data format for the response.
- Uses sqlalchemy.text for building sql query
Makes selectable features a dict with query paths
- Makes selectable features a dict mapping feature alias to query
  path, removes construction of query strings when parsing the
  features parameter
- Updates docstring

@aidanlw17 aidanlw17 changed the title AB-404: Bulk endpoint for select lowlevel features AB-404: Endpoint for users to select only specific lowlevel features Jun 13, 2019

@aidanlw17 aidanlw17 changed the title AB-404: Endpoint for users to select only specific lowlevel features AB-404: Endpoint for users to select specific lowlevel features Jun 13, 2019

aidanlw17 added 2 commits Jun 17, 2019
Nonexistent features default to tracked default type
- Adds support for a default type in case the feature is not present
  in the lowlevel document

- Adds unit testing for load_many_select_features and bulk get
  select features endpoint

# Remove duplicates, preserving order
ret = []
return [x for x in parsed_features if not (x in ret or ret.append(x))]

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 17, 2019

Author Contributor

@alastair although we added the remove duplicates utility, that method did not work once I added support for default types - some of the default types were {}, and they are unhashable so it didn't work with the usage of set(). Is this a suitable method instead?

This comment has been minimized.

Copy link
@alastair

alastair Jun 17, 2019

Contributor

OK, if you need to use the same functionality but in a different form then that's fine to duplicate it

features = [("llj.data->'lowlevel'->'average_loudness'", "lowlevel.average_loudness", None),
("llj.data->'metadata'->'version'", "metadata.version", {}),
("llj.data->'metadata'->'audio_properties'", "metadata.audio_properties", {})]
load_many_select_features.assert_called_with(recordings, features)

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 17, 2019

Author Contributor

@alastair this assertion will fail until PR #349 adds the conversion to lowercase in the _parse_bulk_params method

This comment has been minimized.

Copy link
@alastair

alastair Jun 17, 2019

Contributor

Is this because you proactively added a test for this specific case, even though the previous PR hasn't been merged yet? This is interesting. Normally I would say that this specific PR should be standalone, and that you shouldn't write a test that intentionally fails, but you're right that it's a good idea to have this test eventually, and we need to keep track somehow that it's necessary.
One thing we could do here is mark this test function with @unittest.skip so that it's not run, and make a note that it has to be un-skipped when the other is merged.

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 18, 2019

Author Contributor

Yes I added it proactively because once we merge #349 we will need it anyways, so my thought was that this would be easier than adding a new PR just for the test once this one and #349 are already merged? I will use @unittest.skip and add a comment so we can keep track of it for now.

self.assertEqual(expected_result, resp.json)

# Once PR #349 adds conversion to lowercase in _parse_bulk_params, this skip should be removed
@unittest.skip('Conversion to lowercase in _parse_bulk_params not yet added via PR #349.')

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Jun 24, 2019

Author Contributor

@alastair this now uses unitttest.skip until #349 is merged - sorry to forget about adding this until now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.