Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-412: Compute recording similarity metrics on database #353

Closed
wants to merge 121 commits into from

Conversation

@aidanlw17
Copy link
Contributor

@aidanlw17 aidanlw17 commented Jun 19, 2019

AB-412: Compute recording similarity metrics on database

The first step in adding similarity to AB is to compute metrics for measuring similarity between recordings, since these measurements can later be indexed. This PR will add classes for the metrics and scripts to compute them for the entire database of recordings.

Database Migrations

'similarity' Table
  • Holds computed vectors for each metric, for each recording in the database.
  • These vectors will be added to the indices for querying.
'similarity_metrics' Table
  • Holds metadata about each metric that has been computed for the database
'similarity_stats' Table
  • Hold statistical information, mean and standard deviation of each computed metric.
  • These statistics may be useful for representations at a later stage in the project.

Metrics Classes

  • Adds classes with information about each metric and methods for it to be created, deleted, and
    computed for a recording.
  • Includes formation of hybrid metrics, useful at a later stage in the project.

Incremental Computation of Metrics

  • After the similarity table has all recordings added to it using the cli command
    ./develop.sh run --rm webserver python2 manage.py similarity init, then a metric
    can be computed on the entire database incrementally using the following command:
    ./develop.sh run --rm webserver python2 manage.py similarity add-metric <metric_name>
  • For now, the metric must exist within the list of base metrics - later, hybrid metrics will allow for this
    functionality to change.
  • Also includes cli commands for adding and deleting hybrid metrics, useful at a later stage in the
    project.

Much of this work is modified from the following thesis authored by Philip Tovstogan:
https://doi.org/10.5281/zenodo.1479769

philtgun added 30 commits May 14, 2018
Copy link
Contributor

@alastair alastair left a comment

answering some questions and making some comments about methods that we no longer need. As I said in IRC, I'll remove these unneeded methods myself.

Loading

@@ -593,6 +593,107 @@ def load_many_high_level(recordings, map_classes=False):
return dict(recordings_info)


def get_lowlevel_metric_feature(id, path):
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we switched to getting data for all items at once, is this needed any more?

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to remove this function, the associated tests, and the get_data function in the LowLevelMetric class. However, we discussed leaving this functionality incase we use it in the future. We have the git history, and I think it can be removed without too much concern.

Loading

return result.fetchone()["id"]


def check_for_submission(id):
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem to be used any more

Loading

return row["data"]


def get_lowlevel_id(mbid, offset):
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used only in submit_similarity_by_mbid, but that doesn't seem to be used anywhere

Loading

db/similarity.py Outdated
with db.engine.connect() as connection:
metrics = similarity.utils.init_metrics()
sim_count = count_similarity()
print("Processed {} / {} ({:.3f}%)".format(sim_count,
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with logger, though typically we shouldn't be doing user-facing output in the db module. if possible look at how to introduce this into the management command.

Loading

query = text("""
SELECT highlevel, jsonb_object_agg(model, data) AS data
FROM highlevel_model
WHERE highlevel = :id
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should only check active highlevel models

Loading

db/similarity.py Outdated
"""Inserts a row of similarity vectors for a given lowlevel.id into
the similarity table.
Args: id: lowlevel.id to be submitted
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not all args are here

Loading

path = ''
indices = None

def get_data(self, id):
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry that I didn't reply to this - one easy way would be to just mock the call to db.data, as long as that method also has a test, then this would be enough for the test

Loading

means = None
stddevs = None

def calculate_stats(self):
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good that we moved the sql into db.similarity, thanks

Loading

name = 'moods'
description = 'Moods'
models = [
11, # mood_happy
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was always to use the name of the model to identify it. is there any reason that you can see why we can't select from the model table by the name instead of the id?

Loading

metric = metric_cls()
metrics.append(metric)
try:
metric.calculate_stats()
Copy link
Contributor

@alastair alastair Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked about moving this statistics calculation into an init method

Loading

Copy link
Contributor Author

@aidanlw17 aidanlw17 left a comment

@alastair I've fixed up the things that you mentioned in your last review. When changing to use model names I noticed a bug in the query, fixing this and adding another join to use the model names took some restructuring. And slowed it down a bit, I think - we should rebuild them again on bono to check.

This is ready for review again and for those unneeded functions to be removed! Thanks.

Loading

@@ -593,6 +593,107 @@ def load_many_high_level(recordings, map_classes=False):
return dict(recordings_info)


def get_lowlevel_metric_feature(id, path):
Copy link
Contributor Author

@aidanlw17 aidanlw17 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to remove this function, the associated tests, and the get_data function in the LowLevelMetric class. However, we discussed leaving this functionality incase we use it in the future. We have the git history, and I think it can be removed without too much concern.

Loading


def get_highlevel_models(id):
# Get highlevel model data for a specified id.
try:
Copy link
Contributor Author

@aidanlw17 aidanlw17 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the docstring.

Loading

db.similarity_stats.assign_stats(metric)

sim_count = count_similarity()
current_app.logger.info("Processed {} / {} ({:.3f}%)".format(sim_count,
Copy link
Contributor Author

@aidanlw17 aidanlw17 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alastair I looked for a solution to echo this information in the manage.py command rather than the db module but wasn't very successful. Since it all happens in one loop I'm not sure how we could send info back intermittently? Replaced with logger for now.

Loading

@alastair
Copy link
Contributor

@alastair alastair commented Sep 17, 2020

The eval branch already contains all of this work, so I'm going to close this PR and continue any necessary work in #364

alastair@apmbp:~/code/acousticbrainz-server$ git checkout eval
Already on 'eval'
Your branch is up to date with 'aidanlw17/eval'.
alastair@apmbp:~/code/acousticbrainz-server$ git merge aidanlw17/compute-metrics
Already up to date.

Loading

@alastair alastair closed this Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants