Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-412: Compute recording similarity metrics on database #353

Open
wants to merge 118 commits into
base: master
from

Conversation

@aidanlw17
Copy link
Contributor

commented Jun 19, 2019

AB-412: Compute recording similarity metrics on database

The first step in adding similarity to AB is to compute metrics for measuring similarity between recordings, since these measurements can later be indexed. This PR will add classes for the metrics and scripts to compute them for the entire database of recordings.

Database Migrations

'similarity' Table
  • Holds computed vectors for each metric, for each recording in the database.
  • These vectors will be added to the indices for querying.
'similarity_metrics' Table
  • Holds metadata about each metric that has been computed for the database
'similarity_stats' Table
  • Hold statistical information, mean and standard deviation of each computed metric.
  • These statistics may be useful for representations at a later stage in the project.

Metrics Classes

  • Adds classes with information about each metric and methods for it to be created, deleted, and
    computed for a recording.
  • Includes formation of hybrid metrics, useful at a later stage in the project.

Incremental Computation of Metrics

  • After the similarity table has all recordings added to it using the cli command
    ./develop.sh run --rm webserver python2 manage.py similarity init, then a metric
    can be computed on the entire database incrementally using the following command:
    ./develop.sh run --rm webserver python2 manage.py similarity add-metric <metric_name>
  • For now, the metric must exist within the list of base metrics - later, hybrid metrics will allow for this
    functionality to change.
  • Also includes cli commands for adding and deleting hybrid metrics, useful at a later stage in the
    project.

Much of this work is modified from the following thesis authored by Philip Tovstogan:
https://doi.org/10.5281/zenodo.1479769

philtgun added 30 commits May 14, 2018
@alastair
Copy link
Contributor

left a comment

answering some questions and making some comments about methods that we no longer need. As I said in IRC, I'll remove these unneeded methods myself.

@@ -593,6 +593,107 @@ def load_many_high_level(recordings, map_classes=False):
return dict(recordings_info)


def get_lowlevel_metric_feature(id, path):

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

since we switched to getting data for all items at once, is this needed any more?

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Aug 15, 2019

Author Contributor

It's possible to remove this function, the associated tests, and the get_data function in the LowLevelMetric class. However, we discussed leaving this functionality incase we use it in the future. We have the git history, and I think it can be removed without too much concern.

return result.fetchone()["id"]


def check_for_submission(id):

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

doesn't seem to be used any more

return row["data"]


def get_lowlevel_id(mbid, offset):

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

used only in submit_similarity_by_mbid, but that doesn't seem to be used anywhere

with db.engine.connect() as connection:
metrics = similarity.utils.init_metrics()
sim_count = count_similarity()
print("Processed {} / {} ({:.3f}%)".format(sim_count,

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

replace with logger, though typically we shouldn't be doing user-facing output in the db module. if possible look at how to introduce this into the management command.

query = text("""
SELECT highlevel, jsonb_object_agg(model, data) AS data
FROM highlevel_model
WHERE highlevel = :id

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

should only check active highlevel models

"""Inserts a row of similarity vectors for a given lowlevel.id into
the similarity table.
Args: id: lowlevel.id to be submitted

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

not all args are here

path = ''
indices = None

def get_data(self, id):

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

sorry that I didn't reply to this - one easy way would be to just mock the call to db.data, as long as that method also has a test, then this would be enough for the test

means = None
stddevs = None

def calculate_stats(self):

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

I think it's good that we moved the sql into db.similarity, thanks

name = 'moods'
description = 'Moods'
models = [
11, # mood_happy

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

My idea was always to use the name of the model to identify it. is there any reason that you can see why we can't select from the model table by the name instead of the id?

metric = metric_cls()
metrics.append(metric)
try:
metric.calculate_stats()

This comment has been minimized.

Copy link
@alastair

alastair Aug 9, 2019

Contributor

we talked about moving this statistics calculation into an init method

@aidanlw17
Copy link
Contributor Author

left a comment

@alastair I've fixed up the things that you mentioned in your last review. When changing to use model names I noticed a bug in the query, fixing this and adding another join to use the model names took some restructuring. And slowed it down a bit, I think - we should rebuild them again on bono to check.

This is ready for review again and for those unneeded functions to be removed! Thanks.

@@ -593,6 +593,107 @@ def load_many_high_level(recordings, map_classes=False):
return dict(recordings_info)


def get_lowlevel_metric_feature(id, path):

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Aug 15, 2019

Author Contributor

It's possible to remove this function, the associated tests, and the get_data function in the LowLevelMetric class. However, we discussed leaving this functionality incase we use it in the future. We have the git history, and I think it can be removed without too much concern.


def get_highlevel_models(id):
# Get highlevel model data for a specified id.
try:

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Aug 15, 2019

Author Contributor

Added the docstring.

db.similarity_stats.assign_stats(metric)

sim_count = count_similarity()
current_app.logger.info("Processed {} / {} ({:.3f}%)".format(sim_count,

This comment has been minimized.

Copy link
@aidanlw17

aidanlw17 Aug 15, 2019

Author Contributor

@alastair I looked for a solution to echo this information in the manage.py command rather than the db module but wasn't very successful. Since it all happens in one loop I'm not sure how we could send info back intermittently? Replaced with logger for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.