New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LB-367: MessyBrainz: Add script to create recording clusters #37

Merged
merged 21 commits into from Jun 13, 2018

Conversation

Projects
None yet
3 participants
@kartikeyaSh
Contributor

kartikeyaSh commented May 27, 2018

Summary

This PR includes a script for creating clusters and tests for the same.

  • This is a…
    • Bug fix
    • Feature addition
    • Refactoring
    • Minor / simple change (like a typo)
    • Other

Problem

We need to create clusters and associate MSIDs to MBIDs.

  • JIRA ticket (optional): LB-367

Solution

Add a script which will create clusters for recordings that have recording MBIDs in recording_json table.
This script does work in the following steps for a single recording MBID.

  1. For a given recording MBID in the recording_json table, get all recording_json.id which contain this recording MBID.
  2. From this list of recording_json.id query the recording table to get the recording.gid associated with this recording_json.id.
  3. For this list of recording gids make the first gid as cluster_id in the recording_cluster table and associate this cluster_id to all the recording gids.
  4. In the recording_redirect table add an entry for the cluster_id and the recording MBID which represents this cluster.
    This is somewhat similar to what I've written in my proposal
@kartikeyaSh

This comment has been minimized.

Contributor

kartikeyaSh commented May 27, 2018

Here are 2 screenshots containing data in recording_json, recording, recording_cluster, and recording_redirect tables for the test run.

screenshot from 2018-05-27 15-37-27
screenshot from 2018-05-27 15-38-08

A few questions:

  1. Do we need to have an option like "only execute this script for recordings submitted after some desired date"?
  2. As a recording can have multiple recording MBIDs should I first fetch the recording MBID (recording_redirect stuff) from the MusicBrainz database and then insert the MBID fetched into our database. Or should it be done in some other script?
  3. cluster_ID field: should it be a serial field or is it fine to have it represented by some gid?
@mayhem

This comment has been minimized.

Member

mayhem commented May 28, 2018

Screenshots are hard to copy data from -- can you please post a gist here? https://gist.github.com ?

@kartikeyaSh

This comment has been minimized.

manage.py Outdated
@@ -95,6 +97,16 @@ def init_test_db(force=False):

print("Done!")

@cli.command()
@click.option("--reset", "-r", is_flag=True, help="Drop existing database and user.")

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

Looks like this option text needs updating.

manage.py Outdated
@click.option("--reset", "-r", is_flag=True, help="Drop existing database and user.")
def create_recording_clusters_for_mbids(reset=False):
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)
clusters_modified, clusters_add_to_redirect, num_msid_processed = create_recording_clusters(reset)

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

You need to catch exceptions here and show them to the user.

True if a new value is added to cluster otherwise False is returned.
"""

query = text("""INSERT INTO recording_cluster (cluster_id, recording_gid, updated)

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

As with the previous PR, please clean up the formatting of your SQL queries.

VALUES (:cluster_id, :recording_gid, now())
""")
cluster_modified = False
for recording_gid in recording_gids:

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

While this seems technically correct, it is a very inefficient way to fetch data from the DB -- you are making one query for each data piece you're fetching. Each SQL query has overhead to setup the query, plan an execution plan for it and then wait for the response data.

Instead you should see if you can make this happen in one single query.

True if link is inserted into the cluster otherwise False is returned.
"""

cluster_added = False

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

You don't need to add a variable here -- you should just return True from inside the if, and False from the end of the function.

"""

cluster_added = False
if not is_recording_cluster_present_in_recording_redirect(connection, cluster_id, mbid):

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

There is a better way to do this that to query the data and then take action if the data isn't there. Namely adding a unique table constraint that allows only one unique recording_cluster_id, recording_id pair to be added to the table. Then you simply carry out the insertion -- if the data already exists it will throw an error that you can safely ignore. If the data isn't there, it will be inserted correctly.

return False


def is_recording_cluster_present_in_recording_cluster(connection, cluster_id, gid):

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

Ideally, both of these functions should not be needed if we add the right table contraints as mentioned above. Lets see if we can get rid of these functions, shall we?

clusters_add_to_redirect = 0
num_msid_processed = 0
with db.engine.begin() as connection:
if reset:

This comment has been minimized.

@mayhem

mayhem May 29, 2018

Member

This reset functionality makes me nervous -- for putting this into production is could lead to disaster -- but for development this is fine for now. Later, once we're past the experimentation stage of this project we should consider removing these features, or at least adding some safeguards so that the tables can't be truncated in a production environment.

manage.py Outdated
@click.option("--reset", "-r", is_flag=True, help="Drop existing database and user.")
def create_recording_clusters_for_mbids(reset=False):
def truncate_recording_cluster_and_redirect():
"""Truncate recording_cluster and recording_redirect tables."""
db.init_db_engine(config.SQLALCHEMY_DATABASE_URI)

This comment has been minimized.

@mayhem

mayhem Jun 8, 2018

Member

You need to catch exceptions and properly report them to the user here.

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:recording_cluster branch 3 times, most recently from 47d367a to ab9f1eb Jun 8, 2018

kartikeyaSh added some commits May 27, 2018

Remove UNIQUE constraint from indexes on cluster tables
The UNIQUE constraint makes it impossible to add information
about the gids represented by some cluster_id.
Add a script to create clusters for recordings
This script creates clusters for the recordings which
contain recording_mbids.
Add test data
This data will be used for testing clustering scripts.
Add seperate method for truncating tables
Add a seperate method to truncate tables
and remove reset option as it can be dangerous
in production.

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:recording_cluster branch 2 times, most recently from a09ea82 to 4945a9b Jun 12, 2018

kartikeyaSh added some commits Jun 2, 2018

Modify queries and remove unuesd code
Modify queries to increase the efficiency of the script
and remove the functions no longer needed.
Remove nested query and use JOIN
This is better for large tables

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:recording_cluster branch from 4945a9b to fd37208 Jun 12, 2018

kartikeyaSh added some commits Jun 8, 2018

Move function from data.py to recording.py
Move get_recording_cluster_id_using_recording_mbid from data.py
to recording.py as it used by recording.py only and makes
more sense to have it in recording.py.
Add UNIQUE constraint
Add UNIQUE constraints so that we don't get duplicate entries
in tables for clusters and redirects.
Create and modify constraints
Recreate indexes to remove UNIQUE constraints on cluster_id
and create new UNIQUE constraints on cluster and redirect tables.
Alphabetize tables
Write tables in alphabetical order.
Rename truncate_tables
Rename truncate_tables to truncate_recording_cluster_and_recording_redirect_table.

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:recording_cluster branch from fd37208 to f28c018 Jun 12, 2018

@mayhem

mayhem approved these changes Jun 12, 2018

Overall, I think this now looks good.

@@ -0,0 +1,22 @@
BEGIN;

-- Drop indexes and recreate them to remove UNIQUE constraint

This comment has been minimized.

@mayhem

mayhem Jun 12, 2018

Member

Remove? is this comment correct?

This comment has been minimized.

@kartikeyaSh

kartikeyaSh Jun 12, 2018

Contributor

CREATE UNIQUE INDEX cluster_id_ndx_recording_cluster ON recording_cluster (cluster_id);
Here I've removed the UNIQUE constraint on the index (similar for artist_credit_cluster and release_cluster table). Otherwise we can't represent clusters like:

cluster_id recording_gid updated
6d49962f-fb7b-489f-8656-811c2c8f11fb 6d49962f-fb7b-489f-8656-811c2c8f11fb 2018-05-27 10:04:52.308466+00
6d49962f-fb7b-489f-8656-811c2c8f11fb 4befe14b-629b-4af6-927e-e940441c5c0c 2018-05-27 10:04:52.308466+00
connection.execute(query, values)


def fetch_gids_for_recording_mbid(connection, recording_mbid):

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

this name should be something like fetch_unclustered_gids_for_recording_mbid), it is confusing to read right now.

List of gids.
"""

query = text("""SELECT r.gid

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

just a small nitpick as I noticed this in the last PR as well, instead of saving the query in a variable and then calling connection.execute(query, args), using connection.execute(text("""SELECT * FROM recording"""), args) would be much more readable.

connection: the sqlalchemy db connection to be used to execute queries
Returns:
recording_mbids.

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

Better to remove this if you can't think of anything more detailed.

def test_fetch_gids_for_recording_mbid(self):
"""Tests if gids are correctly fetched."""

recording_1 = {

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

mentioning the difference between the two recordings (& vs and) in a comment would be helpful, took me some time to find. :D

}
submit_listens([recording_1, recording_2])

clusters_modified, clusters_add_to_redirect = create_recording_clusters()

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

Just mention here that recording_1 is a completely new recording while recording_2 should be clustered with a new one.

This comment has been minimized.

@kartikeyaSh

kartikeyaSh Jun 12, 2018

Contributor

done, a few lines above

kartikeyaSh added some commits Jun 12, 2018

Rename function and modify comments
Rename fetch_gids_for_recording_mbid to fetch_unclustered_gids_for_recording_mbid,
add comments to tests to differentiate recordings, and modify doc strings for
better clarity.
@paramsingh

🥇

@paramsingh paramsingh merged commit fe6b05c into metabrainz:master Jun 13, 2018

1 check passed

Jenkins Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment