New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LB-372: MessyBrainz: Create artist_credit clusters using artist MBIDs in database #41

Merged
merged 17 commits into from Jul 2, 2018

Conversation

Projects
None yet
3 participants
@kartikeyaSh
Contributor

kartikeyaSh commented Jun 13, 2018

Summary

This PR includes a script for creating artist_credit clusters.

  • This is a…
    • Bug fix
    • Feature addition
    • Refactoring
    • Minor / simple change (like a typo)
    • Other

Problem

We need to create clusters and associate MSIDs to MBIDs.

  • JIRA ticket (optional): LB-372

Solution

Add a script which will create clusters for artists that have artist MBIDs in recording_json table.
This is somewhat similar to what I've written in my proposal

Additional things apart from the proposal:

  1. New table artist_credit_redirect_array which stores artist MBIDs as arrays, and is represented by an artist_credit_cluster_id.
  2. The clusters are created in 2 phases in phase-1 simple cases are handled and in phase-2 anomalies are handled.

kartikeyaSh added some commits Jun 13, 2018

Add two functions to postgres
Add convert_json_array_to_sorted_uuid_array which converts
a given JSON array to a sorted psql array of type UUID and
array_sort which returns a sorted UUID array.
Add a new table
Add artist_credit_redirect_array table to store
cluster_id, artist_mbids_array pairs.
@kartikeyaSh

This comment has been minimized.

Contributor

kartikeyaSh commented Jun 13, 2018

Here is a gist to test run: https://gist.github.com/kartikeyaSh/f80ce28b0d29ccee61fbe235f5acddc3
I'll write tests for this soon.

@kartikeyaSh kartikeyaSh changed the title from Artist cluster to LB-372: MessyBrainz: Create artist_credit clusters using artist MBIDs in database Jun 13, 2018

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:artist_cluster branch from fd45542 to b2b44f1 Jun 14, 2018

kartikeyaSh added some commits Jun 13, 2018

Add functionality to create artist_credit clusters
Add functionality to create clusters of artist_credit
using artist_mbids present in the recording_json table.
Add schema update
Add artist_credit_redirect_array table

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:artist_cluster branch from b2b44f1 to 4631d20 Jun 15, 2018

@paramsingh

This comment has been minimized.

Member

paramsingh commented Jun 16, 2018

What exactly do you mean by anomaly here?

@kartikeyaSh

This comment has been minimized.

Contributor

kartikeyaSh commented Jun 17, 2018

@paramsingh A single MSID pointing to multiple MBIDs. e.g. James Morrison and James Morrison. Another way to put it is: Any artist_credit_cluster_id which appears more than once in artist_credit_redirect_array table is an anomaly. You can see that in the above gist that cluster ID 1ff80901-f7e9-4f2a-ac37-3ed0ce216362 points to 4 different MBIDs in artist_credit_redirect_array.

Change the way how anomalies are clustered
First fetch all unique cluster_id for a list of artist_gids
instead of fetching cluster_id for artist_gid as this may
cause duplicate inserts in artist_credit_redirect table.
@mayhem

Undo the dropping of that table and then I think we should merge and move on.

@@ -0,0 +1,5 @@
BEGIN;

This comment has been minimized.

@mayhem

mayhem Jun 26, 2018

Member

Not sure this is really needed since we haven't updated production yet -- so that table will not exist there...

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:artist_cluster branch from 4aecf0a to fc91ae4 Jun 26, 2018

Args:
connection: the sqlalchemy db connection to be used to execute queries
artist_credit_mbids (list): a list of sorted artist MBIDs for which

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

What will happen if this list is not sorted? We should consider taking any order and sorting it ourselves inside the function.



def create_artist_credit_clusters_without_considering_anomalies(connection):
"""Creates cluster for artist_credit without considering anamolies.

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

Please specify in the comment what you mean by anomalies here (and in other places).

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

Also anamolies -> anomalies

@@ -102,3 +103,297 @@ def fetch_and_store_artist_mbids_for_all_recording_mbids():
pass

return num_recording_mbids_processed, num_recording_mbids_added


def fetch_distinct_artist_credit_mbids(connection):

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

this name should be fixed, there is no indication in the function name that it only fetches those mbids which do not have msids in the artist_credit_cluster table.

@@ -1,6 +1,7 @@

import brainzutils.musicbrainz_db.recording as mb_recording
import json
import uuid

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

is this import being used somewhere?

@@ -0,0 +1,5 @@
BEGIN;

DROP TABLE IF EXISTS artist_credit_redirect CASCADE;

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

Just move this to the other update script, keeping multiple schema change scripts in one PR doesn't really make sense.



def test_link_artist_mbids_to_artist_credit_cluster_id(self):
"""Tests if artist MBIDs are linked to correctly. Links are created in

This comment has been minimized.

@paramsingh

paramsingh Jun 26, 2018

Member

linked to cluster

kartikeyaSh added some commits Jun 28, 2018

Alter and rename artist_credit_redirect_array
Rename artist_credit_redirect to artist_credit_redirect
and alter the column to store artist MBIDs to array of UUIDs.
Rename fetch_distinct_artist_credit_mbids
Rename fetch_distinct_artist_credit_mbids to
fetch_unclustered_distinct_artist_credit_mbids. This
gives indication that only those MBIDs are fetched which
have not been clustered.

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:artist_cluster branch from f762c15 to 8c7af9a Jun 28, 2018

to multiple MBIDs in entity_redirect table).
Args:
connection: the sqlalchemy db connection to be used to execute queries.

This comment has been minimized.

@paramsingh

paramsingh Jun 30, 2018

Member

taking a connection as input here and then creating a connection in the function as well.

This comment has been minimized.

@kartikeyaSh

kartikeyaSh Jun 30, 2018

Contributor

done:). removed creating a connection in the function.

@paramsingh

This comment has been minimized.

Member

paramsingh commented Jun 30, 2018

From IRC

iliekcomputers: as you mentioned in #44 I wrote generic functions. I'm not sure If I need to write tests for those functions cz those get tested when testing create_artist_credit_clusters, create_artist_credit_clusters_without_anomalies........ And I can't test those functions without using an entity like artist/release.

I don't think we need tests for the generic functions. They'll only get called by each of the entity clustering functions and tested there.

kartikeyaSh added some commits Jun 29, 2018

Add generic functions common to entities
These functions are common to both artist and
release when creating clusters.
Use generic functions
Use function declared in db_common to
create clusters.

@kartikeyaSh kartikeyaSh force-pushed the kartikeyaSh:artist_cluster branch from c8c64cb to a9ccca8 Jun 30, 2018

@paramsingh

🥇

Add logging while creating clusters
This makes it easy to understand how clusters are formed.
@mayhem

mayhem approved these changes Jul 2, 2018

The logging output option should also be added to the other clustering functions in manage.py, but we can do that in the other PRs. For now, let's merge!

@mayhem mayhem merged commit fc1da45 into metabrainz:master Jul 2, 2018

1 check passed

Jenkins Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment