Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-340: Import MusicBrainz data in a separate musicbrainz schema in AB database #272

Merged
merged 6 commits into from Jun 16, 2018

Conversation

Projects
None yet
2 participants
@rsh7
Copy link
Contributor

rsh7 commented Jun 5, 2018

  • To initialize the musicbrainz database:
    ./develop.sh run --rm webserver python2 manage.py init_mb_db

  • To start the import:
    ./develop.sh run --rm webserver python2 manage.py import_musicbrainz_db

@rsh7 rsh7 force-pushed the rsh7:import_musicbrainz_db branch from b0b19bf to 1cf845a Jun 5, 2018

@rsh7 rsh7 changed the base branch from master to musicbrainz-integration-gsoc Jun 5, 2018

@rsh7 rsh7 force-pushed the rsh7:import_musicbrainz_db branch 3 times, most recently from a475555 to 5f20009 Jun 7, 2018

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Jun 11, 2018

Remaining things to do:

  1. Add docstrings explaining work of all the functions.
  2. Add comments in the entire code and more clean up.
@paramsingh
Copy link
Member

paramsingh left a comment

Looks better now, let's add some docstrings to the new functions too.

MB_release_fk_artist_credit = list(set(MB_release_fk_artist_credit))

result = connection.execute(artist_credit_query, {'gids': tuple(gids_in_AB), 'data': tuple(MB_release_fk_artist_credit)})
global MB_artist_credit_data

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

Please don't use global, return this data and then use it.

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

Should fix this in every load function.

This comment has been minimized.

@rsh7

rsh7 Jun 13, 2018

Author Contributor

Done.

SELECT DISTINCT release_group_primary_type.id, release_group_primary_type.name,
release_group_primary_type.parent, release_group_primary_type.child_order,
release_group_primary_type.description, release_group_primary_type.gid
FROM release_group_primary_type INNER JOIN release_group

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

this query should be indented.

VALUES (:id, :gid, :name, :sort_name, :begin_date_year, :begin_date_month, :begin_date_day,
:end_date_year, :end_date_month, :end_date_day, :type, :area, :gender, :comment, :edits_pending,
:last_updated, :ended, :begin_area, :end_area)
ON conflict do nothing

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

ON CONFLICT here should name the column which conflicts, and CONFLICT should be in caps.

def write_recording_gid_redirect(transaction, connection):
recording_gid_redirect_query = text("""
INSERT INTO musicbrainz.recording_gid_redirect
VALUES (:gid, :new_id, :created)

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

indentation.

def insert_MB_data_AB():
with db.engine.connect() as connection:
if MB_artist_credit_data:
transaction = connection.begin()

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

Do we need a transaction for each insert?

This comment has been minimized.

@rsh7

rsh7 Jun 13, 2018

Author Contributor

Removed the integrity check part as we are using 'on conflict action'.

manage.py Outdated
@@ -179,6 +181,15 @@ def remove_admin(username):
sys.exit(1)


@cli.command()
def init_mb_db():

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

this doesn't look needed ?

MB_track_data = result.fetchall()


print("--------------------------------------------------------------------------------------------------")

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

remove this.


def start_import():
with db.engine.begin() as connection:
lowlevel_query = text("""SELECT gid from lowlevel""")

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

Could we get these gids in batches too?

SELECT gid  FROM lowlevel ORDER BY id OFFSET 0 LIMIT 10000
SELECT gid  FROM lowlevel ORDER BY id OFFSET 10000 LIMIT 10000

and so on.

This comment has been minimized.

@rsh7

rsh7 Jun 13, 2018

Author Contributor

Added.

gids_in_AB = [value[0] for value in gids]
no_of_rows = len(gids_in_AB)
start = 0
rows_to_fetch = 10000

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018

Member

we should make this a configurable variable in config.py

@rsh7 rsh7 force-pushed the rsh7:import_musicbrainz_db branch from eb6640c to 7871580 Jun 13, 2018

@paramsingh
Copy link
Member

paramsingh left a comment

Ok, this seems to be working. This PR is a bit too big for us to keep working on. I'll merge this and open a few tickets with changes that I would like to see.

@paramsingh paramsingh merged commit f71ca57 into metabrainz:musicbrainz-integration-gsoc Jun 16, 2018

1 check passed

Jenkins Build finished.
Details

@rsh7 rsh7 deleted the rsh7:import_musicbrainz_db branch Jun 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.