Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-340: Import MusicBrainz data in a separate musicbrainz schema in AB database #272

Merged

Conversation

@rsh7
Copy link
Contributor

@rsh7 rsh7 commented Jun 5, 2018

  • To initialize the musicbrainz database:
    ./develop.sh run --rm webserver python2 manage.py init_mb_db

  • To start the import:
    ./develop.sh run --rm webserver python2 manage.py import_musicbrainz_db

@rsh7 rsh7 force-pushed the rsh7:import_musicbrainz_db branch from b0b19bf to 1cf845a Jun 5, 2018
@rsh7 rsh7 changed the base branch from master to musicbrainz-integration-gsoc Jun 5, 2018
@rsh7 rsh7 force-pushed the rsh7:import_musicbrainz_db branch 3 times, most recently from a475555 to 5f20009 Jun 7, 2018
@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Jun 11, 2018

Remaining things to do:

  1. Add docstrings explaining work of all the functions.
  2. Add comments in the entire code and more clean up.
Copy link
Member

@paramsingh paramsingh left a comment

Looks better now, let's add some docstrings to the new functions too.

MB_release_fk_artist_credit = list(set(MB_release_fk_artist_credit))

result = connection.execute(artist_credit_query, {'gids': tuple(gids_in_AB), 'data': tuple(MB_release_fk_artist_credit)})
global MB_artist_credit_data

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

Please don't use global, return this data and then use it.

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

Should fix this in every load function.

This comment has been minimized.

@rsh7

rsh7 Jun 13, 2018
Author Contributor

Done.

SELECT DISTINCT release_group_primary_type.id, release_group_primary_type.name,
release_group_primary_type.parent, release_group_primary_type.child_order,
release_group_primary_type.description, release_group_primary_type.gid
FROM release_group_primary_type INNER JOIN release_group

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

this query should be indented.

VALUES (:id, :gid, :name, :sort_name, :begin_date_year, :begin_date_month, :begin_date_day,
:end_date_year, :end_date_month, :end_date_day, :type, :area, :gender, :comment, :edits_pending,
:last_updated, :ended, :begin_area, :end_area)
ON conflict do nothing

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

ON CONFLICT here should name the column which conflicts, and CONFLICT should be in caps.

def write_recording_gid_redirect(transaction, connection):
recording_gid_redirect_query = text("""
INSERT INTO musicbrainz.recording_gid_redirect
VALUES (:gid, :new_id, :created)

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

indentation.

def insert_MB_data_AB():
with db.engine.connect() as connection:
if MB_artist_credit_data:
transaction = connection.begin()

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

Do we need a transaction for each insert?

This comment has been minimized.

@rsh7

rsh7 Jun 13, 2018
Author Contributor

Removed the integrity check part as we are using 'on conflict action'.

manage.py Outdated
@@ -179,6 +181,15 @@ def remove_admin(username):
sys.exit(1)


@cli.command()
def init_mb_db():

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

this doesn't look needed ?

MB_track_data = result.fetchall()


print("--------------------------------------------------------------------------------------------------")

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

remove this.


def start_import():
with db.engine.begin() as connection:
lowlevel_query = text("""SELECT gid from lowlevel""")

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

Could we get these gids in batches too?

SELECT gid  FROM lowlevel ORDER BY id OFFSET 0 LIMIT 10000
SELECT gid  FROM lowlevel ORDER BY id OFFSET 10000 LIMIT 10000

and so on.

This comment has been minimized.

@rsh7

rsh7 Jun 13, 2018
Author Contributor

Added.

gids_in_AB = [value[0] for value in gids]
no_of_rows = len(gids_in_AB)
start = 0
rows_to_fetch = 10000

This comment has been minimized.

@paramsingh

paramsingh Jun 12, 2018
Member

we should make this a configurable variable in config.py

@rsh7 rsh7 force-pushed the rsh7:import_musicbrainz_db branch from eb6640c to 7871580 Jun 13, 2018
Copy link
Member

@paramsingh paramsingh left a comment

Ok, this seems to be working. This PR is a bit too big for us to keep working on. I'll merge this and open a few tickets with changes that I would like to see.

@paramsingh paramsingh merged commit f71ca57 into metabrainz:musicbrainz-integration-gsoc Jun 16, 2018
1 check passed
1 check passed
@brainzbot
Jenkins Build finished.
Details
@rsh7 rsh7 deleted the rsh7:import_musicbrainz_db branch Jun 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants