Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-348: Make the musicbrainz import more efficient #281

Merged
merged 11 commits into from Jul 12, 2018

Conversation

@rsh7
Copy link
Contributor

@rsh7 rsh7 commented Jun 28, 2018

No description provided.

@rsh7 rsh7 changed the base branch from master to musicbrainz-integration-gsoc Jun 28, 2018
Copy link
Member

@paramsingh paramsingh left a comment

Let's add more documentation as to what parameters these functions take.

This looks okay to me, but we still need to spend some time finding inefficiencies and removing them.

@@ -48,12 +66,12 @@ def load_artist_credit(connection, gids_in_AB, MB_release_data):
return MB_artist_credit_data


def load_artist_type(connection, gids_in_AB):
def load_artist_type(connection, gids_in_AB, artist_credit_from_recording):

This comment has been minimized.

@paramsingh

paramsingh Jul 2, 2018
Member

Is the var gids_in_AB still needed here?

This comment has been minimized.

@rsh7

rsh7 Jul 5, 2018
Author Contributor

No, not needed and I've removed it.

@rsh7 rsh7 force-pushed the rsh7:optimize branch 2 times, most recently from 8ec3fe8 to 1c1dfa3 Jul 5, 2018
@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Jul 6, 2018

Data import was pretty fast, but data insertion was taking much time as there were a lot of duplications in the data fetched from many tables. So, I have again included the DISTINCT keyword in the queries. And also improved some queries to get all the data from MusicBrainz db as the number of rows in some tables are less than 20.

@paramsingh
Copy link
Member

@paramsingh paramsingh commented Jul 11, 2018

Rebase this once, please.

@rsh7 rsh7 force-pushed the rsh7:optimize branch from f7ef7fa to 0bf7156 Jul 11, 2018
@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Jul 11, 2018

Rebased, and add a commit for some changes in docstrings of all the functions.

@rsh7 rsh7 force-pushed the rsh7:optimize branch from 99fa5ec to b6159f0 Jul 11, 2018
Copy link
Member

@paramsingh paramsingh left a comment

docstrings should explicitly list the type of data the function is expecting also.

MB_release_status_data = result.fetchall()

return MB_release_status_data


def load_release_group_primary_type(connection, gids_in_AB, MB_release_group_data):
def load_release_group_primary_type(connection):
"""Fetch release_group_primary_type table data from MusicBrainz database for the

This comment has been minimized.

@paramsingh

paramsingh Jul 11, 2018
Member

now that these functions are fetching entire tables, we should change docs to reflect that.

)
result = connection.execute(area_type_query, filter_data)
area_type_query = text("""
SELECT * FROM area_type

This comment has been minimized.

@paramsingh

paramsingh Jul 11, 2018
Member

this should be

  SELECT column1,
         column2
    FROM area_type
ORDER BY id
@rsh7
Copy link
Contributor Author

@rsh7 rsh7 commented Jul 12, 2018

Changes made.

@paramsingh paramsingh merged commit 711866c into metabrainz:musicbrainz-integration-gsoc Jul 12, 2018
1 check passed
1 check passed
@brainzbot
Jenkins Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants