Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient data loader #3448

Merged
merged 7 commits into from Feb 26, 2024
Merged

Efficient data loader #3448

merged 7 commits into from Feb 26, 2024

Conversation

Anish9901
Copy link
Member

@Anish9901 Anish9901 commented Feb 19, 2024

Fixes #3423

This PR reduces the time required to load the "Movie Collection" dataset for local as well as remote DBs.

Technical details

Problem: We previously relied on a single SQL dump of Movie Collection schema to load the entire dataset into the DB. This was fine when the DB was present along with the django service but was really slow in the case of a remote DB which ultimately resulted in server timeout with a 502 response.

Solution: This PR solves the aforementioned problem by breaking the large SQL dump into parts and extracting all the data to be loaded in multiple .csv.

  • movie_collection_tables.sql: Contains SQL queries for setting up tables for the dataset.
  • movies_csv/: Contains all the data required to be loaded into the tables in separate .csvs.
  • movie_collection_fks.sql: Contains SQL queries for setting up PKs and FKs on the tables.

We first execute movie_collection_tables.sql then we load all the data in the tables using SQL COPY instead of INSERT and then finally we execute movie_collection_fks.sql to setup PKs and FKs.

Performance (GCP):

DB Before After Improvement
Local 25s 9s 64%
Remote 186s ~(3.1mins) 12s 93.5%

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the develop branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@seancolsen seancolsen added this to the v0.1.5 milestone Feb 21, 2024
@Anish9901 Anish9901 marked this pull request as ready for review February 25, 2024 18:09
@Anish9901 Anish9901 added the pr-status: review A PR awaiting review label Feb 25, 2024
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job; looks good to me. I tried loading everything in this branch and found now problems.

@mathemancer mathemancer added this pull request to the merge queue Feb 26, 2024
Merged via the queue into develop with commit e011123 Feb 26, 2024
24 checks passed
@mathemancer mathemancer deleted the eff_data_load branch February 26, 2024 16:21
@seancolsen seancolsen mentioned this pull request Feb 27, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attempting to add connection with both Library Management and Movie Collection schema results in a 502
3 participants