Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put Postgres import and sql commands into a single script #124

Open
ijabz opened this issue Sep 6, 2020 · 3 comments
Open

Put Postgres import and sql commands into a single script #124

ijabz opened this issue Sep 6, 2020 · 3 comments
Labels

Comments

@ijabz
Copy link
Collaborator

ijabz commented Sep 6, 2020

Put the following into a script since you need to do all every time to properly import discogs data into the database. I run importcsv.dir with nohup, it takes some time so when I find it has completed I dont want then have to create the keys/indexes ectera and wait some more time.

Would need to check previous step has worked okay before moving to next step.

python3 postgresql/psql.py < postgresql/sql/CreateTables.sql
python3 postgresql/importcsv.py /csvdir/*
python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql
python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql

@philipmat
Copy link
Owner

I have some concerns as to how having the indexes and keys present in the database before the import affects the speed of the import.

In most databases it tends to be slower, even in bulk load, because:

  • every row has to be PK-checked, FK-checked, before insert; failure of a FK check would result in the entire import failing;
  • having the FKs in place means that the files need to be loaded in a precise order, with the files providing the FK first, and the files using the FK last;
  • after every row or row-threshold insert the indexes may have to be rebuilt, slowing the whole process down (rebuilding at the end is a single table scan)

By contrast, the current solution leaves a DB full of data at the end of the csv import; optimization (PK, FK, indexes) is then left at the latitude of the users.

All that aside, could you try the following two scenarios and see if the outcome is acceptable?

In a script:

Scenario A

  1. Run CreateTables.sql
  2. Import csv
  3. Run the rest of the sql scripts

Scenario B

  1. Run all sql scripts
  2. Import csv

Run each script of the scripts above with the time command and if the results from scenario B are positive and faster than A, I'd be happy to start looking into it.

@ijabz
Copy link
Collaborator Author

ijabz commented Sep 6, 2020

Hi, if you look at my scripts you'll see indexes etc are created AFTER the import. Im just suggesting the user should be able to run one command that creates db tables, import and then creates indexes, keys and constraints, rather than having to run 5 different commands.

@ijabz
Copy link
Collaborator Author

ijabz commented Sep 6, 2020

--- is then left at the latitude of the users.
There maybe cases where they want to change this, but for most I cannot see why they would not want to run add keys and indexes after the db import, otherwise the db will be almost unusable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants