Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace loader with ddl and copyCSV #747

Merged
merged 1 commit into from
Aug 17, 2022
Merged

Replace loader with ddl and copyCSV #747

merged 1 commit into from
Aug 17, 2022

Conversation

acquamarin
Copy link
Collaborator

@acquamarin acquamarin commented Aug 12, 2022

This PR adds the copyCSV backend (without transaction support) and replaces the loader with ddl and COPY_CSV

  1. DDL gramar:
    CREATE NODE: CREATE NODE TABLE person (ID INT64, NAME STRING, PRIMARY KEY (ID))
    If the user doesn't give primary key, the system will use the first property as the primary key.
    CREATE REL: CREATE REL knows (FROM PERSON | ANIMAL TO PERSON, TIME DATE, MANY_MANY)

  2. COPY CSV grammar:
    COPY person FROM "dataset/person.csv" (ESCAPE="", DELIM=';', QUOTE='"',LIST_BEGIN='{',LIST_END='}')
    Where parsing options are optional fields.
    Supported parsing options:
    ESCAPE, DELIM, QUOTE, LIST_BEGIN, LIST_END

Refactors all tests to use DDL command and COPY_CSV.

src/catalog/catalog.cpp Show resolved Hide resolved
src/common/csv_reader/csv_reader.cpp Show resolved Hide resolved
src/antlr4/Cypher.g4 Outdated Show resolved Hide resolved
src/parser/transformer.cpp Show resolved Hide resolved
src/planner/enumerator.cpp Show resolved Hide resolved
src/storage/in_mem_builder/include/in_mem_node_builder.h Outdated Show resolved Hide resolved
test/copy_csv/copy_csv_fault_test.cpp Outdated Show resolved Hide resolved
test/copy_csv/copy_csv_fault_test.cpp Outdated Show resolved Hide resolved
test/runner/e2e_update_test.cpp Outdated Show resolved Hide resolved
test/test_utility/test_helper.cpp Show resolved Hide resolved
src/catalog/catalog.cpp Outdated Show resolved Hide resolved
src/catalog/catalog.cpp Show resolved Hide resolved
src/storage/in_mem_builder/include/in_mem_rel_builder.h Outdated Show resolved Hide resolved
src/catalog/include/catalog.h Show resolved Hide resolved
src/main/connection.cpp Outdated Show resolved Hide resolved
src/main/database.cpp Outdated Show resolved Hide resolved
src/storage/store/include/nodes_store.h Outdated Show resolved Hide resolved
src/storage/store/include/rels_store.h Outdated Show resolved Hide resolved
@acquamarin acquamarin force-pushed the ddl-copy-csv branch 3 times, most recently from d219e14 to 0e84644 Compare August 15, 2022 02:19
Copy link
Contributor

@semihsalihoglu-uw semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a set of medium level comments but it would be good to go over these again together on a Zoom call as there is quite a bit of them.

Besides what I left in the comments, I have some more high-level comments:

  • In the DDL Grammar, I noticed that users need to specify the multiplicity of the rel tables, e.g., MANY_TO_MANY. Let's default to MANY_TO_MANY and not require users to specify this. The fewer things users need to specify and the system just works, the better usability experience we will provide.

  • Take a note somewhere that when implementing transactions, change the database or connection to stop all transactions before starting to run copyCSV.

  • Test what happens when read queries and the copyCSV query run concurrently.

  • Update the design doc with the new changes: https://docs.google.com/document/d/1Q3LDTnm1DJFVZfPI7Dt5Lr1zbkT5sTaUnh0mTaK7GbU/edit

  • Flexible Headers: Double check what DuckDB assume about the structure of the csv files. Do they have headers? Headers has the flexibility that the users do not have to provide the columns of the table in the csv file in the order that the DBMS expects. For example, if the csv file was (ID, Name, Age), but some one gave a csv file with Name, ID, Age, and this was indicated in a header, we would succeed. It looks like https://duckdb.org/docs/sql/statements/copy DuckDB allows users to specify a header, which probably gives them this flexibility. I would want that feature and this should be implemented in an immediate second PR (or after the transactions if you are already in the middle of it). So open an issue about this.

This feature still requires a design and I would have a special "Unstructured" column in the csv, which stores Unstructured properties of nodes.

  • This is not related to this PR but it has lately started bothering me, so I'll put it here: Can you open an issue to rename Unstructured -> Semistructured?

dataset/copy-csv-empty-lists-test/vPerson.csv Show resolved Hide resolved
src/antlr4/Cypher.g4 Show resolved Hide resolved
src/antlr4/Cypher.g4 Show resolved Hide resolved
src/binder/binder.cpp Outdated Show resolved Hide resolved
@@ -426,14 +387,30 @@ void CatalogContent::readFromFile(const string& directory, bool isForWALRecord)
FileUtils::closeFile(fileInfo->fd);
}

uint64_t CatalogContent::getTotalNumRels() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is being used to get nextRelID, call it getNextRelID(). The "function/purpose" of a function, certainly a public function, should determine its name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what the function is doing is to calculate the total number of rels(edges) in all relLabels.
The copy_rel_csv just utilize this function to getNextRelID. In the future, we may have other callers that just want to know the total number of rels in the catalog. So i would not prefer to rename it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we shouldn't plan for the future but present. This is for two good reasons: this ensures our current state of the code reflects where we are and 2) we are generally very bad at predicting how the code will evolve, so make mistakes. So let's do the renaming now and we can rename again later if that situation arises.

test/copy_csv/copy_csv_test.cpp Show resolved Hide resolved
test/runner/e2e_ddl_test.cpp Outdated Show resolved Hide resolved
test/runner/e2e_ddl_test.cpp Outdated Show resolved Hide resolved
createDBAndConn();
catalog = conn->database->getCatalog();
}

void initWithoutLoadingGraph() {
createDBAndConn();
createDBAndConn(true /* testRecovery */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you testing recovery here now whereas before you were not?

Copy link
Collaborator Author

@acquamarin acquamarin Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are trying to testRecovery, we shouldn't try to load the csv.
Otherwise, we should try to load csv in createDBAndConnection

test/test_utility/include/test_helper.h Outdated Show resolved Hide resolved
@acquamarin acquamarin force-pushed the ddl-copy-csv branch 2 times, most recently from c60f02a to e0ec838 Compare August 17, 2022 03:45
@@ -426,14 +387,30 @@ void CatalogContent::readFromFile(const string& directory, bool isForWALRecord)
FileUtils::closeFile(fileInfo->fd);
}

uint64_t CatalogContent::getTotalNumRels() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we shouldn't plan for the future but present. This is for two good reasons: this ensures our current state of the code reflects where we are and 2) we are generally very bad at predicting how the code will evolve, so make mistakes. So let's do the renaming now and we can rename again later if that situation arises.

src/main/database.cpp Outdated Show resolved Hide resolved
src/storage/in_mem_copier/in_mem_node_copier.cpp Outdated Show resolved Hide resolved
src/storage/store/include/rel_table.h Show resolved Hide resolved
src/storage/store/nodes_metadata.cpp Outdated Show resolved Hide resolved
test/runner/e2e_ddl_test.cpp Outdated Show resolved Hide resolved
@acquamarin acquamarin merged commit 68559c4 into master Aug 17, 2022
@acquamarin acquamarin deleted the ddl-copy-csv branch August 17, 2022 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants