Replace loader with ddl and copyCSV #747

acquamarin · 2022-08-12T05:17:25Z

This PR adds the copyCSV backend (without transaction support) and replaces the loader with ddl and COPY_CSV

DDL gramar:
CREATE NODE: CREATE NODE TABLE person (ID INT64, NAME STRING, PRIMARY KEY (ID))
If the user doesn't give primary key, the system will use the first property as the primary key.
CREATE REL: CREATE REL knows (FROM PERSON | ANIMAL TO PERSON, TIME DATE, MANY_MANY)
COPY CSV grammar:
COPY person FROM "dataset/person.csv" (ESCAPE="", DELIM=';', QUOTE='"',LIST_BEGIN='{',LIST_END='}')
Where parsing options are optional fields.
Supported parsing options:
ESCAPE, DELIM, QUOTE, LIST_BEGIN, LIST_END

Refactors all tests to use DDL command and COPY_CSV.

src/catalog/catalog.cpp

src/common/csv_reader/csv_reader.cpp

src/antlr4/Cypher.g4

src/parser/transformer.cpp

src/planner/enumerator.cpp

src/storage/in_mem_builder/include/in_mem_node_builder.h

test/copy_csv/copy_csv_fault_test.cpp

test/runner/e2e_update_test.cpp

test/test_utility/test_helper.cpp

src/catalog/catalog.cpp

src/processor/physical_plan/operator/copy_csv/copy_node_csv.cpp

src/storage/in_mem_builder/include/in_mem_rel_builder.h

src/catalog/include/catalog.h

src/main/connection.cpp

src/main/database.cpp

src/storage/store/include/nodes_store.h

src/storage/store/include/rels_store.h

semihsalihoglu-uw

I have a set of medium level comments but it would be good to go over these again together on a Zoom call as there is quite a bit of them.

Besides what I left in the comments, I have some more high-level comments:

In the DDL Grammar, I noticed that users need to specify the multiplicity of the rel tables, e.g., MANY_TO_MANY. Let's default to MANY_TO_MANY and not require users to specify this. The fewer things users need to specify and the system just works, the better usability experience we will provide.
Take a note somewhere that when implementing transactions, change the database or connection to stop all transactions before starting to run copyCSV.
Test what happens when read queries and the copyCSV query run concurrently.
Update the design doc with the new changes: https://docs.google.com/document/d/1Q3LDTnm1DJFVZfPI7Dt5Lr1zbkT5sTaUnh0mTaK7GbU/edit
Flexible Headers: Double check what DuckDB assume about the structure of the csv files. Do they have headers? Headers has the flexibility that the users do not have to provide the columns of the table in the csv file in the order that the DBMS expects. For example, if the csv file was (ID, Name, Age), but some one gave a csv file with Name, ID, Age, and this was indicated in a header, we would succeed. It looks like https://duckdb.org/docs/sql/statements/copy DuckDB allows users to specify a header, which probably gives them this flexibility. I would want that feature and this should be implemented in an immediate second PR (or after the transactions if you are already in the middle of it). So open an issue about this.

This feature still requires a design and I would have a special "Unstructured" column in the csv, which stores Unstructured properties of nodes.

This is not related to this PR but it has lately started bothering me, so I'll put it here: Can you open an issue to rename Unstructured -> Semistructured?

dataset/copy-csv-empty-lists-test/vPerson.csv

src/antlr4/Cypher.g4

src/binder/binder.cpp

semihsalihoglu-uw · 2022-08-15T20:15:45Z

src/catalog/catalog.cpp

@@ -426,14 +387,30 @@ void CatalogContent::readFromFile(const string& directory, bool isForWALRecord)
    FileUtils::closeFile(fileInfo->fd);
 }

+uint64_t CatalogContent::getTotalNumRels() const {


If this is being used to get nextRelID, call it getNextRelID(). The "function/purpose" of a function, certainly a public function, should determine its name.

I think what the function is doing is to calculate the total number of rels(edges) in all relLabels.
The copy_rel_csv just utilize this function to getNextRelID. In the future, we may have other callers that just want to know the total number of rels in the catalog. So i would not prefer to rename it.

In general we shouldn't plan for the future but present. This is for two good reasons: this ensures our current state of the code reflects where we are and 2) we are generally very bad at predicting how the code will evolve, so make mistakes. So let's do the renaming now and we can rename again later if that situation arises.

test/copy_csv/copy_csv_test.cpp

test/runner/e2e_ddl_test.cpp

semihsalihoglu-uw · 2022-08-16T04:44:15Z

test/runner/e2e_ddl_test.cpp

        createDBAndConn();
        catalog = conn->database->getCatalog();
    }

    void initWithoutLoadingGraph() {
-        createDBAndConn();
+        createDBAndConn(true /* testRecovery */);


Why are you testing recovery here now whereas before you were not?

If we are trying to testRecovery, we shouldn't try to load the csv.
Otherwise, we should try to load csv in createDBAndConnection

test/test_utility/include/test_helper.h

semihsalihoglu-uw · 2022-08-17T05:22:10Z

src/catalog/catalog.cpp

@@ -426,14 +387,30 @@ void CatalogContent::readFromFile(const string& directory, bool isForWALRecord)
    FileUtils::closeFile(fileInfo->fd);
 }

+uint64_t CatalogContent::getTotalNumRels() const {


In general we shouldn't plan for the future but present. This is for two good reasons: this ensures our current state of the code reflects where we are and 2) we are generally very bad at predicting how the code will evolve, so make mistakes. So let's do the renaming now and we can rename again later if that situation arises.

src/main/database.cpp

src/storage/in_mem_copier/in_mem_node_copier.cpp

src/processor/include/physical_plan/operator/copy_csv/copy_csv.h

src/storage/store/include/rel_table.h

src/storage/store/nodes_metadata.cpp

test/runner/e2e_ddl_test.cpp

acquamarin requested a review from semihsalihoglu-uw August 12, 2022 05:36

ray6080 self-requested a review August 12, 2022 06:31

andyfengHKU reviewed Aug 12, 2022

View reviewed changes

ray6080 reviewed Aug 12, 2022

View reviewed changes

acquamarin force-pushed the ddl-copy-csv branch 3 times, most recently from d219e14 to 0e84644 Compare August 15, 2022 02:19

semihsalihoglu-uw requested changes Aug 16, 2022

View reviewed changes

acquamarin force-pushed the ddl-copy-csv branch 2 times, most recently from c60f02a to e0ec838 Compare August 17, 2022 03:45

semihsalihoglu-uw approved these changes Aug 17, 2022

View reviewed changes

replace loader with ddl and copy csv

4b1a2ac

acquamarin force-pushed the ddl-copy-csv branch from e0ec838 to 4b1a2ac Compare August 17, 2022 16:03

acquamarin merged commit 68559c4 into master Aug 17, 2022

acquamarin deleted the ddl-copy-csv branch August 17, 2022 16:22

acquamarin mentioned this pull request Aug 17, 2022

Add transactionTestType #753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace loader with ddl and copyCSV #747

Replace loader with ddl and copyCSV #747

acquamarin commented Aug 12, 2022 •

edited

Loading

semihsalihoglu-uw left a comment •

edited

Loading

semihsalihoglu-uw Aug 15, 2022

acquamarin Aug 16, 2022

semihsalihoglu-uw Aug 17, 2022

semihsalihoglu-uw Aug 16, 2022

acquamarin Aug 16, 2022 •

edited

Loading

semihsalihoglu-uw Aug 17, 2022

Replace loader with ddl and copyCSV #747

Replace loader with ddl and copyCSV #747

Conversation

acquamarin commented Aug 12, 2022 • edited Loading

semihsalihoglu-uw left a comment • edited Loading

Choose a reason for hiding this comment

semihsalihoglu-uw Aug 15, 2022

Choose a reason for hiding this comment

acquamarin Aug 16, 2022

Choose a reason for hiding this comment

semihsalihoglu-uw Aug 17, 2022

Choose a reason for hiding this comment

semihsalihoglu-uw Aug 16, 2022

Choose a reason for hiding this comment

acquamarin Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

semihsalihoglu-uw Aug 17, 2022

Choose a reason for hiding this comment

acquamarin commented Aug 12, 2022 •

edited

Loading

semihsalihoglu-uw left a comment •

edited

Loading

acquamarin Aug 16, 2022 •

edited

Loading