-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace loader with ddl and copyCSV #747
Conversation
src/processor/physical_plan/operator/copy_csv/copy_node_csv.cpp
Outdated
Show resolved
Hide resolved
src/processor/physical_plan/operator/copy_csv/copy_node_csv.cpp
Outdated
Show resolved
Hide resolved
d219e14
to
0e84644
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a set of medium level comments but it would be good to go over these again together on a Zoom call as there is quite a bit of them.
Besides what I left in the comments, I have some more high-level comments:
-
In the DDL Grammar, I noticed that users need to specify the multiplicity of the rel tables, e.g., MANY_TO_MANY. Let's default to MANY_TO_MANY and not require users to specify this. The fewer things users need to specify and the system just works, the better usability experience we will provide.
-
Take a note somewhere that when implementing transactions, change the database or connection to stop all transactions before starting to run copyCSV.
-
Test what happens when read queries and the copyCSV query run concurrently.
-
Update the design doc with the new changes: https://docs.google.com/document/d/1Q3LDTnm1DJFVZfPI7Dt5Lr1zbkT5sTaUnh0mTaK7GbU/edit
-
Flexible Headers: Double check what DuckDB assume about the structure of the csv files. Do they have headers? Headers has the flexibility that the users do not have to provide the columns of the table in the csv file in the order that the DBMS expects. For example, if the csv file was (ID, Name, Age), but some one gave a csv file with Name, ID, Age, and this was indicated in a header, we would succeed. It looks like https://duckdb.org/docs/sql/statements/copy DuckDB allows users to specify a header, which probably gives them this flexibility. I would want that feature and this should be implemented in an immediate second PR (or after the transactions if you are already in the middle of it). So open an issue about this.
This feature still requires a design and I would have a special "Unstructured" column in the csv, which stores Unstructured properties of nodes.
- This is not related to this PR but it has lately started bothering me, so I'll put it here: Can you open an issue to rename Unstructured -> Semistructured?
src/catalog/catalog.cpp
Outdated
@@ -426,14 +387,30 @@ void CatalogContent::readFromFile(const string& directory, bool isForWALRecord) | |||
FileUtils::closeFile(fileInfo->fd); | |||
} | |||
|
|||
uint64_t CatalogContent::getTotalNumRels() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is being used to get nextRelID, call it getNextRelID(). The "function/purpose" of a function, certainly a public function, should determine its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what the function is doing is to calculate the total number of rels(edges) in all relLabels.
The copy_rel_csv just utilize this function to getNextRelID. In the future, we may have other callers that just want to know the total number of rels in the catalog. So i would not prefer to rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we shouldn't plan for the future but present. This is for two good reasons: this ensures our current state of the code reflects where we are and 2) we are generally very bad at predicting how the code will evolve, so make mistakes. So let's do the renaming now and we can rename again later if that situation arises.
test/runner/e2e_ddl_test.cpp
Outdated
createDBAndConn(); | ||
catalog = conn->database->getCatalog(); | ||
} | ||
|
||
void initWithoutLoadingGraph() { | ||
createDBAndConn(); | ||
createDBAndConn(true /* testRecovery */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you testing recovery here now whereas before you were not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are trying to testRecovery, we shouldn't try to load the csv.
Otherwise, we should try to load csv in createDBAndConnection
c60f02a
to
e0ec838
Compare
src/catalog/catalog.cpp
Outdated
@@ -426,14 +387,30 @@ void CatalogContent::readFromFile(const string& directory, bool isForWALRecord) | |||
FileUtils::closeFile(fileInfo->fd); | |||
} | |||
|
|||
uint64_t CatalogContent::getTotalNumRels() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we shouldn't plan for the future but present. This is for two good reasons: this ensures our current state of the code reflects where we are and 2) we are generally very bad at predicting how the code will evolve, so make mistakes. So let's do the renaming now and we can rename again later if that situation arises.
src/processor/include/physical_plan/operator/copy_csv/copy_csv.h
Outdated
Show resolved
Hide resolved
e0ec838
to
4b1a2ac
Compare
This PR adds the copyCSV backend (without transaction support) and replaces the loader with ddl and COPY_CSV
DDL gramar:
CREATE NODE: CREATE NODE TABLE person (ID INT64, NAME STRING, PRIMARY KEY (ID))
If the user doesn't give primary key, the system will use the first property as the primary key.
CREATE REL: CREATE REL knows (FROM PERSON | ANIMAL TO PERSON, TIME DATE, MANY_MANY)
COPY CSV grammar:
COPY person FROM "dataset/person.csv" (ESCAPE="", DELIM=';', QUOTE='"',LIST_BEGIN='{',LIST_END='}')
Where parsing options are optional fields.
Supported parsing options:
ESCAPE, DELIM, QUOTE, LIST_BEGIN, LIST_END
Refactors all tests to use DDL command and COPY_CSV.