-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[master < T221] Periodic iterate module #221
Conversation
|
||
mg::Client::Init(); | ||
|
||
mg::Client::Params params{.host = "localhost", .port = 7687}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can do something with environment variables, let me check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, I left few comments and one optimization point to possibly benchmark
|
||
mg::Client::Finalize(); | ||
|
||
record.Insert(kReturnSuccess, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for every batch executed to return success status, instead of returning one for all batches? This way we can test that batches are executed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you decide to go in this direction, it makes sense to add one more test for batching, to check that batches are executed correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, one success flag is better because even if some batch failed, what will user do with it? He will see false and then what. Regarding what @antoniofilipovic said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe returning number of batches would be good. The user would explicitly see in how many batches, the query was executed so he/she could maybe tune it next time + you can test that the code is working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will adjust this so it returns the number of batches
return new_columns; | ||
} | ||
|
||
std::string ConstructQueryPreffix(ParamNames names) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what is faster, but one optimization idea is to check istream
(input stream) and to fill input stream and then return that string that input stream created from function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can do quick benchmark and report results, instead of using boost what we talked on daily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is utils::Join in Memgraph also if the performance is problem with boost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't use Join since that's not in the included headers for mage so I just written my own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can copy-paste it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job but I think there are things you can improve in your code. I would suggest adding test when batch is not 1000 or 1 because that would prove a bug that rows variable isn't incremented and test which works with relationships also not just vertices.
|
||
mg::Client::Finalize(); | ||
|
||
record.Insert(kReturnSuccess, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, one success flag is better because even if some batch failed, what will user do with it? He will see false and then what. Regarding what @antoniofilipovic said
e2e/periodic_iterate_test/test_periodic_iterate_invalid_batch_argument_negative/test.yml
Outdated
Show resolved
Hide resolved
|
||
mg::Client::Finalize(); | ||
|
||
record.Insert(kReturnSuccess, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe returning number of batches would be good. The user would explicitly see in how many batches, the query was executed so he/she could maybe tune it next time + you can test that the code is working as expected.
|
||
auto query_params = ConstructParams(columns, batch); | ||
|
||
mg::Client::Params session_params{.host = "localhost", .port = 7687}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but I was thinking more in sense that I want to make session open as short as possible?
|
||
mg::Client::Init(); | ||
|
||
mg::Client::Params params{.host = "localhost", .port = 7687}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can do something with environment variables, let me check
return new_columns; | ||
} | ||
|
||
std::string ConstructQueryPreffix(ParamNames names) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't use Join since that's not in the included headers for mage so I just written my own
if (!client->Execute(input_query)) { | ||
record.Insert(kReturnSuccess, false); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CREATE INDEX ON :SuperNode;
CREATE INDEX ON :Node;
CREATE INDEX ON :Node(id);
CREATE (:SuperNode);
FOREACH (i IN range(1, 1000000) | CREATE (:Node {id: i}));
// UNWIND range(1, 1000000) as x
// MATCH (s:SuperNode), (n:Node {id: x})
// CREATE (s)-[:HAS_REL_TO]->(n);
CALL periodic.iterate(
"UNWIND range(1, 1000000) AS x RETURN x",
"MATCH (s:SuperNode), (n:Node {id: x}) CREATE (s)-[:HAS_REL_TO]->(n)",
{batch_size: 5000})
YIELD * RETURN *;
wanted to write this in the documentation, which is a clear example why periodic iterate helps (until we maybe do some changes in the frame, which solve it out of the box)
if (!client->Execute(input_query)) { | ||
record.Insert(kReturnSuccess, false); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem does not occur if MATCH
is the first command since it happens only once. However, in a LOAD CSV
heavy environment where some of the values can be dynamic, and supernodes are mostly fetched in the same manner, it is beneficial.
|
||
mg::Client::Finalize(); | ||
|
||
record.Insert(kReturnSuccess, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will adjust this so it returns the number of batches
@@ -1,4 +1,4 @@ | |||
[submodule "cpp/memgraph"] | |||
path = cpp/memgraph | |||
url = https://github.com/memgraph/memgraph.git | |||
branch = release/2.8 | |||
branch = fix-creating-mgp-value-from-nullptr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be rewritten to release/v2.9 once it's up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can bump this up then.
std::vector<std::vector<mg::Value>> batch; | ||
batch.reserve(batch_size); | ||
int rows = 0; | ||
while (const auto maybe_result = client->FetchOne()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it should exist since const auto maybe_result = client->FetchOne()
will evaluate to bool
std::vector<std::vector<mg::Value>> batch; | ||
batch.reserve(batch_size); | ||
int rows = 0; | ||
while (const auto maybe_result = client->FetchOne()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's why it's in the while condition
|
||
auto query_params = ConstructParams(columns, batch); | ||
|
||
mg::Client::Params session_params{.host = "localhost", .port = 7687}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite sure that both ways are correct, what I wrote made more sense to me but do as you want 😄
std::vector<std::vector<mg::Value>> batch; | ||
batch.reserve(batch_size); | ||
int rows = 0; | ||
while (const auto maybe_result = client->FetchOne()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't then the condition be:
while (const auto maybe_result = client->FetchOne(); maybe_result)
return res; | ||
} | ||
|
||
auto Join(const std::vector<std::string> &strings, const std::string &delimiter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string instead of auto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes sense to do that here as we are not just returning output from other function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do that in every method
|
||
auto query_params = ConstructParams(columns, batch); | ||
|
||
mg::Client::Params session_params{.host = "localhost", .port = 7687}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both ways are correct, do as you want 😄
std::vector<std::vector<mg::Value>> batch; | ||
batch.reserve(batch_size); | ||
int rows = 0; | ||
while (const auto maybe_result = client->FetchOne()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But which type is then maybe_result?
auto ConstructWithStatement(const ParamNames &names) { | ||
std::vector<std::string> with_entity_vector; | ||
for (const auto &node_name : names.node_names) { | ||
with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, node_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emplace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And without GetGraphFirstClassEntityAlias
as we want to not use ctor and move ctor then
with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, node_name)); | ||
} | ||
for (const auto &rel_name : names.relationship_names) { | ||
with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, rel_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emplace
with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, rel_name)); | ||
} | ||
for (const auto &prim_name : names.primitive_names) { | ||
with_entity_vector.push_back(GetPrimitiveEntityAlias(kBatchRowInternalName, prim_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emplace
with_entity_vector.push_back(GetPrimitiveEntityAlias(kBatchRowInternalName, prim_name)); | ||
} | ||
|
||
auto with_variables = fmt::format("WITH {}", Join(with_entity_vector, ", ")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return immediately
return res; | ||
} | ||
|
||
auto Join(const std::vector<std::string> &strings, const std::string &delimiter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more efficient version of this would be reserve the total size of the string upfront but 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, check how is this implemented in memgraph. in src/utils/string.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
std::vector<std::string> primitive_names; | ||
}; | ||
|
||
auto ExtractParamNames(const std::vector<std::string> &columns, const std::vector<mg::Value> &batch_row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also put ParamNames as the return type instead of auto as we are returning type from within the function not just returning the output of other functions.
@Josipmrden fix comments which @as51340 suggested, but otherwise looks good to me after that is fixed |
@vpavicic I still need docs for this, will do tomorrow |
@Josipmrden release note as well please thank you you're awesome |
@Josipmrden ok? |
Description
In the pull request, periodic.iterate module has been added to support periodic transactions over a query. It helps when having to allocate a lot of deltas that need to be read all over again during the each passing of the operator tree.
Pull request type
######################################
Reviewer checklist (the reviewer checks this part)
Module/Algorithm
######################################