Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[master < T221] Periodic iterate module #221

Merged
merged 12 commits into from
Jun 27, 2023

Conversation

Josipmrden
Copy link
Collaborator

@Josipmrden Josipmrden commented Jun 12, 2023

Description

In the pull request, periodic.iterate module has been added to support periodic transactions over a query. It helps when having to allocate a lot of deltas that need to be read all over again during the each passing of the operator tree.

Pull request type

  • Algorithm/Module

######################################

Reviewer checklist (the reviewer checks this part)

Module/Algorithm

  • Core algorithm/module implementation
  • Query module implementation
  • Unit tests
  • End-to-end tests
  • Code documentation
  • README short description
  • Documentation on memgraph/docs

######################################

@Josipmrden Josipmrden changed the title Periodic iterate module [master < #221] Periodic iterate module Jun 12, 2023
@Josipmrden Josipmrden changed the title [master < #221] Periodic iterate module [master < T221] Periodic iterate module Jun 12, 2023
@Josipmrden Josipmrden self-assigned this Jun 12, 2023
@Josipmrden Josipmrden added this to the 1.8.0 milestone Jun 12, 2023
@Josipmrden Josipmrden linked an issue Jun 12, 2023 that may be closed by this pull request
@Josipmrden Josipmrden marked this pull request as ready for review June 19, 2023 09:10

mg::Client::Init();

mg::Client::Params params{.host = "localhost", .port = 7687};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can do something with environment variables, let me check

Copy link
Contributor

@antoniofilipovic antoniofilipovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, I left few comments and one optimization point to possibly benchmark

cpp/periodic_module/periodic_iterate.cpp Show resolved Hide resolved

mg::Client::Finalize();

record.Insert(kReturnSuccess, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for every batch executed to return success status, instead of returning one for all batches? This way we can test that batches are executed correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to go in this direction, it makes sense to add one more test for batching, to check that batches are executed correctly

Copy link
Collaborator

@as51340 as51340 Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, one success flag is better because even if some batch failed, what will user do with it? He will see false and then what. Regarding what @antoniofilipovic said

Copy link
Collaborator

@as51340 as51340 Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe returning number of batches would be good. The user would explicitly see in how many batches, the query was executed so he/she could maybe tune it next time + you can test that the code is working as expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will adjust this so it returns the number of batches

cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved
return new_columns;
}

std::string ConstructQueryPreffix(ParamNames names) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is faster, but one optimization idea is to check istream (input stream) and to fill input stream and then return that string that input stream created from function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can do quick benchmark and report results, instead of using boost what we talked on daily

Copy link
Collaborator

@as51340 as51340 Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is utils::Join in Memgraph also if the performance is problem with boost

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use Join since that's not in the included headers for mage so I just written my own

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can copy-paste it here?

cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Copy link
Collaborator

@as51340 as51340 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job but I think there are things you can improve in your code. I would suggest adding test when batch is not 1000 or 1 because that would prove a bug that rows variable isn't incremented and test which works with relationships also not just vertices.

cpp/CMakeLists.txt Show resolved Hide resolved
cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved
cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved
cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved
cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved

mg::Client::Finalize();

record.Insert(kReturnSuccess, true);
Copy link
Collaborator

@as51340 as51340 Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, one success flag is better because even if some batch failed, what will user do with it? He will see false and then what. Regarding what @antoniofilipovic said

cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved
cpp/periodic_module/periodic_iterate.cpp Outdated Show resolved Hide resolved

mg::Client::Finalize();

record.Insert(kReturnSuccess, true);
Copy link
Collaborator

@as51340 as51340 Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe returning number of batches would be good. The user would explicitly see in how many batches, the query was executed so he/she could maybe tune it next time + you can test that the code is working as expected.


auto query_params = ConstructParams(columns, batch);

mg::Client::Params session_params{.host = "localhost", .port = 7687};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but I was thinking more in sense that I want to make session open as short as possible?

python/tests/requirements.txt Show resolved Hide resolved

mg::Client::Init();

mg::Client::Params params{.host = "localhost", .port = 7687};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can do something with environment variables, let me check

return new_columns;
}

std::string ConstructQueryPreffix(ParamNames names) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use Join since that's not in the included headers for mage so I just written my own

Comment on lines +188 to +190
if (!client->Execute(input_query)) {
record.Insert(kReturnSuccess, false);
return;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE INDEX ON :SuperNode;
CREATE INDEX ON :Node;
CREATE INDEX ON :Node(id);

CREATE (:SuperNode);
FOREACH (i IN range(1, 1000000) | CREATE (:Node {id: i}));

// UNWIND range(1, 1000000) as x
// MATCH (s:SuperNode), (n:Node {id: x}) 
// CREATE (s)-[:HAS_REL_TO]->(n);

CALL periodic.iterate(
  "UNWIND range(1, 1000000) AS x RETURN x", 
  "MATCH (s:SuperNode), (n:Node {id: x}) CREATE (s)-[:HAS_REL_TO]->(n)", 
  {batch_size: 5000}) 
YIELD * RETURN *;

wanted to write this in the documentation, which is a clear example why periodic iterate helps (until we maybe do some changes in the frame, which solve it out of the box)

Comment on lines +188 to +190
if (!client->Execute(input_query)) {
record.Insert(kReturnSuccess, false);
return;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem does not occur if MATCH is the first command since it happens only once. However, in a LOAD CSV heavy environment where some of the values can be dynamic, and supernodes are mostly fetched in the same manner, it is beneficial.


mg::Client::Finalize();

record.Insert(kReturnSuccess, true);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will adjust this so it returns the number of batches

@@ -1,4 +1,4 @@
[submodule "cpp/memgraph"]
path = cpp/memgraph
url = https://github.com/memgraph/memgraph.git
branch = release/2.8
branch = fix-creating-mgp-value-from-nullptr
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be rewritten to release/v2.9 once it's up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can bump this up then.

std::vector<std::vector<mg::Value>> batch;
batch.reserve(batch_size);
int rows = 0;
while (const auto maybe_result = client->FetchOne()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it should exist since const auto maybe_result = client->FetchOne() will evaluate to bool

std::vector<std::vector<mg::Value>> batch;
batch.reserve(batch_size);
int rows = 0;
while (const auto maybe_result = client->FetchOne()) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why it's in the while condition


auto query_params = ConstructParams(columns, batch);

mg::Client::Params session_params{.host = "localhost", .port = 7687};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite sure that both ways are correct, what I wrote made more sense to me but do as you want 😄

std::vector<std::vector<mg::Value>> batch;
batch.reserve(batch_size);
int rows = 0;
while (const auto maybe_result = client->FetchOne()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't then the condition be:
while (const auto maybe_result = client->FetchOne(); maybe_result)

return res;
}

auto Join(const std::vector<std::string> &strings, const std::string &delimiter) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string instead of auto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense to do that here as we are not just returning output from other function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do that in every method


auto query_params = ConstructParams(columns, batch);

mg::Client::Params session_params{.host = "localhost", .port = 7687};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways are correct, do as you want 😄

std::vector<std::vector<mg::Value>> batch;
batch.reserve(batch_size);
int rows = 0;
while (const auto maybe_result = client->FetchOne()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But which type is then maybe_result?

auto ConstructWithStatement(const ParamNames &names) {
std::vector<std::string> with_entity_vector;
for (const auto &node_name : names.node_names) {
with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, node_name));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emplace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And without GetGraphFirstClassEntityAlias as we want to not use ctor and move ctor then

with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, node_name));
}
for (const auto &rel_name : names.relationship_names) {
with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, rel_name));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emplace

with_entity_vector.push_back(GetGraphFirstClassEntityAlias(kBatchRowInternalName, rel_name));
}
for (const auto &prim_name : names.primitive_names) {
with_entity_vector.push_back(GetPrimitiveEntityAlias(kBatchRowInternalName, prim_name));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emplace

with_entity_vector.push_back(GetPrimitiveEntityAlias(kBatchRowInternalName, prim_name));
}

auto with_variables = fmt::format("WITH {}", Join(with_entity_vector, ", "));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return immediately

return res;
}

auto Join(const std::vector<std::string> &strings, const std::string &delimiter) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more efficient version of this would be reserve the total size of the string upfront but 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, check how is this implemented in memgraph. in src/utils/string.hpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

std::vector<std::string> primitive_names;
};

auto ExtractParamNames(const std::vector<std::string> &columns, const std::vector<mg::Value> &batch_row) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also put ParamNames as the return type instead of auto as we are returning type from within the function not just returning the output of other functions.

@antoniofilipovic
Copy link
Contributor

@Josipmrden fix comments which @as51340 suggested, but otherwise looks good to me after that is fixed

@Josipmrden Josipmrden merged commit 09e3354 into main Jun 27, 2023
4 checks passed
@Josipmrden Josipmrden deleted the add-periodic-iterate-cpp-module branch June 27, 2023 11:25
@Josipmrden
Copy link
Collaborator Author

@vpavicic I still need docs for this, will do tomorrow

@vpavicic
Copy link

@vpavicic I still need docs for this, will do tomorrow

@Josipmrden release note as well please thank you you're awesome

@vpavicic
Copy link

@Josipmrden ok?
When executing complex queries, the period module allows batching results
from one query into another to improve execution time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT] Add periodic iterate module
4 participants