Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix md.concat error when there are same fetch chunk data #3285

Merged
merged 3 commits into from Nov 4, 2022

Conversation

zhongchun
Copy link
Contributor

What do these changes do?

There is a GraphContainsCycleError when concatenating two DataFrame in which there are same fetch chunks.

Related issue number

Fixes #3284

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@zhongchun zhongchun requested a review from a team as a code owner October 25, 2022 12:21
@qinxuye
Copy link
Collaborator

qinxuye commented Oct 28, 2022

What's the root cause of this issue?

@zhongchun
Copy link
Contributor Author

What's the root cause of this issue?

Like the example:

import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()

There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.

In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.

image

@wjsi
Copy link
Member

wjsi commented Oct 31, 2022

This does not actually resolves the issue. You need to fix the algorithm to avoid cycles instead of removing the optimization straightforwardly.

@fyrestone
Copy link
Contributor

fyrestone commented Oct 31, 2022

What's the root cause of this issue?

Like the example:

import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()

There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.

In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.

image

Why these two chunks have the same index? I think there may be some bugs in tiling.

@zhongchun
Copy link
Contributor Author

This does not actually resolves the issue. You need to fix the algorithm to avoid cycles instead of removing the optimization straightforwardly.

buffered_base in BaseSerializer is not necessary, because its parent class has buffered the serial method.

@zhongchun
Copy link
Contributor Author

What's the root cause of this issue?

Like the example:

import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()

There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.
In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.
image

Why these two chunks have the same index? I think there may be some bugs in tiling.

The key point is that should two fetches have the same id and index. Their same id affects calculation results while same index not.

@fyrestone
Copy link
Contributor

What's the root cause of this issue?

Like the example:

import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()

There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.
In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.
image

Why these two chunks have the same index? I think there may be some bugs in tiling.

The key point is that should two fetches have the same id and index. Their same id affects calculation results while same index not.

@qinxuye @wjsi Can we just use a random id instead of the tokenize id? Currently, one stage may contains multiple chunks with the same key in different subtasks.

@qinxuye
Copy link
Collaborator

qinxuye commented Nov 1, 2022

Id is generated by id(obj) IIRC, but maybe when generating fetch, it copied the original id, if this cause some unexpected consequence, we can remove the logic if everything works well.

@fyrestone
Copy link
Contributor

fyrestone commented Nov 1, 2022

IIRC

The duplicate id and key may cause unexpected problems in distributed system. How mars optimize the compute by the tokenized key?

@qinxuye
Copy link
Collaborator

qinxuye commented Nov 1, 2022

I guess just regenerating id is ok, the mech for key should be kept.

@fyrestone
Copy link
Contributor

I guess just regenerating id is ok, the mech for key should be kept.

For this issue, maybe regenerating id is OK. But, tokenize key cost CPU and may introduce some bugs to Mars, I want to check if the optimization actually works.

@zhongchun
Copy link
Contributor Author

zhongchun commented Nov 1, 2022

Key could be the same if the tileable, chunk or op has the same properties while id should be different in my opinion.

@fyrestone
Copy link
Contributor

Key could be the same if the chunk or op has the same properties while id should be different in my opinion.

Yes, but the meta and store managment are using key as the key. Different subtasks may overwrite the data of meta and store. Also, there are hundreds of reset_key() to reset the key. If Mars does not get some benefits from the tokenized key, then we can use a random id or uuid for key to make Mars faster and cleaner.

My suggestion:

  • Merge id to key. Use id(obj) instead of obj.id.
  • Use a random id or uuid for key instead of tokenize key. If we want to calc a logic id, then we can tokenize them.

@qinxuye @wjsi

@qinxuye
Copy link
Collaborator

qinxuye commented Nov 3, 2022

@fyrestone You can give a try to generate key randomly and see if everything can work well.

@zhongchun
Copy link
Contributor Author

@qinxuye @wjsi any question about this pr, please take a look?

Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@fyrestone fyrestone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fyrestone fyrestone merged commit f4b1cc0 into mars-project:master Nov 4, 2022
qianduoduo0904 pushed a commit to qianduoduo0904/mars that referenced this pull request Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] md.concat GraphContainsCycleError
4 participants