Fix `md.concat` error when there are same fetch chunk data #3285

zhongchun · 2022-10-25T12:21:09Z

What do these changes do?

There is a GraphContainsCycleError when concatenating two DataFrame in which there are same fetch chunks.

Related issue number

Fixes #3284

Check code requirements

tests added / passed (if needed)
Ensure all linting tests pass, see here for how to run them

qinxuye · 2022-10-28T02:39:29Z

What's the root cause of this issue?

zhongchun · 2022-10-28T03:59:54Z

What's the root cause of this issue?

Like the example:

import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()

There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.

In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.

wjsi · 2022-10-31T02:43:30Z

This does not actually resolves the issue. You need to fix the algorithm to avoid cycles instead of removing the optimization straightforwardly.

fyrestone · 2022-10-31T03:49:10Z

What's the root cause of this issue?

Like the example:
import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()
There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.

In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.

Why these two chunks have the same index? I think there may be some bugs in tiling.

zhongchun · 2022-11-01T02:23:33Z

This does not actually resolves the issue. You need to fix the algorithm to avoid cycles instead of removing the optimization straightforwardly.

buffered_base in BaseSerializer is not necessary, because its parent class has buffered the serial method.

zhongchun · 2022-11-01T02:27:42Z

What's the root cause of this issue?

Like the example:
import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()
There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.
In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.
Why these two chunks have the same index? I think there may be some bugs in tiling.

The key point is that should two fetches have the same id and index. Their same id affects calculation results while same index not.

fyrestone · 2022-11-01T02:58:54Z

What's the root cause of this issue?

Like the example:
import mars
import mars.dataframe as md
import mars.tensor as mt

mars.new_session()

data = {"A": [i for i in range(10)]}
df0 = md.DataFrame(data)
df1 = df0[['A']]
df2 = df0[['A']]
df1 = df1.execute()
df2 = df2.execute()
df3 = md.concat([df1, df2], axis=1)
df3.execute()
There will be one subtask who has 4 nodes as follow and the two DataFrameFetch chunk data has the same key and same id. And when serializing the subtask we will recognize the two chunk data as one, because we use buffered_base wrapper to cache the chunk which identity chunk by chunk data key and id. The second chunk data will be serialized by a Placeholder.
In the early version, we introduced a new fused algorithm and will put two chunk data branches into one subtask. And this cause the problem.
Why these two chunks have the same index? I think there may be some bugs in tiling.
The key point is that should two fetches have the same id and index. Their same id affects calculation results while same index not.

@qinxuye @wjsi Can we just use a random id instead of the tokenize id? Currently, one stage may contains multiple chunks with the same key in different subtasks.

qinxuye · 2022-11-01T03:03:35Z

Id is generated by id(obj) IIRC, but maybe when generating fetch, it copied the original id, if this cause some unexpected consequence, we can remove the logic if everything works well.

fyrestone · 2022-11-01T03:09:31Z

IIRC

The duplicate id and key may cause unexpected problems in distributed system. How mars optimize the compute by the tokenized key?

qinxuye · 2022-11-01T03:12:21Z

I guess just regenerating id is ok, the mech for key should be kept.

fyrestone · 2022-11-01T03:20:26Z

I guess just regenerating id is ok, the mech for key should be kept.

For this issue, maybe regenerating id is OK. But, tokenize key cost CPU and may introduce some bugs to Mars, I want to check if the optimization actually works.

zhongchun · 2022-11-01T03:41:46Z

Key could be the same if the tileable, chunk or op has the same properties while id should be different in my opinion.

fyrestone · 2022-11-01T03:48:58Z

Key could be the same if the chunk or op has the same properties while id should be different in my opinion.

Yes, but the meta and store managment are using key as the key. Different subtasks may overwrite the data of meta and store. Also, there are hundreds of reset_key() to reset the key. If Mars does not get some benefits from the tokenized key, then we can use a random id or uuid for key to make Mars faster and cleaner.

My suggestion:

Merge id to key. Use id(obj) instead of obj.id.
Use a random id or uuid for key instead of tokenize key. If we want to calc a logic id, then we can tokenize them.

@qinxuye @wjsi

qinxuye · 2022-11-03T03:07:38Z

@fyrestone You can give a try to generate key randomly and see if everything can work well.

zhongchun · 2022-11-03T12:48:26Z

@qinxuye @wjsi any question about this pr, please take a look?

qinxuye

LGTM

fyrestone

LGTM

…t#3285)

Fix fetch chunk build

4e0dc7d

zhongchun requested a review from a team as a code owner October 25, 2022 12:21

zhongchun added 2 commits October 26, 2022 16:48

Fix lint

e8b11a3

Fix lint

f463f16

qinxuye approved these changes Nov 4, 2022

View reviewed changes

fyrestone approved these changes Nov 4, 2022

View reviewed changes

fyrestone merged commit f4b1cc0 into mars-project:master Nov 4, 2022

qianduoduo0904 pushed a commit to qianduoduo0904/mars that referenced this pull request Dec 9, 2022

Fix md.concat error when there are same fetch chunk data (mars-projec…

c51c5a2

…t#3285)

aresnow1 pushed a commit to aresnow1/mars that referenced this pull request Dec 23, 2022

Fix md.concat error when there are same fetch chunk data (mars-projec…

26d863e

…t#3285)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `md.concat` error when there are same fetch chunk data #3285

Fix `md.concat` error when there are same fetch chunk data #3285

zhongchun commented Oct 25, 2022

qinxuye commented Oct 28, 2022

zhongchun commented Oct 28, 2022

wjsi commented Oct 31, 2022

fyrestone commented Oct 31, 2022 •

edited

zhongchun commented Nov 1, 2022

zhongchun commented Nov 1, 2022

fyrestone commented Nov 1, 2022

qinxuye commented Nov 1, 2022

fyrestone commented Nov 1, 2022 •

edited

qinxuye commented Nov 1, 2022

fyrestone commented Nov 1, 2022

zhongchun commented Nov 1, 2022 •

edited

fyrestone commented Nov 1, 2022

qinxuye commented Nov 3, 2022

zhongchun commented Nov 3, 2022

qinxuye left a comment

fyrestone left a comment

Fix md.concat error when there are same fetch chunk data #3285

Fix md.concat error when there are same fetch chunk data #3285

Conversation

zhongchun commented Oct 25, 2022

What do these changes do?

Related issue number

Check code requirements

qinxuye commented Oct 28, 2022

zhongchun commented Oct 28, 2022

wjsi commented Oct 31, 2022

fyrestone commented Oct 31, 2022 • edited

zhongchun commented Nov 1, 2022

zhongchun commented Nov 1, 2022

fyrestone commented Nov 1, 2022

qinxuye commented Nov 1, 2022

fyrestone commented Nov 1, 2022 • edited

qinxuye commented Nov 1, 2022

fyrestone commented Nov 1, 2022

zhongchun commented Nov 1, 2022 • edited

fyrestone commented Nov 1, 2022

qinxuye commented Nov 3, 2022

zhongchun commented Nov 3, 2022

qinxuye left a comment

Choose a reason for hiding this comment

fyrestone left a comment

Choose a reason for hiding this comment

Fix `md.concat` error when there are same fetch chunk data #3285

Fix `md.concat` error when there are same fetch chunk data #3285

fyrestone commented Oct 31, 2022 •

edited

fyrestone commented Nov 1, 2022 •

edited

zhongchun commented Nov 1, 2022 •

edited