Orca: Align the data analysis method of dataloader and dataframe #5763

leonardozcm · 2022-09-14T15:07:20Z

Description

when model has only one input which is a list or tuple consists of tensors, we should not extract it in args.

Basic Assumption:
There are only three possible types in features: torch.Tensor, list\tuple and dict

features and lables type list:

	features	labels
dataframe/Xshards	a (list or tuple) of tensor, or a dict(Xshards only)	a single tensor, tuple, list
raydataset	a single tensor or a list of tensors	a single tensor or a list of tensors
dataloader	a single tensor or a list of user's input(all elements beside last one which is lable )	last one which is label

When will features be a single tensor?

dataloader yields feature consists of only one tensor.
there is only one feature_column specified by user.(df, xshard and rayDataset)

When will features be a list or tuple?

dataloader yields feature consists of more than one tensor or object is not a tensor.
there is more than only one feature_column specified by user.(df, xshard and rayDataset)

When will features be a dict?
only when input is XShards of dictionary

1. Why the change?

#5762
In some case, the model does take x as a list of two tensors as input:

    def forward(self, x, bboxes=None):
        x = x[:]  # avoid pass by reference
        x = self.s1(x)

code
but our torchrunner will extract this as two separated ones:
https://github.com/intel-analytics/BigDL/blob/affe54803c320afd4fc0631dc3fa02f8be1cfcdc/python/orca/src/bigdl/orca/learn/pytorch/training_operator.py#L279

2. User API changes

none

3. Summary of the change

~~before: output = self.model(*features)~~
~~after: output = self.model(*features) if not isSingleListInput else self.model(features)~~

if data is a pt dataloader of creator, reload_dataloader_creator wil combine all elements besides lables into a list, and if feature consists of only one tensor it remains the same:


def make_dataloader_list_wrapper(func):
    import torch
    def make_feature_list(batch):
        if func is not None:
            batch = func(batch)
        *features, target = batch
        if len(features) == 1 and torch.is_tensor(features[0]):
            features = features[0]
        return features, target

    return make_feature_list

and will parse features here:

        features, target = batch
        # Compute output.
        with self.timers.record("fwd"):
            if torch.is_tensor(features):
                output = self.model(features)
            elif isinstance(features, (tuple, list)):
                output = self.model(*features)

This ensure the consistency of *features and user input.

And current df, xshard and raydataset logic is right, we keep it safe.

4. How to test?

leonardozcm · 2022-09-19T07:09:30Z

Now we assume any features is a list of tensors(one or more) to align dataloader with df

@hkvision would you mind taking a look at this?

leonardozcm · 2022-09-19T08:30:22Z

http://10.112.231.51:18888/job/BigDL-Orca-PR-Validation/1075/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-2.4-py37-ray/1249/

jason-dai · 2022-09-20T02:27:46Z

I don't think the logic here is correct.

If the user uses Dataframe (Spark or Xshards of pandas) input, we assume that each input (either x or y) corresponds to a column in the dataframe, and therefore we should pass a list of inputs (x or y) to the model if there are multiple columns, just like model.fit in Keras does
If the user uses TF Dataset and PT Dataloader, we assume that result returned by these objects are already prepared by the user and we should not change its format
If the user uses XShards of dictionary, I think we also assumes each item (x or y) corresponds to the argument in model.fit in Keras.

leonardozcm · 2022-09-20T03:21:24Z

I don't think the logic here is correct.

If the user uses Dataframe (Spark or Xshards of pandas) input, we assume that each input (either x or y) corresponds to a column in the dataframe, and therefore we should pass a list of inputs (x or y) to the model if there are multiple columns, just like model.fit in Keras does

If the user uses TF Dataset and PT Dataloader, we assume that result returned by these objects are already prepared by the user and we should not change its format

If the user uses XShards of dictionary, I think we also assumes each item (x or y) corresponds to the argument in model.fit in Keras.

sorry for outdated description, now updated.

leonardozcm · 2022-09-20T05:07:21Z

if inputs is XShards of dictionary, in which case features is a dict, should pass **features i think

        with self.timers.record("fwd"):
            output = self.model(**features) # if features is  XShards of dictionary

leonardozcm · 2022-09-20T10:21:30Z

http://10.112.231.51:18888/job/BigDL-Orca-PR-Validation/1085/

python/orca/src/bigdl/orca/learn/pytorch/training_operator.py

hkvision · 2022-09-21T11:52:57Z

The modification looks good to me.

Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

jason-dai · 2022-09-21T12:28:59Z

The modification looks good to me.

Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

Need more tests to cover different cases (e.g., label containing multiple inputs)

hkvision · 2022-09-21T12:33:00Z

The modification looks good to me.
Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

Need more tests to cover different cases (e.g., label containing multiple inputs)

I suppose at this moment we may not be able to perfect support multi-label outputs, especially when the dataset return all the items as a tuple (x1, x2, x3, y1, y2)?

hkvision · 2022-09-21T13:26:09Z

python/orca/src/bigdl/orca/learn/pytorch/training_operator.py

+                invalidInputError(False,
+                                  "Features should either be tensor, list/tuple or dict, "
+                                  "but got {}".format(type(features)))
+
            if isinstance(output, tuple) or isinstance(output, list):
                # Then target is also assumed to be a tuple or list.
                loss = self.criterion(*output, *target)


This should support multi-label output if target is already a list right?

This should support multi-label output if target is already a list right?

that's right

leonardozcm · 2022-09-21T16:20:25Z

Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

No, only find multi-input test case for df
https://github.com/intel-analytics/BigDL/blob/main/python/orca/test/bigdl/orca/learn/ray/pytorch/test_estimator_pyspark_backend.py#L369

leonardozcm · 2022-09-21T16:28:31Z

Need more tests to cover different cases (e.g., label containing multiple inputs)

I suppose at this moment we may not be able to perfect support multi-label outputs, especially when the dataset return all the items as a tuple (x1, x2, x3, y1, y2)?

Yes, we have supported multi-label in df/xshards and raydataset, but not in dataloader. Maybe we can enable it in another pr?

leonardozcm · 2022-09-21T16:39:21Z

Lack of uts for input

Multi-input dataloader x1, x2, y (dataloader yields feature consists of more than one tensor)
Xshards of dict {"column1":x1, "column2":x2}, "column3":y (when input is XShards of dictionary)
Single list input dataloader (x1, x2), y (dataloader yields feature consists of object is not a tensor.)
Single dict input dataloader {"x":x}, y (just in case it will not trigger model(**features))
more complicated situation in dataloader (x1,x2), {"x3": x3}, x4, y

hkvision · 2022-09-22T02:05:13Z

Need more tests to cover different cases (e.g., label containing multiple inputs)

I suppose at this moment we may not be able to perfect support multi-label outputs, especially when the dataset return all the items as a tuple (x1, x2, x3, y1, y2)?

Yes, we have supported multi-label in df/xshards and raydataset, but not in dataloader. Maybe we can enable it in another pr?

The question is that is it possible for us to detect return a, b, c, d, e which one are labels and which one are features?

leonardozcm · 2022-09-22T02:18:00Z

The question is that is it possible for us to detect return a, b, c, d, e which one are labels and which one are features?

I think it may be difficult to detect automatically, only user knows which ones are labels. But we can provide an extra argument for user to specify label indexes like [3, 4]?

leonardozcm · 2022-09-22T09:05:42Z

http://10.112.231.51:18888/job/BigDL-Orca-PR-Validation/1092/
http://10.112.231.51:18888/job/BigDL-PRVN-Check-Log4j/2869/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-2.4-py37-ray/1273/
http://10.112.231.51:18888/job/BigDL-PRVN-ray-ctx/807/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Python-Spark-3.1-py37-Horovod/1602/
http://10.112.231.51:18888/job/ZOO-PR-BigDL-Pip-JEP-Examples-Spark2.4/1061/console

hkvision · 2022-09-22T10:36:12Z

Do we have multi-label unit test, particularly for Spark DataFrame intput?

leonardozcm · 2022-09-22T11:03:29Z

Do we have multi-label unit test, particularly for Spark DataFrame intput?

yes
https://github.com/intel-analytics/BigDL/blob/main/python/orca/test/bigdl/orca/learn/ray/pytorch/test_estimator_pyspark_backend.py#L369

hkvision · 2022-09-22T11:22:02Z

Do we have multi-label unit test, particularly for Spark DataFrame intput?

yes https://github.com/intel-analytics/BigDL/blob/main/python/orca/test/bigdl/orca/learn/ray/pytorch/test_estimator_pyspark_backend.py#L369

I mean multiple label not multiple inputs...

leonardozcm · 2022-09-22T11:37:26Z

I mean multiple label not multiple inputs...

no then, may we add in another pr

Fix single list input

533e4a7

leonardozcm requested a review from hkvision September 14, 2022 15:07

leonardozcm added 10 commits September 14, 2022 23:46

fix

e32c68b

df -> ndarray

d6914c2

wrapper

e74d981

Merge remote-tracking branch 'upstream/main' into single_list_input

53fd0cc

ray backend

00a2241

refine

3781668

pyspark

1dcf9c6

code style

4b39214

make_list

54d2f88

valdataloader None

254e172

leonardozcm added the orca label Sep 19, 2022

codestyle

ba96bd6

dict bs support

0ae92a6

leonardozcm added 3 commits September 20, 2022 14:45

if features is dict

5cdcef0

consistent assumption

ab897c5

code style

45b548c

leonardozcm changed the title ~~Orca: Allow single list input in ray and spark backend~~ Orca: Align the data analysis method of dataloader and dataframe Sep 21, 2022

uts

cc5c7d9

hkvision reviewed Sep 21, 2022

View reviewed changes

python/orca/src/bigdl/orca/learn/pytorch/training_operator.py Outdated Show resolved Hide resolved

hkvision reviewed Sep 21, 2022

View reviewed changes

typo

3256692

more uts

8217352

leonardozcm mentioned this pull request Sep 22, 2022

Orca: Support multi-labels analysis in dataloader #5918

Open

leonardozcm merged commit 8d0f17e into intel-analytics:main Sep 23, 2022

leonardozcm mentioned this pull request Oct 24, 2022

[WIP]Orca: adjust the format of xshards data input in unit test of Orca torch #6175

Closed

1 task

This was referenced Mar 1, 2023

Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader #7728

Merged

Pytorch Estimator support more input type #7750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orca: Align the data analysis method of dataloader and dataframe #5763

Orca: Align the data analysis method of dataloader and dataframe #5763

leonardozcm commented Sep 14, 2022 •

edited

leonardozcm commented Sep 19, 2022

leonardozcm commented Sep 19, 2022 •

edited

jason-dai commented Sep 20, 2022 •

edited

leonardozcm commented Sep 20, 2022

leonardozcm commented Sep 20, 2022 •

edited

leonardozcm commented Sep 20, 2022

hkvision commented Sep 21, 2022

jason-dai commented Sep 21, 2022

hkvision commented Sep 21, 2022

hkvision Sep 21, 2022

leonardozcm Sep 21, 2022

leonardozcm commented Sep 21, 2022

leonardozcm commented Sep 21, 2022 •

edited

leonardozcm commented Sep 21, 2022 •

edited

hkvision commented Sep 22, 2022

leonardozcm commented Sep 22, 2022

leonardozcm commented Sep 22, 2022 •

edited

hkvision commented Sep 22, 2022

leonardozcm commented Sep 22, 2022

hkvision commented Sep 22, 2022

leonardozcm commented Sep 22, 2022 •

edited

Orca: Align the data analysis method of dataloader and dataframe #5763

Orca: Align the data analysis method of dataloader and dataframe #5763

Conversation

leonardozcm commented Sep 14, 2022 • edited

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

leonardozcm commented Sep 19, 2022

leonardozcm commented Sep 19, 2022 • edited

jason-dai commented Sep 20, 2022 • edited

leonardozcm commented Sep 20, 2022

leonardozcm commented Sep 20, 2022 • edited

leonardozcm commented Sep 20, 2022

hkvision commented Sep 21, 2022

jason-dai commented Sep 21, 2022

hkvision commented Sep 21, 2022

hkvision Sep 21, 2022

Choose a reason for hiding this comment

leonardozcm Sep 21, 2022

Choose a reason for hiding this comment

leonardozcm commented Sep 21, 2022

leonardozcm commented Sep 21, 2022 • edited

leonardozcm commented Sep 21, 2022 • edited

hkvision commented Sep 22, 2022

leonardozcm commented Sep 22, 2022

leonardozcm commented Sep 22, 2022 • edited

hkvision commented Sep 22, 2022

leonardozcm commented Sep 22, 2022

hkvision commented Sep 22, 2022

leonardozcm commented Sep 22, 2022 • edited

leonardozcm commented Sep 14, 2022 •

edited

leonardozcm commented Sep 19, 2022 •

edited

jason-dai commented Sep 20, 2022 •

edited

leonardozcm commented Sep 20, 2022 •

edited

leonardozcm commented Sep 21, 2022 •

edited

leonardozcm commented Sep 21, 2022 •

edited

leonardozcm commented Sep 22, 2022 •

edited

leonardozcm commented Sep 22, 2022 •

edited