Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orca: Align the data analysis method of dataloader and dataframe #5763

Merged
merged 19 commits into from Sep 23, 2022

Conversation

leonardozcm
Copy link
Contributor

@leonardozcm leonardozcm commented Sep 14, 2022

Description

when model has only one input which is a list or tuple consists of tensors, we should not extract it in args.

Basic Assumption:
There are only three possible types in features: torch.Tensor, list\tuple and dict

features and lables type list:

features labels
dataframe/Xshards a (list or tuple) of tensor, or a dict(Xshards only) a single tensor, tuple, list
raydataset a single tensor or a list of tensors a single tensor or a list of tensors
dataloader a single tensor or a list of user's input(all elements beside last one which is lable ) last one which is label

When will features be a single tensor?

  1. dataloader yields feature consists of only one tensor.
  2. there is only one feature_column specified by user.(df, xshard and rayDataset)

When will features be a list or tuple?

  1. dataloader yields feature consists of more than one tensor or object is not a tensor.
  2. there is more than only one feature_column specified by user.(df, xshard and rayDataset)

When will features be a dict?
only when input is XShards of dictionary

1. Why the change?

#5762
In some case, the model does take x as a list of two tensors as input:

    def forward(self, x, bboxes=None):
        x = x[:]  # avoid pass by reference
        x = self.s1(x)

code
but our torchrunner will extract this as two separated ones:
https://github.com/intel-analytics/BigDL/blob/affe54803c320afd4fc0631dc3fa02f8be1cfcdc/python/orca/src/bigdl/orca/learn/pytorch/training_operator.py#L279

2. User API changes

none

3. Summary of the change

before: output = self.model(*features)
after: output = self.model(*features) if not isSingleListInput else self.model(features)

if data is a pt dataloader of creator, reload_dataloader_creator wil combine all elements besides lables into a list, and if feature consists of only one tensor it remains the same:


def make_dataloader_list_wrapper(func):
    import torch
    def make_feature_list(batch):
        if func is not None:
            batch = func(batch)
        *features, target = batch
        if len(features) == 1 and torch.is_tensor(features[0]):
            features = features[0]
        return features, target

    return make_feature_list

and will parse features here:

        features, target = batch
        # Compute output.
        with self.timers.record("fwd"):
            if torch.is_tensor(features):
                output = self.model(features)
            elif isinstance(features, (tuple, list)):
                output = self.model(*features)

This ensure the consistency of *features and user input.

And current df, xshard and raydataset logic is right, we keep it safe.

4. How to test?

  • N/A
  • Unit test
  • Application test
  • Document test
  • ...

@leonardozcm
Copy link
Contributor Author

Now we assume any features is a list of tensors(one or more) to align dataloader with df

@hkvision would you mind taking a look at this?

@jason-dai
Copy link
Contributor

jason-dai commented Sep 20, 2022

I don't think the logic here is correct.

  1. If the user uses Dataframe (Spark or Xshards of pandas) input, we assume that each input (either x or y) corresponds to a column in the dataframe, and therefore we should pass a list of inputs (x or y) to the model if there are multiple columns, just like model.fit in Keras does

  2. If the user uses TF Dataset and PT Dataloader, we assume that result returned by these objects are already prepared by the user and we should not change its format

  3. If the user uses XShards of dictionary, I think we also assumes each item (x or y) corresponds to the argument in model.fit in Keras.

@leonardozcm
Copy link
Contributor Author

I don't think the logic here is correct.

  1. If the user uses Dataframe (Spark or Xshards of pandas) input, we assume that each input (either x or y) corresponds to a column in the dataframe, and therefore we should pass a list of inputs (x or y) to the model if there are multiple columns, just like model.fit in Keras does
  2. If the user uses TF Dataset and PT Dataloader, we assume that result returned by these objects are already prepared by the user and we should not change its format
  3. If the user uses XShards of dictionary, I think we also assumes each item (x or y) corresponds to the argument in model.fit in Keras.

sorry for outdated description, now updated.

@leonardozcm
Copy link
Contributor Author

leonardozcm commented Sep 20, 2022

if inputs is XShards of dictionary, in which case features is a dict, should pass **features i think

        with self.timers.record("fwd"):
            output = self.model(**features) # if features is  XShards of dictionary

@leonardozcm
Copy link
Contributor Author

@leonardozcm leonardozcm changed the title Orca: Allow single list input in ray and spark backend Orca: Align the data analysis method of dataloader and dataframe Sep 21, 2022
@hkvision
Copy link
Contributor

The modification looks good to me.

Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

@jason-dai
Copy link
Contributor

The modification looks good to me.

Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

Need more tests to cover different cases (e.g., label containing multiple inputs)

@hkvision
Copy link
Contributor

The modification looks good to me.
Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

Need more tests to cover different cases (e.g., label containing multiple inputs)

I suppose at this moment we may not be able to perfect support multi-label outputs, especially when the dataset return all the items as a tuple (x1, x2, x3, y1, y2)?

invalidInputError(False,
"Features should either be tensor, list/tuple or dict, "
"but got {}".format(type(features)))

if isinstance(output, tuple) or isinstance(output, list):
# Then target is also assumed to be a tuple or list.
loss = self.criterion(*output, *target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should support multi-label output if target is already a list right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should support multi-label output if target is already a list right?

that's right

@leonardozcm
Copy link
Contributor Author

Do we have test for dataloader that returns more than 2 values (e.g. feature1, feature2, label)?

No, only find multi-input test case for df
https://github.com/intel-analytics/BigDL/blob/main/python/orca/test/bigdl/orca/learn/ray/pytorch/test_estimator_pyspark_backend.py#L369

@leonardozcm
Copy link
Contributor Author

leonardozcm commented Sep 21, 2022

Need more tests to cover different cases (e.g., label containing multiple inputs)

I suppose at this moment we may not be able to perfect support multi-label outputs, especially when the dataset return all the items as a tuple (x1, x2, x3, y1, y2)?

Yes, we have supported multi-label in df/xshards and raydataset, but not in dataloader. Maybe we can enable it in another pr?

@leonardozcm
Copy link
Contributor Author

leonardozcm commented Sep 21, 2022

Lack of uts for input

  • Multi-input dataloader x1, x2, y (dataloader yields feature consists of more than one tensor)
  • Xshards of dict {"column1":x1, "column2":x2}, "column3":y (when input is XShards of dictionary)
  • Single list input dataloader (x1, x2), y (dataloader yields feature consists of object is not a tensor.)
  • Single dict input dataloader {"x":x}, y (just in case it will not trigger model(**features))
  • more complicated situation in dataloader (x1,x2), {"x3": x3}, x4, y

@hkvision
Copy link
Contributor

Need more tests to cover different cases (e.g., label containing multiple inputs)

I suppose at this moment we may not be able to perfect support multi-label outputs, especially when the dataset return all the items as a tuple (x1, x2, x3, y1, y2)?

Yes, we have supported multi-label in df/xshards and raydataset, but not in dataloader. Maybe we can enable it in another pr?

The question is that is it possible for us to detect return a, b, c, d, e which one are labels and which one are features?

@leonardozcm
Copy link
Contributor Author

The question is that is it possible for us to detect return a, b, c, d, e which one are labels and which one are features?

I think it may be difficult to detect automatically, only user knows which ones are labels. But we can provide an extra argument for user to specify label indexes like [3, 4]?

@hkvision
Copy link
Contributor

Do we have multi-label unit test, particularly for Spark DataFrame intput?

@leonardozcm
Copy link
Contributor Author

Do we have multi-label unit test, particularly for Spark DataFrame intput?

yes
https://github.com/intel-analytics/BigDL/blob/main/python/orca/test/bigdl/orca/learn/ray/pytorch/test_estimator_pyspark_backend.py#L369

@hkvision
Copy link
Contributor

Do we have multi-label unit test, particularly for Spark DataFrame intput?

yes https://github.com/intel-analytics/BigDL/blob/main/python/orca/test/bigdl/orca/learn/ray/pytorch/test_estimator_pyspark_backend.py#L369

I mean multiple label not multiple inputs...

@leonardozcm
Copy link
Contributor Author

leonardozcm commented Sep 22, 2022

I mean multiple label not multiple inputs...

no then, may we add in another pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants