Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader #7728

leonardozcm · 2023-03-01T09:23:54Z

Description

Consider the case when the features from xshards is a list of tensors, our runner will unpack it by mistake:
https://github.com/intel-analytics/BigDL/blob/main/python/orca/src/bigdl/orca/learn/pytorch/torch_runner.py#L423

So this pr just simply applies modifications of #5763 on df and xshards again.

For short, Ndarray dataset of DF or shard format a batch like [x1, x2], [y1, y2] if there are multiple input or output previously, and a pytorch dataloador will format a batch like x1, x2, [y1, y2] in this case. So this pr just keeps both cases consistent: x1, x2, [y1, y2]

And also we may support a more flexiable way to split and reorganize feature col and label col based on the length of feature cols.

hkvision · 2023-03-13T10:16:00Z

python/dllib/src/bigdl/dllib/utils/utils.py

            return result[0]
        return result

    features = convert_for_cols(row, feature_cols)
+    # For pytorch we format multi-input as `f1, f2, label` instead of `[f1, f2], label`


We format multi-input as f1, f2, label instead of [f1, f2], label to align with PyTorch DataLoader

hkvision · 2023-03-13T10:53:44Z

python/orca/src/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py

@@ -92,7 +96,7 @@ def __getitem__(self, i):
        data_loader = DataLoader(dataset, **params)
        return data_loader

-    return data_creator
+    return reload_dataloader_creator(data_creator)


why we need to reload here?

We will disable reload_dataloader_creator in the next pr, now we just keep everything the same before we create the dataloader.

hkvision · 2023-03-13T10:54:08Z

python/orca/src/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py

@@ -92,7 +96,7 @@ def __getitem__(self, i):
        data_loader = DataLoader(dataset, **params)
        return data_loader

-    return data_creator
+    return reload_dataloader_creator(data_creator)


 def parse_model_dir(model_dir):


shoudn't we remove the reload in data_creator function if branch? (in fit)

hkvision · 2023-03-13T10:54:42Z

python/orca/src/bigdl/orca/learn/utils.py

@@ -413,7 +421,8 @@ def _dataframe_to_xshards(data, feature_cols, label_cols=None,
                                                              schema,
                                                              feature_cols,
                                                              label_cols,
-                                                              accept_str_col))
+                                                              accept_str_col,
+                                                              True))


add the keyword here so that readers can easier to understand

hkvision · 2023-03-13T10:59:30Z

python/orca/src/bigdl/orca/learn/utils.py

@@ -442,7 +451,8 @@ def dataframe_to_xshards_of_feature_dict(data, feature_cols, label_cols=None,
                                                              schema,
                                                              feature_cols,
                                                              label_cols,
-                                                              accept_str_col))
+                                                              accept_str_col,
+                                                              True))


Will setting always be True impact TF related logic? On the other hand, are there any place this argument is False?

Actually we only set this to true under pytorch estimator fields to prevent any influence on tf

leonardozcm · 2023-03-20T11:22:22Z

python/orca/src/bigdl/orca/learn/pytorch/pytorch_pyspark_estimator.py

@@ -76,7 +76,11 @@ def __len__(self):
                return get_size(self.y)

            def __getitem__(self, i):
-                return index_data(self.x, i), index_data(self.y, i)
+                index_data_x = index_data(self.x, i)


For we can only allocate x and y as two split part here, we need to reform multi-input as [x1, x2] as the whole x

hkvision · 2023-03-21T09:36:31Z

python/orca/src/bigdl/orca/learn/utils.py

@@ -318,6 +318,7 @@ def add_row(data, results, current):
    feature_lists = None
    label_lists = None
    counter = 0
+    feature_tail = len(feature_cols) if feature_cols else None


remove this

…dataloader (intel-analytics#7728) * Orca: reload dataloader when df or shard * reformat df list * Only in pytorch * specify by feature len

leonardozcm and others added 5 commits March 1, 2023 17:17

Orca: reload dataloader when df or shard

837549c

reformat df list

ceca45e

Only in pytorch

f6148cb

specify by feature len

6bfe520

pep8

b99718e

leonardozcm changed the title ~~Orca: Reload dataloader when df or shard~~ Orca: Reload DF or shard dataloader to keep comsistence with pytorch dataloader Mar 7, 2023

leonardozcm requested a review from hkvision March 7, 2023 08:17

leonardozcm changed the title ~~Orca: Reload DF or shard dataloader to keep comsistence with pytorch dataloader~~ Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader Mar 8, 2023

hkvision reviewed Mar 13, 2023

View reviewed changes

leonardozcm added 2 commits March 14, 2023 15:05

args

d39614e

ut

ae5cbe8

leonardozcm commented Mar 20, 2023

View reviewed changes

leonardozcm added 5 commits March 20, 2023 19:25

test no packing

88d2b58

revert ut

07e146b

revert ut

59d6bc7

args

bf24256

rm get_label_row

159804c

hkvision reviewed Mar 21, 2023

View reviewed changes

hkvision approved these changes Mar 21, 2023

View reviewed changes

rm useless line

7972298

leonardozcm merged commit 9584314 into intel-analytics:main Mar 23, 2023
24 of 25 checks passed

leonardozcm deleted the pre-format-df branch March 23, 2023 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader #7728

Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader #7728

leonardozcm commented Mar 1, 2023 •

edited

hkvision Mar 13, 2023

hkvision Mar 13, 2023

leonardozcm Mar 14, 2023

hkvision Mar 14, 2023

hkvision Mar 13, 2023

hkvision Mar 13, 2023

hkvision Mar 13, 2023

leonardozcm Mar 14, 2023

leonardozcm Mar 20, 2023

hkvision Mar 21, 2023

Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader #7728

Orca: Reload DF or shard dataloader to keep consistence with pytorch dataloader #7728

Conversation

leonardozcm commented Mar 1, 2023 • edited

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardozcm commented Mar 1, 2023 •

edited