Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Ray Train and Ray Datasets in training #1391

Merged
merged 47 commits into from
Oct 22, 2021
Merged

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Oct 15, 2021

Fixes #1354.
Fixes #1331.

@tgaddair tgaddair changed the title Add support for RaySGD and RayDatasets in training Add support for Ray Train and Ray Datasets in training Oct 20, 2021
@tgaddair tgaddair marked this pull request as ready for review October 20, 2021 16:31
@tgaddair tgaddair requested a review from ShreyaR October 20, 2021 16:31
Copy link
Contributor

@ShreyaR ShreyaR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Added mostly minor comments -- feel free to merge whenever updated!

min_cpus = min(r['CPU'] for r in resources)
num_workers = len(resources)
resources_per_worker = {
'CPU': min(min_cpus / 2 + 1, min_cpus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resources_per_worker could potentially be fractional here -- you could use min_cpus // 2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is actually allowed in Ray, though, that you can specify fractional CPUs. So I think it's fine to leave as is.

@@ -110,14 +154,121 @@ def train_online(self, *args, **kwargs):
return results


class RayTrainer(BaseTrainer):
def __init__(self, horovod_kwargs, trainer_kwargs):
def train_fn(executable_kwargs=None, remote_model=None, train_shards=None, val_shards=None, test_shards=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add type hints here? It's not super clear from the code what train_shards, val_shards, etc. should be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Good catch.

training_set = split_dataset(dataset, split, 0)
validation_set = split_dataset(dataset, split, 1)
test_set = split_dataset(dataset, split, 2)
distinct_values = dataset[split].drop_duplicates()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Q in case we want to keep duplicates: is it possible that duplicates in the dataset were an artifact of upsampling some points in the dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the drop_duplicates is only on the split column. The idea is that this is a more efficient way to determine whether or not the datasets has, for example, a validation split, so we don't have to call len(val_df) (which is expensive).

In general, I don't think Ludwig does any duplicate removal across the entire dataset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment.

training_set_metadata.get(DATA_TRAIN_HDF5_FP)
)

def create_inference_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like create_inference_dataset and create have the same functionality. The tag argument looks like it isn't being used. Is it possible to consolidate both methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be once we remove Petastorm. We need it for Petastorm (and by extension here to fit the interface) as we use different datasets for training vs prediction. Will add a TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed create_inference_dataset to call create.


@contextlib.contextmanager
def initialize_batcher(self, batch_size=128,
should_shuffle=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It looks like some of the arguments here aren't used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but we need it for the interface. Some of these params will be removed when we drop Petastorm.

@tgaddair tgaddair merged commit 8e6420c into master Oct 22, 2021
@tgaddair tgaddair deleted the ray-train branch October 22, 2021 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ray] Use RayDatasets for training [ray] Migrate RayExecutor to Ray SGD
2 participants