-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Ray Train and Ray Datasets in training #1391
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! Added mostly minor comments -- feel free to merge whenever updated!
min_cpus = min(r['CPU'] for r in resources) | ||
num_workers = len(resources) | ||
resources_per_worker = { | ||
'CPU': min(min_cpus / 2 + 1, min_cpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resources_per_worker
could potentially be fractional here -- you could use min_cpus // 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is actually allowed in Ray, though, that you can specify fractional CPUs. So I think it's fine to leave as is.
ludwig/backend/ray.py
Outdated
@@ -110,14 +154,121 @@ def train_online(self, *args, **kwargs): | |||
return results | |||
|
|||
|
|||
class RayTrainer(BaseTrainer): | |||
def __init__(self, horovod_kwargs, trainer_kwargs): | |||
def train_fn(executable_kwargs=None, remote_model=None, train_shards=None, val_shards=None, test_shards=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add type hints here? It's not super clear from the code what train_shards
, val_shards
, etc. should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Good catch.
training_set = split_dataset(dataset, split, 0) | ||
validation_set = split_dataset(dataset, split, 1) | ||
test_set = split_dataset(dataset, split, 2) | ||
distinct_values = dataset[split].drop_duplicates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General Q in case we want to keep duplicates: is it possible that duplicates in the dataset were an artifact of upsampling some points in the dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the drop_duplicates
is only on the split column. The idea is that this is a more efficient way to determine whether or not the datasets has, for example, a validation split, so we don't have to call len(val_df)
(which is expensive).
In general, I don't think Ludwig does any duplicate removal across the entire dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
training_set_metadata.get(DATA_TRAIN_HDF5_FP) | ||
) | ||
|
||
def create_inference_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like create_inference_dataset
and create
have the same functionality. The tag
argument looks like it isn't being used. Is it possible to consolidate both methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be once we remove Petastorm. We need it for Petastorm (and by extension here to fit the interface) as we use different datasets for training vs prediction. Will add a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed create_inference_dataset
to call create
.
|
||
@contextlib.contextmanager | ||
def initialize_batcher(self, batch_size=128, | ||
should_shuffle=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It looks like some of the arguments here aren't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but we need it for the interface. Some of these params will be removed when we drop Petastorm.
Fixes #1354.
Fixes #1331.