Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Ray remote backend and Dask distributed preprocessing #1090

Merged
merged 128 commits into from
Mar 19, 2021
Merged
Show file tree
Hide file tree
Changes from 115 commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
3b8713d
POC of Dask replacing Pandas for CSV
tgaddair Oct 18, 2020
197314c
WIP performance improvements for categorical
tgaddair Oct 22, 2020
6bf7083
Removed debug code
tgaddair Oct 22, 2020
dd52d5c
Auto parallelize across CPU cores
tgaddair Oct 22, 2020
1f228b7
Added DataProcessingEngine
tgaddair Oct 23, 2020
12bfea7
Fixed split
tgaddair Oct 23, 2020
b39d372
Fixed API
tgaddair Oct 23, 2020
5b5fc60
Fixed data processing
tgaddair Oct 24, 2020
8b2b594
Drop index
tgaddair Oct 24, 2020
b7f9546
Added Petastorm dataset
tgaddair Oct 24, 2020
c952130
Cleaned up dataset creation
tgaddair Oct 24, 2020
2c93b60
Added Dataset
tgaddair Oct 24, 2020
ea6b4a7
Train from dataset
tgaddair Oct 24, 2020
b203fa6
Fixed bugs
tgaddair Oct 24, 2020
6b3bb08
Fixed string_utils
tgaddair Oct 25, 2020
ef2a314
Fixed tests
tgaddair Oct 25, 2020
9a743fe
Fixed temp dataset
tgaddair Oct 25, 2020
a630f14
Added Backend
tgaddair Oct 25, 2020
945d56e
Plumb through backend
tgaddair Oct 25, 2020
2aab9c5
Plumb backend through get_feature_meta
tgaddair Oct 25, 2020
0a0a7c4
Plumb through backend to add_feature_data
tgaddair Oct 25, 2020
3419178
Plumb in preprocess_for_prediction
tgaddair Oct 25, 2020
9d13c71
Fixed Pandas processing
tgaddair Oct 25, 2020
95a7952
Added cache management
tgaddair Oct 25, 2020
22b7538
Fixed unit tests
tgaddair Oct 25, 2020
fd7cbab
Removed context, engine to processor
tgaddair Oct 25, 2020
b63b316
Added numerical test
tgaddair Oct 25, 2020
0941ecd
RayBackend -> DaskBackend
tgaddair Oct 25, 2020
77a59f9
Fixed read_xsv
tgaddair Oct 25, 2020
cab90a1
Fixed set feature
tgaddair Oct 26, 2020
755c204
Untracked Netflix example
tgaddair Oct 26, 2020
a105e41
Added Dask requirements
tgaddair Oct 26, 2020
0cbb582
Fixed bag feature
tgaddair Oct 26, 2020
87f6e4b
Fixed vector feature
tgaddair Oct 26, 2020
5e089d4
Fixed h3
tgaddair Oct 26, 2020
4f6c0ba
Fixed date
tgaddair Oct 26, 2020
93cbccd
Fixed timeseries
tgaddair Oct 26, 2020
0e93043
Converted audio features processing
tgaddair Oct 27, 2020
bb33fc0
Fixed reshaping
tgaddair Oct 27, 2020
7e8d3c3
Fixed tests
tgaddair Oct 27, 2020
8924ef6
Removed debug print
tgaddair Oct 27, 2020
8e95be2
Fixed image processing
tgaddair Oct 27, 2020
b19ea58
Added tests for exceptions
tgaddair Oct 27, 2020
0afade5
meta_kwargs -> map_objects
tgaddair Oct 27, 2020
aabb582
Removed unused methods
tgaddair Oct 27, 2020
ac29b9d
Removed prints
tgaddair Oct 27, 2020
e3e7a14
Reduced runtime
tgaddair Oct 27, 2020
318da2f
Removed Dask dependency on critical code paths
tgaddair Oct 28, 2020
1a98f33
Added dask extra
tgaddair Oct 28, 2020
f625402
Fixed concatenation
tgaddair Oct 28, 2020
418600a
Fixed split empty dataset
tgaddair Oct 28, 2020
ed9451b
Fixed subselect
tgaddair Oct 28, 2020
a3de815
Restored tests, removed subselect
tgaddair Oct 28, 2020
eacfd63
Moved meta.json
tgaddair Oct 30, 2020
4d8e690
Fixed cache key
tgaddair Oct 30, 2020
cd70992
Updated Petastorm
tgaddair Oct 30, 2020
6c94f22
Spawn Dask tests
tgaddair Oct 30, 2020
cb8bb91
Merge branch 'master' into dask
tgaddair Oct 30, 2020
985b5bd
Fixed test_sequence_features.py
tgaddair Oct 30, 2020
60ad4f4
Added tables
tgaddair Oct 30, 2020
dff8461
Fixed image features
tgaddair Oct 31, 2020
0469097
Fixed string_utils.py
tgaddair Oct 31, 2020
1922f35
Fixed kfold
tgaddair Oct 31, 2020
8952e23
Fixed test splits
tgaddair Oct 31, 2020
5057be6
Fixed test_visualization_api.py
tgaddair Oct 31, 2020
20351e0
Fixed test_visualization.py
tgaddair Oct 31, 2020
92d64c1
Fixed Dask
tgaddair Oct 31, 2020
25ab59b
Fixed test_experiment.py
tgaddair Oct 31, 2020
9f92c38
Changed backend to processor in string_utils
tgaddair Nov 6, 2020
f13fbe5
Added RayBackend to Dask preprocessing and Horovod training on a Ray …
tgaddair Nov 2, 2020
1963305
Replaced Trainer with Backend abstraction
tgaddair Nov 2, 2020
bbfe0de
Added Ray test
tgaddair Nov 3, 2020
498f788
Added Ray test implementation
tgaddair Nov 6, 2020
4d57e18
Refactored into mixins
tgaddair Nov 6, 2020
e0bebc2
Fixed TensorFlow initialization
tgaddair Nov 6, 2020
4ab92cf
Added remote utils
tgaddair Nov 6, 2020
61262ef
Fixed ECD serialization
tgaddair Nov 6, 2020
5fa2b17
Added sync_model to backend
tgaddair Nov 8, 2020
e15e93b
Created RayPredictor
tgaddair Nov 8, 2020
a9bb14b
Fixed kwargs
tgaddair Nov 8, 2020
e461c00
Fixed Horovod in Ray
tgaddair Nov 8, 2020
1294a21
Added return_on_master
tgaddair Nov 8, 2020
dc17f92
Renamed return_first for clarity
tgaddair Nov 8, 2020
1e59b3c
Refactored broadcast_return
tgaddair Nov 9, 2020
004902f
Added Backend plumbing in hyperopt
tgaddair Nov 10, 2020
cbcaaa1
Refactored horovod_utils
tgaddair Nov 10, 2020
21a3660
Removed occurrences of use_horovod
tgaddair Nov 10, 2020
f343a64
Replaced is_on_master
tgaddair Nov 10, 2020
83c9aad
Replaced is_on_master with is_coordinator for Trainer and Predictor
tgaddair Nov 10, 2020
3cb9185
Removed remaining occurrences of is_on_master
tgaddair Nov 10, 2020
594ca61
master -> coordinator
tgaddair Nov 10, 2020
902e8b7
Fixed implicit Horovod backend
tgaddair Nov 10, 2020
1e65b21
Fixed hyperopt
tgaddair Nov 10, 2020
b245055
Added requirements for Ray
tgaddair Nov 10, 2020
08bf843
Store model weights on Ray object store
tgaddair Nov 12, 2020
2febe92
Merged from master
tgaddair Jan 17, 2021
e8ed0f5
Merged master
tgaddair Feb 4, 2021
5b979a3
Removed processor module
tgaddair Feb 4, 2021
fe07d3f
Removed remote_utils
tgaddair Feb 4, 2021
273c85d
Fixed Dask tests
tgaddair Feb 4, 2021
b10393a
Temp fix tests
tgaddair Feb 7, 2021
73c5ad1
Spawn test
tgaddair Feb 7, 2021
9fab810
Merged master
tgaddair Feb 21, 2021
77e1b1f
Force local backend for prediction
tgaddair Feb 22, 2021
7c2d65b
Set Ray as Dask backend
tgaddair Mar 2, 2021
1822501
Fixed comments
tgaddair Mar 2, 2021
745c96c
Merge
tgaddair Mar 2, 2021
ff795a4
Removed link
tgaddair Mar 2, 2021
8f65656
Added test
tgaddair Mar 2, 2021
b87e7bc
TEST: skip experiment.py
tgaddair Mar 2, 2021
f6e3686
Split backend tests
tgaddair Mar 2, 2021
15920c1
Added pytest.ini
tgaddair Mar 2, 2021
1094f78
backend -> distributed
tgaddair Mar 2, 2021
7c324c1
Reordered tests
tgaddair Mar 2, 2021
ff598e3
Configure Dask parallelism
tgaddair Mar 2, 2021
7ea761f
TEST: disable fiber
tgaddair Mar 4, 2021
23b4be0
Test without Ray hyperopt
tgaddair Mar 6, 2021
214f718
Test ray only
tgaddair Mar 6, 2021
65f3293
Test all distributed
tgaddair Mar 6, 2021
f6a4227
Run distributed
tgaddair Mar 6, 2021
740ed9d
Only ray
tgaddair Mar 7, 2021
d10e14f
Serialize on load
tgaddair Mar 10, 2021
faf0052
Only return the weights
tgaddair Mar 10, 2021
0259e07
Revert test changes
tgaddair Mar 10, 2021
97139a6
Revert changes to visualization_utils
tgaddair Mar 10, 2021
54f1939
Merge branch 'master' into ray
tgaddair Mar 10, 2021
eb9fb2e
Resolved merge conflicts
tgaddair Mar 16, 2021
6dfbce6
Addressed comments
tgaddair Mar 19, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 10 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,17 @@ language: python
jobs:
include:
- python: "3.6"
env: TENSORFLOW=2.3.1
env: TENSORFLOW=2.3.1 TEST_FILTER="not distributed"
- python: "3.6"
env: TENSORFLOW=2.3.1 TEST_FILTER="distributed"
- python: "3.7"
env: TENSORFLOW=2.4.0 TEST_FILTER="not distributed"
- python: "3.7"
env: TENSORFLOW=2.4.0
env: TENSORFLOW=2.4.0 TEST_FILTER="distributed"
- python: "3.8"
env: TENSORFLOW=nightly TEST_FILTER="not distributed"
- python: "3.8"
env: TENSORFLOW=nightly
env: TENSORFLOW=nightly TEST_FILTER="distributed"
before_install:
- sudo apt-get update
- sudo apt-get install -y cmake libsndfile1
Expand All @@ -28,4 +34,4 @@ install:
- HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_MPI=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir '.[test]'
script:
- pip list
- pytest -v --timeout 300 tests
- pytest -v --timeout 300 --durations 10 -m "$TEST_FILTER" tests
16 changes: 15 additions & 1 deletion ludwig/backend/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,23 +22,37 @@
LOCAL_BACKEND = LocalBackend()

LOCAL = 'local'
DASK = 'dask'
HOROVOD = 'horovod'
RAY = 'ray'

ALL_BACKENDS = [LOCAL, HOROVOD]
ALL_BACKENDS = [LOCAL, DASK, HOROVOD, RAY]


def get_local_backend():
return LOCAL_BACKEND


def create_dask_backend():
from ludwig.backend.dask import DaskBackend
return DaskBackend()


def create_horovod_backend():
from ludwig.backend.horovod import HorovodBackend
return HorovodBackend()


def create_ray_backend():
from ludwig.backend.ray import RayBackend
return RayBackend()


backend_registry = {
LOCAL: get_local_backend,
DASK: create_dask_backend,
HOROVOD: create_horovod_backend,
RAY: create_ray_backend,
None: get_local_backend,
}

Expand Down
41 changes: 41 additions & 0 deletions ludwig/backend/dask.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#! /usr/bin/env python
# coding=utf-8
# Copyright (c) 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

from ludwig.backend.base import Backend, LocalTrainingMixin
from ludwig.constants import NAME
from ludwig.data.dataframe.dask import DaskEngine


class DaskBackend(LocalTrainingMixin, Backend):
def __init__(self):
super().__init__()
self._df_engine = DaskEngine()

def initialize(self):
pass

@property
def df_engine(self):
return self._df_engine

@property
def supports_multiprocessing(self):
return False

def check_lazy_load_supported(self, feature):
raise ValueError(f'DaskBackend does not support lazy loading of data files at train time. '
f'Set preprocessing config `in_memory: True` for feature {feature[NAME]}')
187 changes: 187 additions & 0 deletions ludwig/backend/ray.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
#! /usr/bin/env python
# coding=utf-8
# Copyright (c) 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import logging
from collections import defaultdict

import dask
import ray
from horovod.ray import RayExecutor
from ray.util.dask import ray_dask_get

from ludwig.backend.base import Backend, RemoteTrainingMixin
from ludwig.constants import NAME
from ludwig.data.dataframe.dask import DaskEngine
from ludwig.models.predictor import BasePredictor, RemotePredictor
from ludwig.models.trainer import BaseTrainer, RemoteTrainer
from ludwig.utils.tf_utils import initialize_tensorflow


logger = logging.getLogger(__name__)


def get_dask_kwargs():
# TODO ray: select this more intelligently,
# must be greather than or equal to number of Horovod workers
return dict(
parallelism=int(ray.cluster_resources()['CPU'])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow does this make sense as the default repartition value? One partition per CPU? Not sure if there's a more reasonable heuristic for this. The one restriction we have is that for Petastorm, we must have at least one row group per Horovod worker, and the safest way to guarantee this at the moment is to repartition the dataframe.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the typical heuristic, yes, under the soft constraint of those chunks/partitions fitting nicely into each worker's memory.

)


def get_horovod_kwargs():
# TODO ray: https://github.com/horovod/horovod/issues/2702
resources = [node['Resources'] for node in ray.state.nodes()]
use_gpu = int(ray.cluster_resources().get('GPU', 0)) > 0

# Our goal is to maximize the number of training resources we can
# form into a homogenous configuration. The priority is GPUs, but

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to support non-homogenous configurations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Horovod's perspective: definitely. I think we would just need to rework the RayExecutor interface a little. Namely, you could imagine the user just saying num_gpus=N, then we place however many processes per host we need to in order to meet this request (so no more num_hosts or num_slots params in this mode).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that sounds good. Let me make an issue on horovod then!

# can fall back to CPUs if there are no GPUs available.
key = 'GPU' if use_gpu else 'CPU'

# Bucket the per node resources by the number of the target resource
# available on that host (equivalent to number of slots).
buckets = defaultdict(list)
for node_resources in resources:
buckets[int(node_resources.get(key, 0))].append(node_resources)
Comment on lines +47 to +59

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving forward, we want to move away from the ray.nodes() API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the preferred alternative?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we get HorovodRay to just support num_gpus=N, then we wouldn't need to do the accounting.

Generally, we're trying to move towards a more 'serverless' abstraction where programmers think about 'resources' rather than 'nodes'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to keep this here for now (at least until we support num_gpus in horovod)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, I'll add a TODO referencing this issue.


# Maximize for the total number of the target resource = num_slots * num_workers
def get_total_resources(bucket):
slots, resources = bucket
return slots * len(resources)

best_slots, best_resources = max(buckets.items(), key=get_total_resources)
return dict(
num_slots=best_slots,
num_hosts=len(best_resources),
use_gpu=use_gpu
)


class RayRemoteModel:
def __init__(self, model):
self.cls, self.args, state = list(model.__reduce__())
self.state = ray.put(state)

def load(self):
obj = self.cls(*self.args)
obj.__setstate__(ray.get(self.state))
return obj


class RayTrainer(BaseTrainer):
def __init__(self, horovod_kwargs, trainer_kwargs):
# TODO ray: make this more configurable by allowing YAML overrides of timeout_s, etc.
setting = RayExecutor.create_settings(timeout_s=30)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we expose more settings here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I definitely want to make this more configurable via the Ludwig YAML or similar. I think we can do this in a follow-up to allow specifying the backend in a YAML file, so I will add a TODO for now. Does that seem reasonable to you?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sounds good!

self.executor = RayExecutor(setting, **{**get_horovod_kwargs(), **horovod_kwargs})
self.executor.start(executable_cls=RemoteTrainer, executable_kwargs=trainer_kwargs)

def train(self, model, *args, **kwargs):
model = RayRemoteModel(model)
results = self.executor.execute(
lambda trainer: trainer.train(model.load(), *args, **kwargs)
)
return results[0]

def train_online(self, model, *args, **kwargs):
model = RayRemoteModel(model)
results = self.executor.execute(
lambda trainer: trainer.train_online(model.load(), *args, **kwargs)
)
return results[0]

@property
def validation_field(self):
return self.executor.execute_single(lambda trainer: trainer.validation_field)

@property
def validation_metric(self):
return self.executor.execute_single(lambda trainer: trainer.validation_metric)

def shutdown(self):
self.executor.shutdown()


class RayPredictor(BasePredictor):
def __init__(self, horovod_kwargs, predictor_kwargs):
# TODO ray: investigate using Dask for prediction instead of Horovod
setting = RayExecutor.create_settings(timeout_s=30)
self.executor = RayExecutor(setting, **{**get_horovod_kwargs(), **horovod_kwargs})
self.executor.start(executable_cls=RemotePredictor, executable_kwargs=predictor_kwargs)

def batch_predict(self, model, *args, **kwargs):
model = RayRemoteModel(model)
results = self.executor.execute(
lambda predictor: predictor.batch_predict(model.load(), *args, **kwargs)
)
return results[0]

def batch_evaluation(self, model, *args, **kwargs):
model = RayRemoteModel(model)
results = self.executor.execute(
lambda predictor: predictor.batch_evaluation(model.load(), *args, **kwargs)
)
return results[0]

def batch_collect_activations(self, model, *args, **kwargs):
model = RayRemoteModel(model)
return self.executor.execute_single(
lambda predictor: predictor.batch_collect_activations(model.load(), *args, **kwargs)
)

def shutdown(self):
self.executor.shutdown()


class RayBackend(RemoteTrainingMixin, Backend):
def __init__(self, horovod_kwargs=None):
super().__init__()
self._df_engine = DaskEngine(**get_dask_kwargs())
self._horovod_kwargs = horovod_kwargs or {}
self._tensorflow_kwargs = {}

def initialize(self):
try:
ray.init('auto', ignore_reinit_error=True)
except ConnectionError:
logger.info('Initializing new Ray cluster...')
ray.init(ignore_reinit_error=True)
dask.config.set(scheduler=ray_dask_get)

def initialize_tensorflow(self, **kwargs):
# Make sure we don't claim any GPU resources on the head node
initialize_tensorflow(gpus=-1)
self._tensorflow_kwargs = kwargs

def create_trainer(self, **kwargs):
executable_kwargs = {**kwargs, **self._tensorflow_kwargs}
return RayTrainer(self._horovod_kwargs, executable_kwargs)

def create_predictor(self, **kwargs):
executable_kwargs = {**kwargs, **self._tensorflow_kwargs}
return RayPredictor(self._horovod_kwargs, executable_kwargs)

@property
def df_engine(self):
return self._df_engine

@property
def supports_multiprocessing(self):
return False

def check_lazy_load_supported(self, feature):
raise ValueError(f'RayBackend does not support lazy loading of data files at train time. '
f'Set preprocessing config `in_memory: True` for feature {feature[NAME]}')
55 changes: 55 additions & 0 deletions ludwig/data/batcher/iterable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#! /usr/bin/env python
# coding=utf-8
# Copyright (c) 2020 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
from ludwig.data.batcher.base import Batcher


class IterableBatcher(Batcher):
def __init__(self,
dataset,
data,
steps_per_epoch,
ignore_last=False):
self.dataset = dataset
self.data = data
self.data_it = iter(data)

self.ignore_last = ignore_last
self.steps_per_epoch = steps_per_epoch
self.step = 0

def next_batch(self):
if self.last_batch():
raise StopIteration()

sub_batch = {}
batch = next(self.data_it)
for features_name in self.dataset.features:
sub_batch[features_name] = self.dataset.get(
features_name,
batch
)

self.step += 1
return sub_batch

def last_batch(self):
return self.step >= self.steps_per_epoch or (
self.ignore_last and
self.step + 1 >= self.steps_per_epoch)

def set_epoch(self, epoch):
self.step = 0
Loading