Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaskLGBMRegressor early stopping cause socket error 104 #6197

Open
daviddwlee84 opened this issue Nov 16, 2023 · 1 comment
Open

DaskLGBMRegressor early stopping cause socket error 104 #6197

daviddwlee84 opened this issue Nov 16, 2023 · 1 comment
Labels

Comments

@daviddwlee84
Copy link

daviddwlee84 commented Nov 16, 2023

Description

It is working fine without early stopping.
But when enabling early stop callback, seems it will early stop one of the workers and cause the error.

DaskLGBMRegressor can always trigger this issue so far.
I tried changing make_regression to make_classification as well as lgb.DaskLGBMRegressor to lgb.DaskLGBMClassifier.
The issue is reproducible, but sometimes won't trigger.

Reproducible example

# start Dask cluster like this
dask-ssh 192.168.222.{235,236,237} --scheduler 192.168.222.236
import dask.array as da
import lightgbm as lgb
from sklearn.datasets import make_regression
from distributed import Client, wait

client = Client(address="tcp://192.168.222.236:8786")

# starting with clean workers
client.restart()

EARLY_STOP_ROUND = 20
NUM_ITERATION = 1000
LEARNING_RATE = 0.01

# adding callbacks
callbacks = []
eval_result = {}
record_evaluation_callback = lgb.record_evaluation(eval_result)
callbacks.append(record_evaluation_callback)
log_evaluation_callback = lgb.log_evaluation()
callbacks.append(log_evaluation_callback)
early_stopping_callback = lgb.early_stopping(EARLY_STOP_ROUND)
callbacks.append(early_stopping_callback)

# creating sample regression data
X_np, y_np = make_regression(n_samples=1000, n_features=10)
row_chunks = (100, 100, 100, 100, 100, 100, 100, 100, 100, 100)
X = da.from_array(X_np, chunks=(row_chunks, (10,)))
y = da.from_array(y_np, chunks=(row_chunks))
X_test_np, y_test_np = make_regression(n_samples=300, n_features=10)
test_row_chunks = (100, 100, 100)
X_test = da.from_array(X_test_np, chunks=(test_row_chunks, (10,)))
y_test = da.from_array(y_test_np, chunks=(test_row_chunks))

# persist() + wait() + rebalance() to get an even spread of the data across workers
X = X.persist()
y = y.persist()
X_test = client.persist(X_test)
y_test = client.persist(y_test)
_ = wait([X, y, X_test, y_test])
client.rebalance()

# training and get socket recv error code 104
model = lgb.DaskLGBMRegressor(num_iterations=NUM_ITERATION, learning_rate=LEARNING_RATE).fit(
    X, y, eval_set=[(X_test, y_test)], eval_names=['test'], callbacks=callbacks)

Environment info

LightGBM version or commit hash:

4.1.0

Command(s) you used to install LightGBM

pip install lightgbm

Dependencies (pip list)

dask                              2023.5.0
distributed                       2023.5.0
lightgbm                          4.1.0
scikit-learn                      1.3.2

Additional Comments

Client-side logs

Finding random open ports for workers
Traceback (most recent call last):
  File "lightgbm_reproduce_socket_error.py", line 49, in <module>
    model = lgb.DaskLGBMRegressor(num_iterations=NUM_ITERATION, learning_rate=LEARNING_RATE).fit(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 1406, in fit
    self._lgb_dask_fit(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 1082, in _lgb_dask_fit
    model = _train(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 818, in _train
    results = client.gather(futures_classifiers)
  File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/client.py", line 2361, in gather
    return self.sync(
  File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync
    return sync(
  File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/utils.py", line 418, in sync
    raise exc.with_traceback(tb)
  File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/utils.py", line 391, in f
    result = yield future
  File "/home/lidawei/.local/lib/python3.8/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/lidawei/.local/lib/python3.8/site-packages/distributed/client.py", line 2224, in _gather
    raise exception.with_traceback(traceback)
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/dask.py", line 313, in _train_part
    model.fit(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 1049, in fit
    super().fit(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 842, in fit
    self._Booster = train(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/engine.py", line 276, in train
    booster.update(fobj=fobj)
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 3658, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "/home/lidawei/.local/lib/python3.8/site-packages/lightgbm/basic.py", line 242, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104)

Server-side logs

[ worker 192.168.222.235 ] : [60]       test's l2: 14300.6
[ worker 192.168.222.235 ] : [61]       test's l2: 14316.5
[ worker 192.168.222.235 ] : Early stopping, best iteration is:
[ worker 192.168.222.235 ] : [41]       test's l2: 14060.2
[ worker 192.168.222.235 ] : [LightGBM] [Info] Finished linking network in 1.007508 seconds
[ worker 192.168.222.236 ] : [60]       test's l2: 15432.3
[ worker 192.168.222.236 ] : [61]       test's l2: 15396.1
[ worker 192.168.222.237 ] : [60]       test's l2: 19323.7
[ worker 192.168.222.237 ] : [61]       test's l2: 19312.2
[ worker 192.168.222.237 ] : [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
[ scheduler 192.168.222.236:8786 ] : 2023-11-16 09:47:30,012 - distributed.scheduler - INFO - Remove client Client-134a1478-8422-11ee-aee8-288023a82aca
[ scheduler 192.168.222.236:8786 ] : 2023-11-16 09:47:30,017 - distributed.core - INFO - Received 'close-stream' from tcp://192.168.222.235:60582; closing.
[ scheduler 192.168.222.236:8786 ] : 2023-11-16 09:47:30,018 - distributed.scheduler - INFO - Remove client Client-134a1478-8422-11ee-aee8-288023a82aca
[ scheduler 192.168.222.236:8786 ] : 2023-11-16 09:47:30,021 - distributed.scheduler - INFO - Close client connection: Client-134a1478-8422-11ee-aee8-288023a82aca
[ worker 192.168.222.237 ] : 2023-11-16 09:47:29,986 - distributed.worker - WARNING - Compute Failed
[ worker 192.168.222.237 ] : Key:       _train_part-a926586b-f13d-4492-b706-89ad0093bf37
[ worker 192.168.222.237 ] : Function:  _train_part
[ worker 192.168.222.237 ] : args:      ()
[ worker 192.168.222.237 ] : kwargs:    {'model_factory': <class 'lightgbm.sklearn.LGBMRegressor'>, 'params': {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.01, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'num_leaves': 31, 'objective': None, 'random_state': None, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'num_iterations': 1000, 'tree_learner': 'data', 'num_threads': 120, 'machines': '192.168.222.236:36747,192.168.222.235:38877,192.168.222.237:44473', 'local_listen_port': 44473, 'time_out': 120, 'num_machines': 3}, 'list_of_parts': [{'data': array([[ 4.39651081e-01,  4.20902410e-01, -1.89537866e-01,
[ worker 192.168.222.237 ] :         -1.94632152e-01,  1.46491621e+00, -4.68106652e-01,
[ worker 192.168.222.237 ] :         -1.82754281e-01,  9.03791325e-01, -1.57407489e+00,
[ worker 192.168.222.237 ] :         -9.74526425e-01],
[ worker 192.168.222.237 ] :        [-9.12978201e-01, -1.02316910e+00, -1.90681758e-01,
[ worker 192.168.222.237 ] :         -4.17785427e-01, -
[ worker 192.168.222.237 ] : Exception: "LightGBMError('Socket recv error, Connection reset by peer (code: 104)')"

Some issues and pull requests might be related

  1. DaskLGBMRegressor early_stopping_rounds error · Issue #5963 · microsoft/LightGBM
  2. dask early stopping and eval-set · Issue #4189 · microsoft/LightGBM
    1. [dask] Early stopping by ffineis · Pull Request #3952 · microsoft/LightGBM
    2. [dask] add support for eval sets and custom eval functions by ffineis · Pull Request #4101 · microsoft/LightGBM
  3. [LightGBM] [Fatal] Socket recv error, code: 104 when training with binary objective. · Issue #728 · microsoft/SynapseML
  4. [ci] Random failure with Dask test · Issue #6036 · microsoft/LightGBM
  5. Dask tests randomly fail with socket error code 104 · Issue #4074 · microsoft/LightGBM
  6. [dask] run one training task on each worker by jameslamb · Pull Request #4132 · microsoft/LightGBM
@jameslamb jameslamb added the dask label Nov 16, 2023
@jmoralez
Copy link
Collaborator

Hey @daviddwlee84, thanks for using LightGBM and for the excellent report. The dask interface doesn't support early stopping yet, that's being tracked in #3712.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants