too much memory consumption by xgboost #13

pplonski · 2019-04-13T21:11:39Z

when running several xgboost algorithms in row, with dataset > 100 MB, the RAM memory consumption is growing very fast - looks like a bug

tRosenflanz · 2019-04-15T18:39:51Z

Ran into this with GPU implementation of Xgboost: Consider deleting previous booster objects. They tend to keep data around

pplonski · 2019-04-15T19:00:13Z

@tRosenflanz thank you for the suggestion, I got a hard time today to figure out how to fix it. I will try and let you know how it goes.

tmontana · 2020-01-22T11:07:37Z

Having the same issue. Would a dump of booster model to disk while keeping just run results work? With a 900mb dataset, running 10 models for max 300 seconds each consumes 400 gig of ram.

Thanks for a great tool.

pplonski · 2020-01-22T11:33:00Z

@tmontana would you send me some more details so I can reproduce the issue:

how many rows and columns are in the dataset? are all columns numeric? what is the class balance?
what arguments were used in AutoML? 10 models with 5-cross validation each?
would you post the code snippet?

tmontana · 2020-01-22T12:34:10Z

Rows: 391,032
Columns: 397 (all numeric. 40% is class 0 and 60% is class 1)

arguments used:

model_types=['Xgboost']
automl = AutoML(total_time_limit=None, learner_time_limit=30,algorithms=model_types,train_ensemble=True,start_random_models=10,hill_climbing_steps=3,top_models_to_improve=2)

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": False, "stratify": True}

%%time
automl.fit(X, y)

Result on a machine with 384 gigs: ERROR

~/anaconda3/envs/mlj_shap/lib/python3.6/site-packages/xgboost/core.py in _maybe_pandas_data(data, feature_names, feature_types)
226 feature_types = [PANDAS_DTYPE_MAPPER[dtype.name] for dtype in data_dtypes]
227
--> 228 data = data.values.astype('float')
229
230 return data, feature_names, feature_types

MemoryError:

tmontana · 2020-01-22T12:37:57Z

And this fails after 22 models on a machine with 784 gigs:

automl = AutoML(total_time_limit=None, learner_time_limit=30,algorithms=model_types,train_ensemble=True,start_random_models=30,hill_climbing_steps=5,top_models_to_improve=3)

tmontana · 2020-01-22T13:00:19Z

Forgot to mention but there is no preprocessing - all features are numeric and no missing values

pplonski · 2020-01-22T13:16:11Z

It is a serious problem. I'm thinking about a major change in the code to not keep all models in RAM memory. It will take me some time to rewrite the package, though.

In the meantime, I can offer you help through my web service. If you want you can try to tune models in https://mljar.com please set the account and I will give you as many free credits as needed to tune models to your dataset. Apologize for the problems!

tmontana · 2020-01-22T13:17:24Z

Hi Piotr: no worries - I'm already a client of the web platform. Thanks

pplonski · 2020-01-22T13:23:25Z

Please send me your username or email to contact@mljar.com so I can find your user and give credits for computation.

pplonski · 2020-04-02T13:20:35Z

I've investigated the memory consumption by xgboost. You can find notebook with example here

I've submitted the ticket to xgboost dmlc/xgboost#5474 with asking for ways to limit memory usage. For now, it can be saving model to hard drive and then loading it back.

tmontana · 2020-04-04T12:59:25Z

Seems like this is an ongoing issue for xgboost. From what I can read they have solved this for GPUs.
Is there a way to specify using a GPU for xgboost in mljar-supervised?

pplonski · 2020-04-07T05:58:50Z

@tmontana I've made a few improvements in the mljar-supervised package to better handle the memory:

X train data is stored to the hard drive (in parquet format), and for each cross-validation step it is read and split for train/test subset. So I don't need to keep a copy of original data needed at each CV step for doing preprocessing,
for xgboost I just save and load the model, and it looks like it reduces the memory consumption,

There is still a lot of things to be added in the package, but I hope memory consumption should not be so huge.

pplonski added the bug Something isn't working label Apr 13, 2019

pplonski self-assigned this Apr 13, 2019

pplonski mentioned this issue Apr 2, 2020

High memory consumption in python xgboost dmlc/xgboost#5474

Closed

pplonski closed this as completed Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too much memory consumption by xgboost #13

too much memory consumption by xgboost #13

pplonski commented Apr 13, 2019

tRosenflanz commented Apr 15, 2019

pplonski commented Apr 15, 2019

tmontana commented Jan 22, 2020

pplonski commented Jan 22, 2020

tmontana commented Jan 22, 2020

tmontana commented Jan 22, 2020

tmontana commented Jan 22, 2020

pplonski commented Jan 22, 2020

tmontana commented Jan 22, 2020

pplonski commented Jan 22, 2020

pplonski commented Apr 2, 2020

tmontana commented Apr 4, 2020

pplonski commented Apr 7, 2020

too much memory consumption by xgboost #13

too much memory consumption by xgboost #13

Comments

pplonski commented Apr 13, 2019

tRosenflanz commented Apr 15, 2019

pplonski commented Apr 15, 2019

tmontana commented Jan 22, 2020

pplonski commented Jan 22, 2020

tmontana commented Jan 22, 2020

tmontana commented Jan 22, 2020

tmontana commented Jan 22, 2020

pplonski commented Jan 22, 2020

tmontana commented Jan 22, 2020

pplonski commented Jan 22, 2020

pplonski commented Apr 2, 2020

tmontana commented Apr 4, 2020

pplonski commented Apr 7, 2020