Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too much memory consumption by xgboost #13

Closed
pplonski opened this issue Apr 13, 2019 · 13 comments
Closed

too much memory consumption by xgboost #13

pplonski opened this issue Apr 13, 2019 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@pplonski
Copy link
Contributor

when running several xgboost algorithms in row, with dataset > 100 MB, the RAM memory consumption is growing very fast - looks like a bug

@pplonski pplonski added the bug Something isn't working label Apr 13, 2019
@pplonski pplonski self-assigned this Apr 13, 2019
@tRosenflanz
Copy link

Ran into this with GPU implementation of Xgboost: Consider deleting previous booster objects. They tend to keep data around

@pplonski
Copy link
Contributor Author

@tRosenflanz thank you for the suggestion, I got a hard time today to figure out how to fix it. I will try and let you know how it goes.

@tmontana
Copy link

Having the same issue. Would a dump of booster model to disk while keeping just run results work? With a 900mb dataset, running 10 models for max 300 seconds each consumes 400 gig of ram.

Thanks for a great tool.

@pplonski
Copy link
Contributor Author

@tmontana would you send me some more details so I can reproduce the issue:

  • how many rows and columns are in the dataset? are all columns numeric? what is the class balance?
  • what arguments were used in AutoML? 10 models with 5-cross validation each?
  • would you post the code snippet?

@tmontana
Copy link

Rows: 391,032
Columns: 397 (all numeric. 40% is class 0 and 60% is class 1)

arguments used:

model_types=['Xgboost']
automl = AutoML(total_time_limit=None, learner_time_limit=30,algorithms=model_types,train_ensemble=True,start_random_models=10,hill_climbing_steps=3,top_models_to_improve=2)

automl._validation = {"validation_type": "kfold", "k_folds": 15, "shuffle": False, "stratify": True}

%%time
automl.fit(X, y)

Result on a machine with 384 gigs: ERROR

~/anaconda3/envs/mlj_shap/lib/python3.6/site-packages/xgboost/core.py in _maybe_pandas_data(data, feature_names, feature_types)
226 feature_types = [PANDAS_DTYPE_MAPPER[dtype.name] for dtype in data_dtypes]
227
--> 228 data = data.values.astype('float')
229
230 return data, feature_names, feature_types

MemoryError:

@tmontana
Copy link

And this fails after 22 models on a machine with 784 gigs:

automl = AutoML(total_time_limit=None, learner_time_limit=30,algorithms=model_types,train_ensemble=True,start_random_models=30,hill_climbing_steps=5,top_models_to_improve=3)

@tmontana
Copy link

Forgot to mention but there is no preprocessing - all features are numeric and no missing values

@pplonski
Copy link
Contributor Author

It is a serious problem. I'm thinking about a major change in the code to not keep all models in RAM memory. It will take me some time to rewrite the package, though.

In the meantime, I can offer you help through my web service. If you want you can try to tune models in https://mljar.com please set the account and I will give you as many free credits as needed to tune models to your dataset. Apologize for the problems!

@tmontana
Copy link

Hi Piotr: no worries - I'm already a client of the web platform. Thanks

@pplonski
Copy link
Contributor Author

Please send me your username or email to contact@mljar.com so I can find your user and give credits for computation.

@pplonski
Copy link
Contributor Author

pplonski commented Apr 2, 2020

I've investigated the memory consumption by xgboost. You can find notebook with example here

I've submitted the ticket to xgboost dmlc/xgboost#5474 with asking for ways to limit memory usage. For now, it can be saving model to hard drive and then loading it back.

@tmontana
Copy link

tmontana commented Apr 4, 2020

Seems like this is an ongoing issue for xgboost. From what I can read they have solved this for GPUs.
Is there a way to specify using a GPU for xgboost in mljar-supervised?

@pplonski
Copy link
Contributor Author

pplonski commented Apr 7, 2020

@tmontana I've made a few improvements in the mljar-supervised package to better handle the memory:

  • X train data is stored to the hard drive (in parquet format), and for each cross-validation step it is read and split for train/test subset. So I don't need to keep a copy of original data needed at each CV step for doing preprocessing,
  • for xgboost I just save and load the model, and it looks like it reduces the memory consumption,

There is still a lot of things to be added in the package, but I hope memory consumption should not be so huge.

@pplonski pplonski closed this as completed Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants