Out of memory error with NVIDIA K80 GPU #76

waqarws · 2018-08-13T07:04:16Z

Trying to create an image classifier with ~1000 training samples and 7 classes but it throws a runtime error. Is there a way of reducing batch size or something else that can be done to circumvent this?

Following is the error.

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown len(cache))

The text was updated successfully, but these errors were encountered:

Jangol · 2018-08-14T06:07:33Z

Hi:
You can try to adjust the " constant.MAX_BATCH_SIZE" parameter，the default value is 32 .

Zvezdin · 2018-08-14T08:48:28Z

I also keep getting these errors on a GTX 1060. I managed to fix it the following way. If training a network crashes, I updated the train function to return 0.0 as accuracy and the maximum possible loss as loss. This way the Bayesian optimization algorithm understands that this path is unreliable (for the current hardware) and tries to find an alternative. If @jhfjhfj1 wants, I can make a pull request with the change.

haifeng-jin · 2018-08-14T16:57:19Z

@Jangol @Zvezdin
Thank you for your help.

I think the memory is usually big enough to handle most of the datasets.
I am not sure how to clean up the GPU memory usage of a model in pytorch.
I think we should clean the GPU memory on the main process.
Then see if it still crashes or not.

If it still crashes, we will try to feed the data really in batches.

Zvezdin · 2018-08-14T18:18:31Z

@jhfjhfj1 Thanks for the reply.

I start getting out of memory errors after many successfully trained models. I don't think that it's a memory leak or a dataset error. I'm using a custom dataset and one input sample (without batch) is a 80x90x24 matrix. In my case, I think that AutoKeras decides to create too large models for my 6GB GPU after many successful iterations. In such a case, would you consider my proposed solution as optimal (giving negative or zero feedback if it fails), or else how would you tackle such an issue?

sparkdoc · 2018-08-14T20:24:04Z

When I first ran this with about 550 128x128 grayscale images using a Quadro P4000 with 8 GB of memory, it immediately crashed due to insufficient memory. I adjusted the constant.MAX_BATCH_SIZE parameter from the default of 128 down to 32, and then it worked for about an hour until crashing again. The error message was:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I was watching the GPU memory usage before it crashed, and it fluctuated in cycles as expected for a "grid search" sort of activity. Unfortunately, it looks like the peak memory usage corresponding to the more memory-intensive models progressively increase until overwhelming the GPU memory.

Maybe it would be good, upon initialization of the program, to quantify the available memory and then cap the model search to models that fit within that limit. If the program determines that it cannot identify an optimal model within that constraint, and may require more memory, it could output such a message and hints as to how to accomplish this (i.e., smaller batches, smaller images, larger GPU memory, etc...). It might also help to offer a grayscale option in the load_image_dataset method that reduces a color image from three color channels to one grayscale channel.

also, what is the LIMIT_MEMORY parameter?

haifeng-jin · 2018-08-15T18:43:01Z

@Zvezdin Return 0 is not a good solution. It will seriously impact the performance of the GaussianProcessRegressor.
If it could return some special value and not update the GaussianProcessRegressor with such value, it would be better.

aa18514 · 2018-08-16T03:44:51Z

yes this is a bug, you could get around this by specifying the time_limit to a reasonably small value such as 10 seconds to ensure the run_searcher_once method runs only once.

if you check the following:

user@ubuntu:~$ vim C:\Users\user\AppData\Local\Programs\Python36\lib\site-packages\autokeras\classifier.py

you will find the following piece of code on line 223


if time_limit is None:
    time_limit = 24 * 60 * 60

start_time = time.time()
while time.time() - start_time <= time_limit:
    run_searcher_once(train_data, test_data, self.path)
    if len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM:
        break

you could see the flaw in this piece of code, if time_limit parameter is not specified it defaults to 24 * 60 minutes, the default value of Constant.MAX_MODEL_NUM is 1000, so you keep on looping in the while loop until len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM, also after the train process is complete self.load_searcher().history stores the new trained model which means its length only increases by one...you could get around this by maybe replacing Constant.MAX_MODEL_NUM to a sane value like 1 (or choose the time limit to be low like 10 seonds), I hope this helps....

I banged my head over this for a few hours, there a number of other problematic things in the code, I think?

jageshmaharjan · 2018-08-16T06:45:05Z

I came across the same issue:

Traceback (most recent call last):
  File "trainalutokeras_raw.py", line 25, in <module>
    clf.fit(x_train, y_train, time_limit=5 *60 *60)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
    run_searcher_once(train_data, test_data, self.path)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
    searcher.search(train_data, test_data)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 190, in search
    accuracy, loss, graph = train_results.get()[0]
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Also, I tried to reduce the time limit with
clf.fit(x_train, y_train, time_limit=10)
doesn't solve the problem. I bet its a bug.

aa18514 · 2018-08-16T09:47:25Z

Yes, this is another bug
try to replace line 178 in (home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py):

"train_results = pool.map_async(train, [(graph, train_data, test_data, self.trainer_args, <br>
                                                os.path.join(self.path, str(model_id) + '.png'), self.verbose)])" <br>

with:

train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose)) <br>

and replace line 190

accuracy, loss, graph = train_results.get()[0] <br>

with:

accuracy, loss, graph = train_results <br>

if you are on windows, torch with CUDA and multiprocessing do not seem to work well together.

Also please try to wrap your code in trainalutokeras_raw.py in:

if __name__ == "__main__"

jageshmaharjan · 2018-08-16T10:11:20Z

yup, It's wrapped in if __name__ == "__main__" function

yet, there's another error but similar. The stack trace is shown below

Using TensorFlow backend.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "trainalutokeras_raw.py", line 26, in <module>
    clf.fit(x_train, y_train, time_limit=10)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
    run_searcher_once(train_data, test_data, self.path)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
    searcher.search(train_data, test_data)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 178, in search
    train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose))
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 326, in train
    verbose).train_model(**trainer_args)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 122, in train_model
    self._train(train_loader, epoch)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 143, in _train
    outputs = self.model(inputs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/graph.py", line 603, in forward
    temp_tensor = torch_layer(edge_input_tensor)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Thanks for the quick response.

aa18514 · 2018-08-16T10:29:33Z

AutoKeras is poorly maintained at the minute, I had a similar issue

In "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py" line 301 explicitly cast values in the tuple with the name 'self.padding' as int (before calling the function F.conv2d with appropiate parameters), one way you could do this by is adding the line:

self.padding = (int(self.padding[0]), int(self.padding[1]))

before line 301:

return F.conv2d(input, self.weight, self.bias, self.stride,
                           self.padding, self.dilation, self.groups)

jageshmaharjan · 2018-08-16T10:49:41Z

Yet the same error, despite the changes.

aa18514 · 2018-08-16T10:54:28Z

what dataset are you using? have you tried MNIST?

jageshmaharjan · 2018-08-16T11:02:01Z

MNIST works perfectly. The data I am trying is my own dataset.

Here's my PR.

from autokeras.classifier import ImageClassifier
from autokeras.classifier import load_image_dataset
import argparse

if __name__ == '__main__':
  parser = argparse.ArgumentParser(description="parameters for the input program")
  parser.add_argument('--train_csv', type=str, help="training csv data directory")
  parser.add_argument('--train_images', type=str, help="training images directory")
  parser.add_argument('--test_csv', type=str, help="test csv directory")
  parser.add_argument('--test_images', type=str, help="test images directory")
  #parser.add_argument('--dev', type=str, help="dev directory")

  args = parser.parse_args()

  x_train, y_train = load_image_dataset(csv_file_path=args.train_csv, images_path=args.train_images)
  print(x_train.shape)
  print(y_train.shape)

  x_test, y_test = load_image_dataset(csv_file_path=args.test_csv, images_path=args.test_images)
  print(x_train.shape)
  print(y_train.shape)

  clf = ImageClassifier(verbose=True)
  clf.fit(x_train, y_train, time_limit=10)
  clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
  y = clf.evaluate(x_test, y_test)
  print(y)

and my script :
python trainalutokeras_raw.py --train_csv ./train.csv --train_images ./images/train --test_csv ./test.csv --test_images ./images/test

aa18514 · 2018-08-16T11:30:45Z

could you please report x_train.shape and x_test.shape?

jageshmaharjan · 2018-08-17T03:38:01Z

Sorry for my late reply.
This is the shape of x_train.shape
(1348, 480, 640, 4)
and x_test.shape
(1348, 480, 640, 4)

haifeng-jin · 2018-08-20T21:46:03Z

@tl-yang
Add some code in the second last line in train() function in search.py.
to remove the model from the GPU memory.

Be careful about the import weights in layers.py and the loss returned by train_model() function in utils.py should be a float instead of a tensor.

haifeng-jin · 2018-08-21T21:23:17Z

@aa18514 Thank you so much for your help!
We are trying hard to fix all the issues.
Do you think replacing multiprocessing by torch.multiprocessing would solve the problem of not working well in windows?
I mean if not considering the problem of training to large models.

Thanks.

haifeng-jin · 2018-08-21T21:30:46Z

@tl-yang
Please see the file of net_transformer.py. It is where we can add if clause to limit the depth and width of the model.
Currently, we will put two more constants MAX_MODEL_WIDTH, MAX_MODEL_DEPTH in the Constant class instead of passing them through the parameters.

We will change it later if necessary.
Let me know if you have any questions.

Remember to branch out from develop branch.

Thanks.

haifeng-jin · 2018-08-22T01:38:05Z

@tl-yang you can try the torch.multiprocessing.
I will be set the search space.

Thanks.

haifeng-jin · 2018-08-24T01:03:47Z

This issue is fixed in the new release.
Thank you all for the contribution.

rafaelmarconiramos · 2018-09-23T01:56:29Z

Hello, I have the same problem.
Was this problem fixed in the repository?
I am using a small dataset to try, but I have the same error: out of memory.

What is the recommendation? Update autokeras to a specific branch?

Thanks

MartinThoma · 2018-11-04T09:48:25Z

How is one supposed to use autokeras.constant.Constant? Is it enough to make

import autokeras
autokeras.constant.Constant.MAX_BATCH_SIZE = 64
autokeras.constant.Constant.MAX_LAYERS = 5

before fitting?

haifeng-jin added the bug report label Aug 14, 2018

haifeng-jin assigned tl-yang Aug 17, 2018

eagleoflqj mentioned this issue Aug 21, 2018

An issue on multiprocessing with CUDA #106

Closed

tl-yang mentioned this issue Aug 21, 2018

[MRG] Attempts to Fix Memory Error #112

Merged

haifeng-jin assigned haifeng-jin and tl-yang and unassigned tl-yang Aug 22, 2018

haifeng-jin mentioned this issue Aug 22, 2018

[MRG] Search Space limited to avoid out of memory #121

Merged

haifeng-jin closed this as completed Aug 24, 2018

rafaelmarconiramos mentioned this issue Sep 26, 2018

Out of memory in my small dataset #223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error with NVIDIA K80 GPU #76

Out of memory error with NVIDIA K80 GPU #76

waqarws commented Aug 13, 2018

Jangol commented Aug 14, 2018

Zvezdin commented Aug 14, 2018 •

edited

haifeng-jin commented Aug 14, 2018

Zvezdin commented Aug 14, 2018

sparkdoc commented Aug 14, 2018 •

edited

haifeng-jin commented Aug 15, 2018

aa18514 commented Aug 16, 2018 •

edited

jageshmaharjan commented Aug 16, 2018 •

edited

aa18514 commented Aug 16, 2018 •

edited

jageshmaharjan commented Aug 16, 2018

aa18514 commented Aug 16, 2018

jageshmaharjan commented Aug 16, 2018

aa18514 commented Aug 16, 2018

jageshmaharjan commented Aug 16, 2018

aa18514 commented Aug 16, 2018

jageshmaharjan commented Aug 17, 2018

haifeng-jin commented Aug 20, 2018

haifeng-jin commented Aug 21, 2018

haifeng-jin commented Aug 21, 2018

haifeng-jin commented Aug 22, 2018

haifeng-jin commented Aug 24, 2018

rafaelmarconiramos commented Sep 23, 2018

MartinThoma commented Nov 4, 2018 •

edited

Out of memory error with NVIDIA K80 GPU #76

Out of memory error with NVIDIA K80 GPU #76

Comments

waqarws commented Aug 13, 2018

Jangol commented Aug 14, 2018

Zvezdin commented Aug 14, 2018 • edited

haifeng-jin commented Aug 14, 2018

Zvezdin commented Aug 14, 2018

sparkdoc commented Aug 14, 2018 • edited

haifeng-jin commented Aug 15, 2018

aa18514 commented Aug 16, 2018 • edited

jageshmaharjan commented Aug 16, 2018 • edited

aa18514 commented Aug 16, 2018 • edited

jageshmaharjan commented Aug 16, 2018

aa18514 commented Aug 16, 2018

jageshmaharjan commented Aug 16, 2018

aa18514 commented Aug 16, 2018

jageshmaharjan commented Aug 16, 2018

aa18514 commented Aug 16, 2018

jageshmaharjan commented Aug 17, 2018

haifeng-jin commented Aug 20, 2018

haifeng-jin commented Aug 21, 2018

haifeng-jin commented Aug 21, 2018

haifeng-jin commented Aug 22, 2018

haifeng-jin commented Aug 24, 2018

rafaelmarconiramos commented Sep 23, 2018

MartinThoma commented Nov 4, 2018 • edited

Zvezdin commented Aug 14, 2018 •

edited

sparkdoc commented Aug 14, 2018 •

edited

aa18514 commented Aug 16, 2018 •

edited

jageshmaharjan commented Aug 16, 2018 •

edited

aa18514 commented Aug 16, 2018 •

edited

MartinThoma commented Nov 4, 2018 •

edited