Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error with NVIDIA K80 GPU #76

Closed
waqarws opened this issue Aug 13, 2018 · 23 comments
Closed

Out of memory error with NVIDIA K80 GPU #76

waqarws opened this issue Aug 13, 2018 · 23 comments
Assignees

Comments

@waqarws
Copy link

waqarws commented Aug 13, 2018

Trying to create an image classifier with ~1000 training samples and 7 classes but it throws a runtime error. Is there a way of reducing batch size or something else that can be done to circumvent this?

Following is the error.

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown len(cache))

@Jangol
Copy link

Jangol commented Aug 14, 2018

Hi:
You can try to adjust the " constant.MAX_BATCH_SIZE" parameter,the default value is 32 .

@Zvezdin
Copy link

Zvezdin commented Aug 14, 2018

I also keep getting these errors on a GTX 1060. I managed to fix it the following way. If training a network crashes, I updated the train function to return 0.0 as accuracy and the maximum possible loss as loss. This way the Bayesian optimization algorithm understands that this path is unreliable (for the current hardware) and tries to find an alternative. If @jhfjhfj1 wants, I can make a pull request with the change.

@haifeng-jin
Copy link
Collaborator

@Jangol @Zvezdin
Thank you for your help.

I think the memory is usually big enough to handle most of the datasets.
I am not sure how to clean up the GPU memory usage of a model in pytorch.
I think we should clean the GPU memory on the main process.
Then see if it still crashes or not.

If it still crashes, we will try to feed the data really in batches.

@Zvezdin
Copy link

Zvezdin commented Aug 14, 2018

@jhfjhfj1 Thanks for the reply.

I start getting out of memory errors after many successfully trained models. I don't think that it's a memory leak or a dataset error. I'm using a custom dataset and one input sample (without batch) is a 80x90x24 matrix. In my case, I think that AutoKeras decides to create too large models for my 6GB GPU after many successful iterations. In such a case, would you consider my proposed solution as optimal (giving negative or zero feedback if it fails), or else how would you tackle such an issue?

@sparkdoc
Copy link

sparkdoc commented Aug 14, 2018

When I first ran this with about 550 128x128 grayscale images using a Quadro P4000 with 8 GB of memory, it immediately crashed due to insufficient memory. I adjusted the constant.MAX_BATCH_SIZE parameter from the default of 128 down to 32, and then it worked for about an hour until crashing again. The error message was:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I was watching the GPU memory usage before it crashed, and it fluctuated in cycles as expected for a "grid search" sort of activity. Unfortunately, it looks like the peak memory usage corresponding to the more memory-intensive models progressively increase until overwhelming the GPU memory.

Maybe it would be good, upon initialization of the program, to quantify the available memory and then cap the model search to models that fit within that limit. If the program determines that it cannot identify an optimal model within that constraint, and may require more memory, it could output such a message and hints as to how to accomplish this (i.e., smaller batches, smaller images, larger GPU memory, etc...). It might also help to offer a grayscale option in the load_image_dataset method that reduces a color image from three color channels to one grayscale channel.

also, what is the LIMIT_MEMORY parameter?

@haifeng-jin
Copy link
Collaborator

@Zvezdin Return 0 is not a good solution. It will seriously impact the performance of the GaussianProcessRegressor.
If it could return some special value and not update the GaussianProcessRegressor with such value, it would be better.

@aa18514
Copy link

aa18514 commented Aug 16, 2018

yes this is a bug, you could get around this by specifying the time_limit to a reasonably small value such as 10 seconds to ensure the run_searcher_once method runs only once.

if you check the following:

user@ubuntu:~$ vim C:\Users\user\AppData\Local\Programs\Python36\lib\site-packages\autokeras\classifier.py

you will find the following piece of code on line 223


if time_limit is None:
    time_limit = 24 * 60 * 60

start_time = time.time()
while time.time() - start_time <= time_limit:
    run_searcher_once(train_data, test_data, self.path)
    if len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM:
        break

you could see the flaw in this piece of code, if time_limit parameter is not specified it defaults to 24 * 60 minutes, the default value of Constant.MAX_MODEL_NUM is 1000, so you keep on looping in the while loop until len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM, also after the train process is complete self.load_searcher().history stores the new trained model which means its length only increases by one...you could get around this by maybe replacing Constant.MAX_MODEL_NUM to a sane value like 1 (or choose the time limit to be low like 10 seonds), I hope this helps....

I banged my head over this for a few hours, there a number of other problematic things in the code, I think?

@jageshmaharjan
Copy link

jageshmaharjan commented Aug 16, 2018

I came across the same issue:

Traceback (most recent call last):
  File "trainalutokeras_raw.py", line 25, in <module>
    clf.fit(x_train, y_train, time_limit=5 *60 *60)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
    run_searcher_once(train_data, test_data, self.path)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
    searcher.search(train_data, test_data)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 190, in search
    accuracy, loss, graph = train_results.get()[0]
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Also, I tried to reduce the time limit with
clf.fit(x_train, y_train, time_limit=10)
doesn't solve the problem. I bet its a bug.

@aa18514
Copy link

aa18514 commented Aug 16, 2018

Yes, this is another bug
try to replace line 178 in (home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py):

"train_results = pool.map_async(train, [(graph, train_data, test_data, self.trainer_args, <br>
                                                os.path.join(self.path, str(model_id) + '.png'), self.verbose)])" <br>

with:

train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose)) <br>

and replace line 190

accuracy, loss, graph = train_results.get()[0] <br>

with:

accuracy, loss, graph = train_results <br>

if you are on windows, torch with CUDA and multiprocessing do not seem to work well together.

Also please try to wrap your code in trainalutokeras_raw.py in:

if __name__ == "__main__"

@jageshmaharjan
Copy link

yup, It's wrapped in if __name__ == "__main__" function

yet, there's another error but similar. The stack trace is shown below

Using TensorFlow backend.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "trainalutokeras_raw.py", line 26, in <module>
    clf.fit(x_train, y_train, time_limit=10)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 225, in fit
    run_searcher_once(train_data, test_data, self.path)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/classifier.py", line 40, in run_searcher_once
    searcher.search(train_data, test_data)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 178, in search
    train_results = train((graph, train_data, test_data, self.trainer_args, os.path.join(self.path, str(model_id) + '.png'), self.verbose))
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/search.py", line 326, in train
    verbose).train_model(**trainer_args)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 122, in train_model
    self._train(train_loader, epoch)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/utils.py", line 143, in _train
    outputs = self.model(inputs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/autokeras/graph.py", line 603, in forward
    temp_tensor = torch_layer(edge_input_tensor)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
/home/maybe/anaconda3/envs/asr/lib/python3.6/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Thanks for the quick response.

@aa18514
Copy link

aa18514 commented Aug 16, 2018

AutoKeras is poorly maintained at the minute, I had a similar issue

In "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py" line 301 explicitly cast values in the tuple with the name 'self.padding' as int (before calling the function F.conv2d with appropiate parameters), one way you could do this by is adding the line:

self.padding = (int(self.padding[0]), int(self.padding[1]))

before line 301:

return F.conv2d(input, self.weight, self.bias, self.stride,
                           self.padding, self.dilation, self.groups)

@jageshmaharjan
Copy link

Yet the same error, despite the changes.

@aa18514
Copy link

aa18514 commented Aug 16, 2018

what dataset are you using? have you tried MNIST?

@jageshmaharjan
Copy link

MNIST works perfectly. The data I am trying is my own dataset.

Here's my PR.

from autokeras.classifier import ImageClassifier
from autokeras.classifier import load_image_dataset
import argparse

if __name__ == '__main__':
  parser = argparse.ArgumentParser(description="parameters for the input program")
  parser.add_argument('--train_csv', type=str, help="training csv data directory")
  parser.add_argument('--train_images', type=str, help="training images directory")
  parser.add_argument('--test_csv', type=str, help="test csv directory")
  parser.add_argument('--test_images', type=str, help="test images directory")
  #parser.add_argument('--dev', type=str, help="dev directory")

  args = parser.parse_args()

  x_train, y_train = load_image_dataset(csv_file_path=args.train_csv, images_path=args.train_images)
  print(x_train.shape)
  print(y_train.shape)

  x_test, y_test = load_image_dataset(csv_file_path=args.test_csv, images_path=args.test_images)
  print(x_train.shape)
  print(y_train.shape)

  clf = ImageClassifier(verbose=True)
  clf.fit(x_train, y_train, time_limit=10)
  clf.final_fit(x_train, y_train, x_test, y_test, retrain=True)
  y = clf.evaluate(x_test, y_test)
  print(y)

and my script :
python trainalutokeras_raw.py --train_csv ./train.csv --train_images ./images/train --test_csv ./test.csv --test_images ./images/test

@aa18514
Copy link

aa18514 commented Aug 16, 2018

could you please report x_train.shape and x_test.shape?

@jageshmaharjan
Copy link

Sorry for my late reply.
This is the shape of x_train.shape
(1348, 480, 640, 4)
and x_test.shape
(1348, 480, 640, 4)

@haifeng-jin
Copy link
Collaborator

@tl-yang
Add some code in the second last line in train() function in search.py.
to remove the model from the GPU memory.

Be careful about the import weights in layers.py and the loss returned by train_model() function in utils.py should be a float instead of a tensor.

@haifeng-jin
Copy link
Collaborator

@aa18514 Thank you so much for your help!
We are trying hard to fix all the issues.
Do you think replacing multiprocessing by torch.multiprocessing would solve the problem of not working well in windows?
I mean if not considering the problem of training to large models.

Thanks.

@haifeng-jin
Copy link
Collaborator

@tl-yang
Please see the file of net_transformer.py. It is where we can add if clause to limit the depth and width of the model.
Currently, we will put two more constants MAX_MODEL_WIDTH, MAX_MODEL_DEPTH in the Constant class instead of passing them through the parameters.

We will change it later if necessary.
Let me know if you have any questions.

Remember to branch out from develop branch.

Thanks.

@haifeng-jin haifeng-jin assigned haifeng-jin and tl-yang and unassigned tl-yang Aug 22, 2018
@haifeng-jin
Copy link
Collaborator

@tl-yang you can try the torch.multiprocessing.
I will be set the search space.

Thanks.

@haifeng-jin
Copy link
Collaborator

This issue is fixed in the new release.
Thank you all for the contribution.

@rafaelmarconiramos
Copy link

Hello, I have the same problem.
Was this problem fixed in the repository?
I am using a small dataset to try, but I have the same error: out of memory.

What is the recommendation? Update autokeras to a specific branch?

Thanks

@MartinThoma
Copy link

MartinThoma commented Nov 4, 2018

How is one supposed to use autokeras.constant.Constant? Is it enough to make

import autokeras
autokeras.constant.Constant.MAX_BATCH_SIZE = 64
autokeras.constant.Constant.MAX_LAYERS = 5

before fitting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants