-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory error with NVIDIA K80 GPU #76
Comments
Hi: |
I also keep getting these errors on a GTX 1060. I managed to fix it the following way. If training a network crashes, I updated the train function to return 0.0 as accuracy and the maximum possible loss as loss. This way the Bayesian optimization algorithm understands that this path is unreliable (for the current hardware) and tries to find an alternative. If @jhfjhfj1 wants, I can make a pull request with the change. |
@Jangol @Zvezdin I think the memory is usually big enough to handle most of the datasets. If it still crashes, we will try to feed the data really in batches. |
@jhfjhfj1 Thanks for the reply. I start getting out of memory errors after many successfully trained models. I don't think that it's a memory leak or a dataset error. I'm using a custom dataset and one input sample (without batch) is a 80x90x24 matrix. In my case, I think that AutoKeras decides to create too large models for my 6GB GPU after many successful iterations. In such a case, would you consider my proposed solution as optimal (giving negative or zero feedback if it fails), or else how would you tackle such an issue? |
When I first ran this with about 550 128x128 grayscale images using a Quadro P4000 with 8 GB of memory, it immediately crashed due to insufficient memory. I adjusted the constant.MAX_BATCH_SIZE parameter from the default of 128 down to 32, and then it worked for about an hour until crashing again. The error message was: I was watching the GPU memory usage before it crashed, and it fluctuated in cycles as expected for a "grid search" sort of activity. Unfortunately, it looks like the peak memory usage corresponding to the more memory-intensive models progressively increase until overwhelming the GPU memory. Maybe it would be good, upon initialization of the program, to quantify the available memory and then cap the model search to models that fit within that limit. If the program determines that it cannot identify an optimal model within that constraint, and may require more memory, it could output such a message and hints as to how to accomplish this (i.e., smaller batches, smaller images, larger GPU memory, etc...). It might also help to offer a grayscale option in the load_image_dataset method that reduces a color image from three color channels to one grayscale channel. also, what is the LIMIT_MEMORY parameter? |
@Zvezdin Return 0 is not a good solution. It will seriously impact the performance of the GaussianProcessRegressor. |
yes this is a bug, you could get around this by specifying the time_limit to a reasonably small value such as 10 seconds to ensure the run_searcher_once method runs only once. if you check the following:
you will find the following piece of code on line 223
you could see the flaw in this piece of code, if time_limit parameter is not specified it defaults to 24 * 60 minutes, the default value of Constant.MAX_MODEL_NUM is 1000, so you keep on looping in the while loop until len(self.load_searcher().history) >= Constant.MAX_MODEL_NUM, also after the train process is complete self.load_searcher().history stores the new trained model which means its length only increases by one...you could get around this by maybe replacing Constant.MAX_MODEL_NUM to a sane value like 1 (or choose the time limit to be low like 10 seonds), I hope this helps.... I banged my head over this for a few hours, there a number of other problematic things in the code, I think? |
I came across the same issue:
Also, I tried to reduce the time limit with |
Yes, this is another bug
with:
and replace line 190
with:
if you are on windows, torch with CUDA and multiprocessing do not seem to work well together. Also please try to wrap your code in trainalutokeras_raw.py in:
|
yup, It's wrapped in yet, there's another error but similar. The stack trace is shown below
Thanks for the quick response. |
AutoKeras is poorly maintained at the minute, I had a similar issue In "/home/maybe/anaconda3/envs/asr/lib/python3.6/site-packages/torch/nn/modules/conv.py" line 301 explicitly cast values in the tuple with the name 'self.padding' as int (before calling the function F.conv2d with appropiate parameters), one way you could do this by is adding the line: self.padding = (int(self.padding[0]), int(self.padding[1])) before line 301: return F.conv2d(input, self.weight, self.bias, self.stride,
self.padding, self.dilation, self.groups) |
Yet the same error, despite the changes. |
what dataset are you using? have you tried MNIST? |
MNIST works perfectly. The data I am trying is my own dataset. Here's my PR.
and my script : |
could you please report x_train.shape and x_test.shape? |
Sorry for my late reply. |
@tl-yang Be careful about the import weights in layers.py and the loss returned by train_model() function in utils.py should be a float instead of a tensor. |
@aa18514 Thank you so much for your help! Thanks. |
@tl-yang We will change it later if necessary. Remember to branch out from develop branch. Thanks. |
@tl-yang you can try the torch.multiprocessing. Thanks. |
This issue is fixed in the new release. |
Hello, I have the same problem. What is the recommendation? Update autokeras to a specific branch? Thanks |
How is one supposed to use
before fitting? |
Trying to create an image classifier with ~1000 training samples and 7 classes but it throws a runtime error. Is there a way of reducing batch size or something else that can be done to circumvent this?
Following is the error.
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58/usr/lib/python3.5/multiprocessing/semaphore_tracker.py:129: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown len(cache))
The text was updated successfully, but these errors were encountered: