Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.cuda.OutOfMemoryError: CUDA out of memory #6

Closed
m-GDEV opened this issue Mar 7, 2023 · 9 comments
Closed

torch.cuda.OutOfMemoryError: CUDA out of memory #6

m-GDEV opened this issue Mar 7, 2023 · 9 comments

Comments

@m-GDEV
Copy link

m-GDEV commented Mar 7, 2023

Thanks for making this repo! I was looking to run this on my own hardware and this is helping me do just that.

I first tried to run inference with Facebook's own instructions by I was getting a memory error. I tried a few other modifications but they did not work either.

Finally, I came to this repository to try and fix my problem. I'm still getting the same error, however.

Error:

Traceback (most recent call last):
  File "/mnt/FILEZ/Files/Downloads/Media/llama/inference.py", line 67, in <module>
    run(
  File "/mnt/FILEZ/Files/Downloads/Media/llama/inference.py", line 48, in run
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size, max_seq_len, max_batch_size)
  File "/mnt/FILEZ/Files/Downloads/Media/llama/inference.py", line 32, in load
    model = Transformer(model_args)
  File "/mnt/FILEZ/Files/Downloads/Media/llama/llama/model_single.py", line 196, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/FILEZ/Files/Downloads/Media/llama/llama/model_single.py", line 170, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/FILEZ/Files/Downloads/Media/llama/llama/model_single.py", line 152, in __init__
    self.w2 = nn.Linear(
  File "/home/musa/.local/share/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 96, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.75 GiB total capacity; 11.50 GiB already allocated; 11.12 MiB free; 11.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have been consistently seeing this error everytime that I've tried to run inference and I'm not sure how to fix it.

I run the inference with this command: python inference.py --ckpt_dir ./llama-dl/7B --tokenizer_path ./llama-dl/tokenizer.model

My Specs:

CPU: Intel 5 11500
GPU: 12GB Nvidia 3060
RAM: 16 GB

With these specs it seems I should be able to run this version of inference but it still does not work.

Before running the program I ran the free command:

               total        used        free      shared  buff/cache   available
Mem:        15173760      668404    12962088         584     1543268    14165152
Swap:       15605752      550436    15055316

So I definitely have more than the 8GB of ram shown in the README.

I would really appreciate your help, thanks!

@Starlento
Copy link

Starlento commented Mar 8, 2023

For me, the inference.py use 15.7GB VRAM with everything by default, though I am using Windows WSL2 which means the VRAM usage of the script should be roughly 14GB... Seems still not enough for 12GB VRAM.

@m-GDEV
Copy link
Author

m-GDEV commented Mar 8, 2023

Oh ok, yeah that would make sense. Do you think there might be a way to run the script in a way the uses less VRAM?

@brandonrobertz
Copy link

You can try the 8bit (less precision) model: https://github.com/tloen/llama-int8

@Starlento
Copy link

I found https://github.com/qwopqwop200/GPTQ-for-LLaMa which turn it to 4-bit, I can run the benchmark in that repo. But the code in that repo is using huggingface transformer and it seems that at least the model loading is different.

@SWHL
Copy link

SWHL commented Mar 8, 2023

I changed the max_seq_len from 1024 to 512, and successfully inference the result in the RAM 16GB.

max_seq_len=1024,

@vo2021
Copy link

vo2021 commented Mar 9, 2023

I changed the max_seq_len from 1024 to 512, and successfully inference the result in the RAM 16GB.

max_seq_len=1024,

Thanks for the tips!

@SWHL
Copy link

SWHL commented Mar 9, 2023

  • I simply sorted out the code of the repo, and only kept the two simplest use cases.
  • Hope it can help everyone.
  • LLaMADemo

@DigitalLawyerLCB
Copy link

Hello, i am using a Nvidia Gtx1650 4GB GPU. Are there any way to run the 7B model on it?

@juncongmoo
Copy link
Owner

Hello, i am using a Nvidia Gtx1650 4GB GPU. Are there any way to run the 7B model on it?

Sure. I am still working on it... so that we can run 13B on a 4GB GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants