torch.cuda.OutOfMemoryError: CUDA out of memory #6

m-GDEV · 2023-03-07T18:25:55Z

Thanks for making this repo! I was looking to run this on my own hardware and this is helping me do just that.

I first tried to run inference with Facebook's own instructions by I was getting a memory error. I tried a few other modifications but they did not work either.

Finally, I came to this repository to try and fix my problem. I'm still getting the same error, however.

Error:

Traceback (most recent call last):
  File "/mnt/FILEZ/Files/Downloads/Media/llama/inference.py", line 67, in <module>
    run(
  File "/mnt/FILEZ/Files/Downloads/Media/llama/inference.py", line 48, in run
    generator = load(ckpt_dir, tokenizer_path, local_rank, world_size, max_seq_len, max_batch_size)
  File "/mnt/FILEZ/Files/Downloads/Media/llama/inference.py", line 32, in load
    model = Transformer(model_args)
  File "/mnt/FILEZ/Files/Downloads/Media/llama/llama/model_single.py", line 196, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/mnt/FILEZ/Files/Downloads/Media/llama/llama/model_single.py", line 170, in __init__
    self.feed_forward = FeedForward(
  File "/mnt/FILEZ/Files/Downloads/Media/llama/llama/model_single.py", line 152, in __init__
    self.w2 = nn.Linear(
  File "/home/musa/.local/share/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 96, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 11.75 GiB total capacity; 11.50 GiB already allocated; 11.12 MiB free; 11.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have been consistently seeing this error everytime that I've tried to run inference and I'm not sure how to fix it.

I run the inference with this command: python inference.py --ckpt_dir ./llama-dl/7B --tokenizer_path ./llama-dl/tokenizer.model

My Specs:

CPU: Intel 5 11500
GPU: 12GB Nvidia 3060
RAM: 16 GB

With these specs it seems I should be able to run this version of inference but it still does not work.

Before running the program I ran the free command:

               total        used        free      shared  buff/cache   available
Mem:        15173760      668404    12962088         584     1543268    14165152
Swap:       15605752      550436    15055316

So I definitely have more than the 8GB of ram shown in the README.

I would really appreciate your help, thanks!

The text was updated successfully, but these errors were encountered:

Starlento · 2023-03-08T03:29:25Z

For me, the inference.py use 15.7GB VRAM with everything by default, though I am using Windows WSL2 which means the VRAM usage of the script should be roughly 14GB... Seems still not enough for 12GB VRAM.

m-GDEV · 2023-03-08T05:51:03Z

Oh ok, yeah that would make sense. Do you think there might be a way to run the script in a way the uses less VRAM?

brandonrobertz · 2023-03-08T06:08:33Z

You can try the 8bit (less precision) model: https://github.com/tloen/llama-int8

Starlento · 2023-03-08T06:14:56Z

I found https://github.com/qwopqwop200/GPTQ-for-LLaMa which turn it to 4-bit, I can run the benchmark in that repo. But the code in that repo is using huggingface transformer and it seems that at least the model loading is different.

SWHL · 2023-03-08T07:32:11Z

I changed the max_seq_len from 1024 to 512, and successfully inference the result in the RAM 16GB.

pyllama/inference.py

Line 72 in 450d686

max_seq_len=1024,

vo2021 · 2023-03-09T08:56:45Z

I changed the max_seq_len from 1024 to 512, and successfully inference the result in the RAM 16GB.

pyllama/inference.py

Line 72 in 450d686

max_seq_len=1024,

Thanks for the tips!

SWHL · 2023-03-09T09:08:13Z

I simply sorted out the code of the repo, and only kept the two simplest use cases.
Hope it can help everyone.
LLaMADemo

DigitalLawyerLCB · 2023-03-09T17:27:43Z

Hello, i am using a Nvidia Gtx1650 4GB GPU. Are there any way to run the 7B model on it?

juncongmoo · 2023-03-09T18:15:55Z

Hello, i am using a Nvidia Gtx1650 4GB GPU. Are there any way to run the 7B model on it?

Sure. I am still working on it... so that we can run 13B on a 4GB GPU.

juncongmoo closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.cuda.OutOfMemoryError: CUDA out of memory #6

torch.cuda.OutOfMemoryError: CUDA out of memory #6

m-GDEV commented Mar 7, 2023

Starlento commented Mar 8, 2023 •

edited

Loading

m-GDEV commented Mar 8, 2023

brandonrobertz commented Mar 8, 2023

Starlento commented Mar 8, 2023

SWHL commented Mar 8, 2023

vo2021 commented Mar 9, 2023

SWHL commented Mar 9, 2023

DigitalLawyerLCB commented Mar 9, 2023

juncongmoo commented Mar 9, 2023

torch.cuda.OutOfMemoryError: CUDA out of memory #6

torch.cuda.OutOfMemoryError: CUDA out of memory #6

Comments

m-GDEV commented Mar 7, 2023

Starlento commented Mar 8, 2023 • edited Loading

m-GDEV commented Mar 8, 2023

brandonrobertz commented Mar 8, 2023

Starlento commented Mar 8, 2023

SWHL commented Mar 8, 2023

vo2021 commented Mar 9, 2023

SWHL commented Mar 9, 2023

DigitalLawyerLCB commented Mar 9, 2023

juncongmoo commented Mar 9, 2023

Starlento commented Mar 8, 2023 •

edited

Loading