-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4-bit: "Allocator: not enough memory: you tried to allocate 35389440 bytes." #429
Comments
The final line indicates you are out of DRAM (regular ram)to load the model, have you tried using a pagefile? |
No I haven't used a pagefile, because I didn't need it. The 8-bit version works without a pagefile, does the 4-bit need more DRAM? |
well, It only loads your model into ram initially, then loads it into vram. |
Please create a page file/swap space. I had the same problem, until i created a swap space. |
Has anyone tried converting the 4bit weights to |
That's weird. I checked again and I do have a swap file on my second hard disk. There's 100GB of memory free space and the swap file size is automatically managed by Windows. |
> python server.py --share --model oasst-sft-1-pythia-12b --cpu --load-in-8bit
Loading oasst-sft-1-pythia-12b...
Loading checkpoint shards: 33%|████████████████████████████████████████████████▎ | 1/3 [07:52<15:44, 472.28s/it]
Traceback (most recent call last):
File "D:\Study\MCU\C Learning\text-generation-webui\server.py", line 236, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Study\MCU\C Learning\text-generation-webui\modules\models.py", line 157, in load_model
model = AutoModelForCausalLM.from_pretrained(checkpoint, **params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Python\Lib\site-packages\transformers\models\auto\auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Python\Lib\site-packages\transformers\modeling_utils.py", line 2646, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Python\Lib\site-packages\transformers\modeling_utils.py", line 2969, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Python\Lib\site-packages\transformers\modeling_utils.py", line 640, in _load_state_dict_into_meta_model
param = param.to(dtype)
^^^^^^^^^^^^^^^
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 419430400 bytes. I have this error too. I have 16 Gb of ram and 32 Gb swap file. Any ideas? |
@patrickmros This is only an idea but since you are running cpu generation, can you temporarily uninstall Cuda and try again? |
Wait - there is something wrong here. I don't want to run cpu generation! This is the output I get from conda list -p "C:\Users\patri\miniconda3\envs\textgen": `# packages in environment at C:\Users\patri\miniconda3\envs\textgen: Name Version Build Channel7zip 19.00 h2d74725_2 conda-forge |
Your env looks alright, as far as i can tell.
Sorry, I was mixing this up with @HCBlackFox and on your side, the cpu allocator was failing. |
Thanks, I removed libbitsandbytescpu.so from the package directory and the error went away (I got the following error lul) |
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below. |
I followed the guide 4bit LLaMA Setup for Windows and it worked. One time.
The next time i tried to start it, i get
Loading llama-13b... Loading model ... Traceback (most recent call last): File "J:\LLaMA\text-generation-webui\server.py", line 236, in <module> shared.model, shared.tokenizer = load_model(shared.model_name) File "J:\LLaMA\text-generation-webui\modules\models.py", line 100, in load_model model = load_quantized(model_name) File "J:\LLaMA\text-generation-webui\modules\GPTQ_loader.py", line 55, in load_quantized model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits) File "J:\LLaMA\text-generation-webui\repositories\GPTQ-for-LLaMa\llama.py", line 245, in load_quant model.load_state_dict(torch.load(checkpoint)) File "C:\Users\patri\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 789, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\patri\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1131, in _load result = unpickler.load() File "C:\Users\patri\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1101, in persistent_load load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\patri\miniconda3\envs\textgen\lib\site-packages\torch\serialization.py", line 1079, in load_tensor storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage).storage().untyped() RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 35389440 bytes.
Well, i'm quite sure i have plenty of 35 megabytes free in my vram, ram and on my hard disk... Where exactly does it try to allocate 35 MB ?
The text was updated successfully, but these errors were encountered: