Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable diffusion malloc error #21

Closed
davidjoffe opened this issue Dec 6, 2023 · 6 comments
Closed

stable diffusion malloc error #21

davidjoffe opened this issue Dec 6, 2023 · 6 comments

Comments

@davidjoffe
Copy link

davidjoffe commented Dec 6, 2023

mlx seemed to install fine but I tried the stable diffusion example and keep getting this:

134217728 bytes seems small, there should not be a problem allocating only ~134MB memory, the machine has 8GB RAM & ~4GB free. It's an Apple M2 Mac mini (running headless & connected to via VNC, could that cause issues?),
Or am I misunderstanding something basic, does this need an M2 Pro or something? Though it does go through all the (default) 50 steps before failing

I ran the MLX unit tests and they seemed to run fine, said '154 tests ran, 4 skipped'

(foo) david@Davids-Mac-mini stable_diffusion % python txt2image.py "A photo of an astronaut riding a horse on Mars." --n_images 1 --n_rows 1
/Users/david/mlx/foo/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
100%|
 [00:00<?, ?it/s]libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 134217728 bytes.
zsh: abort      python txt2image.py "A photo of an astronaut riding a horse on Mars."  1  1
(foo) david@Davids-Mac-mini stable_diffusion % /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

python --version 3.9.6. OS: Ventura 13.6.1. tried Python 3.11.5 and does same
It cuts off at "There appears to be %d '

Briefly tried installing via source & also tried on conda, and Python 3.12, also didn't work

@angeloskath
Copy link
Member

Hm unfortunately that means it ran out of memory. We haven't tested it on a Mac mini with 8GB of ram unfortunately. I am using it on my M2 Air but it has 24GB. Can you try removing CFG by setting --cfg 0 ?

@davidjoffe
Copy link
Author

davidjoffe commented Dec 7, 2023

Hm, thanks .. adding --cfg 0 unfortunately makes no difference

Even if it's running out of memory, seems odd, why wouldn't it just use the system pagefile/swap for such a relatively small amount like ~134MB? In 20+ years of dev I've never seen malloc() fail from high memory load, it normally just allocs and uses swap ... these 8GB Macs regularly swap like crazy but continue, I often use much MUCH more on these Macs than this appears to be using from Activity Monitor
I've closed everything else also.

Or does this have something to do with the unified memory architecture? Or memory fragmentation?

It runs smoothly through the default 50 steps - it just fails at the end. I'm watching memory load as this runs and it's in the green most the way, using max 2GB of swap until it completes the 50 steps ... using only 2GB swap it doesn't seem to me like such an excessive memory load that a malloc of just 134MB should fail, that's very very light memory load:

(foo) (base) david@Davids-Mac-mini stable_diffusion % python3 txt2image.py "A cat on a sandwich" --cfg 0 --n_images 1 --n_rows 1
/Users/david/mlx/foo/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:37<00:00,  3.14s/it]
  0%|                                                                                                           | 0/1 [00:00<?, ?it/s]libc++abi: terminating due to uncaught exception of type std::runtime_error: [malloc_or_wait] Unable to allocate 134217728 bytes.
zsh: abort      python3 txt2image.py "A cat on a sandwich" --cfg 0 --n_images 1 --n_rows 1

@davidjoffe
Copy link
Author

davidjoffe commented Dec 7, 2023

I did some more tests, ran it again with --steps 2 - the maximum memory load system hits is about 7.5GB and the entire system is using ONLY ~500MB swap this time (that's nothing). I don't see why a malloc of just ~134MB should fail under those conditions, why it shouldn't just use swap

I upgraded to Sonoma, same thing

EDIT2 FWIW the system report looks like this:

Edit3 Looking at the source I see this looks like it internally calls/uses metal::allocator?

Edit4 is it possible the line 'block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}' is where the issue stems? If I have time later I may try rebuild with that line changed and see if it helps


Crashed Thread:        2

Application Specific Information:
abort() called


Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	       0x18fc680ac __psynch_cvwait + 8
1   libsystem_pthread.dylib       	       0x18fca55fc _pthread_cond_wait + 1228
2   libc++.1.dylib                	       0x18fbd04dc std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 28
3   libc++.1.dylib                	       0x18fbd0fec std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 56
4   libc++.1.dylib                	       0x18fbd1090 std::__1::__assoc_sub_state::wait() + 56
5   libmlx.dylib                  	       0x101d0e0e8 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 2756
...

Thread 2 Crashed:
0   libsystem_kernel.dylib        	       0x18fc6d11c __pthread_kill + 8
1   libsystem_pthread.dylib       	       0x18fca4cc0 pthread_kill + 288
2   libsystem_c.dylib             	       0x18fbb4a40 abort + 180
3   libc++abi.dylib               	       0x18fc5c6d8 abort_message + 132
4   libc++abi.dylib               	       0x18fc4c7ac demangling_terminate_handler() + 320
5   libobjc.A.dylib               	       0x18f8f78a4 _objc_terminate() + 160
6   libc++abi.dylib               	       0x18fc5ba9c std::__terminate(void (*)()) + 16
7   libc++abi.dylib               	       0x18fc5ea48 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 36
8   libc++abi.dylib               	       0x18fc5e9f4 __cxa_throw + 140
9   libmlx.dylib                  	       0x101cc0c30 mlx::core::allocator::malloc_or_wait(unsigned long) + 344
10  libmlx.dylib                  	       0x1023a9d8c mlx::core::(anonymous namespace)::binary_op(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 380
11  libmlx.dylib                  	       0x1023a9bc4 mlx::core::Add::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 56
12  libmlx.dylib                  	       0x1023a854c std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 148
13  libmlx.dylib                  	       0x101d0bf14 mlx::core::scheduler::StreamThread::thread_fn() + 500
14  libmlx.dylib                  	       0x101d0c0d0 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 72
15  libsystem_pthread.dylib       	       0x18fca5034 _pthread_start + 1

@davidjoffe
Copy link
Author

davidjoffe commented Dec 7, 2023

New information, and good news as I found the cause: I forked mlx itself, changed one line of code, rebuilt from my source fork, and now it works 😊

This line of code in MetalAllocator::MetalAllocator() in mlx/backend/metal/allocator.cpp I changed to a much higher limit - this 1.5 seems maybe a bit conservative for low-RAM Macs:

block_limit_(1.5 * device_->recommendedMaxWorkingSetSize()) {}'
https://github.com/davidjoffe/mlx/blob/main/mlx/backend/metal/allocator.cpp

There is indeed a relatively big spike of memory usage as it finishes the steps, but not world-ending, I'd rather have 'something that works' even if it spikes my swap, but I suppose it's debatable how best to handle that in the long run as a general solution for all users, or just warn, or maybe give more options to control how much memory to use or how to handle low memory, anyway.

That means the issue though is not in mlx-examples but probably really in mlx.

Should I try submit a PR for mlx? But I'm not sure if someone had some 'very good reason' for the particular choice of 1.5 as a hard factor here.

@angeloskath
Copy link
Member

May I say, awesome work :-)

We need to start adding more 8GB Macs in our testing pool. There really wasn't a particular reason for the 1.5. It just gets significantly slower at that point and we wanted to avoid freezing the system.

I would encourage you to at least open an issue or maybe a PR at the main repo. Can't tell you that it will be the first one merged but for sure we 'll test the implications of increasing this limit and if there are none for most use cases then merge it.

Thanks for investigating! Feel free to close the issue and link to it from the PR.

@davidjoffe
Copy link
Author

Thank you!

I opened an issue for this on mlx: ml-explore/mlx#63 and created this Pull Request: ml-explore/mlx#64 - hope it's accepted, though of course it's up to the mlx maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants