Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent malloc failure on 8GB Macs #64

Closed
wants to merge 5 commits into from

Conversation

davidjoffe
Copy link

Hi,

This Pull Request is for this issue below (error "std::runtime_error: [malloc_or_wait] Unable to allocate" which I'd been encountering running the official stable diffusion example on an 8GB M2 Mac Mini), please see this for more detailed explanation:

#63

Effectively this should help stop the metal::allocator malloc from failing on low-RAM Macs e.g. 8GB RAM Macs, though this behavior probably needs further future testing/refinement etc.

Quick experiment to see if this helps fix malloc error
Experimenting with values here to work on 8GB RAM Mac
@awni
Copy link
Member

awni commented Dec 8, 2023

@jagrit06 what was the rationale for setting this to 1.5x in the first place? I have no idea what the performance implications of changing it to 4.0 are, but we should test it / understand that better. Then either merge this or come up with a more sophisticated method for setting it (or maybe it a configurable environment variable).

@hduy7222
Copy link

hduy7222 commented Dec 8, 2023

have a nice today sir

@jagrit06
Copy link
Member

jagrit06 commented Dec 8, 2023

So, the rationale for setting it to 1.5 was that the recommended working set it usually 0.75 * the total ram available on the system - we wanted that to be the point where we basically block until one operation is done to use its memory before scheduling a new one

We will probably need something a little more sophisticated than just upping the memory pressure limit though - probably changing how and when we throw that error

@awni
Copy link
Member

awni commented Dec 8, 2023

I agree though that a hard no-swap policy seems a little Draconian. Though maybe 4 * recommendedMaxWorkingSetSize is too high.

Part of the problem is that we are using this limit as a way to throttle the GPU from running to fast and gobbling memory that it won't need for a while (hence paying a swap penalty). This is a good thing for models where the working set is not too big at any point, but the overall amount of memory allocated in a single eval is.

But we also should allow allocated memory (that isn't being used in the current eval) to swap. I.e. if I'm just holding a bunch of arrays that aren't relevant for the computation and I won't use them for a while.

So maybe a better design would be something like keeping track of the amount of memory used during the specific eval and limiting that 1.5 * recommendedMaxWorkingSetSize?

In that case the API also has the flexibility to allow swapping by breaking a graph into multiple evals.

@davidjoffe
Copy link
Author

davidjoffe commented Dec 8, 2023

Hi, yes, to be clear the idea of this PR was more something like 'a short-term fix to get it to at least work on 8GB Macs' (as that's preferable for users to failing with an error, I think) but I agree a better solution is desirable in the longer term.

Some thoughts; I like the idea of having ways to configure the behavior (e.g. via one or more environment variables, as suggested, or perhaps other ways) and having ways to (ideally fully) control it so users can decide what suits their situation

I guess ideally the defaults should be something that at least 'works on a broad range of platforms/use-cases' out the box, though that can be tricky. It could maybe warn on high mem usage (with maybe a setting to disable warnings). And if it fails it could give a message giving some guidance to the user how to try solve it (e.g. what settings to try change, of course, unless they're trying to run something really huge)

Unfortunately all my Macs are 8GB and I'd love to be able to run MLX stuff even if it's swappy, but as a programmer I know well enough what I can reasonably get away with.

@awni
Copy link
Member

awni commented Dec 9, 2023

Hi @davidjoffe I hear you, I'm just worried this change is going to slow down other use cases which would be bad. I think if you made it an environment variable that can be set, we could merge that for now with the intention of fixing the overall strategy in the future.

Merging getting sync'd with latest mlx
@awni
Copy link
Member

awni commented Dec 27, 2023

I think this is going to be fixed in #292 so I am closing it.

@awni awni closed this Dec 27, 2023
@davidjoffe
Copy link
Author

Hi, ok, thank you! I haven't had time to work on it unfortunately so far, but will definitely give it a try again once this PR 292 is in

@beverm2391
Copy link

Still getting the exact same error on M3 Pro 18GB, except it's being thrown during the quantization in convert.py I'm using mlx==0.0.7 via PyPi which I think should include the changes merged in #292.

@awni
Copy link
Member

awni commented Jan 4, 2024

What are you converting? I'll try to repro.

@beverm2391
Copy link

beverm2391 commented Jan 4, 2024

sorry for not including that - I'm converting Mixtral-8x7B-Instruct-v0.1 following this readme. Weirdly, every time I run it I get the same behavior - the first two tensors are converted and quantized as expected but the third throws the same Runtime Error: [malloc]... as outlined so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants