Make `_fast_init` fast again (by surely skipping model weights init)! #26258

poedator · 2023-09-19T11:10:31Z

I observed that loading pre-trained model takes rather long, even when loading cached models from fast SSD. It is especially noticeable when dealing with LLMs with billions of weights.
Apparently, majority of the time is lost in this section of the code:

# Instantiate model.
    init_contexts = [no_init_weights(_enable=_fast_init)]
# (...)
with ContextManagers(init_contexts):
	model = cls(config, *model_args, **model_kwargs)

Time is spent on weights initialization (by torch.nn.init.kaiming_uniform_() and similar) is wasted, because the newly initialized weights will be then replaced by loaded ones. The no_init_weights context manager sets _init_weights global variable, but it gets ignored by model's code (tested on Llama_2_7B).

I recently discussed a similar issue with PEFT team, but there it was easier to solve, because in PEFT the init code was dealing with specific torch.nn layer. see huggingface/peft#871 and linked PRs by @BenjaminBossan. Here we need a model-scale solution.

One (not perfectly elegant one) - is to temporarily override methods like torch.nn.init.kaiming_uniform_(). It is used in our SpQR repo:

@contextmanager
def suspend_nn_inits():
    skip = lambda *args, **kwargs: None
    saved_inits = torch.nn.init.kaiming_uniform_, torch.nn.init.uniform_, torch.nn.init.normal_  # saving
    torch.nn.init.kaiming_uniform_ = torch.nn.init.uniform_ = torch.nn.init.normal_ = skip  # replacing
    try:
        yield
    finally:
        torch.nn.init.kaiming_uniform_, torch.nn.init.uniform_, torch.nn.init.normal_ = saved_inits  # restoring

but there may be better ones, using some native torch tools?

I'd be glad to contribute a PR with the maintainers blessing. Summoning @younesbelkada

System Info

A100-80G + SSD + mucho RAM and kernels.

Who can help?

@younesbelkada

Reproduction

load model, measure timing for this line

Expected behavior

faster loading

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2023-09-20T14:23:49Z

The no_init_weights context manager sets _init_weights global variable, but it gets ignored by model's code (tested on Llama_2_7B).

Interesting, could you please describe how you tested this? This sounds like a bug.

poedator · 2023-09-20T21:13:41Z

Interesting, could you please describe how you tested this? This sounds like a bug.

Hi, @BenjaminBossan ,

This is how to test this slow loading issue:
select a model, large enough for the effect to be noticeable. I tested with meta-llama/Llama-2-7b-hf; load it as AutoModel.from_pretrained(), then delete - this fills the models cache.

Then try some or all of these:

load it again and notice time passed before Loading checkpoint shards: progress bar appears. Normally it should be few seconds or less.
compare overal command run time with Loading checkpoint shards: time. In my case it is 41s vs 2s. What takes the other 39s, if the model is cached on SSD already?
run AutoModel.from_pretrained() with profiler and see that uniform (i.e. weight init) process takes most of the time, though it is not needed for from_pretrained().
try loading model with disabled weight inits (using context manager, see notebook). In my case it reduced Llama2-7B loading time 10X (from 41s to 4s)

See my testing notebook as gist here: https://gist.github.com/poedator/792d6c7528a1bc5a84acb550268777ed

BenjaminBossan · 2023-09-21T14:36:36Z

Thanks for providing the context and notebook. I could replicate your results and also confirmed that the model produces the same output in both cases. This smells like a bug to me, maybe @ArthurZucker can take a look.

ArthurZucker · 2023-09-22T01:32:33Z

definitely interesting, I'll have a look!

younesbelkada · 2023-09-25T10:55:26Z

@poedator thanks a lot for the deep investigation - do you observe the same behaviour with low_cpu_mem_usage=True ? Looking at the gist it seems you are calling from_pretrained without any additional arguments - we should maybe start thinking of using that argument as default
I also went through SpQr repository you have shared, I have seen some community interest to support it natively on the HF ecosystem, I did not had a deep look into the repository, I wanted to ask if you think that it is possible, design-wise to integrate that into transformers ? cc @SunMarc FYI

poedator · 2023-09-25T19:10:14Z

@younesbelkada,
Whatever is behind low_cpu_mem_usage=True may be a good basis for the solution. I knew about it but hesitated to use because it does more magic than that (at least this was my impression from reading the doc). Please see, how much of low_cpu_mem_usage=True functionality can be included into default options. Hopefully it is a small fix.

Thank you for your interest in supporting SpQR in the HF ecosystem. Let me discuss with my teammates the best way to do this, and then I will get back to you.

jph00 · 2023-10-10T04:00:42Z

One possible solution is mentioned here: #18505

yuanenming · 2023-10-14T12:43:37Z

I met the same issue. And I have another specific scenario, where I want to randomly initialize a large model for debug. So I just want a very fast initialization.

I tried:

config = AutoConfig.from_pretrained(model_name)
model = AutoModelForCausalLM.from_config(config)

I found it is even slower than just loading the weights:

model = AutoModelForCausalLM.from_pretrained(model_name, _fast_init=True, low_cpu_mem_usage=True)

So I wonder if there is a way to fast initialize a very large model (without any initialization algorithm) using from_config?

Thank you very much!

ArthurZucker · 2023-11-08T08:26:15Z

Ouch sorry about that! Was off for a bit, and it's planned! Will try to open a draft PR asap

ArthurZucker · 2023-11-10T13:30:36Z

Update 🤗
I'll tackle this as I can indeed reproduce and though we have the low_cpu_mem_usage flag that requires accelerate, this seems like a somewhat low-hanging fruit. We gotta make sure the weights that are missing from the state-dict are initialized ( non-persistant buffers etc).

pacman100 · 2023-11-28T06:45:32Z

On main branch of Transformers, I observe the following:

low_cpu_mem_usage should resolve the issue coupled with _fast_init which is True by default.
low_cpu_mem_usage internally calls accelerate's init_empty_weights which sets the weights on meta device leading to reset_parameters() being a no-op. If include_buffers=True, it just directly uses with torch.device("meta") context manager as suggested by Horace in the other linked issue.

ArthurZucker · 2023-11-28T10:25:29Z

The goal is to still have fast init without accelerate

huggingface deleted a comment from github-actions bot Nov 8, 2023

ArthurZucker self-assigned this Nov 10, 2023

ArthurZucker mentioned this issue Nov 26, 2023

[from_pretrained] Make from_pretrained fast again #27709

Merged

1 task

pacman100 mentioned this issue Nov 28, 2023

Suppress reset_parameters of torch.nn.Linear,Conv2d... inside no_init_weights #18505

Closed

ArthurZucker closed this as completed in #27709 Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `_fast_init` fast again (by surely skipping model weights init)! #26258

Make `_fast_init` fast again (by surely skipping model weights init)! #26258

poedator commented Sep 19, 2023 •

edited

BenjaminBossan commented Sep 20, 2023

poedator commented Sep 20, 2023

BenjaminBossan commented Sep 21, 2023

ArthurZucker commented Sep 22, 2023

younesbelkada commented Sep 25, 2023

poedator commented Sep 25, 2023 •

edited

jph00 commented Oct 10, 2023

yuanenming commented Oct 14, 2023

ArthurZucker commented Nov 8, 2023

ArthurZucker commented Nov 10, 2023

pacman100 commented Nov 28, 2023

ArthurZucker commented Nov 28, 2023

Make _fast_init fast again (by surely skipping model weights init)! #26258

Make _fast_init fast again (by surely skipping model weights init)! #26258

Comments

poedator commented Sep 19, 2023 • edited

System Info

Who can help?

Reproduction

Expected behavior

BenjaminBossan commented Sep 20, 2023

poedator commented Sep 20, 2023

BenjaminBossan commented Sep 21, 2023

ArthurZucker commented Sep 22, 2023

younesbelkada commented Sep 25, 2023

poedator commented Sep 25, 2023 • edited

jph00 commented Oct 10, 2023

yuanenming commented Oct 14, 2023

ArthurZucker commented Nov 8, 2023

ArthurZucker commented Nov 10, 2023

pacman100 commented Nov 28, 2023

ArthurZucker commented Nov 28, 2023

Make `_fast_init` fast again (by surely skipping model weights init)! #26258

Make `_fast_init` fast again (by surely skipping model weights init)! #26258

poedator commented Sep 19, 2023 •

edited

poedator commented Sep 25, 2023 •

edited