Update Inference Engine checkpoint loading + meta tensor assertions #2940

lekurile · 2023-03-03T23:55:13Z

This PR adds a model_device_meta attribute to the InferenceEngine that's used to:

assert that replace_with_kernel_inject == True if meta tensors are used, since we only support meta tensors when kernel injection is enabled.
Allow the InferenceEngine to load checkpoints via _load_checkpoint() when a checkpoint is passed to the init_inference API only when meta tensors are not used.

This PR also adds an assertion in the initialize_tensors function of the base container to check that if the model is using meta tensors, that the corresponding model container uses the meta tensor feature.

… tensor

awan-10 · 2023-05-10T00:23:03Z

deepspeed/inference/engine.py

@@ -151,6 +151,13 @@ def __init__(self, model, config):
            assert pkg_version.parse(torch.__version__) >= pkg_version.parse("1.10"), \
                "If you want to use cuda graph, please upgrade torch to at least v1.10"

+        # Check if model passed to engine is loaded w/ meta tensors, in which case
+        # kernel injection must be enabled.


Please update the comment to add why this only works for HF models.

Added a note saying that the device type is sourced assuming a Hugging Face hierarchy.

…crosoft/DeepSpeed into lekurile/update_inf_ckpt_load

lekurile added 3 commits March 3, 2023 23:53

Change init_inference checkpoint loading to explicitly check for meta…

5a5d943

… tensor

formatting

2ece62f

Add comment describing why is_meta is being checked

ded3c6b

lekurile mentioned this pull request Mar 8, 2023

[BUG] can not initialize DeepSpeed-Inference engine with deepspeed.init_inference() #2149

Closed

larekrow mentioned this pull request Mar 14, 2023

[BUG] deepspeed.init_inference() erroneously attempts to copy out of meta tensor #3012

Open

lekurile and others added 4 commits March 17, 2023 12:24

Merge branch 'master' into lekurile/update_inf_ckpt_load

3ee14a6

Merge branch 'master' into lekurile/update_inf_ckpt_load

6b3ea04

Merge branch 'master' into lekurile/update_inf_ckpt_load

f80103f

Add additional assertions

ebc228a

lekurile marked this pull request as ready for review May 9, 2023 19:17

lekurile requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners May 9, 2023 19:17

lekurile changed the title ~~Change init_inference checkpoint loading to explicitly check for meta tensor~~ Update Inference Engine checkpoint loading + meta tensor assertions May 9, 2023

lekurile added 2 commits May 9, 2023 20:36

Remove _load_checkpoint function from InferenceEngine

0e8f746

Additional meta tensor checks

c2dcb40

awan-10 reviewed May 10, 2023

View reviewed changes

awan-10 approved these changes May 10, 2023

View reviewed changes

awan-10 and others added 3 commits May 9, 2023 17:23

Merge branch 'master' into lekurile/update_inf_ckpt_load

a2602da

Add comment about Hugging Face device type hierarchy

da27ef5

Merge branch 'lekurile/update_inf_ckpt_load' of https://github.com/mi…

22d8767

…crosoft/DeepSpeed into lekurile/update_inf_ckpt_load

awan-10 enabled auto-merge (squash) May 10, 2023 00:42

awan-10 merged commit db26f8b into master May 10, 2023

lekurile mentioned this pull request May 12, 2023

[BUG] Loading checkpoint in the inference script fails when not using kernel injection #3292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Inference Engine checkpoint loading + meta tensor assertions #2940

Update Inference Engine checkpoint loading + meta tensor assertions #2940

lekurile commented Mar 3, 2023 •

edited

Loading

awan-10 May 10, 2023

lekurile May 10, 2023

Update Inference Engine checkpoint loading + meta tensor assertions #2940

Update Inference Engine checkpoint loading + meta tensor assertions #2940

Conversation

lekurile commented Mar 3, 2023 • edited Loading

awan-10 May 10, 2023

Choose a reason for hiding this comment

lekurile May 10, 2023

Choose a reason for hiding this comment

lekurile commented Mar 3, 2023 •

edited

Loading