why cpu_checkpointing can't work? #522

ghosthamlet · 2020-11-11T12:12:43Z

I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can't do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can't offload all its checkpoints to the CPU?
My CPU memory is enough, configs:

{
       'zero_optimization': {
          'stage': 2,
          'cpu_offload': True,
          'contiguous_gradients': True,
          },
       'train_batch_size': 2,
       'fp16': {
          'enabled': True,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "hysteresis": 2,
          "min_loss_scale": 1,
          },
        "activation_checkpointing": {
          "partition_activations": True,
          "contiguous_memory_optimization": True,
          "cpu_checkpointing": True
        },
       "wall_clock_breakdown": False,
}

Environment:
python 3.6
torch 1.6.0
deepspeed 0.3.7

The text was updated successfully, but these errors were encountered:

ghosthamlet · 2020-12-01T02:00:35Z

I think this is related to #541

hpourmodheji · 2022-02-04T19:12:06Z

@ghosthamlet
I have the same issue. Did you kindly get an answer to this issue?

tjruwase · 2022-02-04T19:38:01Z

@hpourmodheji, thanks for the question. This should have been fixed by #1254. Are you still having this problem?

hpourmodheji · 2022-02-04T19:59:25Z

@tjruwase, thanks for your comment. I tried ZeRO-Infinity with the DS version 0.5.7+, but it seems that activations are not offloaded to the CPU.

Here is the config part:

"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme",
},
"offload_param": {
"device": "cpu",
"_device": "nvme",
}
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000
},

Here are the results. There is no memory reduction during the forward pass.

Training Stages	Baseline: GPU Memory Consumption during Training with single GPU	ZeRO-Infinity: GPU Memory Consumption during Training with single GPU	Memory consumption reduction
before forward	MA 4.38 GB	MA 0.12 GB	~37x
before backward	MA 5.74 GB	MA 1.55 GB	~4x
before optimizer	MA 5.01 GB	MA 0.12 GB	~42x

hpourmodheji · 2022-02-07T16:08:20Z

@tjruwase, Do you kindly have any comment on this and offloading activations?

ghosthamlet · 2022-02-08T02:18:11Z

@ghosthamlet I have the same issue. Did you kindly get an answer to this issue?

@hpourmodheji Sorry, i did not try single GPU for activation_checkpointing again, only added another GPU, then it works.

hpourmodheji · 2022-02-16T20:20:02Z

@ghosthamlet, Thanks for your comment. I tried with 2 GPUs, and it still does not work. I hope to hear from the DS team.

tjruwase · 2022-02-16T20:34:13Z

@ghosthamlet and @hpourmodheji, it seems there a few issues here. I will take a closer look.

tjruwase · 2022-02-16T20:36:03Z

@ghosthamlet I have the same issue. Did you kindly get an answer to this issue?

@hpourmodheji Sorry, i did not try single GPU for activation_checkpointing again, only added another GPU, then it works.

@ghosthamlet, can you please confirm that activation_checkpointing and cpu_offloading work as expected with 2 GPUs, but not with 1 GPU?

tjruwase · 2022-02-16T20:42:41Z

@ghosthamlet, Thanks for your comment. I tried with 2 GPUs, and it still does not work. I hope to hear from the DS team.

@hpourmodheji, thanks for the table you shared earlier, but I think we need something different for this investigation. Activation checkpointing (including cpu offload) can be enabled without zero stage 3. And so, it would be good to disable zero by removing the zero key in the json config. Can you please share similar information for single GPU for these 3 scenarios?

No activation checkpointing - by removing the activation_checkpointing key in the json config.
Activation checkpointing without cpu offloading - by setting cpu_checkpointing to false
Activation checkpointing with cpu offloading - by setting cpu_checkpointing to true

The collected information will help with the next steps of investigation. Thanks!

ghosthamlet · 2022-02-21T14:55:48Z

@tjruwase Sorry for the late reply.
When i was using old version DeepSpeed months ago, activation_checkpointing and cpu_offloading can't work with 1 GPU, then i added more GPUs, and used stage3 with offload_param, no model parallel, the activation_checkpointing and cpu_offloading of old version DeepSpeed should also not work at that time, but the GPU memory was enough.
With the new version DeepSpeed now, it is working both on 1 GPU and 2 GPUs, you can see the logs:

1 GPUs

activation_checkpointing cpu_offload

[2022-02-21 22:18:11,198] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:18:11,199] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:18:11,199] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.11 GB, percent = 39.9%
[2022-02-21 22:18:12,359] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:18:12,360] [INFO] [utils.py:827:see_memory_usage] MA 0.27 GB         Max_MA 2.02 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:18:12,360] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.11 GB, percent = 39.9%
[2022-02-21 22:18:16,976] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:18:16,977] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 2.41 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:18:16,977] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.11 GB, percent = 39.9%


activation_checkpointing but no cpu_offload

[2022-02-21 22:21:26,836] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:21:26,837] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:21:26,837] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.09 GB, percent = 39.9%
[2022-02-21 22:21:27,809] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:21:27,810] [INFO] [utils.py:827:see_memory_usage] MA 0.5 GB         Max_MA 2.02 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:21:27,810] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.09 GB, percent = 39.9%
[2022-02-21 22:21:32,630] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:21:32,631] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 2.64 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:21:32,631] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.09 GB, percent = 39.9%


2 GPUs:

activation_checkpointing cpu_offload

[2022-02-21 22:05:01,225] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:05:01,226] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:05:01,226] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:02,447] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:05:02,448] [INFO] [utils.py:827:see_memory_usage] MA 0.22 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:05:02,448] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:06,865] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:05:06,865] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.36 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:05:06,866] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%


activation_checkpointing but no cpu_offload

[2022-02-21 22:08:47,487] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:08:47,487] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:08:47,488] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:48,631] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:08:48,631] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:08:48,632] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:53,495] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:08:53,495] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.6 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:08:53,496] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%

Configs:

 {
        "zero_optimization": {
            "stage": 3,
            "contiguous_gradients": True,
            "overlap_comm": False,
            "reduce_bucket_size": 50000000,
            "stage3_max_live_parameters": 1e9, 
            "stage3_max_reuse_distance": 1e9, 
            "stage3_prefetch_bucket_size": 5e8, 
            "stage3_param_persistence_threshold": 1e6, 
            "sub_group_size": 1e12, 
            "offload_optimizer": {
                "device": "cpu",
                },
            "offload_param": {
                "device": "cpu",
                },
            },

        "activation_checkpointing": {
            "partition_activations": True,
            "num_checkpoints": 1,
            "cpu_checkpointing": True,
            "contiguous_memory_optimization": False
        } 

        "train_batch_size": 2 * args.world_size,
        "gradient_accumulation_steps": 1,
        "fp16": {
            "enabled": True,
            "loss_scale": 0,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1,
            "initial_scale_power": 16,
        },
    }

ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 1.7.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed info ................... 0.5.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.2

tjruwase · 2022-02-21T15:20:55Z

@ghosthamlet, awesome! Thanks so much for sharing this detailed feedback. If possible, can you please check if this activation checkpointing features also work with zero stages < 3? Thanks!

ghosthamlet · 2022-02-22T05:02:28Z

@tjruwase For stage0/stage1 2080Ti GPU run out of memory, so i have to test small model.
stage1 cpu_offload has problem, so disabled it.

I think stage0/stage1/stage2 working too:

1 GPUs

stage0, gradient_checkpointing cpu_offload

[2022-02-22 12:31:29,822] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:31:29,823] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:31:29,823] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:31:29,950] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:31:29,951] [INFO] [utils.py:827:see_memory_usage] MA 0.88 GB         Max_MA 1.0 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:31:29,951] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:31:30,128] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:31:30,129] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:31:30,129] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%


stage0, gradient_checkpointing no cpu_offload

[2022-02-22 12:30:11,060] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:30:11,060] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:30:11,061] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:30:11,186] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:30:11,186] [INFO] [utils.py:827:see_memory_usage] MA 0.89 GB         Max_MA 1.01 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:30:11,187] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:30:11,376] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:30:11,377] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:30:11,377] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%

------------------------------

stage1 no cpu_offload, gradient_checkpointing cpu_offload

[2022-02-22 12:22:28,406] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:22:28,407] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.5 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:22:28,407] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:22:28,535] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:22:28,536] [INFO] [utils.py:827:see_memory_usage] MA 0.88 GB         Max_MA 1.0 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:22:28,536] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:22:28,707] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:22:28,707] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:22:28,708] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%


stage1 no cpu_offload, gradient_checkpointing but no cpu_offload 

[2022-02-22 12:23:20,951] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:23:20,951] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.5 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:23:20,952] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:23:21,077] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:23:21,077] [INFO] [utils.py:827:see_memory_usage] MA 0.89 GB         Max_MA 1.01 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:23:21,078] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:23:21,245] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:23:21,246] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:23:21,246] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%

------------------------------

stage2 cpu_offload, gradient_checkpointing cpu_offload

[2022-02-21 23:44:59,393] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:44:59,394] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 5.86 GB         Max_CA 6 GB
[2022-02-21 23:44:59,394] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.42 GB, percent = 36.2%
[2022-02-21 23:44:59,835] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:44:59,836] [INFO] [utils.py:827:see_memory_usage] MA 4.96 GB         Max_MA 5.1 GB         CA 5.86 GB         Max_CA 6 GB
[2022-02-21 23:44:59,836] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.42 GB, percent = 36.2%
[2022-02-21 23:45:02,321] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:45:02,322] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.72 GB         CA 5.86 GB         Max_CA 6 GB
[2022-02-21 23:45:02,322] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.42 GB, percent = 36.2%


stage2 cpu_offload, gradient_checkpointing but no cpu_offload 

[2022-02-21 23:40:55,849] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:40:55,850] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 6.14 GB         Max_CA 6 GB
[2022-02-21 23:40:55,850] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.41 GB, percent = 36.2%
[2022-02-21 23:40:56,250] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:40:56,250] [INFO] [utils.py:827:see_memory_usage] MA 5.2 GB         Max_MA 5.33 GB         CA 6.14 GB         Max_CA 6 GB
[2022-02-21 23:40:56,251] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.41 GB, percent = 36.2%
[2022-02-21 23:40:58,562] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:40:58,563] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.94 GB         CA 6.14 GB         Max_CA 6 GB
[2022-02-21 23:40:58,563] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.41 GB, percent = 36.2%



2 GPUs:


stage0, gradient_checkpointing cpu_offload

[2022-02-22 12:27:51,083] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:27:51,084] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 1.75 GB         Max_CA 2 GB
[2022-02-22 12:27:51,084] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.78 GB, percent = 6.2%
[2022-02-22 12:27:51,212] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:27:51,212] [INFO] [utils.py:827:see_memory_usage] MA 0.88 GB         Max_MA 1.02 GB         CA 1.83 GB         Max_CA 2 GB
[2022-02-22 12:27:51,212] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.78 GB, percent = 6.2%
[2022-02-22 12:27:51,377] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:27:51,378] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.3 GB         CA 2.05 GB         Max_CA 2 GB
[2022-02-22 12:27:51,378] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.78 GB, percent = 6.2%


stage0, gradient_checkpointing no cpu_offload

[2022-02-22 12:29:05,231] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:29:05,231] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 1.75 GB         Max_CA 2 GB
[2022-02-22 12:29:05,232] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 6.1%
[2022-02-22 12:29:05,353] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:29:05,353] [INFO] [utils.py:827:see_memory_usage] MA 0.89 GB         Max_MA 1.03 GB         CA 1.83 GB         Max_CA 2 GB
[2022-02-22 12:29:05,354] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 6.1%
[2022-02-22 12:29:05,505] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:29:05,506] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.31 GB         CA 2.05 GB         Max_CA 2 GB
[2022-02-22 12:29:05,506] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 6.1%

------------------------------

stage1 no cpu_offload, gradient_checkpointing cpu_offload

[2022-02-22 12:25:50,491] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:25:50,492] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 0.81 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:25:50,492] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.81 GB, percent = 6.2%
[2022-02-22 12:25:50,617] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:25:50,618] [INFO] [utils.py:827:see_memory_usage] MA 0.54 GB         Max_MA 0.67 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:25:50,618] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.81 GB, percent = 6.2%
[2022-02-22 12:25:50,779] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:25:50,780] [INFO] [utils.py:827:see_memory_usage] MA 0.53 GB         Max_MA 0.96 GB         CA 1.28 GB         Max_CA 1 GB
[2022-02-22 12:25:50,780] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.81 GB, percent = 6.2%


stage1 no cpu_offload, gradient_checkpointing but no cpu_offload

[2022-02-22 12:24:42,382] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:24:42,383] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 0.81 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:24:42,383] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.82 GB, percent = 6.2%
[2022-02-22 12:24:42,504] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:24:42,504] [INFO] [utils.py:827:see_memory_usage] MA 0.55 GB         Max_MA 0.68 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:24:42,505] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.82 GB, percent = 6.2%
[2022-02-22 12:24:42,660] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:24:42,661] [INFO] [utils.py:827:see_memory_usage] MA 0.53 GB         Max_MA 0.96 GB         CA 1.28 GB         Max_CA 1 GB
[2022-02-22 12:24:42,661] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.82 GB, percent = 6.2%

------------------------------

stage2 cpu_offload, gradient_checkpointing cpu_offload

[2022-02-21 23:35:26,890] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:35:26,891] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 6.39 GB         Max_CA 6 GB
[2022-02-21 23:35:26,891] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.5 GB, percent = 40.2%
[2022-02-21 23:35:27,337] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:35:27,338] [INFO] [utils.py:827:see_memory_usage] MA 4.96 GB         Max_MA 5.09 GB         CA 6.39 GB         Max_CA 6 GB
[2022-02-21 23:35:27,338] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.5 GB, percent = 40.2%
[2022-02-21 23:35:29,950] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:35:29,951] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.55 GB         CA 6.39 GB         Max_CA 6 GB
[2022-02-21 23:35:29,951] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.5 GB, percent = 40.2%


stage2 cpu_offload, gradient_checkpointing but no cpu_offload

[2022-02-21 23:37:48,850] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:37:48,850] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 6.68 GB         Max_CA 7 GB
[2022-02-21 23:37:48,851] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.45 GB, percent = 40.2%
[2022-02-21 23:37:49,246] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:37:49,246] [INFO] [utils.py:827:see_memory_usage] MA 5.19 GB         Max_MA 5.31 GB         CA 6.68 GB         Max_CA 7 GB
[2022-02-21 23:37:49,247] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.45 GB, percent = 40.2%
[2022-02-21 23:37:51,801] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:37:51,802] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.76 GB         CA 6.68 GB         Max_CA 7 GB
[2022-02-21 23:37:51,802] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.46 GB, percent = 40.2%

I want to ask another question:
There is one thing i am very confusing, in four 2080TI GPUs training, when the GPU 0 mem is:
MA 0.79 GB Max_MA 2.77 GB CA 6.55 GB Max_CA 7 GB
GPU will ran out of memory, But max allocated Max_MA is just 2.77 GB,
out of memory is caused by Max_CA 7 GB, this number is what nvidia-smi show,
if Max_CA is the number matters most, then MA and Max_MA reduced by DeepSpeed is not enough, are there any methods to reduce CA and Max_CA too? CA and Max_CA are cached memory, why it can't be released?
as you can see the 2 GPUs log, Max_CA is always very large:

2 GPUs:

stage3 cpu_offload, gradient_checkpointing cpu_offload

[2022-02-21 22:05:01,225] [INFO] [utils.py:822:see_memory_usage] before forward 6   
[2022-02-21 22:05:01,226] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:05:01,226] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:02,447] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:05:02,448] [INFO] [utils.py:827:see_memory_usage] MA 0.22 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:05:02,448] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:06,865] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:05:06,865] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.36 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:05:06,866] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%


stage3 cpu_offload, gradient_checkpointing but no cpu_offload

[2022-02-21 22:08:47,487] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:08:47,487] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:08:47,488] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:48,631] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:08:48,631] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:08:48,632] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:53,495] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:08:53,495] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.6 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:08:53,496] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%

hpourmodheji · 2022-02-28T17:34:47Z

@hpourmodheji, thanks for the table you shared earlier, but I think we need something different for this investigation. Activation checkpointing (including cpu offload) can be enabled without zero stage 3. And so, it would be good to disable zero by removing the zero key in the json config. Can you please share similar information for single GPU for these 3 scenarios?

No activation checkpointing - by removing the activation_checkpointing key in the json config.

Activation checkpointing without cpu offloading - by setting cpu_checkpointing to false

Activation checkpointing with cpu offloading - by setting cpu_checkpointing to true

The collected information will help with the next steps of investigation. Thanks!

@tjruwase, sorry for my late reply. Please see the table below. Unfortunately, I see no offloading. Can you please advise?

Training Stages	No activation checkpointing activation_checkpointing removed	Activation checkpointing without cpu offloading cpu_checkpointing = false	Activation checkpointing with cpu offloading cpu_checkpointing = true
before forward	MA 4.37 GB	MA 4.38 GB	MA 4.38 GB
before backward	MA 5.73 GB	MA 5.73 GB	MA 5.73 GB
before optimizer	MA 5.01 GB	MA 5.01 GB	MA 5.01 GB

No activation checkpointing activation_checkpointing removed:

{
"_train_batch_size": 30,
"train_micro_batch_size_per_gpu": 10,
"steps_per_print": 10,
"gradient_accumulation_steps": 1,
"_prescale_gradients": false,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"_zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme"
},
"offload_param": {
"device": "cpu",
"_device": "nvme"
}
},
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-4,
"weight_decay": 0.01,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},

"_activation_checkpointing": {

"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000

},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}

Activation checkpointing without cpu offloading cpu_checkpointing = false:

{
"_train_batch_size": 30,
"train_micro_batch_size_per_gpu": 10,
"steps_per_print": 10,
"gradient_accumulation_steps": 1,
"_prescale_gradients": false,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"_zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme"
},
"offload_param": {
"device": "cpu",
"_device": "nvme"
}
},
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-4,
"weight_decay": 0.01,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},

"activation_checkpointing": {

"partition_activations": false,
"cpu_checkpointing": false,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000

},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}

Activation checkpointing without cpu offloading cpu_checkpointing = true:

{
"_train_batch_size": 30,
"train_micro_batch_size_per_gpu": 10,
"steps_per_print": 10,
"gradient_accumulation_steps": 1,
"_prescale_gradients": false,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"_zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme"
},
"offload_param": {
"device": "cpu",
"_device": "nvme"
}
},
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-4,
"weight_decay": 0.01,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},

"activation_checkpointing": {

"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000

},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}

$ ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['anaconda3/envs/deepspeed-env/lib/python3.7/site-packages/torch']
torch version .................... 1.10.0
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['anaconda3/envs/deepspeed-env/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.7+fa9d3e8, fa9d3e8, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1

tjruwase · 2022-03-02T18:47:25Z

@hpourmodheji, thanks for your response. By the way, are you testing with an existing deepseed examples code, or your own model? If you are using your own, did you properly wrap your forward pass as done below

I suspect you may be doing this already but wanted to be sure. If you are already doing this, then is it possible to share your model code for me to repro? Thanks!

hpourmodheji · 2022-03-02T20:29:01Z

@hpourmodheji, thanks for your response. By the way, are you testing with an existing deepseed examples code, or your own model? If you are using your own, did you properly wrap your forward pass as done below

https://github.com/microsoft/DeepSpeedExamples/blob/36212dd59cb3eb342c39bc8965aaba04d5491933/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L227-L230

https://github.com/microsoft/DeepSpeedExamples/blob/36212dd59cb3eb342c39bc8965aaba04d5491933/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L948-L968

I suspect you may be doing this already but wanted to be sure. If you are already doing this, then is it possible to share your model code for me to repro? Thanks!

@tjruwase, I use your bing_bert example. It seems there is no checkpointing in this model. Since we have no Megatron module in this example, how can use checkpointing in this example?

tjruwase · 2022-03-02T22:36:35Z

@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:

Switch the flag here to True
Replace this import with from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint

hpourmodheji · 2022-03-04T06:37:46Z

@tjruwase, Thank you so much for your help. I have also changed the following line: checkpoint.checkpoint(...) => checkpoint(...). It is working now. Thanks for your patience and help.

… kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix

* INT4 weight only quantization (#479) * INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * add zero3 test * quantize small weight first to prevent oom * fold quantization config into ds_config * Fix license & refactor ds_config & rebase master * fix UT * Moving quantization into post_init_method and add int4 dequantization kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix * Refactor: move int4 code to deepspeed/inference (#528) * Move int 4 code to deepspeed/inference * fix * fix * fix * zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO++ (#3784) Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix dequant bug * Address PR feedback * Use super() __exit__ * Fix unit tests --------- Co-authored-by: Donglin Zhuang <donglinzhuang@outlook.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com> Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>

tjruwase pushed a commit that referenced this issue Aug 23, 2023

Moving quantization into post_init_method and add int4 dequantization…

2461449

… kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix

jomayeri closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why cpu_checkpointing can't work? #522

why cpu_checkpointing can't work? #522

ghosthamlet commented Nov 11, 2020 •

edited

Loading

ghosthamlet commented Dec 1, 2020

hpourmodheji commented Feb 4, 2022

tjruwase commented Feb 4, 2022

hpourmodheji commented Feb 4, 2022

hpourmodheji commented Feb 7, 2022

ghosthamlet commented Feb 8, 2022

hpourmodheji commented Feb 16, 2022

tjruwase commented Feb 16, 2022 •

edited

Loading

tjruwase commented Feb 16, 2022

tjruwase commented Feb 16, 2022

ghosthamlet commented Feb 21, 2022 •

edited

Loading

tjruwase commented Feb 21, 2022

ghosthamlet commented Feb 22, 2022 •

edited

Loading

hpourmodheji commented Feb 28, 2022

tjruwase commented Mar 2, 2022

hpourmodheji commented Mar 2, 2022

tjruwase commented Mar 2, 2022

hpourmodheji commented Mar 4, 2022

why cpu_checkpointing can't work? #522

why cpu_checkpointing can't work? #522

Comments

ghosthamlet commented Nov 11, 2020 • edited Loading

ghosthamlet commented Dec 1, 2020

hpourmodheji commented Feb 4, 2022

tjruwase commented Feb 4, 2022

hpourmodheji commented Feb 4, 2022

hpourmodheji commented Feb 7, 2022

ghosthamlet commented Feb 8, 2022

hpourmodheji commented Feb 16, 2022

tjruwase commented Feb 16, 2022 • edited Loading

tjruwase commented Feb 16, 2022

tjruwase commented Feb 16, 2022

ghosthamlet commented Feb 21, 2022 • edited Loading

tjruwase commented Feb 21, 2022

ghosthamlet commented Feb 22, 2022 • edited Loading

hpourmodheji commented Feb 28, 2022

No activation checkpointing activation_checkpointing removed:

"_activation_checkpointing": {

Activation checkpointing without cpu offloading cpu_checkpointing = false:

"activation_checkpointing": {

Activation checkpointing without cpu offloading cpu_checkpointing = true:

"activation_checkpointing": {

$ ds_report

tjruwase commented Mar 2, 2022

hpourmodheji commented Mar 2, 2022

tjruwase commented Mar 2, 2022

hpourmodheji commented Mar 4, 2022

ghosthamlet commented Nov 11, 2020 •

edited

Loading

tjruwase commented Feb 16, 2022 •

edited

Loading

ghosthamlet commented Feb 21, 2022 •

edited

Loading

ghosthamlet commented Feb 22, 2022 •

edited

Loading