Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why cpu_checkpointing can't work? #522

Closed
ghosthamlet opened this issue Nov 11, 2020 · 18 comments
Closed

why cpu_checkpointing can't work? #522

ghosthamlet opened this issue Nov 11, 2020 · 18 comments

Comments

@ghosthamlet
Copy link
Contributor

ghosthamlet commented Nov 11, 2020

I have partition_activations and cpu_checkpointing enabled, but it seems like activations still on GPU, I have just one GPU, can't do model parallel, do cpu_checkpointing just work for model parallel? why single GPU same as 1 GPU model parallel can't offload all its checkpoints to the CPU?
My CPU memory is enough, configs:

{
       'zero_optimization': {
          'stage': 2,
          'cpu_offload': True,
          'contiguous_gradients': True,
          },
       'train_batch_size': 2,
       'fp16': {
          'enabled': True,
          "loss_scale": 0,
          "loss_scale_window": 1000,
          "hysteresis": 2,
          "min_loss_scale": 1,
          },
        "activation_checkpointing": {
          "partition_activations": True,
          "contiguous_memory_optimization": True,
          "cpu_checkpointing": True
        },
       "wall_clock_breakdown": False,
}

Environment:
python 3.6
torch 1.6.0
deepspeed 0.3.7

@ghosthamlet
Copy link
Contributor Author

I think this is related to #541

@hpourmodheji
Copy link

@ghosthamlet
I have the same issue. Did you kindly get an answer to this issue?

@tjruwase
Copy link
Contributor

tjruwase commented Feb 4, 2022

@hpourmodheji, thanks for the question. This should have been fixed by #1254. Are you still having this problem?

@hpourmodheji
Copy link

@tjruwase, thanks for your comment. I tried ZeRO-Infinity with the DS version 0.5.7+, but it seems that activations are not offloaded to the CPU.

Here is the config part:


"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme",
},
"offload_param": {
"device": "cpu",
"_device": "nvme",
}
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000
},

Here are the results. There is no memory reduction during the forward pass.

Training Stages Baseline: GPU Memory Consumption during Training with single GPU ZeRO-Infinity: GPU Memory Consumption during Training with single GPU Memory consumption reduction
before forward MA 4.38 GB MA 0.12 GB ~37x
before backward MA 5.74 GB MA 1.55 GB ~4x
before optimizer MA 5.01 GB MA 0.12 GB ~42x

@hpourmodheji
Copy link

@tjruwase, Do you kindly have any comment on this and offloading activations?

@ghosthamlet
Copy link
Contributor Author

@ghosthamlet I have the same issue. Did you kindly get an answer to this issue?

@hpourmodheji Sorry, i did not try single GPU for activation_checkpointing again, only added another GPU, then it works.

@hpourmodheji
Copy link

@ghosthamlet, Thanks for your comment. I tried with 2 GPUs, and it still does not work. I hope to hear from the DS team.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 16, 2022

@ghosthamlet and @hpourmodheji, it seems there a few issues here. I will take a closer look.

@tjruwase
Copy link
Contributor

@ghosthamlet I have the same issue. Did you kindly get an answer to this issue?

@hpourmodheji Sorry, i did not try single GPU for activation_checkpointing again, only added another GPU, then it works.

@ghosthamlet, can you please confirm that activation_checkpointing and cpu_offloading work as expected with 2 GPUs, but not with 1 GPU?

@tjruwase
Copy link
Contributor

@ghosthamlet, Thanks for your comment. I tried with 2 GPUs, and it still does not work. I hope to hear from the DS team.

@hpourmodheji, thanks for the table you shared earlier, but I think we need something different for this investigation. Activation checkpointing (including cpu offload) can be enabled without zero stage 3. And so, it would be good to disable zero by removing the zero key in the json config. Can you please share similar information for single GPU for these 3 scenarios?

  1. No activation checkpointing - by removing the activation_checkpointing key in the json config.
  2. Activation checkpointing without cpu offloading - by setting cpu_checkpointing to false
  3. Activation checkpointing with cpu offloading - by setting cpu_checkpointing to true

The collected information will help with the next steps of investigation. Thanks!

@ghosthamlet
Copy link
Contributor Author

ghosthamlet commented Feb 21, 2022

@tjruwase Sorry for the late reply.
When i was using old version DeepSpeed months ago, activation_checkpointing and cpu_offloading can't work with 1 GPU, then i added more GPUs, and used stage3 with offload_param, no model parallel, the activation_checkpointing and cpu_offloading of old version DeepSpeed should also not work at that time, but the GPU memory was enough.
With the new version DeepSpeed now, it is working both on 1 GPU and 2 GPUs, you can see the logs:

1 GPUs

activation_checkpointing cpu_offload

[2022-02-21 22:18:11,198] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:18:11,199] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:18:11,199] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.11 GB, percent = 39.9%
[2022-02-21 22:18:12,359] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:18:12,360] [INFO] [utils.py:827:see_memory_usage] MA 0.27 GB         Max_MA 2.02 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:18:12,360] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.11 GB, percent = 39.9%
[2022-02-21 22:18:16,976] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:18:16,977] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 2.41 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:18:16,977] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.11 GB, percent = 39.9%


activation_checkpointing but no cpu_offload

[2022-02-21 22:21:26,836] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:21:26,837] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 0.14 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:21:26,837] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.09 GB, percent = 39.9%
[2022-02-21 22:21:27,809] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:21:27,810] [INFO] [utils.py:827:see_memory_usage] MA 0.5 GB         Max_MA 2.02 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:21:27,810] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.09 GB, percent = 39.9%
[2022-02-21 22:21:32,630] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:21:32,631] [INFO] [utils.py:827:see_memory_usage] MA 0.14 GB         Max_MA 2.64 GB         CA 6.11 GB         Max_CA 6 GB
[2022-02-21 22:21:32,631] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.09 GB, percent = 39.9%


2 GPUs:

activation_checkpointing cpu_offload

[2022-02-21 22:05:01,225] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:05:01,226] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:05:01,226] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:02,447] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:05:02,448] [INFO] [utils.py:827:see_memory_usage] MA 0.22 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:05:02,448] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:06,865] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:05:06,865] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.36 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:05:06,866] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%


activation_checkpointing but no cpu_offload

[2022-02-21 22:08:47,487] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:08:47,487] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:08:47,488] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:48,631] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:08:48,631] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:08:48,632] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:53,495] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:08:53,495] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.6 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:08:53,496] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%

Configs:

 {
        "zero_optimization": {
            "stage": 3,
            "contiguous_gradients": True,
            "overlap_comm": False,
            "reduce_bucket_size": 50000000,
            "stage3_max_live_parameters": 1e9, 
            "stage3_max_reuse_distance": 1e9, 
            "stage3_prefetch_bucket_size": 5e8, 
            "stage3_param_persistence_threshold": 1e6, 
            "sub_group_size": 1e12, 
            "offload_optimizer": {
                "device": "cpu",
                },
            "offload_param": {
                "device": "cpu",
                },
            },

        "activation_checkpointing": {
            "partition_activations": True,
            "num_checkpoints": 1,
            "cpu_checkpointing": True,
            "contiguous_memory_optimization": False
        } 

        "train_batch_size": 2 * args.world_size,
        "gradient_accumulation_steps": 1,
        "fp16": {
            "enabled": True,
            "loss_scale": 0,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1,
            "initial_scale_power": 16,
        },
    } 

ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch version .................... 1.7.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed info ................... 0.5.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.2

@tjruwase
Copy link
Contributor

@ghosthamlet, awesome! Thanks so much for sharing this detailed feedback. If possible, can you please check if this activation checkpointing features also work with zero stages < 3? Thanks!

@ghosthamlet
Copy link
Contributor Author

ghosthamlet commented Feb 22, 2022

@tjruwase For stage0/stage1 2080Ti GPU run out of memory, so i have to test small model.
stage1 cpu_offload has problem, so disabled it.

I think stage0/stage1/stage2 working too:

1 GPUs

stage0, gradient_checkpointing cpu_offload

[2022-02-22 12:31:29,822] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:31:29,823] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:31:29,823] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:31:29,950] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:31:29,951] [INFO] [utils.py:827:see_memory_usage] MA 0.88 GB         Max_MA 1.0 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:31:29,951] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:31:30,128] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:31:30,129] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:31:30,129] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%


stage0, gradient_checkpointing no cpu_offload

[2022-02-22 12:30:11,060] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:30:11,060] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:30:11,061] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:30:11,186] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:30:11,186] [INFO] [utils.py:827:see_memory_usage] MA 0.89 GB         Max_MA 1.01 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:30:11,187] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%
[2022-02-22 12:30:11,376] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:30:11,377] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 2.06 GB         Max_CA 2 GB
[2022-02-22 12:30:11,377] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.09 GB, percent = 4.1%

------------------------------

stage1 no cpu_offload, gradient_checkpointing cpu_offload

[2022-02-22 12:22:28,406] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:22:28,407] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.5 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:22:28,407] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:22:28,535] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:22:28,536] [INFO] [utils.py:827:see_memory_usage] MA 0.88 GB         Max_MA 1.0 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:22:28,536] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:22:28,707] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:22:28,707] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:22:28,708] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%


stage1 no cpu_offload, gradient_checkpointing but no cpu_offload 

[2022-02-22 12:23:20,951] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:23:20,951] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.5 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:23:20,952] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:23:21,077] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:23:21,077] [INFO] [utils.py:827:see_memory_usage] MA 0.89 GB         Max_MA 1.01 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:23:21,078] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%
[2022-02-22 12:23:21,245] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:23:21,246] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.27 GB         CA 1.74 GB         Max_CA 2 GB
[2022-02-22 12:23:21,246] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 5.08 GB, percent = 4.0%

------------------------------

stage2 cpu_offload, gradient_checkpointing cpu_offload

[2022-02-21 23:44:59,393] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:44:59,394] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 5.86 GB         Max_CA 6 GB
[2022-02-21 23:44:59,394] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.42 GB, percent = 36.2%
[2022-02-21 23:44:59,835] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:44:59,836] [INFO] [utils.py:827:see_memory_usage] MA 4.96 GB         Max_MA 5.1 GB         CA 5.86 GB         Max_CA 6 GB
[2022-02-21 23:44:59,836] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.42 GB, percent = 36.2%
[2022-02-21 23:45:02,321] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:45:02,322] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.72 GB         CA 5.86 GB         Max_CA 6 GB
[2022-02-21 23:45:02,322] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.42 GB, percent = 36.2%


stage2 cpu_offload, gradient_checkpointing but no cpu_offload 

[2022-02-21 23:40:55,849] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:40:55,850] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 6.14 GB         Max_CA 6 GB
[2022-02-21 23:40:55,850] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.41 GB, percent = 36.2%
[2022-02-21 23:40:56,250] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:40:56,250] [INFO] [utils.py:827:see_memory_usage] MA 5.2 GB         Max_MA 5.33 GB         CA 6.14 GB         Max_CA 6 GB
[2022-02-21 23:40:56,251] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.41 GB, percent = 36.2%
[2022-02-21 23:40:58,562] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:40:58,563] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.94 GB         CA 6.14 GB         Max_CA 6 GB
[2022-02-21 23:40:58,563] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 45.41 GB, percent = 36.2%



2 GPUs:


stage0, gradient_checkpointing cpu_offload

[2022-02-22 12:27:51,083] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:27:51,084] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 1.75 GB         Max_CA 2 GB
[2022-02-22 12:27:51,084] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.78 GB, percent = 6.2%
[2022-02-22 12:27:51,212] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:27:51,212] [INFO] [utils.py:827:see_memory_usage] MA 0.88 GB         Max_MA 1.02 GB         CA 1.83 GB         Max_CA 2 GB
[2022-02-22 12:27:51,212] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.78 GB, percent = 6.2%
[2022-02-22 12:27:51,377] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:27:51,378] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.3 GB         CA 2.05 GB         Max_CA 2 GB
[2022-02-22 12:27:51,378] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.78 GB, percent = 6.2%


stage0, gradient_checkpointing no cpu_offload

[2022-02-22 12:29:05,231] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:29:05,231] [INFO] [utils.py:827:see_memory_usage] MA 0.81 GB         Max_MA 1.24 GB         CA 1.75 GB         Max_CA 2 GB
[2022-02-22 12:29:05,232] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 6.1%
[2022-02-22 12:29:05,353] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:29:05,353] [INFO] [utils.py:827:see_memory_usage] MA 0.89 GB         Max_MA 1.03 GB         CA 1.83 GB         Max_CA 2 GB
[2022-02-22 12:29:05,354] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 6.1%
[2022-02-22 12:29:05,505] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:29:05,506] [INFO] [utils.py:827:see_memory_usage] MA 0.93 GB         Max_MA 1.31 GB         CA 2.05 GB         Max_CA 2 GB
[2022-02-22 12:29:05,506] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.72 GB, percent = 6.1%

------------------------------

stage1 no cpu_offload, gradient_checkpointing cpu_offload

[2022-02-22 12:25:50,491] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:25:50,492] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 0.81 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:25:50,492] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.81 GB, percent = 6.2%
[2022-02-22 12:25:50,617] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:25:50,618] [INFO] [utils.py:827:see_memory_usage] MA 0.54 GB         Max_MA 0.67 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:25:50,618] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.81 GB, percent = 6.2%
[2022-02-22 12:25:50,779] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:25:50,780] [INFO] [utils.py:827:see_memory_usage] MA 0.53 GB         Max_MA 0.96 GB         CA 1.28 GB         Max_CA 1 GB
[2022-02-22 12:25:50,780] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.81 GB, percent = 6.2%


stage1 no cpu_offload, gradient_checkpointing but no cpu_offload

[2022-02-22 12:24:42,382] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-22 12:24:42,383] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 0.81 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:24:42,383] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.82 GB, percent = 6.2%
[2022-02-22 12:24:42,504] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-22 12:24:42,504] [INFO] [utils.py:827:see_memory_usage] MA 0.55 GB         Max_MA 0.68 GB         CA 1.13 GB         Max_CA 1 GB
[2022-02-22 12:24:42,505] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.82 GB, percent = 6.2%
[2022-02-22 12:24:42,660] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-22 12:24:42,661] [INFO] [utils.py:827:see_memory_usage] MA 0.53 GB         Max_MA 0.96 GB         CA 1.28 GB         Max_CA 1 GB
[2022-02-22 12:24:42,661] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 7.82 GB, percent = 6.2%

------------------------------

stage2 cpu_offload, gradient_checkpointing cpu_offload

[2022-02-21 23:35:26,890] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:35:26,891] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 6.39 GB         Max_CA 6 GB
[2022-02-21 23:35:26,891] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.5 GB, percent = 40.2%
[2022-02-21 23:35:27,337] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:35:27,338] [INFO] [utils.py:827:see_memory_usage] MA 4.96 GB         Max_MA 5.09 GB         CA 6.39 GB         Max_CA 6 GB
[2022-02-21 23:35:27,338] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.5 GB, percent = 40.2%
[2022-02-21 23:35:29,950] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:35:29,951] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.55 GB         CA 6.39 GB         Max_CA 6 GB
[2022-02-21 23:35:29,951] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.5 GB, percent = 40.2%


stage2 cpu_offload, gradient_checkpointing but no cpu_offload

[2022-02-21 23:37:48,850] [INFO] [utils.py:822:see_memory_usage] before forward 1
[2022-02-21 23:37:48,850] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 4.89 GB         CA 6.68 GB         Max_CA 7 GB
[2022-02-21 23:37:48,851] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.45 GB, percent = 40.2%
[2022-02-21 23:37:49,246] [INFO] [utils.py:822:see_memory_usage] before backward 1
[2022-02-21 23:37:49,246] [INFO] [utils.py:827:see_memory_usage] MA 5.19 GB         Max_MA 5.31 GB         CA 6.68 GB         Max_CA 7 GB
[2022-02-21 23:37:49,247] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.45 GB, percent = 40.2%
[2022-02-21 23:37:51,801] [INFO] [utils.py:822:see_memory_usage] before optimizer 1
[2022-02-21 23:37:51,802] [INFO] [utils.py:827:see_memory_usage] MA 4.89 GB         Max_MA 5.76 GB         CA 6.68 GB         Max_CA 7 GB
[2022-02-21 23:37:51,802] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 50.46 GB, percent = 40.2%

I want to ask another question:
There is one thing i am very confusing, in four 2080TI GPUs training, when the GPU 0 mem is:
MA 0.79 GB Max_MA 2.77 GB CA 6.55 GB Max_CA 7 GB
GPU will ran out of memory, But max allocated Max_MA is just 2.77 GB,
out of memory is caused by Max_CA 7 GB, this number is what nvidia-smi show,
if Max_CA is the number matters most, then MA and Max_MA reduced by DeepSpeed is not enough, are there any methods to reduce CA and Max_CA too? CA and Max_CA are cached memory, why it can't be released?
as you can see the 2 GPUs log, Max_CA is always very large:

2 GPUs:

stage3 cpu_offload, gradient_checkpointing cpu_offload

[2022-02-21 22:05:01,225] [INFO] [utils.py:822:see_memory_usage] before forward 6   
[2022-02-21 22:05:01,226] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:05:01,226] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:02,447] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:05:02,448] [INFO] [utils.py:827:see_memory_usage] MA 0.22 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:05:02,448] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%
[2022-02-21 22:05:06,865] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:05:06,865] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.36 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:05:06,866] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.65 GB, percent = 46.7%


stage3 cpu_offload, gradient_checkpointing but no cpu_offload

[2022-02-21 22:08:47,487] [INFO] [utils.py:822:see_memory_usage] before forward 6
[2022-02-21 22:08:47,487] [INFO] [utils.py:827:see_memory_usage] MA 0.09 GB         Max_MA 0.09 GB         CA 5.46 GB         Max_CA 5 GB
[2022-02-21 22:08:47,488] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:48,631] [INFO] [utils.py:822:see_memory_usage] before backward 6
[2022-02-21 22:08:48,631] [INFO] [utils.py:827:see_memory_usage] MA 0.46 GB         Max_MA 1.97 GB         CA 5.53 GB         Max_CA 6 GB
[2022-02-21 22:08:48,632] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%
[2022-02-21 22:08:53,495] [INFO] [utils.py:822:see_memory_usage] before optimizer 6
[2022-02-21 22:08:53,495] [INFO] [utils.py:827:see_memory_usage] MA 0.08 GB         Max_MA 2.6 GB         CA 5.83 GB         Max_CA 6 GB
[2022-02-21 22:08:53,496] [INFO] [utils.py:832:see_memory_usage] CPU Virtual Memory:  used = 58.62 GB, percent = 46.7%


@hpourmodheji
Copy link

@hpourmodheji, thanks for the table you shared earlier, but I think we need something different for this investigation. Activation checkpointing (including cpu offload) can be enabled without zero stage 3. And so, it would be good to disable zero by removing the zero key in the json config. Can you please share similar information for single GPU for these 3 scenarios?

  1. No activation checkpointing - by removing the activation_checkpointing key in the json config.
  2. Activation checkpointing without cpu offloading - by setting cpu_checkpointing to false
  3. Activation checkpointing with cpu offloading - by setting cpu_checkpointing to true

The collected information will help with the next steps of investigation. Thanks!

@tjruwase, sorry for my late reply. Please see the table below. Unfortunately, I see no offloading. Can you please advise?

Training Stages No activation checkpointing activation_checkpointing removed Activation checkpointing without cpu offloading cpu_checkpointing = false Activation checkpointing with cpu offloading cpu_checkpointing = true
before forward MA 4.37 GB MA 4.38 GB MA 4.38 GB
before backward MA 5.73 GB MA 5.73 GB MA 5.73 GB
before optimizer MA 5.01 GB MA 5.01 GB MA 5.01 GB

No activation checkpointing activation_checkpointing removed:

{
"_train_batch_size": 30,
"train_micro_batch_size_per_gpu": 10,
"steps_per_print": 10,
"gradient_accumulation_steps": 1,
"_prescale_gradients": false,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"_zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme"
},
"offload_param": {
"device": "cpu",
"_device": "nvme"
}
},
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-4,
"weight_decay": 0.01,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},

"_activation_checkpointing": {

"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000

},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}


Activation checkpointing without cpu offloading cpu_checkpointing = false:

{
"_train_batch_size": 30,
"train_micro_batch_size_per_gpu": 10,
"steps_per_print": 10,
"gradient_accumulation_steps": 1,
"_prescale_gradients": false,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"_zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme"
},
"offload_param": {
"device": "cpu",
"_device": "nvme"
}
},
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-4,
"weight_decay": 0.01,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},

"activation_checkpointing": {

"partition_activations": false,
"cpu_checkpointing": false,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000

},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}


Activation checkpointing without cpu offloading cpu_checkpointing = true:

{
"_train_batch_size": 30,
"train_micro_batch_size_per_gpu": 10,
"steps_per_print": 10,
"gradient_accumulation_steps": 1,
"_prescale_gradients": false,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"_zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu",
"_device": "nvme"
},
"offload_param": {
"device": "cpu",
"_device": "nvme"
}
},
"zero_allow_untested_optimizer": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 2e-4,
"weight_decay": 0.01,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},

"activation_checkpointing": {

"partition_activations": true,
"cpu_checkpointing": true,
"contiguous_memory_optimization": true,
"_number_checkpoints": null,
"_synchronize_checkpoint_boundary": false,
"_profile": false,
"overlap_comm": true,
"reduce_bucket_size": 500000000

},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}


$ ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['anaconda3/envs/deepspeed-env/lib/python3.7/site-packages/torch']
torch version .................... 1.10.0
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['anaconda3/envs/deepspeed-env/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.7+fa9d3e8, fa9d3e8, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1

@tjruwase
Copy link
Contributor

tjruwase commented Mar 2, 2022

@hpourmodheji, thanks for your response. By the way, are you testing with an existing deepseed examples code, or your own model? If you are using your own, did you properly wrap your forward pass as done below

  1. https://github.com/microsoft/DeepSpeedExamples/blob/36212dd59cb3eb342c39bc8965aaba04d5491933/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L227-L230
  2. https://github.com/microsoft/DeepSpeedExamples/blob/36212dd59cb3eb342c39bc8965aaba04d5491933/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L948-L968

I suspect you may be doing this already but wanted to be sure. If you are already doing this, then is it possible to share your model code for me to repro? Thanks!

@hpourmodheji
Copy link

@hpourmodheji, thanks for your response. By the way, are you testing with an existing deepseed examples code, or your own model? If you are using your own, did you properly wrap your forward pass as done below

  1. https://github.com/microsoft/DeepSpeedExamples/blob/36212dd59cb3eb342c39bc8965aaba04d5491933/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L227-L230
  2. https://github.com/microsoft/DeepSpeedExamples/blob/36212dd59cb3eb342c39bc8965aaba04d5491933/Megatron-LM-v1.1.5-ZeRO3/megatron/model/transformer.py#L948-L968

I suspect you may be doing this already but wanted to be sure. If you are already doing this, then is it possible to share your model code for me to repro? Thanks!

@tjruwase, I use your bing_bert example. It seems there is no checkpointing in this model. Since we have no Megatron module in this example, how can use checkpointing in this example?

@tjruwase
Copy link
Contributor

tjruwase commented Mar 2, 2022

@hpourmodheji, that is helpful context. We did not enable activation checkpointing for bert because models that are less ~1B may not benefit much given the overhead of re-computation introduced by activation checkpointing. However, if you want to enable it do the following:

  1. Switch the flag here to True
  2. Replace this import with from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint

@hpourmodheji
Copy link

@tjruwase, Thank you so much for your help. I have also changed the following line: checkpoint.checkpoint(...) => checkpoint(...). It is working now. Thanks for your patience and help.

tjruwase pushed a commit that referenced this issue Aug 23, 2023
… kernel (#522)

* Add experimental int4 dequantize kernel

* move quantiation into post_init_method

* fix
github-merge-queue bot pushed a commit that referenced this issue Sep 11, 2023
* INT4 weight only quantization (#479)

* INT4 weight only quantization

* pre commit

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* add zero3 test

* quantize small weight first to prevent oom

* fold quantization config into ds_config

* Fix license & refactor ds_config & rebase master

* fix UT

* Moving quantization into post_init_method and add int4 dequantization kernel (#522)

* Add experimental int4 dequantize kernel

* move quantiation into post_init_method

* fix

* Refactor: move int4 code to deepspeed/inference (#528)

* Move int 4 code to deepspeed/inference

* fix

* fix

* fix

* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix dequant bug

* Address PR feedback

* Use super() __exit__

* Fix unit tests

---------

Co-authored-by: Donglin Zhuang <donglinzhuang@outlook.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
@jomayeri jomayeri closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants