Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RFC dir for submission tracking #560

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions rfcs/colossalai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
## **Summary**

This RFC is used to unify the access interface to different devices (cpu and gpu) in the components of ColossalAI. This propose will also benefit on adding Intel XPU support to ColossalAI in the future.

## **Motivation**

As we know, the major features of ColossalAI are built on Nvidia GPU and Cuda package. This limits the scope of leveraging different device types to enable LLM by ColossalAI.

for example, in `utils/cuda.py`, `context/parallel_context.py` and something else, there have had some seperate interfaces for other componenets to access `cpu` or `gpu` device. Besides that, there are also many internal components invoking `torch.cuda` explicitly.

We would like to propose a unified device access interface to provide not only Nvidia GPU support but also other device type support, like Intel X86 CPU and XPU.

## **Proposals**

*NOTE 1: Currently the proposal mainly focues on ColossalAI training part. The ColossalAI inference support is out of scope here.*

*NOTE 2: The RFC focus on pythonic API level only, the replacement of cuda kernels on cpu part and the corresponding upper layer features, like nn.optimizer, nn.layer, gemini and so on, are out of scope here. we plan to provide support in the other RFC/PRs.*

As ColossalAI training is designed to boost up NLP & LLM training speed by data and model parallel, it has had a central place to store the `execution context` in `core.global_context`. The first proposal of unifying the device access is to extend this `core.global_context` structure to get and set device related informations.

The `engine` and `trainer` user facing API will rely on it to copy tensor to the corresponding device the application is runing on.

The details are something like below:

```
class ParallelContext(metaclass=SingletonMeta):
### existing methods
...

### new methods
def set_device(self, device_ordinal=None):
# set the current device
...

def get_device_name(self):
# get the name of device
...

def to(self):
# move or cast the parameter to specified device
...

def device_count(self):
# get the number of devices
...

def synchronize(self, device_index=None):
# communicate with specified device
...

def random(self):
# random number generator

def stream(self):
# get device stream

# other device related methods, details will be included in PRs.
...
```

From user view, the training code is simpler than before as user doesn't need to care which device should be explicitly specified. The new logic will automatically move/cast tensors to the device user is using. For example, if the underneath hardware is Nvidia GPU, the tensor will be automatically go to CUDA device. If the underneath hardware is Intel X86 CPU, the tensor will automatically keep in CPU side.

Blow is some sample codes to demonstrate this idea.

```
# The lines commented off are not needed as engine and trainer will automatically copy tensors from host to device.

for epoch in range(gpc.config.NUM_EPOCHS):
# execute a training iteration
engine.train()
for img, label in train_dataloader:
#img = img.cuda()
#label = label.cuda()

# set gradients to zero
engine.zero_grad()

# run forward pass
output = engine(img)

# compute loss value and run backward pass
train_loss = engine.criterion(output, label)
engine.backward(train_loss)

# update parameters
engine.step()

# update learning rate
lr_scheduler.step()

# execute a testing iteration
engine.eval()
correct = 0
total = 0
for img, label in test_dataloader:
#img = img.cuda()
#label = label.cuda()

# run prediction without back-propagation
with torch.no_grad():
output = engine(img)
test_loss = engine.criterion(output, label)

# compute the number of correct prediction
pred = torch.argmax(output, dim=-1)
```

## **Future Works**

This RFC is focusing on discussing the unified device access interface. We will gradually add CPU support into the internal componenets to make the whole functionality work by following below *TODO* list.

- [ ] Gemini
- [ ] Communication
- [ ] nn/kernel
- [ ] autochunk/autoparallel
- [ ] zero
- [ ] tensor
193 changes: 193 additions & 0 deletions rfcs/deepspeed_rfc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
## **Summary**

This is a design discussion RFC for contributing some device-agnostic compression algorithms, like the post training quantization(QDQ quant format) and structural sparsity supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into DeepSpeed.

## **Motivation**

As we know, the DeepSpeed Compression have supported many useful compression methods like layer reduction via knowledge distillation, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning.

But those compression methods are lack of main stream support on some popular compression algorithms like post training static quantization and structural sparsity, which have been demostrated as efficient and popular compression methods by the industry.

[Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) has implemented such device-agnostic compression algorithms, we would like to contribute those into DeepSpeed.

## **Proposal Details on Pruning**

In this proposal, we would like to introduce structural pruning functionalities by enabling "N in M" and "N x M" block sparsity pattern with snip_momentum criteria and progressive pruning.

We propose a two-phases support on the structural pruning method.

**Phase 1: structural pruning with global sparse ratio**

This way leverages the existing DeepSpeed sparsity design which has a global sparse ratio control. If the accuracy doesn't meet expectation, user has to tune the training process like what they did on DeepSpeed by manually specifying and exlporing the proper sparse ratio per layer.

We extend the json config file format and implement the structural sparsity algorithm in `compression` dir like below.

~~~
{
        "sparse_pruning": {
        "shared_parameters": {
          "enabled": True,
          "method": "snip_momentum", # new value
          "pattern": "4x1", # new field
          "dense_ratio": 0.1,
          "gradient_accumulation_steps": 1, # new field
          "sparsity_decay_type": "exp", # new field
          "start_step": 0, # new field
          "end_tep": 10000 # new field
        },
        "different_groups": {
          "sp1": {
            "params": {
              "dense_ratio": 0.5
            },
            "modules": [
              "attention.self"
            ]
          }
        }
      },
}
~~~

As for the structural sparsity implementation in `compression` dir, let's taking `LinearLayer_Compress` class in `deepspeed/compression/basic_layer.py` as an example, this class is enhanced like this to support structural sparsity algorithm.

<a target="_blank" href="./imgs/linear_example.png">
<img src="./imgs/linear_example.png" alt="Extension" width="100%" height="100%">
</a>

**NOTE**: In this phase 1, the DeepSpeed user facing API keeps unchanged. The only change user need to be aware of is the extended Json file format.

**Phase 2: Advanced structural pruning with fine-grained sparse ratio control per layer**

This advanced algorithm supports the adaptive sparse ratio adjustment algorithm per layer to reach higher accuracy.

This way needs to extend the `initialize()` API to return one more parameter `callbacks` besides ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``. The json config file needs to be adjusted accordingly.

~~~
def initialize(args=None,
model: torch.nn.Module = None,
optimizer: Optional[Union[Optimizer,
DeepSpeedOptimizerCallable]] = None,
model_parameters: Optional[torch.nn.Module] = None,
training_data: Optional[torch.utils.data.Dataset] = None,
lr_scheduler: Optional[Union[_LRScheduler,
DeepSpeedSchedulerCallable]] = None,
mpu=None,
dist_init_required: Optional[bool] = None,
collate_fn=None,
config=None,
config_params=None):
# return A tuple of ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``, ``callbacks``
~~~

This `callbacks` class object returned by `initialize` function is used to register hooks for user into the normal training process.

~~~
class callbacks():
def on_epoch_begin(self, epoch):
...

def on_epoch_end(self):
...

def on_step_begin(self, step):
...

def on_step_end(self):
...

... # other hooks during training

~~~

The user need to manually insert such hooks into their training code for fine-grain sparsity control per layer.

~~~
model, optimizer, _, lr_scheduler, callbacks = deepspeed.initialize(
args=args,
model=model,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
dist_init_required=True)

for epoch in range(args.num_train_epochs):
model.train()
start_time = time.time()
callbacks.on_epoch_start(epoch) # new code
for step, batch in enumerate(train_dataloader):
callbacks.on_step_start(step) # new code
batch = to_device(batch, device)
all_loss = forward_fun(batch, model, teacher_model=teacher_model)
model.backward(all_loss[0])
model.step()
callbacks.on_step_end() # new code

callbacks.on_epoch_end() # new code
...

~~~

## **Structural Sparsity Results**

<a target="_blank" href="./imgs/sparse_result.png">
<img src="./imgs/sparse_result.png" alt="Extension" width="80%" height="80%">
</a>

## **Recommendation**

We recommend to split this contribution into two phases:

1. The first phase focuses on adding the entire structural sparsity methods supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into DeepSpeed with minor changes.

This way provoides the complete structural sparsity capability with global sparse ratio setting. It's easy of use for customer to pilot the structural sparsity feature.

2. The second phase focuses on productivity improvement by supporting the adaptive sparse ratio adjustment to support broad pruning algorithm.

This way has the capability of automatically adjusting the sparse ration per layer for better accuracy. It can highly improve the productivity for those customers who wants to have high sparsity but meet strict accuracy goal.

## **Proposal Details on Quantization**

In this proposal, we would like to enhance the quantization functionality by integrating device agnostic post training static&dynamic quantization (QDQ quant format) supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into DeepSpeed.

As current implementation of DeepSpeed is focusing on simulating the quantization behavior during training, we propose to add post training quantization by below changes.

```yaml

"compression_inference": { # new field
"static_quantization": # new field, value could be "dynamic_quantization" as well
"enabled": true # new field
}

```

Besides the changes in the compression config file, we also need introduce a new function `quantize` like below to support post-training quantization.

```python

def quantize(model, deepspeed_config, mpu=None, calib_dataloader=None):
### The main entry of doing post training quantization
###
### Args:
### ...
### 'calib_dataloader': The dataloader for calibration. Being None for post-training dynamic quantization.
###
### Return q_model which is the QDQ model for deployment

```

The usage would be like below:

```python
model.load_state_dict(torch.load(args.path_to_model))
q_model = quantize(model, args.deepspeed_config, calib_dataloader=test_loader)

### deploy this returned QDQ model through TRT or Intel Extension for PyTorch

```

## **Quantization Results**
As for the post training quantization results, please refer to [this link](https://github.com/intel/neural-compressor/blob/master/docs/source/validated_model_list.md)

## **Future Works**

We enabled new quantization algorithms like SmoothQuant in Intel(R) Neural Compressor and applied to popular large language models such as BLOOM-176B. We plan to enable these new features into DeepSpeed compression library as part of our future works.
Binary file added rfcs/imgs/inc_vs_ort_quantizer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/imgs/linear_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/imgs/sparse_result.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions rfcs/olive_rfc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
## **Summary**

This RFC is a high level design discussion thread for contributing advance features, like post training optimization and during training optimization, in [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) to Olive.

## **Motivation**

As we know, the Olive repo have supported model conversion from FP32 framework models to FP32 ONNX models and auto tuning on ONNXRT performance related configurations. To extend the conversion and tuning capability supported by Olive, we would like to contribute some advance features in
[Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) validated on broad models to Olive.

The model optimizations supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) could be classified into two sets:

1. post-training optimization

This includes `post-training dynamic quantizer` and `post-training static quantizer`. Comping with ORT quantizer, Intel Neural Compressor can provide higher accuracy and better performance gain.

<a target="_blank" href="INC vs ORT on dynamic & static quant">
<img src="./imgs/inc_vs_ort_quantizer.png" alt="Extension" width="100%" height="100%">
</a>


2. during-training optimization

This includes `quantization-aware training`, `pruning` and `distillation`, which has been demostrated on broad models and can get ~1.x to 25x performance gain relying on which optimizations are used and the model structure.

## **Proposal**

we plan to contribute features into Olive with two phases.

**Phase 1:**, focusing on extending Olive conversion and optimization features

1. Add INC ONNX dynamic quantizer into Olive auto performance tuning scope

This leverages the existing Olive design and extend optimization config to support INC dynamic quantizer.

2. Add INC ONNX FP32 to FP16 converter into Olive model conversion

This is used to provide device agnostic model conversion feature (FP32 -> FP16) to Olive.

3. Add INC ONNX static quantizer into Olive auto performance tuning scope

This optimization config needs to be extended to support INC static quantizer and calibration dataset.

**Phase 2:**, focusing on contributing during training optimizations into Olive

1. Add INC quantization aware training into Olive

2. Add INC pruning into Olive

3. Add INC distillation into Olive

4. Add INC accuracy diagnostic and debugging into Olive

5. Add INC visualization into Olive

Please note this proposal is for high level direction alignment. welcome any comments or feedbacks on that.