intel · ftian1 · Feb 20, 2023 · Feb 20, 2023 · Feb 20, 2023 · Feb 20, 2023
diff --git a/rfcs/colossalai.md b/rfcs/colossalai.md
@@ -0,0 +1,117 @@
+## **Summary**
+
+This RFC is used to unify the access interface to different devices (cpu and gpu) in the components of ColossalAI. This propose will also benefit on adding Intel XPU support to ColossalAI in the future.
+
+## **Motivation**
+
+As we know, the major features of ColossalAI are built on Nvidia GPU and Cuda package. This limits the scope of leveraging different device types to enable LLM by ColossalAI. 
+
+for example, in `utils/cuda.py`, `context/parallel_context.py` and something else, there have had some seperate interfaces for other componenets to access `cpu` or `gpu` device. Besides that, there are also many internal components invoking `torch.cuda` explicitly.
+
+We would like to propose a unified device access interface to provide not only Nvidia GPU support but also other device type support, like Intel X86 CPU and XPU.
+
+## **Proposals**
+
+*NOTE 1: Currently the proposal mainly focues on ColossalAI training part. The ColossalAI inference support is out of scope here.*
+
+*NOTE 2: The RFC focus on pythonic API level only, the replacement of cuda kernels on cpu part and the corresponding upper layer features, like nn.optimizer, nn.layer, gemini and so on, are out of scope here. we plan to provide support in the other RFC/PRs.*
+
+As ColossalAI training is designed to boost up NLP & LLM training speed by data and model parallel, it has had a central place to store the `execution context` in `core.global_context`. The first proposal of unifying the device access is to extend this `core.global_context` structure to get and set device related informations.
+
+The `engine` and `trainer` user facing API will rely on it to copy tensor to the corresponding device the application is runing on. 
+
+The details are something like below:
+
+```
+class ParallelContext(metaclass=SingletonMeta):
+    ### existing methods
+    ...
+
+    ### new methods
+    def set_device(self, device_ordinal=None):
+    # set the current device
+    ...
+
+    def get_device_name(self):
+    # get the name of device
+    ...
+
+    def to(self):
+    # move or cast the parameter to specified device
+    ...
+
+    def device_count(self):
+    # get the number of devices
+    ...
+
+    def synchronize(self, device_index=None):
+    # communicate with specified device
+    ...
+
+    def random(self):
+    # random number generator
+
+    def stream(self):
+    # get device stream
+
+    # other device related methods, details will be included in PRs.
+    ...
+```
+
+From user view, the training code is simpler than before as user doesn't need to care which device should be explicitly specified. The new logic will automatically move/cast tensors to the device user is using. For example, if the underneath hardware is Nvidia GPU, the tensor will be automatically go to CUDA device. If the underneath hardware is Intel X86 CPU, the tensor will automatically keep in CPU side. 
+
+Blow is some sample codes to demonstrate this idea.
+
+```
+# The lines commented off are not needed as engine and trainer will automatically copy tensors from host to device.
+
+for epoch in range(gpc.config.NUM_EPOCHS):
+    # execute a training iteration
+    engine.train()
+    for img, label in train_dataloader:
+        #img = img.cuda()
+        #label = label.cuda()
+
+        # set gradients to zero
+        engine.zero_grad()
+
+        # run forward pass
+        output = engine(img)
+
+        # compute loss value and run backward pass
+        train_loss = engine.criterion(output, label)
+        engine.backward(train_loss)
+
+        # update parameters
+        engine.step()
+
+    # update learning rate
+    lr_scheduler.step()
+
+    # execute a testing iteration
+    engine.eval()
+    correct = 0
+    total = 0
+    for img, label in test_dataloader:
+        #img = img.cuda()
+        #label = label.cuda()
+
+        # run prediction without back-propagation
+        with torch.no_grad():
+            output = engine(img)
+            test_loss = engine.criterion(output, label)
+
+        # compute the number of correct prediction
+        pred = torch.argmax(output, dim=-1)
+```
+
+## **Future Works**
+
+This RFC is focusing on discussing the unified device access interface. We will gradually add CPU support into the internal componenets to make the whole functionality work by following below *TODO* list.
+
+- [ ] Gemini
+- [ ] Communication
+- [ ] nn/kernel
+- [ ] autochunk/autoparallel
+- [ ] zero
+- [ ] tensor
diff --git a/rfcs/deepspeed_rfc.md b/rfcs/deepspeed_rfc.md
@@ -0,0 +1,193 @@
+## **Summary**
+
+This is a design discussion RFC for contributing some device-agnostic compression algorithms, like the post training quantization(QDQ quant format) and structural sparsity supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into DeepSpeed.
+
+## **Motivation**
+
+As we know, the DeepSpeed Compression have supported many useful compression methods like layer reduction via knowledge distillation, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning.
+
+But those compression methods are lack of main stream support on some popular compression algorithms like post training static quantization and structural sparsity, which have been demostrated as efficient and popular compression methods by the industry.
+
+[Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) has implemented such device-agnostic compression algorithms, we would like to contribute those into DeepSpeed.
+
+## **Proposal Details on Pruning**
+
+In this proposal, we would like to introduce structural pruning functionalities by enabling "N in M" and "N x M" block sparsity pattern with snip_momentum criteria and progressive pruning.
+
+We propose a two-phases support on the structural pruning method.
+
+**Phase 1: structural pruning with global sparse ratio**
+
+This way leverages the existing DeepSpeed sparsity design which has a global sparse ratio control. If the accuracy doesn't meet expectation, user has to tune the training process like what they did on DeepSpeed by manually specifying and exlporing the proper sparse ratio per layer.
+
+We extend the json config file format and implement the structural sparsity algorithm in `compression` dir like below. 
+
+~~~
+{
+        "sparse_pruning": {
+        "shared_parameters": {
+          "enabled": True,
+          "method": "snip_momentum",          # new value
+          "pattern": "4x1",                   # new field
+          "dense_ratio": 0.1,
+          "gradient_accumulation_steps": 1,   # new field
+          "sparsity_decay_type": "exp",       # new field
+          "start_step": 0,                    # new field
+          "end_tep": 10000                    # new field
+        },
+        "different_groups": {
+          "sp1": {
+            "params": {
+              "dense_ratio": 0.5
+            },
+            "modules": [
+              "attention.self"
+            ]
+          }
+        }
+      },
+}
+~~~
+
+As for the structural sparsity implementation in `compression` dir, let's taking `LinearLayer_Compress` class in `deepspeed/compression/basic_layer.py` as an example, this class is enhanced like this to support structural sparsity algorithm.
+
+<a target="_blank" href="./imgs/linear_example.png">
+  <img src="./imgs/linear_example.png" alt="Extension" width="100%" height="100%">
+</a>
+
+**NOTE**: In this phase 1, the DeepSpeed user facing API keeps unchanged. The only change user need to be aware of is the extended Json file format.
+
+**Phase 2: Advanced structural pruning with fine-grained sparse ratio control per layer** 
+
+This advanced algorithm supports the adaptive sparse ratio adjustment algorithm per layer to reach higher accuracy.
+
+This way needs to extend the `initialize()` API to return one more parameter `callbacks` besides ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``. The json config file needs to be adjusted accordingly.
+
+~~~
+def initialize(args=None,
+               model: torch.nn.Module = None,
+               optimizer: Optional[Union[Optimizer,
+                                         DeepSpeedOptimizerCallable]] = None,
+               model_parameters: Optional[torch.nn.Module] = None,
+               training_data: Optional[torch.utils.data.Dataset] = None,
+               lr_scheduler: Optional[Union[_LRScheduler,
+                                            DeepSpeedSchedulerCallable]] = None,
+               mpu=None,
+               dist_init_required: Optional[bool] = None,
+               collate_fn=None,
+               config=None,
+               config_params=None):
+    # return A tuple of ``engine``, ``optimizer``, ``training_dataloader``, ``lr_scheduler``, ``callbacks``
+~~~
+
+This `callbacks` class object returned by `initialize` function is used to register hooks for user into the normal training process.
+
+~~~
+class callbacks():
+    def on_epoch_begin(self, epoch):
+      ...
+
+    def on_epoch_end(self):
+      ...
+
+    def on_step_begin(self, step):
+      ...
+
+    def on_step_end(self):
+      ...
+
+    ...  # other hooks during training
+
+~~~
+
+The user need to manually insert such hooks into their training code for fine-grain sparsity control per layer.
+
+~~~
+    model, optimizer, _, lr_scheduler, callbacks = deepspeed.initialize(
+        args=args,
+        model=model,
+        optimizer=optimizer,
+        lr_scheduler=lr_scheduler,
+        dist_init_required=True)
+
+    for epoch in range(args.num_train_epochs):
+        model.train()
+        start_time = time.time()
+        callbacks.on_epoch_start(epoch)               # new code
+        for step, batch in enumerate(train_dataloader):
+            callbacks.on_step_start(step)             # new code
+            batch = to_device(batch, device)
+            all_loss = forward_fun(batch, model, teacher_model=teacher_model)
+            model.backward(all_loss[0])
+            model.step()
+            callbacks.on_step_end()                   # new code
+
+        callbacks.on_epoch_end()                      # new code
+        ...
+
+~~~
+
+## **Structural Sparsity Results**
+
+<a target="_blank" href="./imgs/sparse_result.png">
+  <img src="./imgs/sparse_result.png" alt="Extension" width="80%" height="80%">
+</a>
+
+## **Recommendation**
+
+We recommend to split this contribution into two phases:
+
+1. The first phase focuses on adding the entire structural sparsity methods supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into DeepSpeed with minor changes. 
+
+   This way provoides the complete structural sparsity capability with global sparse ratio setting. It's easy of use for customer to pilot the structural sparsity feature. 
+
+2. The second phase focuses on productivity improvement by supporting the adaptive sparse ratio adjustment to support broad pruning algorithm. 
+
+   This way has the capability of automatically adjusting the sparse ration per layer for better accuracy. It can highly improve the productivity for those customers who wants to have high sparsity but meet strict accuracy goal.
+
+## **Proposal Details on Quantization**
+
+In this proposal, we would like to enhance the quantization functionality by integrating device agnostic post training static&dynamic quantization (QDQ quant format) supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) into DeepSpeed.
+
+As current implementation of DeepSpeed is focusing on simulating the quantization behavior during training, we propose to add post training quantization by below changes.
+
+```yaml
+
+   "compression_inference": {         # new field
+    "static_quantization":            # new field, value could be "dynamic_quantization" as well
+        "enabled": true               # new field
+    }
+
+```
+
+Besides the changes in the compression config file, we also need introduce a new function `quantize` like below to support post-training quantization.
+
+```python
+
+def quantize(model, deepspeed_config, mpu=None, calib_dataloader=None):
+    ### The main entry of doing post training quantization
+    ###
+    ### Args:
+    ###      ...
+    ###      'calib_dataloader': The dataloader for calibration. Being None for post-training dynamic quantization. 
+    ###
+    ### Return q_model which is the QDQ model for deployment 
+
+```
+
+The usage would be like below:
+
+```python
+    model.load_state_dict(torch.load(args.path_to_model))
+    q_model = quantize(model, args.deepspeed_config, calib_dataloader=test_loader)
+
+    ### deploy this returned QDQ model through TRT or Intel Extension for PyTorch
+
+```
+
+## **Quantization Results**
+As for the post training quantization results, please refer to [this link](https://github.com/intel/neural-compressor/blob/master/docs/source/validated_model_list.md)
+
+## **Future Works**
+
+We enabled new quantization algorithms like SmoothQuant in Intel(R) Neural Compressor and applied to popular large language models such as BLOOM-176B. We plan to enable these new features into DeepSpeed compression library as part of our future works.
diff --git a/rfcs/imgs/inc_vs_ort_quantizer.png b/rfcs/imgs/inc_vs_ort_quantizer.png
diff --git a/rfcs/imgs/linear_example.png b/rfcs/imgs/linear_example.png
diff --git a/rfcs/imgs/sparse_result.png b/rfcs/imgs/sparse_result.png
diff --git a/rfcs/olive_rfc.md b/rfcs/olive_rfc.md
@@ -0,0 +1,55 @@
+## **Summary**
+
+This RFC is a high level design discussion thread for contributing advance features, like post training optimization and during training optimization, in [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) to Olive.
+
+## **Motivation**
+
+As we know, the Olive repo have supported model conversion from FP32 framework models to FP32 ONNX models and auto tuning on ONNXRT performance related configurations. To extend the conversion and tuning capability supported by Olive, we would like to contribute some advance features in
+[Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) validated on broad models to Olive.
+
+The model optimizations supported by [Intel(R) Neural Compressor](https://github.com/intel/neural-compressor) could be classified into two sets:
+
+1. post-training optimization
+
+   This includes `post-training dynamic quantizer` and `post-training static quantizer`. Comping with ORT quantizer, Intel Neural Compressor can provide higher accuracy and better performance gain.
+
+<a target="_blank" href="INC vs ORT on dynamic & static quant">
+  <img src="./imgs/inc_vs_ort_quantizer.png" alt="Extension" width="100%" height="100%">
+</a>
+
+
+2. during-training optimization
+
+   This includes `quantization-aware training`, `pruning` and `distillation`, which has been demostrated on broad models and can get ~1.x to 25x performance gain relying on which optimizations are used and the model structure.
+
+## **Proposal**
+
+we plan to contribute features into Olive with two phases.
+
+**Phase 1:**, focusing on extending Olive conversion and optimization features
+
+1. Add INC ONNX dynamic quantizer into Olive auto performance tuning scope
+
+   This leverages the existing Olive design and extend optimization config to support INC dynamic quantizer.
+
+2. Add INC ONNX FP32 to FP16 converter into Olive model conversion
+
+   This is used to provide device agnostic model conversion feature (FP32 -> FP16) to Olive.
+
+3. Add INC ONNX static quantizer into Olive auto performance tuning scope
+
+   This optimization config needs to be extended to support INC static quantizer and calibration dataset.
+
+**Phase 2:**, focusing on contributing during training optimizations into Olive
+
+1. Add INC quantization aware training into Olive
+
+2. Add INC pruning into Olive
+
+3. Add INC distillation into Olive
+
+4. Add INC accuracy diagnostic and debugging into Olive
+
+5. Add INC visualization into Olive
+
+Please note this proposal is for high level direction alignment. welcome any comments or feedbacks on that.