minor fix for benchmark.md (#433)

Co-authored-by: VincyZhang <wenxin.zhang@intel.com>
intel · Dec 2, 2022 · c2ae3f0 · c2ae3f0
1 parent b6185df
commit c2ae3f0
Show file tree

Hide file tree

Showing 2 changed files with 68 additions and 46 deletions.
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -13,7 +13,7 @@ The benchmark classes `PyTorchBenchmark` and `ExecutorBenchmark` expect an objec
 
 ## PyTorchBenchmark
 
-The PyTorchBenchmark is only used for inference when the input model is an INT8 model.
+`PyTorchBenchmark` is only used for inference when the input model is an INT8 model.
 
 ```py
 from intel_extension_for_transformers.optimization.benchmark import PyTorchBenchmark, PyTorchBenchmarkArguments

diff --git a/docs/data_augmentation.md b/docs/data_augmentation.md
@@ -1,61 +1,86 @@
-# Data Augmentation: The Tool for Augmenting NLP Datasets
-Data Augmentation is a tool to helps you with augmenting nlp datasets for your machine learning projects. This tool integrates [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.
+Data Augmentation
+============
 
-## Getting Started!
-### Installation
-#### Install Dependency
+1. [Introduction](#introduction)
+
+2. [Getting Started](#getting-started)
+
+    2.1. [Install Dependency](#install-dependency)
+
+    2.2. [Install Intel_Extension_for_Transformers](#install-intel_extension_for_transformers)
+
+3. [Data Augmentation](#data-augmentation)
+
+    3.1. [Script](#script)
+
+    3.2. [Parameters of Data Augmentation](#parameters-of-data-augmentation)
+
+    3.3. [Supported Augmenter](#supported-augmenter)
+
+    3.4. [Text Generation Augmenter](#text-generation-augmenter)
+
+    3.5. [Augmenter Arguments](#augmenter-arguments)
+
+## Introduction
+Data Augmentation is a tool to help with augmenting NLP datasets for machine learning projects. This tool integrates [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.
+
+## Getting Started
+### Install Dependency
+```bash
 pip install nlpaug
 pip install transformers>=4.12.0
+```
 
-#### Install Intel_Extension_for_Transformers
+### Install Intel_Extension_for_Transformers
+```bash
 git clone https://github.com/intel/intel-extension-for-transformers.git intel_extension_for_transformers
 cd intel_extension_for_transformers
 git submodule update --init --recursive
 python setup.py install
-
-### Data Augmentation
-#### Script(Please refer to [example](tests/test_data_augmentation.py))
-    ```python
-    from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
-    aug = DataAugmentation(augmenter_type="TextGenerationAug")
-    aug.input_dataset = "dev.csv"
-    aug.output_path = os.path.join(self.result_path, "test1.cvs")
-    aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
-    aug.data_augment()
-    raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
-    self.assertTrue(len(raw_datasets) == 10)
-    ```
-
-#### Parameters of DataAugmentation
-|parameter |type |Description                                                           |default value |
+```
+
+## Data Augmentation
+### Script
+Please refer to [example](tests/test_data_augmentation.py).
+```python
+from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
+aug = DataAugmentation(augmenter_type="TextGenerationAug")
+aug.input_dataset = "dev.csv"
+aug.output_path = os.path.join(self.result_path, "test1.cvs")
+aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
+aug.data_augment()
+raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
+self.assertTrue(len(raw_datasets) == 10)
+```
+
+### Parameters of Data Augmentation
+|Parameter |Type |Description                                                           |Default value |
 |:---------|:----|:------------------------------------------------------------------|:-------------|
 |augmenter_type|String|Augmentation type                                             |NA  |
 |input_dataset|String|Dataset name or a csv or a json file                           |None  |
 |output_path|String|Saved path and name of augmented data file                       |"save_path/augmented_dataset.csv"|
 |data_config_or_task_name|String|Task name of glue dataset or data configure name    |None  |
-|augmenter_arguments|Dict|parameters for augmenters, different augmenter has different parameters |None|
-|column_names|String|Which you want to augmentation, it is used for python package datasets|"sentence"|
-|split|String|Which dataset you want to augmentation, like:'validation', 'train'     |"validation"  |
-|num_samples|Integer|How many augmented samples can each sentence generate           |1  |
-|device|String|'cuda' or 'cpu' device you used                                       |1  |
+|augmenter_arguments|Dict|Parameters for augmenters. Different augmenter has different parameters |None|
+|column_names|String|The column needed to conduct augmentation, which is used for python package datasets|"sentence"|
+|split|String|Dataset needed to conduct augmentation, like:'validation', 'training'     |"validation"  |
+|num_samples|Integer|The number of the generated augmentation samples           |1  |
+|device|String|Deployment devices, "cuda" or "cpu"                                     |1  |
 
-#### Supported Augmenter
+### Supported Augmenter
 |augmenter_type |augmenter_arguments                                                 |default value |
 |:--------------|:-------------------------------------------------------------------|:-------------|
-|"TextGenerationAug"|refer to "Text Generation Augmenter" field in this document               |NA  |
-|"KeyboardAug"|refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46)      |NA  |
-|"OcrAug"|refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38)           |NA  |
-|"SpellingAug"|refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49)      |NA  |
-|"ContextualWordEmbsForSentenceAug"|refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77)      |    |
-
-#### Text Generation Augmenter
-The text generation augment contains the recipe to run data augmentation algorithm based on the conditional text generation using auto-regressive transformer model (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
+|"TextGenerationAug"|Refer to "Text Generation Augmenter" field in this document               |NA  |
+|"KeyboardAug"|Refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46)      |NA  |
+|"OcrAug"|Refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38)           |NA  |
+|"SpellingAug"|Refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49)      |NA  |
+|"ContextualWordEmbsForSentenceAug"|Refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77)      |    |
+
+### Text Generation Augmenter
+The text generation augment contains recipe to run data augmentation algorithm based on the conditional text generation using auto-regressive transformer models (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
 Our approach follows algorithms described by [Not Enough Data? Deep Learning to the Rescue!](https://arxiv.org/abs/1911.03118) and [Natural Language Generation for Effective Knowledge Distillation](https://www.aclweb.org/anthology/D19-6122.pdf).
 
-- First, we fine-tune an auto-regressive model on the training set. Each sample contains both the label and the sentence.
+- First, we fine-tune an auto-regressive model on the training set. Each sample contains both a label and a sentence.
     - Prepare datasets:
-
-        example:
         ```python
         from datasets import load_dataset
         from intel_extension_for_transformers.preprocessing.utils import EOS
@@ -87,8 +112,7 @@ Our approach follows algorithms described by [Not Enough Data? Deep Learning to
             --overwrite_output_dir
         ```
 
-
-- Second, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
+- Secondly, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
     ```python
     from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
     aug = DataAugmentation(augmenter_type="TextGenerationAug")
@@ -100,9 +124,8 @@ Our approach follows algorithms described by [Not Enough Data? Deep Learning to
 
 This data augmentation algorithm can be used in several scenarios, like model distillation.
 
-
-augmenter_arguments:
-|parameter |Type|Description                                                 |default value |
+### Augmenter Arguments:
+|Parameter |Type|Description                                                 |Default value |
 |:---------|:---|:---------------------------------------------------|:-------------|
 |"model_name_or_path"|String|Language modeling model to generate data, refer to [line](intel_extension_for_transformers/preprocessing/data_augmentation.py#L181)|NA|
 |"stop_token"|String|Stop token used in input data file                     |[EOS](intel_extension_for_transformers/preprocessing/utils.py#L7)|
@@ -111,4 +134,3 @@ augmenter_arguments:
 |"k"|float|top K                                |0.0|
 |"p"|float|top p                                |0.9|
 |"repetition_penalty"|float|repetition_penalty                                |1.0|
-