Skip to content

Commit

Permalink
minor fix for benchmark.md (#433)
Browse files Browse the repository at this point in the history
Co-authored-by: VincyZhang <wenxin.zhang@intel.com>
  • Loading branch information
violetch24 and VincyZhang committed Dec 2, 2022
1 parent b6185df commit c2ae3f0
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 46 deletions.
2 changes: 1 addition & 1 deletion docs/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The benchmark classes `PyTorchBenchmark` and `ExecutorBenchmark` expect an objec

## PyTorchBenchmark

The PyTorchBenchmark is only used for inference when the input model is an INT8 model.
`PyTorchBenchmark` is only used for inference when the input model is an INT8 model.

```py
from intel_extension_for_transformers.optimization.benchmark import PyTorchBenchmark, PyTorchBenchmarkArguments
Expand Down
112 changes: 67 additions & 45 deletions docs/data_augmentation.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,86 @@
# Data Augmentation: The Tool for Augmenting NLP Datasets
Data Augmentation is a tool to helps you with augmenting nlp datasets for your machine learning projects. This tool integrates [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.
Data Augmentation
============

## Getting Started!
### Installation
#### Install Dependency
1. [Introduction](#introduction)

2. [Getting Started](#getting-started)

2.1. [Install Dependency](#install-dependency)

2.2. [Install Intel_Extension_for_Transformers](#install-intel_extension_for_transformers)

3. [Data Augmentation](#data-augmentation)

3.1. [Script](#script)

3.2. [Parameters of Data Augmentation](#parameters-of-data-augmentation)

3.3. [Supported Augmenter](#supported-augmenter)

3.4. [Text Generation Augmenter](#text-generation-augmenter)

3.5. [Augmenter Arguments](#augmenter-arguments)

## Introduction
Data Augmentation is a tool to help with augmenting NLP datasets for machine learning projects. This tool integrates [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.

## Getting Started
### Install Dependency
```bash
pip install nlpaug
pip install transformers>=4.12.0
```

#### Install Intel_Extension_for_Transformers
### Install Intel_Extension_for_Transformers
```bash
git clone https://github.com/intel/intel-extension-for-transformers.git intel_extension_for_transformers
cd intel_extension_for_transformers
git submodule update --init --recursive
python setup.py install

### Data Augmentation
#### Script(Please refer to [example](tests/test_data_augmentation.py))
```python
from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "dev.csv"
aug.output_path = os.path.join(self.result_path, "test1.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
self.assertTrue(len(raw_datasets) == 10)
```

#### Parameters of DataAugmentation
|parameter |type |Description |default value |
```

## Data Augmentation
### Script
Please refer to [example](tests/test_data_augmentation.py).
```python
from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "dev.csv"
aug.output_path = os.path.join(self.result_path, "test1.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
self.assertTrue(len(raw_datasets) == 10)
```

### Parameters of Data Augmentation
|Parameter |Type |Description |Default value |
|:---------|:----|:------------------------------------------------------------------|:-------------|
|augmenter_type|String|Augmentation type |NA |
|input_dataset|String|Dataset name or a csv or a json file |None |
|output_path|String|Saved path and name of augmented data file |"save_path/augmented_dataset.csv"|
|data_config_or_task_name|String|Task name of glue dataset or data configure name |None |
|augmenter_arguments|Dict|parameters for augmenters, different augmenter has different parameters |None|
|column_names|String|Which you want to augmentation, it is used for python package datasets|"sentence"|
|split|String|Which dataset you want to augmentation, like:'validation', 'train' |"validation" |
|num_samples|Integer|How many augmented samples can each sentence generate |1 |
|device|String|'cuda' or 'cpu' device you used |1 |
|augmenter_arguments|Dict|Parameters for augmenters. Different augmenter has different parameters |None|
|column_names|String|The column needed to conduct augmentation, which is used for python package datasets|"sentence"|
|split|String|Dataset needed to conduct augmentation, like:'validation', 'training' |"validation" |
|num_samples|Integer|The number of the generated augmentation samples |1 |
|device|String|Deployment devices, "cuda" or "cpu" |1 |

#### Supported Augmenter
### Supported Augmenter
|augmenter_type |augmenter_arguments |default value |
|:--------------|:-------------------------------------------------------------------|:-------------|
|"TextGenerationAug"|refer to "Text Generation Augmenter" field in this document |NA |
|"KeyboardAug"|refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46) |NA |
|"OcrAug"|refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38) |NA |
|"SpellingAug"|refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49) |NA |
|"ContextualWordEmbsForSentenceAug"|refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77) | |

#### Text Generation Augmenter
The text generation augment contains the recipe to run data augmentation algorithm based on the conditional text generation using auto-regressive transformer model (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
|"TextGenerationAug"|Refer to "Text Generation Augmenter" field in this document |NA |
|"KeyboardAug"|Refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46) |NA |
|"OcrAug"|Refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38) |NA |
|"SpellingAug"|Refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49) |NA |
|"ContextualWordEmbsForSentenceAug"|Refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77) | |

### Text Generation Augmenter
The text generation augment contains recipe to run data augmentation algorithm based on the conditional text generation using auto-regressive transformer models (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
Our approach follows algorithms described by [Not Enough Data? Deep Learning to the Rescue!](https://arxiv.org/abs/1911.03118) and [Natural Language Generation for Effective Knowledge Distillation](https://www.aclweb.org/anthology/D19-6122.pdf).

- First, we fine-tune an auto-regressive model on the training set. Each sample contains both the label and the sentence.
- First, we fine-tune an auto-regressive model on the training set. Each sample contains both a label and a sentence.
- Prepare datasets:

example:
```python
from datasets import load_dataset
from intel_extension_for_transformers.preprocessing.utils import EOS
Expand Down Expand Up @@ -87,8 +112,7 @@ Our approach follows algorithms described by [Not Enough Data? Deep Learning to
--overwrite_output_dir
```


- Second, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
- Secondly, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
```python
from intel_extension_for_transformers.preprocessing.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
Expand All @@ -100,9 +124,8 @@ Our approach follows algorithms described by [Not Enough Data? Deep Learning to

This data augmentation algorithm can be used in several scenarios, like model distillation.


augmenter_arguments:
|parameter |Type|Description |default value |
### Augmenter Arguments:
|Parameter |Type|Description |Default value |
|:---------|:---|:---------------------------------------------------|:-------------|
|"model_name_or_path"|String|Language modeling model to generate data, refer to [line](intel_extension_for_transformers/preprocessing/data_augmentation.py#L181)|NA|
|"stop_token"|String|Stop token used in input data file |[EOS](intel_extension_for_transformers/preprocessing/utils.py#L7)|
Expand All @@ -111,4 +134,3 @@ augmenter_arguments:
|"k"|float|top K |0.0|
|"p"|float|top p |0.9|
|"repetition_penalty"|float|repetition_penalty |1.0|

0 comments on commit c2ae3f0

Please sign in to comment.