-
Notifications
You must be signed in to change notification settings - Fork 203
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding the huggingface Finetuning and inference NLP reference workflow (
#829)
- Loading branch information
Showing
21 changed files
with
1,232 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Workflow purpose | ||
The Huggingface Finetuning(transfer learning) and Inference workflow demonstrates NLP(natural language processing) workflows/pipelines using hugginface transfomer API to be run along with intel optimised software represented using toolkits, domainkits, packages, frameworks and other libraries for effective use of intel hardware leveraging Intel's AI instructions for fast processing and increased performance.The workflows can be easily used by applications or reference kits showcasing usage. | ||
|
||
The workflow currenly supports | ||
``` | ||
Huggingface NLP Finetuning / Transfer Learning | ||
Huggingface NLP Inference | ||
``` | ||
The HF Finetuning and Inference workflow supports the following API | ||
``` | ||
Huggingface transformer's (trainer API) | ||
Intel's extension for transformers API (Itrex API) also named ( Intel's Transformer/NLP Toolkit) | ||
``` | ||
|
||
### Architecture | ||
![Reference_Workflow](assets/HFFinetuningAndInference.png) | ||
|
||
|
||
# Get Started | ||
### Clone this Repository | ||
``` | ||
git clone current repository | ||
cd into the current repository directory | ||
``` | ||
|
||
### Create a new python (Conda or Venv) environment with env name: "hfftinf_wf" | ||
```shell | ||
conda create -n hfftinf_wf python=3.9 | ||
conda activate hfftinf_wf | ||
``` | ||
or | ||
```shell | ||
python -m venv hfftinf_wf | ||
source hfftinf_wf/bin/activate | ||
``` | ||
|
||
### Install package for running hf-finetuning-inference-nlp-workflows | ||
```shell | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Running | ||
See config/README.md for options. | ||
```shell | ||
python src/run.py --config_file config/finetune.yaml | ||
python src/run.py --config_file config/inference.yaml | ||
``` | ||
|
||
|
Binary file added
BIN
+180 KB
workflows/hf_finetuning_and_inference_nlp/assets/HFFinetuningAndInference.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 74 additions & 0 deletions
74
workflows/hf_finetuning_and_inference_nlp/config/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# YAML Config | ||
|
||
Fine Tuning | ||
``` | ||
model_name_or_path : Path to pretrained model or model identifier from huggingface.co/models. | ||
tokenizer_name: Pretrained tokenizer name or path if not the same as model_name. | ||
dataset: Local or Huggingface datasets name. | ||
""" Required only when dataset: 'local' """ | ||
local_dataset: | ||
finetune_input : Input filename incase of local dataset. | ||
delimiter: File delimiter. | ||
features: | ||
class_label: Label column name. | ||
data_column: Data column name. | ||
id: Id column name. | ||
label_list: List of class labels. | ||
pipeline: The pipeline to use. 'finetune' in this case. | ||
finetune_impl: The implementation of fine-tuning pipeline. Now we support trainer and itrex implementation. | ||
dtype_ft: Data type for finetune pipeline. Support fp32 and bf16 for CPU. Support fp32, tf32, and fp16 for GPU. | ||
max_seq_len: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
smoke_test: Whether to execute in sanity check mode. | ||
max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value if set. | ||
max_test_samples: For debugging purposes or quicker testing, truncate the number of testing examples to this value if set. | ||
preprocessing_num_workers: The number of processes to use for the preprocessing. | ||
overwrite_cache: Overwrite the cached training and evaluation sets. | ||
finetune_output: Path of file to write output results. | ||
training_args: | ||
num_train_epochs: Number of epochs to run. | ||
do_train: Whether to run training. | ||
do_predict: Whether to run predictions. | ||
per_device_train_batch_size: Batch size per device during training. | ||
per_device_eval_batch_size: Batch size per device during evaluation. | ||
output:dir: Output directory. | ||
``` | ||
|
||
Inference | ||
``` | ||
model_name_or_path : Path to pretrained model or model identifier from huggingface.co/models. | ||
tokenizer_name: Pretrained tokenizer name or path if not the same as model_name. | ||
dataset: Local or Huggingface datasets name. | ||
""" Required only when dataset: 'local' """ | ||
local_dataset: | ||
inference_input : Input filename incase of local dataset. | ||
delimiter: File delimiter. | ||
features: | ||
class_label: Label column name. | ||
data_column: Data column name. | ||
id: Id column name. | ||
label_list: List of class labels. | ||
pipeline: The pipeline to use. 'inference' in this case. | ||
infer_impl: The implementation of inference pipeline. Now we support trainer and itrex implementation. | ||
dtype_inf: Data type for inference pipeline. Support fp32 and bf16 for CPU. Support fp32, tf32, and fp16 for GPU. | ||
max_seq_len: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. | ||
smoke_test: Whether to execute in sanity check mode. | ||
max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value if set. | ||
max_test_samples: For debugging purposes or quicker testing, truncate the number of testing examples to this value if set. | ||
preprocessing_num_workers: The number of processes to use for the preprocessing. | ||
overwrite_cache: Overwrite the cached training and evaluation sets. | ||
inference_output: Path of file to write output results. | ||
multi_instance: Whether to use multi-instance mode. | ||
training_args: | ||
num_train_epochs: Number of epochs to run. | ||
do_train: Whether to run training. | ||
do_predict: Whether to run predictions. | ||
per_device_train_batch_size: Batch size per device during training. | ||
per_device_eval_batch_size: Batch size per device during evaluation. | ||
output:dir: Output directory. | ||
``` |
24 changes: 24 additions & 0 deletions
24
workflows/hf_finetuning_and_inference_nlp/config/finetune.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
args: | ||
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path | ||
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path | ||
dataset: "imdb" # local or huggingface datasets name | ||
|
||
# Add the fine tuning configurations below | ||
pipeline: "finetune" | ||
finetune_impl: "itrex" | ||
dtype_ft: "fp32" | ||
max_seq_len: 64 | ||
smoke_test: false | ||
max_train_samples: null | ||
max_test_samples: null | ||
preprocessing_num_workers: 8 | ||
overwrite_cache: true | ||
finetune_output: finetune_predictions_report.yaml | ||
|
||
training_args: | ||
num_train_epochs: 1 | ||
do_train: true | ||
do_predict: true | ||
per_device_train_batch_size: 100 | ||
per_device_eval_batch_size: 100 | ||
output:dir: "/.output" |
24 changes: 24 additions & 0 deletions
24
...etuning_and_inference_nlp/config/finetune_Model-bertbase_Task-sentiment_Dataset-imdb.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
args: | ||
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path | ||
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path | ||
dataset: "imdb" # local or huggingface datasets name | ||
|
||
# Add the fine tuning configurations below | ||
pipeline: "finetune" | ||
finetune_impl: "itrex" | ||
dtype_ft: "fp32" | ||
max_seq_len: 64 | ||
smoke_test: false | ||
max_train_samples: null | ||
max_test_samples: null | ||
preprocessing_num_workers: 8 | ||
overwrite_cache: true | ||
finetune_output: finetune_predictions_report.yaml | ||
|
||
training_args: | ||
num_train_epochs: 1 | ||
do_train: true | ||
do_predict: true | ||
per_device_train_batch_size: 100 | ||
per_device_eval_batch_size: 100 | ||
output:dir: "/.output" |
36 changes: 36 additions & 0 deletions
36
...ce_nlp/config/finetune_Model-bioclinicalbert_Task-HLSDiseasePrediction_Dataset-local.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
args: | ||
model_name_or_path: "emilyalsentzer/Bio_ClinicalBERT" | ||
tokenizer_name: "emilyalsentzer/Bio_ClinicalBERT" | ||
dataset: "local" # local or huggingface datasets name | ||
|
||
# Add local dataset configurations below. Skip for HF datasets. | ||
# Make sure to specify your local dataset . The code will fail otherwise. | ||
local_dataset: | ||
finetune_input : '/workspace/dataset/annotation.csv' | ||
inference_input : '/workspace/dataset/annotation.csv' | ||
delimiter: "," | ||
features: | ||
class_label: "label" | ||
data_column: "symptoms" | ||
id: "Patient_ID" | ||
label_list: ["Malignant", "Normal", "Benign"] | ||
|
||
# Add the fine tuning configurations below | ||
pipeline: "finetune" | ||
finetune_impl: "itrex" | ||
dtype_ft: "fp32" | ||
max_seq_len: 64 | ||
smoke_test: false | ||
max_train_samples: null | ||
max_test_samples: null | ||
preprocessing_num_workers: 8 | ||
overwrite_cache: true | ||
finetune_output: finetune_predictions_report.yaml | ||
|
||
training_args: | ||
num_train_epochs: 1 | ||
do_train: true | ||
do_predict: true | ||
per_device_train_batch_size: 100 | ||
per_device_eval_batch_size: 100 | ||
output:dir: "/.output" |
32 changes: 32 additions & 0 deletions
32
workflows/hf_finetuning_and_inference_nlp/config/inference.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
args: | ||
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path | ||
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path | ||
dataset: "imdb" # local or huggingface datasets name | ||
|
||
# Add local dataset configurations below. Skip for HF datasets. | ||
local_dataset: | ||
inference_input : '/workspace/dataset/annotation.csv' | ||
delimiter: "," | ||
features: | ||
class_label: "label" | ||
data_column: "symptoms" | ||
id: "Patient_ID" | ||
label_list: ["Malignant", "Normal", "Benign"] | ||
|
||
# Add the Inference configurations below | ||
pipeline: "inference" | ||
infer_impl: "itrex" | ||
dtype_inf: "fp32" | ||
max_seq_len: 64 | ||
smoke_test: false | ||
max_train_samples: null | ||
max_test_samples: null | ||
preprocessing_num_workers: 8 | ||
overwrite_cache: true | ||
inference_output: inference_predictions_report.yaml | ||
multi_instance: false | ||
|
||
training_args: | ||
do_predict: true | ||
per_device_eval_batch_size: 100 | ||
output:dir: "/.output" |
32 changes: 32 additions & 0 deletions
32
...tuning_and_inference_nlp/config/inference_Model-bertbase_Task-sentiment_Dataset-imdb.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
args: | ||
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path | ||
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path | ||
dataset: "imdb" # local or huggingface datasets name | ||
|
||
# Add local dataset configurations below. Skip for HF datasets. | ||
local_dataset: | ||
inference_input : '/workspace/dataset/annotation.csv' | ||
delimiter: "," | ||
features: | ||
class_label: "label" | ||
data_column: "symptoms" | ||
id: "Patient_ID" | ||
label_list: ["Malignant", "Normal", "Benign"] | ||
|
||
# Add the Inference configurations below | ||
pipeline: "inference" | ||
infer_impl: "itrex" | ||
dtype_inf: "fp32" | ||
max_seq_len: 64 | ||
smoke_test: false | ||
max_train_samples: null | ||
max_test_samples: null | ||
preprocessing_num_workers: 8 | ||
overwrite_cache: true | ||
inference_output: inference_predictions_report.yaml | ||
multi_instance: false | ||
|
||
training_args: | ||
do_predict: true | ||
per_device_eval_batch_size: 100 | ||
output:dir: "/.output" |
33 changes: 33 additions & 0 deletions
33
...e_nlp/config/inference_Model-bioclinicalbert_Task-HLSDiseasePrediction_Dataset-local.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
args: | ||
model_name_or_path: "./models/hls/" # input the fine-tuned model path | ||
tokenizer_name: "./models/hls/" # input the fine-tuned model path | ||
dataset: "local" # local or huggingface datasets name | ||
|
||
# Add local dataset configurations below. Skip for HF datasets. | ||
# Make sure to specify your local dataset . The code will fail otherwise. | ||
local_dataset: | ||
inference_input : '/workspace/dataset/annotation.csv' | ||
delimiter: "," | ||
features: | ||
class_label: "label" | ||
data_column: "symptoms" | ||
id: "Patient_ID" | ||
label_list: ["Malignant", "Normal", "Benign"] | ||
|
||
# Add the Inference configurations below | ||
pipeline: "inference" | ||
infer_impl: "itrex" | ||
dtype_inf: "fp32" | ||
max_seq_len: 64 | ||
smoke_test: false | ||
max_train_samples: null | ||
max_test_samples: null | ||
preprocessing_num_workers: 8 | ||
overwrite_cache: true | ||
inference_output: inference_predictions_report.yaml | ||
multi_instance: false | ||
|
||
training_args: | ||
do_predict: true | ||
per_device_eval_batch_size: 100 | ||
output:dir: "/.output" |
15 changes: 15 additions & 0 deletions
15
workflows/hf_finetuning_and_inference_nlp/docker/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
FROM intel/intel-optimized-pytorch:pip-ipex-1.13.100-ubuntu-22.04 | ||
|
||
RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \ | ||
build-essential \ | ||
libgl1-mesa-glx \ | ||
libglib2.0-0 \ | ||
python3-dev | ||
|
||
RUN mkdir -p /workspace/output | ||
|
||
COPY . /workspace | ||
|
||
WORKDIR /workspace | ||
|
||
RUN python -m pip install --no-cache-dir -r /workspace/requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
#!/bin/bash | ||
|
||
# Copyright (C) 2022 Intel Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions | ||
# and limitations under the License. | ||
# | ||
|
||
pip install -r requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
transformers==4.26.0 | ||
datasets==2.11.0 | ||
neural-compressor==2.1 | ||
--extra-index-url https://download.pytorch.org/whl/cpu | ||
torch==1.13.1 | ||
intel_extension_for_pytorch==1.13.100 | ||
intel-extension-for-transformers==1.0.0 |
Empty file.
Oops, something went wrong.