Skip to content

Commit

Permalink
Adding the huggingface Finetuning and inference NLP reference workflow (
Browse files Browse the repository at this point in the history
  • Loading branch information
nammbash committed Apr 21, 2023
1 parent 4cb9e4f commit bf666c0
Show file tree
Hide file tree
Showing 21 changed files with 1,232 additions and 0 deletions.
49 changes: 49 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Workflow purpose
The Huggingface Finetuning(transfer learning) and Inference workflow demonstrates NLP(natural language processing) workflows/pipelines using hugginface transfomer API to be run along with intel optimised software represented using toolkits, domainkits, packages, frameworks and other libraries for effective use of intel hardware leveraging Intel's AI instructions for fast processing and increased performance.The workflows can be easily used by applications or reference kits showcasing usage.

The workflow currenly supports
```
Huggingface NLP Finetuning / Transfer Learning
Huggingface NLP Inference
```
The HF Finetuning and Inference workflow supports the following API
```
Huggingface transformer's (trainer API)
Intel's extension for transformers API (Itrex API) also named ( Intel's Transformer/NLP Toolkit)
```

### Architecture
![Reference_Workflow](assets/HFFinetuningAndInference.png)


# Get Started
### Clone this Repository
```
git clone current repository
cd into the current repository directory
```

### Create a new python (Conda or Venv) environment with env name: "hfftinf_wf"
```shell
conda create -n hfftinf_wf python=3.9
conda activate hfftinf_wf
```
or
```shell
python -m venv hfftinf_wf
source hfftinf_wf/bin/activate
```

### Install package for running hf-finetuning-inference-nlp-workflows
```shell
pip install -r requirements.txt
```

## Running
See config/README.md for options.
```shell
python src/run.py --config_file config/finetune.yaml
python src/run.py --config_file config/inference.yaml
```


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 74 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/config/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# YAML Config

Fine Tuning
```
model_name_or_path : Path to pretrained model or model identifier from huggingface.co/models.
tokenizer_name: Pretrained tokenizer name or path if not the same as model_name.
dataset: Local or Huggingface datasets name.
""" Required only when dataset: 'local' """
local_dataset:
finetune_input : Input filename incase of local dataset.
delimiter: File delimiter.
features:
class_label: Label column name.
data_column: Data column name.
id: Id column name.
label_list: List of class labels.
pipeline: The pipeline to use. 'finetune' in this case.
finetune_impl: The implementation of fine-tuning pipeline. Now we support trainer and itrex implementation.
dtype_ft: Data type for finetune pipeline. Support fp32 and bf16 for CPU. Support fp32, tf32, and fp16 for GPU.
max_seq_len: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
smoke_test: Whether to execute in sanity check mode.
max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value if set.
max_test_samples: For debugging purposes or quicker testing, truncate the number of testing examples to this value if set.
preprocessing_num_workers: The number of processes to use for the preprocessing.
overwrite_cache: Overwrite the cached training and evaluation sets.
finetune_output: Path of file to write output results.
training_args:
num_train_epochs: Number of epochs to run.
do_train: Whether to run training.
do_predict: Whether to run predictions.
per_device_train_batch_size: Batch size per device during training.
per_device_eval_batch_size: Batch size per device during evaluation.
output:dir: Output directory.
```

Inference
```
model_name_or_path : Path to pretrained model or model identifier from huggingface.co/models.
tokenizer_name: Pretrained tokenizer name or path if not the same as model_name.
dataset: Local or Huggingface datasets name.
""" Required only when dataset: 'local' """
local_dataset:
inference_input : Input filename incase of local dataset.
delimiter: File delimiter.
features:
class_label: Label column name.
data_column: Data column name.
id: Id column name.
label_list: List of class labels.
pipeline: The pipeline to use. 'inference' in this case.
infer_impl: The implementation of inference pipeline. Now we support trainer and itrex implementation.
dtype_inf: Data type for inference pipeline. Support fp32 and bf16 for CPU. Support fp32, tf32, and fp16 for GPU.
max_seq_len: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
smoke_test: Whether to execute in sanity check mode.
max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this value if set.
max_test_samples: For debugging purposes or quicker testing, truncate the number of testing examples to this value if set.
preprocessing_num_workers: The number of processes to use for the preprocessing.
overwrite_cache: Overwrite the cached training and evaluation sets.
inference_output: Path of file to write output results.
multi_instance: Whether to use multi-instance mode.
training_args:
num_train_epochs: Number of epochs to run.
do_train: Whether to run training.
do_predict: Whether to run predictions.
per_device_train_batch_size: Batch size per device during training.
per_device_eval_batch_size: Batch size per device during evaluation.
output:dir: Output directory.
```
24 changes: 24 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/config/finetune.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
args:
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
dataset: "imdb" # local or huggingface datasets name

# Add the fine tuning configurations below
pipeline: "finetune"
finetune_impl: "itrex"
dtype_ft: "fp32"
max_seq_len: 64
smoke_test: false
max_train_samples: null
max_test_samples: null
preprocessing_num_workers: 8
overwrite_cache: true
finetune_output: finetune_predictions_report.yaml

training_args:
num_train_epochs: 1
do_train: true
do_predict: true
per_device_train_batch_size: 100
per_device_eval_batch_size: 100
output:dir: "/.output"
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
args:
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
dataset: "imdb" # local or huggingface datasets name

# Add the fine tuning configurations below
pipeline: "finetune"
finetune_impl: "itrex"
dtype_ft: "fp32"
max_seq_len: 64
smoke_test: false
max_train_samples: null
max_test_samples: null
preprocessing_num_workers: 8
overwrite_cache: true
finetune_output: finetune_predictions_report.yaml

training_args:
num_train_epochs: 1
do_train: true
do_predict: true
per_device_train_batch_size: 100
per_device_eval_batch_size: 100
output:dir: "/.output"
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
args:
model_name_or_path: "emilyalsentzer/Bio_ClinicalBERT"
tokenizer_name: "emilyalsentzer/Bio_ClinicalBERT"
dataset: "local" # local or huggingface datasets name

# Add local dataset configurations below. Skip for HF datasets.
# Make sure to specify your local dataset . The code will fail otherwise.
local_dataset:
finetune_input : '/workspace/dataset/annotation.csv'
inference_input : '/workspace/dataset/annotation.csv'
delimiter: ","
features:
class_label: "label"
data_column: "symptoms"
id: "Patient_ID"
label_list: ["Malignant", "Normal", "Benign"]

# Add the fine tuning configurations below
pipeline: "finetune"
finetune_impl: "itrex"
dtype_ft: "fp32"
max_seq_len: 64
smoke_test: false
max_train_samples: null
max_test_samples: null
preprocessing_num_workers: 8
overwrite_cache: true
finetune_output: finetune_predictions_report.yaml

training_args:
num_train_epochs: 1
do_train: true
do_predict: true
per_device_train_batch_size: 100
per_device_eval_batch_size: 100
output:dir: "/.output"
32 changes: 32 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/config/inference.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
args:
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
dataset: "imdb" # local or huggingface datasets name

# Add local dataset configurations below. Skip for HF datasets.
local_dataset:
inference_input : '/workspace/dataset/annotation.csv'
delimiter: ","
features:
class_label: "label"
data_column: "symptoms"
id: "Patient_ID"
label_list: ["Malignant", "Normal", "Benign"]

# Add the Inference configurations below
pipeline: "inference"
infer_impl: "itrex"
dtype_inf: "fp32"
max_seq_len: 64
smoke_test: false
max_train_samples: null
max_test_samples: null
preprocessing_num_workers: 8
overwrite_cache: true
inference_output: inference_predictions_report.yaml
multi_instance: false

training_args:
do_predict: true
per_device_eval_batch_size: 100
output:dir: "/.output"
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
args:
model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
dataset: "imdb" # local or huggingface datasets name

# Add local dataset configurations below. Skip for HF datasets.
local_dataset:
inference_input : '/workspace/dataset/annotation.csv'
delimiter: ","
features:
class_label: "label"
data_column: "symptoms"
id: "Patient_ID"
label_list: ["Malignant", "Normal", "Benign"]

# Add the Inference configurations below
pipeline: "inference"
infer_impl: "itrex"
dtype_inf: "fp32"
max_seq_len: 64
smoke_test: false
max_train_samples: null
max_test_samples: null
preprocessing_num_workers: 8
overwrite_cache: true
inference_output: inference_predictions_report.yaml
multi_instance: false

training_args:
do_predict: true
per_device_eval_batch_size: 100
output:dir: "/.output"
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
args:
model_name_or_path: "./models/hls/" # input the fine-tuned model path
tokenizer_name: "./models/hls/" # input the fine-tuned model path
dataset: "local" # local or huggingface datasets name

# Add local dataset configurations below. Skip for HF datasets.
# Make sure to specify your local dataset . The code will fail otherwise.
local_dataset:
inference_input : '/workspace/dataset/annotation.csv'
delimiter: ","
features:
class_label: "label"
data_column: "symptoms"
id: "Patient_ID"
label_list: ["Malignant", "Normal", "Benign"]

# Add the Inference configurations below
pipeline: "inference"
infer_impl: "itrex"
dtype_inf: "fp32"
max_seq_len: 64
smoke_test: false
max_train_samples: null
max_test_samples: null
preprocessing_num_workers: 8
overwrite_cache: true
inference_output: inference_predictions_report.yaml
multi_instance: false

training_args:
do_predict: true
per_device_eval_batch_size: 100
output:dir: "/.output"
15 changes: 15 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM intel/intel-optimized-pytorch:pip-ipex-1.13.100-ubuntu-22.04

RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libgl1-mesa-glx \
libglib2.0-0 \
python3-dev

RUN mkdir -p /workspace/output

COPY . /workspace

WORKDIR /workspace

RUN python -m pip install --no-cache-dir -r /workspace/requirements.txt
18 changes: 18 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash

# Copyright (C) 2022 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions
# and limitations under the License.
#

pip install -r requirements.txt
7 changes: 7 additions & 0 deletions workflows/hf_finetuning_and_inference_nlp/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
transformers==4.26.0
datasets==2.11.0
neural-compressor==2.1
--extra-index-url https://download.pytorch.org/whl/cpu
torch==1.13.1
intel_extension_for_pytorch==1.13.100
intel-extension-for-transformers==1.0.0
Empty file.
Loading

0 comments on commit bf666c0

Please sign in to comment.