Adding the huggingface Finetuning and inference NLP reference workflow (

#829)
intel · Apr 21, 2023 · bf666c0 · bf666c0
1 parent 4cb9e4f
commit bf666c0
Show file tree

Hide file tree

Showing 21 changed files with 1,232 additions and 0 deletions.
diff --git a/workflows/hf_finetuning_and_inference_nlp/README.md b/workflows/hf_finetuning_and_inference_nlp/README.md
@@ -0,0 +1,49 @@
+# Workflow purpose
+The Huggingface Finetuning(transfer learning) and Inference workflow demonstrates NLP(natural language processing) workflows/pipelines using hugginface transfomer API to be run along with intel optimised software represented using toolkits, domainkits, packages, frameworks and other libraries for effective use of intel hardware leveraging Intel's AI instructions for fast processing and increased performance.The  workflows can be easily used by applications or reference kits showcasing usage. 
+
+The workflow currenly supports
+```
+Huggingface NLP Finetuning / Transfer Learning
+Huggingface NLP Inference
+```
+The HF Finetuning and Inference workflow supports the following API
+```
+Huggingface transformer's (trainer API)
+Intel's extension for transformers API (Itrex API) also named ( Intel's Transformer/NLP Toolkit)
+```
+
+### Architecture
+![Reference_Workflow](assets/HFFinetuningAndInference.png)
+
+
+# Get Started
+### Clone this Repository
+```
+git clone current repository
+cd into the current repository directory
+```
+
+### Create a new python  (Conda or Venv) environment with env name: "hfftinf_wf"
+```shell
+conda create -n hfftinf_wf python=3.9
+conda activate hfftinf_wf
+```
+or
+```shell
+python -m venv hfftinf_wf
+source hfftinf_wf/bin/activate
+```
+
+### Install package for running hf-finetuning-inference-nlp-workflows
+```shell
+pip install -r requirements.txt
+```
+
+## Running 
+See config/README.md for options.
+```shell
+python src/run.py --config_file config/finetune.yaml 
+python src/run.py --config_file config/inference.yaml 
+```
+
+
diff --git a/workflows/hf_finetuning_and_inference_nlp/assets/HFFinetuningAndInference.png b/workflows/hf_finetuning_and_inference_nlp/assets/HFFinetuningAndInference.png
diff --git a/workflows/hf_finetuning_and_inference_nlp/config/README.md b/workflows/hf_finetuning_and_inference_nlp/config/README.md
@@ -0,0 +1,74 @@
+# YAML Config
+
+Fine Tuning
+```
+model_name_or_path :             Path to pretrained model or model identifier from huggingface.co/models.
+tokenizer_name:                  Pretrained tokenizer name or path if not the same as model_name.
+dataset:                         Local or Huggingface datasets name.
+
+""" Required only when dataset: 'local' """
+local_dataset:
+    finetune_input :             Input filename incase of local dataset.
+    delimiter:                   File delimiter.
+    features:
+        class_label:             Label column name.
+        data_column:             Data column name.
+        id:                      Id column name.
+    label_list:                  List of class labels.
+
+pipeline:                        The pipeline to use. 'finetune' in this case.
+finetune_impl:                   The implementation of fine-tuning pipeline. Now we support trainer and itrex implementation.
+dtype_ft:                        Data type for finetune pipeline. Support fp32 and bf16 for CPU. Support fp32, tf32, and fp16 for GPU.
+max_seq_len:                     The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
+smoke_test:                      Whether to execute in sanity check mode.
+max_train_samples:               For debugging purposes or quicker training, truncate the number of training examples to this value if set.
+max_test_samples:                For debugging purposes or quicker testing, truncate the number of testing examples to this value if set.
+preprocessing_num_workers:       The number of processes to use for the preprocessing.
+overwrite_cache:                 Overwrite the cached training and evaluation sets.
+finetune_output:                 Path of file to write output results.
+
+training_args:
+    num_train_epochs:            Number of epochs to run.
+    do_train:                    Whether to run training.
+    do_predict:                  Whether to run predictions.
+    per_device_train_batch_size: Batch size per device during training.
+    per_device_eval_batch_size:  Batch size per device during evaluation.
+    output:dir:                  Output directory.
+```
+
+Inference
+```
+model_name_or_path :             Path to pretrained model or model identifier from huggingface.co/models.
+tokenizer_name:                  Pretrained tokenizer name or path if not the same as model_name.
+dataset:                         Local or Huggingface datasets name.
+
+""" Required only when dataset: 'local' """
+local_dataset:
+    inference_input :             Input filename incase of local dataset.
+    delimiter:                   File delimiter.
+    features:
+        class_label:             Label column name.
+        data_column:             Data column name.
+        id:                      Id column name.
+    label_list:                  List of class labels.
+
+pipeline:                        The pipeline to use. 'inference' in this case.
+infer_impl:                      The implementation of inference pipeline. Now we support trainer and itrex implementation.
+dtype_inf:                       Data type for inference pipeline. Support fp32 and bf16 for CPU. Support fp32, tf32, and fp16 for GPU.
+max_seq_len:                     The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.
+smoke_test:                      Whether to execute in sanity check mode.
+max_train_samples:               For debugging purposes or quicker training, truncate the number of training examples to this value if set.
+max_test_samples:                For debugging purposes or quicker testing, truncate the number of testing examples to this value if set.
+preprocessing_num_workers:       The number of processes to use for the preprocessing.
+overwrite_cache:                 Overwrite the cached training and evaluation sets.
+inference_output:                Path of file to write output results.
+multi_instance:                  Whether to use multi-instance mode.
+
+training_args:
+    num_train_epochs:            Number of epochs to run.
+    do_train:                    Whether to run training.
+    do_predict:                  Whether to run predictions.
+    per_device_train_batch_size: Batch size per device during training.
+    per_device_eval_batch_size:  Batch size per device during evaluation.
+    output:dir:                  Output directory.
+```
diff --git a/workflows/hf_finetuning_and_inference_nlp/config/finetune.yaml b/workflows/hf_finetuning_and_inference_nlp/config/finetune.yaml
@@ -0,0 +1,24 @@
+args:
+  model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
+  tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
+  dataset: "imdb" # local or huggingface datasets name
+
+  # Add the fine tuning configurations below
+  pipeline: "finetune"
+  finetune_impl: "itrex"
+  dtype_ft: "fp32"
+  max_seq_len: 64
+  smoke_test: false
+  max_train_samples: null
+  max_test_samples: null
+  preprocessing_num_workers: 8
+  overwrite_cache: true
+  finetune_output: finetune_predictions_report.yaml
+
+training_args:
+  num_train_epochs: 1
+  do_train: true
+  do_predict: true
+  per_device_train_batch_size: 100
+  per_device_eval_batch_size: 100
+  output:dir: "/.output"
diff --git a/...etuning_and_inference_nlp/config/finetune_Model-bertbase_Task-sentiment_Dataset-imdb.yaml b/...etuning_and_inference_nlp/config/finetune_Model-bertbase_Task-sentiment_Dataset-imdb.yaml
@@ -0,0 +1,24 @@
+args:
+  model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
+  tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
+  dataset: "imdb" # local or huggingface datasets name
+
+  # Add the fine tuning configurations below
+  pipeline: "finetune"
+  finetune_impl: "itrex"
+  dtype_ft: "fp32"
+  max_seq_len: 64
+  smoke_test: false
+  max_train_samples: null
+  max_test_samples: null
+  preprocessing_num_workers: 8
+  overwrite_cache: true
+  finetune_output: finetune_predictions_report.yaml
+
+training_args:
+  num_train_epochs: 1
+  do_train: true
+  do_predict: true
+  per_device_train_batch_size: 100
+  per_device_eval_batch_size: 100
+  output:dir: "/.output"
diff --git a/...ce_nlp/config/finetune_Model-bioclinicalbert_Task-HLSDiseasePrediction_Dataset-local.yaml b/...ce_nlp/config/finetune_Model-bioclinicalbert_Task-HLSDiseasePrediction_Dataset-local.yaml
@@ -0,0 +1,36 @@
+args:
+  model_name_or_path: "emilyalsentzer/Bio_ClinicalBERT"
+  tokenizer_name: "emilyalsentzer/Bio_ClinicalBERT"
+  dataset: "local" # local or huggingface datasets name
+
+  # Add local dataset configurations below. Skip for HF datasets.
+  # Make sure to specify your local dataset . The code will fail otherwise.
+  local_dataset:
+    finetune_input : '/workspace/dataset/annotation.csv'
+    inference_input : '/workspace/dataset/annotation.csv'
+    delimiter: ","
+    features:
+      class_label: "label"
+      data_column: "symptoms"
+      id: "Patient_ID"
+    label_list: ["Malignant", "Normal", "Benign"]
+
+  # Add the fine tuning configurations below
+  pipeline: "finetune"
+  finetune_impl: "itrex"
+  dtype_ft: "fp32"
+  max_seq_len: 64
+  smoke_test: false
+  max_train_samples: null
+  max_test_samples: null
+  preprocessing_num_workers: 8
+  overwrite_cache: true
+  finetune_output: finetune_predictions_report.yaml
+
+training_args:
+  num_train_epochs: 1
+  do_train: true
+  do_predict: true
+  per_device_train_batch_size: 100
+  per_device_eval_batch_size: 100
+  output:dir: "/.output"
diff --git a/workflows/hf_finetuning_and_inference_nlp/config/inference.yaml b/workflows/hf_finetuning_and_inference_nlp/config/inference.yaml
@@ -0,0 +1,32 @@
+args:
+  model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
+  tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
+  dataset: "imdb" # local or huggingface datasets name
+
+  # Add local dataset configurations below. Skip for HF datasets.
+  local_dataset:
+    inference_input : '/workspace/dataset/annotation.csv'
+    delimiter: ","
+    features:
+      class_label: "label"
+      data_column: "symptoms"
+      id: "Patient_ID"
+    label_list: ["Malignant", "Normal", "Benign"]
+
+  # Add the Inference configurations below
+  pipeline: "inference"   
+  infer_impl: "itrex"
+  dtype_inf: "fp32"
+  max_seq_len: 64
+  smoke_test: false
+  max_train_samples: null
+  max_test_samples: null
+  preprocessing_num_workers: 8
+  overwrite_cache: true
+  inference_output: inference_predictions_report.yaml
+  multi_instance: false
+
+training_args:
+  do_predict: true
+  per_device_eval_batch_size: 100
+  output:dir: "/.output"
diff --git a/...tuning_and_inference_nlp/config/inference_Model-bertbase_Task-sentiment_Dataset-imdb.yaml b/...tuning_and_inference_nlp/config/inference_Model-bertbase_Task-sentiment_Dataset-imdb.yaml
@@ -0,0 +1,32 @@
+args:
+  model_name_or_path: "bert-base-uncased" # input the fine-tuned model path
+  tokenizer_name: "bert-base-uncased" # input the fine-tuned model path
+  dataset: "imdb" # local or huggingface datasets name
+
+  # Add local dataset configurations below. Skip for HF datasets.
+  local_dataset:
+    inference_input : '/workspace/dataset/annotation.csv'
+    delimiter: ","
+    features:
+      class_label: "label"
+      data_column: "symptoms"
+      id: "Patient_ID"
+    label_list: ["Malignant", "Normal", "Benign"]
+
+  # Add the Inference configurations below
+  pipeline: "inference"   
+  infer_impl: "itrex"
+  dtype_inf: "fp32"
+  max_seq_len: 64
+  smoke_test: false
+  max_train_samples: null
+  max_test_samples: null
+  preprocessing_num_workers: 8
+  overwrite_cache: true
+  inference_output: inference_predictions_report.yaml
+  multi_instance: false
+
+training_args:
+  do_predict: true
+  per_device_eval_batch_size: 100
+  output:dir: "/.output"
diff --git a/...e_nlp/config/inference_Model-bioclinicalbert_Task-HLSDiseasePrediction_Dataset-local.yaml b/...e_nlp/config/inference_Model-bioclinicalbert_Task-HLSDiseasePrediction_Dataset-local.yaml
@@ -0,0 +1,33 @@
+args:
+  model_name_or_path: "./models/hls/" # input the fine-tuned model path
+  tokenizer_name: "./models/hls/" # input the fine-tuned model path
+  dataset: "local" # local or huggingface datasets name
+
+  # Add local dataset configurations below. Skip for HF datasets.
+  # Make sure to specify your local dataset . The code will fail otherwise.
+  local_dataset:
+    inference_input : '/workspace/dataset/annotation.csv'
+    delimiter: ","
+    features:
+      class_label: "label"
+      data_column: "symptoms"
+      id: "Patient_ID"
+    label_list: ["Malignant", "Normal", "Benign"]
+
+  # Add the Inference configurations below
+  pipeline: "inference"   
+  infer_impl: "itrex"
+  dtype_inf: "fp32"
+  max_seq_len: 64
+  smoke_test: false
+  max_train_samples: null
+  max_test_samples: null
+  preprocessing_num_workers: 8
+  overwrite_cache: true
+  inference_output: inference_predictions_report.yaml
+  multi_instance: false
+
+training_args:
+  do_predict: true
+  per_device_eval_batch_size: 100
+  output:dir: "/.output"
diff --git a/workflows/hf_finetuning_and_inference_nlp/docker/Dockerfile b/workflows/hf_finetuning_and_inference_nlp/docker/Dockerfile
@@ -0,0 +1,15 @@
+FROM intel/intel-optimized-pytorch:pip-ipex-1.13.100-ubuntu-22.04
+
+RUN apt-get update && apt-get install -y --no-install-recommends --fix-missing \
+    build-essential \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    python3-dev 
+
+RUN mkdir -p /workspace/output
+
+COPY . /workspace
+
+WORKDIR /workspace
+
+RUN python -m pip install --no-cache-dir -r /workspace/requirements.txt
diff --git a/workflows/hf_finetuning_and_inference_nlp/install.sh b/workflows/hf_finetuning_and_inference_nlp/install.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+
+# Copyright (C) 2022 Intel Corporation                                                                                              
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions
+# and limitations under the License.
+#
+
+pip install -r requirements.txt
diff --git a/workflows/hf_finetuning_and_inference_nlp/requirements.txt b/workflows/hf_finetuning_and_inference_nlp/requirements.txt
@@ -0,0 +1,7 @@
+transformers==4.26.0
+datasets==2.11.0
+neural-compressor==2.1
+--extra-index-url https://download.pytorch.org/whl/cpu
+torch==1.13.1
+intel_extension_for_pytorch==1.13.100
+intel-extension-for-transformers==1.0.0
diff --git a/workflows/hf_finetuning_and_inference_nlp/src/__init__.py b/workflows/hf_finetuning_and_inference_nlp/src/__init__.py