Skip to content

Releases: intel/intel-extension-for-transformers

Intel® Extension for Transformers v1.4.1 Release

21 Apr 08:38
0fc6e01
Compare
Choose a tag to compare

Highlights
Improvements
Examples
Bug Fixing

Highlights

  • Support Weight-only Quantization on MTL iGPU
  • Upgrade lm-eval to 0.4.2
  • Support Llama3

Improvements

  • Support TPP for Xeon Tensor Parallel (5f0430f )
  • Refine Model from_pretrained When use_neural_speed (39ecf38e )

Examples

  • Add vision front-end demo (1c6550 )
  • Add example for table extraction, and enabled multi-page table handling pipeline (db9e6fb )
  • Adapted textual inversion distillation for quantization example to latest transformers and diffusers packages (0ec83b1 )
  • Update NeuralChat Notebooks (83bb65a, 629b9d4 )

Bug Fixing

  • Fix QBits actshuf buf overflow under large batch (a6f3ab3 )
  • Fix TPP support for single socket (a690072 )
  • Fix retrieval dependency (281b0a3 )
  • Fix loading issue of woq model with parameters (37f9db25 )

Validated Configurations

  • Python 3.10
  • Ubuntu 22.04
  • PyTorch 2.2.0+cpu
  • Intel® Extension for Torch 2.2.0+cpu

Intel® Extension for Transformers v1.4 Release

03 Apr 16:33
346211c
Compare
Choose a tag to compare

Highlights
Features
Productivity
Examples
Bug Fixing

Highlights

  • AutoRound is SOTA weight-only quantization (WOQ) algorithm for low-bit LLM inference on typical LLMs. This release includes support for AutoRound quantization and inference with INT4 models quantized by AutoRound.

Features

Productivity

  • Add bm25 algorithm into retrievers (a19467d0 )
  • Add evaluation perplexity during training (2858ed1 )
  • Enhance embedding to support jit model (588c60 )
  • Update the character checking function to enable the Chinese character (0da63fe1 )
  • Enlarge the context window for HPU graph recompile (dcaf17ac )
  • Support IPEX bf16 & fp32 optimization for emebedding model (b51552 )
  • Enable lm_eval during training. (2de883 )
  • Refine setup.py and requirements.txt (436847 )
  • Improve WOQ model saving and loading (30d9d10, 1065d81c )
  • Add layerwise for WOQ RTN & GPTQ (15a848f3 )
  • Update sparseGPT example (3ae0cd0 )
  • Changed regular expression to add support of the unicode characters (fd2516b )
  • Check and convert contiguous tensor when model saving (d21bb3e )
  • Support load model from modelscope using NeuralSpeed (20ae00 )

Examples

Bug Fixing

  • Fix CLM tasks when transformers >= 4.38.1 (98bfcf8 )
  • Fix distilgpt2 TF signature issue (a7c15a9f )
  • Add User input + max tokens requested exceeds model context window error response (ae91bf8 )
  • Fix audio plugin sample code issue and provide a way to set tts/asr model path (db7da09 )
  • Fix modeling_auto trust_remote_code issue (3a0987 )
  • Fix lm-eval neuralspeed loading model (cd6e488 )
  • Fixed weight-only config save issue (5c92fe31 )
  • Fix index error in Child-parent retriever (8797cfe )
  • Fix WOQ int8 unpack weight (edede4 )
  • Fix gptq desc_act and static_group (528d7de )
  • Fix request.client=None issue (494a571 )
  • Fix WOQ huggingface model loading (01b1a44 )
  • Fix SQ model restore loading (1e00f29 )
  • Remove redundant parameters for WOQ saving config and fix GPTQ issue (ef0882f6 )
  • Fixed exmple error for Intel GPU WOQ (8fdde06 )
  • Fix woq autoround last layer quant issue (d21bb3e )
  • Fix code-generation params (ab2fd05 )

Validated Configurations

  • Python 3.8, 3.9, 3.10, 3.11
  • Ubuntu 20.04 & Windows 10
  • Intel® Extension for TensorFlow 2.13.0, 2.14.0
  • PyTorch 2.2.0+cpu 2.1.0+cpu
  • Intel® ...
Read more

Intel® Extension for Transformers v1.3.2 Release

24 Feb 05:47
9e9e4c7
Compare
Choose a tag to compare

Highlights

  • Support NeuralChat-TGI serving with Docker (8ebff39)
  • Support Neuralchat-vLLM serving with Docker (1988dd)
  • Support SQL generation in NeuralChat (098aca7)
  • Enable llava mmmu evaluation on Gaudi2 (c30353f)
  • Improve LLM INT4 inference on Intel GPUs

Improvements

  • Minimize dependencies for running a chatbot (a0c9dfe)
  • Remove redundant knowledge id in audio plugin API (9a7353)
  • Update parameters for NeuralSpeed (19fec91)
  • Integrate backend code of Askdoc (c5d4cd)
  • Refine finetuning data preprocessing with static shape for Gaudi2 (3f62ceb)
  • Sync RESTful API with latest OpenAI protocol (2e1c79)
  • Support WOQ model save and load (1c8078f)
  • Extend API for GGUF (7733d4)
  • Enable OpenAI compatible audio API (d62ff9e)
  • Add pack_weight info acquire interface (18d36ef)
  • add customized system prompts (04b2f8)
  • Support WOQ scheme asym (c7f0b70)
  • update code_lm_eval to bigcode_eval (44f914e)
  • enable Retrieval PDF figure to text (d6a66b3)
  • enable retrieval then rerank pipeline (15feadf)
  • enable gramma check and query polish to enhance RAG performance (a63ec0)

Examples

  • Add Rank-One Model Editing (ROME) implementation and example (8dcf0ea7)
  • Support GPTQ, AWQ model in NeuralChat (5b08de)
  • Add Neural Speed example scripts (6a97d15, 3385c42)
  • Add langchain extension example and update notebook (d40e2f1)
  • Support deepseek-coder models in NeuralChat (e7f5b1d)
  • Add autoround examples (71f5e84)
  • BGE embedding model finetuning (67bef24)
  • Support DeciLM-7B and DeciLM-7B-instruct in NeuralChat (e6f87ab)
  • Support GGUF model in NeuralChat (a53a33c)

Bug Fixing

  • Add trust_remote_code args for lm_eval of WOQ example.( 9022eb)
  • Fix CPU WOQ accuracy issue (e530f7)
  • Change the default value for XPU weight-only quantization (4a78ba)
  • Fix whisper forced_decoder_ids error (09ddad)
  • Fix off by one error on masking (525076d)
  • Fix backprop error for text only examples (9cff14a)
  • Use unk token instead of eos token (6387a0)
  • Fix errors in trainer save (ff501d0)
  • Fix Qdrant bug caused by langchain_core upgrade (eb763e6)
  • Set trainer.save_model state_dict format to safetensors (2eca8c)
  • Fix text-generation example accuracy scripts (a2cfb80)
  • Resolve WOQ quantization error when running neuralchat (6c0bd77)
  • Fix response issue of model.predict (3068496)
  • Fix pydub library import issues (c37dab)
  • Fix chat history issue (7bb3314)
  • Update gradio APP to sync with backend change (362b7af)

Validated Configurations

  • Python 3.10
  • Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for Torch 2.1.0+cpu

Intel® Extension for Transformers v1.3.1 Release

19 Jan 16:38
81d4c56
Compare
Choose a tag to compare

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

  • Support experimental INT4 inference on Intel GPU (ARC and PVC) with Intel Extension for PyTorch as backend
  • Enhance LangChain to support new vectorstore (e.g., Qdrant)

Improvements

  • Improve error code handling coverage (dd6dcb4 )
  • NeuralChat document refine (aabb2fc )
  • Improve Text-generation API (a4aba8 )
  • Refactor transformers-like API to adapt to latest transformers version (4e6834a )
  • NeuralChat integrate GGML INT4 (29bbd8 )
  • Enable Qdrant vectorstore (f6b9e32 )
  • Support llama series model for llava finetuning (d753cb )

Examples

  • Support GGUF Q4_0, Q5_0 and Q8_0 models from HuggnigFcae (1383c7)
  • Support GPTQ model inference on CPU (f4c58d0 )
  • Support SOLAR-10.7B-Instruct-v1.0 model (77fb81 )
  • Support magicoder model and refine load model (f29c1e )
  • Support Mixstral-8x7b model (9729b6 )
  • Support Phi-2 model (04f5ef6c )
  • Evaluate Perplexity of NeuralSpeed (b0b381)

Bug Fixing

  • Fix GPTQ load in issue ( 226e08 )
  • Fix tts crash with messy retrieval input and enhance normalizer (4d8d9a )
  • Support compatible stats format (c0a89c5a )
  • Fix RAG example for retrieval plugin parameter change (c35d2b )
  • Fix magicoder tokenizer issue and streaming redundant end format (2758d4 )

Validated Configurations

  • Python 3.10
  • Centos 8.4 & Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for Torch 2.1.0+cpu

Intel® Extension for Transformers v1.3 Release

22 Dec 07:42
6e3a514
Compare
Choose a tag to compare

Highlights
Publication
Features
Examples
Bug Fixing
Incompatible change

Highlights

  • LLM Workflow/Neural Chat
    • Achieved Top-1 7B LLM Hugging Face Open Leaderboard in Nov’23
    • Released DPO dataset to Hugging Face Space for fine-tuning
    • Published the blog and fine-tuning code on Gaudi2
    • Supported fine-tuning and inference on Gaudi2 and Xeon
    • Updated notebooks for chatbot development and deployment
    • Provided customizable RAG-based chatbot applications
    • Published INT4 chatbot on Hugging Face Space
  • Transformer Extension for Low-bit Inference and Fine-tuning
    • Supported INT4/NF4/FP4/FP8 LLM inference
    • Improved StreamingLLM for efficient endless text generation
    • Demonstrated up to 40x better performance than llama.cpp on Intel Xeon Scalable Processors
    • Supported QLoRA fine-tuning on CPU

Publications

Features

  • LLM Workflow/Neural Chat
    • Support Gaudi model parallelism serving (7f0090)
    • Add PEFT model support in deepspeed sharded mode (370ca3)
    • Support return error code (ea173a)
    • Enhance NeuralChat security (ab43c7, 43e8b9, 6e0386)
    • Support assisted generation for NeuralChat (5ba797)
    • Add codegen restful API in NeuralChat (0c77b1)
    • Support multi cards streaming inference on Gaudi (9ad75c)
    • Support multi CPU restful API serving (fec4bb4)
    • Support IPEX int8 model (e13363)
    • Enable retrieval with URL as inputs (9d90e1d)
    • Add NER plugin to NeuralChat (aa5d8a)
    • Integrate PhotoAI backend into NeuralChat (da138c, d7a1d8)
    • Support image to image plugin as service (12ad4c)
    • Support optimized SadTalker to Video plugin in NeuralChat (7f24c79)
    • Add askdoc retrieval API & example (89cf76)
    • Add sidebyside UI (dbbcc2b)
  • Transformer Extension for Low-bit Inference and Fine-tuning
    • Support load_in_nbit in llm runtime (4423f7)
    • Extend langchain embedding API (80a779)
    • Support QLoRA on CPU device (adb109)
    • Support PPO rl_training (936c2d2, 8543e2f)
    • Support multi-model training (ecb448)
    • Transformers Extension for Low-bit Inference Runtime support GPTQ models (8145e6)
    • Enable Beam Search post-processing (958d04, ae95a2, ae95a2, 224656, 958d04, 6ea825)
    • Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4) (f49f2d, 9f96ae)
    • Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas (43e30b)
    • Support attention block TP and add jblas split weight interface (2c31dc, 22ceda4)
    • Enabing streaming LLM for Runtime (ffc73bb5)
    • Support starcoder MHA fusion (841b29a)
    • SmoothQuantConfig support recipes (1e0d7e)
    • Add SetFit API in ITREX (ffb7fd8)
    • Support full parameters finetuning (2b541)
    • Support SmoothQuant auto tune (2fde68c)
    • Use python logging instead of print (60942e)
    • Support Falcon and unify int8 API (0fb2da8)
    • Support ipex.optimize_transformers feature (d2bd4d, ee855, 3f9ee42)
    • Optimized dropout operator (7d276c)
    • Add Script for PPL Evaluation (df40d5)
    • Refine Python API (91511d, 6e32ca6)
    • Allow CompileBF16 on GCC11 ([d9e95d](d9e95da382a9f2b81cace55825...
Read more

Intel® Extension for Transformers v1.2.2 Release

06 Dec 09:54
8644604
Compare
Choose a tag to compare

Bug Fixing & Improvements

  • Replace test dataset with validation dataset when do_eval.(e764bb5)
  • fix save issue of deepspeed zero3.(cf5ff82)
  • Fix UT issues on Nvidia GPU.(464962e)
  • Fix Ner nightly ut bug.(9e5a6b3)
  • Escape sql string for SDL.(43e8b9a)
  • Fix added_tokens error.(fd74a9a)

Validated Configurations

  • Python 3.9, 3.10
  • Centos 8.4 & Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for PyTorch 2.1.0+cpu
  • Transformers 4.34.1

Intel® Extension for Transformers v1.2.1 Release

08 Nov 09:14
Compare
Choose a tag to compare
  • Examples
  • Bug Fixing & Improvements

Examples

  • Add docker for code-generation (dd3829 )
  • Enable Qwen-7B-Chat for NeuralChat (698e58 )
  • Enable Baichuan & Baichuan2 CPP inference (98e5f9 )
  • Add sidebyside UI for NeuralChat (dbbcc2 )
  • Support Falcon-180B CPP inference (900ebf )
  • Support starcoder finetuning example (073bdd )
  • Enable text-generation using qwen (8f41d4 )
  • Add docker for neuralchat (a17d952 )

Bug Fixing & Improvements

  • Fix bug for woq with AWQ due to not set calib_iters if calib_dataloader is not None.( 565ab4)
  • Fix init issue of langchain chroma (fdefe2)
  • Fix NeuralChat starcoder mha fusion issue (ce3d24)
  • Fix setuptools version limitation for build (2cae32)
  • Fix post process with topk topp of python api (7b4730)
  • Fix msvc compile issues (87b00d)
  • Refine notebook and fix restful api issues (d8cc11)
  • Upgrade qbits backend (45e03b )
  • Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b )
  • Fix ChatGLM2 model loading issue (4f2169 )
  • Remove OneDNN graph env setting for BF16 inference (59ab03 )
  • Improve database by escape sql string (be6790 )
  • fix qbits backend get wrong workspace malloc size (6dbd0b )

Validated Configurations

  • Python 3.9, 3.10
  • Centos 8.4 & Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for PyTorch 2.1.0+cpu
  • Transformers 4.34.1

Intel® Extension for Transformers v1.2 Release

26 Sep 18:53
8fbcceb
Compare
Choose a tag to compare

Highlights
Features
Productivity
Examples
Bug Fixing
API Modification
Documentation

Highlights

  • NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
  • NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
  • LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.

Features

  • Neural Chat
    • Support ASR/TTS on CPU and HPU (fb619e5 56685a )
    • Added docker for chatbot on Xeon SPR and Habana Gaudi (59fc92e ad2ee1)
    • Refine Chatbot workflow and use NeuralChat API (53bed4 e95fc32 )
    • Implement API python sdk, weight only quantization and AMP for Neural-Chat. (08ba5d85 )
  • Model Optimization
    • Add GPTQ/TEQ/WOQ quantization with plenty examples (b4b2fcc 1bcab14 )
    • Enhance the ITREX quantization API as well as LLMRuntime, users can now obtain a quantized model using AutoModelForCausalLM.from_pretrained. (be651b f4dc78 )
    • Support GPT-J pruning (802ec0d2 )
  • LLM Runtime
    • Enable FFN fusion LLMs (277108 )
    • Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
    • Implement AMX INT8/BF16 MHA (c314d6c )
    • Support asymmetric models in LLM Runtime (93ca55 )
    • Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
    • Enhance Beam-search early-stopping mechanisms (cd4c33d )

Productivity

  • ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
  • Support streaming mode for neuralchat (f5892ec )
  • Support Direct Preference Optimization to improve accuracy. (50b5b9 )
  • Support query cache for chatbot (1b4463 )
  • Weight-only support for Pytorch Framework (3a064fa )
  • Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
  • Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
  • Update Onednn to v3.3 (e6d8a4 )
  • Weight-only kernel support INT8 quantization (6ce8b13 )
  • Enable flash attention like kernel in weight-only (0ef3942)
  • Weight-only kernel ISA based dispatcher (ff7af86 )
  • Support 4bits per-channel quantization (4e164a8 )

Examples

Bug Fixing

Read more

Intel® Extension for Transformers v1.1.1 Release

06 Sep 16:37
49336d3
Compare
Choose a tag to compare
  • Highlights
  • Bug Fixing & Improvements
  • Tests & Tutorials

Highlights
In this release, we improved NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers. NeuralChat is now available for you to create your own chatbot within minutes on multiple architectures.

Bug Fixing & Improvements

  • Fix the code structure and the plugin in NeuralChat (commit 486e9e)
  • Fix bug in retrieval chat (commit d2cee0)
  • NeuralChat Inference return correct input len without pad to user (commit 18be4c)
  • Fix MPT not support left padding issue (commit 24ae58)
  • Fix double remove dataset columns when concatenation (commit 67ce6e)
  • Fix DeepSpeed and use cache issue (commit 4675d4)
  • Fix bugs in predict_stream (commit e1da7e)
  • Fix docker CPU issues (commit 8fa0dc)
  • Fix read HuggingFaceH4/oasst1_en dataset issue (commit 76ee68)
  • Modify Dockerfile for finetuning (commit 797aa2)
  • Fix the perf of LLaMA2 by static_shape in optimum Habana (commit 481f38)
  • Remove NeuralChat redundant code and hard codes. (commit 0e1e4d, 037ce8, 10af3c)
  • Refined NeuralChat finetuning config (commit e372cf)

Tests & Tutorials

  • Add inference test for LLaMA2 and MPT with HPU (commit 5c4f5e)
  • Add inference test for LLaMA2 and MPT with Intel CPUs (commit ad4bec, 2f6188)
  • Add finetuning test for MPT (commit 72d81e, 423242)
  • Add GHA Unit Tests (commit 49336d)
  • NeuralChat finetuning tutorial for LLaMA2 and MPT (commit d156e9)
  • NeuralChat deployment on Intel CPU/ Habana HPU/ Nvidia tutorial (commit b36711)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.9
  • PyTorch 2.0.0
  • TensorFlow 2.12.0

Acknowledgements
Thanks for the contributions from sywangyi, jiafuzha and itayariel. Thanks to all the participants to Intel Extension for Transformers.

Intel® Extension for Transformers v1.1 Release

14 Jul 10:26
4269f96
Compare
Choose a tag to compare
  • Highlights
  • Features
  • Productivity
  • Examples
  • Bug Fixing
  • Documentation

Highlights

  • Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
  • Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
  • Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
  • Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types

Features

  • Model Optimization
    • Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
    • Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
    • Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
    • Enable QAT for Stable Diffusion (commit 2e2efd)
    • Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
  • Transformers-accelerated Neural Engine
  • Transformers-accelerated Libraries
    • MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
    • Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
    • Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
    • Support dynamic quantization op (commit 6fcc15)
    • Add AVX2 kernels for Windows (commit bc313c)

Productivity

  • Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
  • Enable docker for Chatbot (commit 6b9522, 37b455)
  • Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
  • Update Torch and TensorFlow (commit f54817)
  • Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
  • Add summarization evaluation for PyTorch (commit 062e62)

Examples

  • Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
  • Electra fp32 & bf16 inference (commit e09c96)
  • GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
  • Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
  • Onnx whisper-large quantization (commit 038be0)
  • 8-layers MiniLM inference (commit 0dd104)
  • Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)

Bug Fixing

  • Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
  • Fix quantization for transfor...
Read more