Intel® Extension for Transformers supports seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs.

# Prepare Environment

Install intel extension for transformers:

In [5]:
!pip install intel-extension-for-transformers

Install Requirements:

In [None]:
!git clone https://github.com/intel/intel-extension-for-transformers.git
%cd ./intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/
!pip install -r requirements.txt
%cd ../../../

# Run LLM with Transformer-extension API

You can use transformers extension Python API to run Hugging Face model simply. Here is the sample code:

In [None]:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

To directly load a GPTQ model, here is the sample code:

In [None]:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig

# Download Hugging Face GPTQ model to local path
model_name = "PATH_TO_MODEL"  # local path to model
woq_config = WeightOnlyQuantConfig(use_gptq=True)
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

To use smooth quantization, here is the sample code:

In [None]:
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
from intel_extension_for_transformers.transformers import SmoothQuantConfig
from transformers import AutoTokenizer
# smooth-quant
tokenizer = AutoTokenizer.from_pretrained("Intel/neural-chat-7b-v3-1")
recipes = {
        "smooth_quant": True,
        "smooth_quant_args": {"alpha": 0.5},
            }
sq_config = SmoothQuantConfig(
                            tokenizer=tokenizer,  # either two of one, tokenizer or calib_func
                            calib_iters=5,
                            recipes=recipes
                        )
q_model = AutoModelForCausalLM.from_pretrained("Intel/neural-chat-7b-v3-1",
                                            quantization_config=sq_config,
                                            use_neural_speed=False
                                        )
q_model.save("./saved_results")