# ITREX - Leveraging Intel Optimizations for Enhanced Inference with Hugging Face

<img src="https://miro.medium.com/v2/resize:fit:1400/1*AfoKgyTN6l7xNg30GmbsLg.png" alt="Alt Text" style="width: 800px;"/>

Welcome to this developer-focused workshop, where we explore the integration of Intel extensions with Hugging Face models for optimized inference. The goal of this notebook is to demonstrate how developers can use Intel's extensions to achieve efficient and performant inference in production applications.

## Why Intel Optimizations Matter

In the realm of machine learning, particularly in NLP, the ability to perform efficient and speedy inference is crucial. By using the Intel extension for Transformers, we can load models directly from the Hugging Face Hub, like the "Intel/neural-chat-7b-v1-1" model, and optimize them for high-performance inference.

### Key Learning Points

- **Model Optimization**: Learn how to load and optimize Hugging Face models using Intel's neural compressor and extension APIs.
- **Streaming Output**: We'll use the TextStreamer functionality from Hugging Face Transformers to deliver a constant stream of tokens, enhancing the user experience by avoiding large text dumps.
- **Intel's Neural Chat Model**: Explore the "Intel/neural-chat-7b-v1-1" model, fine-tuned on Gaudi 2 processors, to understand its capabilities in generating text based on input prompts.
- **Practical Application**: Understand how these optimizations can be applied in real-world scenarios to deliver performant inference with minimal code.

By the end of this notebook, you'll have a practical understanding of how to apply Intel's optimizations to Hugging Face models for efficient inference.

Let's dive in and explore the power of optimized model inference!


In [1]:
!source /opt/intel/oneapi/setvars.sh #comment out if not running on Intel Developer Cloud Jupyter
!pip install transformers==4.35.2
!pip install intel_extension_for_transformers==1.2.2
!pip install intel_extension_for_pytorch==2.1.100
!pip install tqdm
!pip install einops
!pip install neural_speed==0.2
!pip install torch==2.1.1

 
   To force a re-execution of setvars.sh, use the '--force' option.
   Using '--force' can result in excessive use of your environment variables.
  
usage: source setvars.sh [--force] [--config=file] [--help] [...]
  --force        Force setvars.sh to re-run, doing so may overload environment.
  --config=file  Customize env vars using a setvars.sh configuration file.
  --help         Display this help message and exit.
  ...            Additional args are passed to individual env/vars.sh scripts
                 and should follow this script's arguments.
  
  Some POSIX shells do not accept command-line options. In that case, you can pass
  command-line options via the SETVARS_ARGS environment variable. For example:
  
  $ SETVARS_ARGS="ia32 --config=config.txt" ; export SETVARS_ARGS
  $ . path/to/setvars.sh
  
  The SETVARS_ARGS environment variable is cleared on exiting setvars.sh.
  
Defaulting to user installation because normal site-packages is not writeable
[0mDefaulting to us

#### Importing Required Libraries

This cell sets the foundation for our model optimization and text generation tasks. We import:
- `AutoTokenizer` and `TextStreamer` from Hugging Face's `transformers` library, crucial for tokenizing our input text and streaming the model's output.
- `AutoModelForCausalLM` from `intel_extension_for_transformers`, which is a specialized version of the model class optimized for Intel hardware.

In [2]:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


#### Model and Prompt Setup

Here, we specify the model and the initial text prompt for our text generation task.
- `model_name`: We set this to "Intel/neural-chat-7b-v1-1", a model fine-tuned on Intel's hardware, available on the Hugging Face model hub.
- `prompt`: This is our starting text for the model to generate from, setting the context for the text generation.


In [3]:
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a fisherman at sea,"

#### Tokenizer Initialization and Input Preparation

In this cell, we initialize the tokenizer with our chosen model and prepare our input text for the model.
- `tokenizer`: Loaded with the `AutoTokenizer.from_pretrained` method, tailored for our specific model.
- `inputs`: The prompt is tokenized to be fed into the model.
- `streamer`: An instance of `TextStreamer` is created with our tokenizer, enabling efficient and user-friendly text generation output.

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

#### Model Loading and Text Generation

This is where the action happens:
- We load our model using `AutoModelForCausalLM.from_pretrained`, with `load_in_4bit=True` to enable optimized inference.
- The model's `generate` function is called with the `streamer` parameter, which enables streaming output of the text. We set `max_new_tokens` to 300 to control the length of the generated text.

In [5]:
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

2024-02-02 08:05:37 [INFO] CPU device is used.
2024-02-02 08:05:37 [INFO] Applying Weight Only Quantization.
2024-02-02 08:05:37 [INFO] Using LLM runtime.


runtime_outs/ne_mpt_q_int4_jblas_cint8_g32.bin existed, will use cache file. Otherwise please remove the file


In [6]:
model.generate(inputs, streamer=streamer, max_new_tokens=300)

Once upon a time, there existed a fisherman at sea, who had 

model.cpp: loading model from runtime_outs/ne_mpt_q_int4_jblas_cint8_g32.bin
init: n_vocab    = 50279
init: n_embd     = 4096
init: n_mult     = 4096
init: n_head     = 32
init: n_layer    = 32
init: n_rot      = 32
init: n_ff       = 16384
init: n_parts    = 1
load: ne ctx size = 4737.55 MB
load: mem required  = 12929.55 MB (+ memory per state)
..................................................................................................
model_init_from_file: support_jblas_kv = 1
model_init_from_file: kv self size =  276.00 MB


been fishing for a long time. He had fished in many different places, but he had never fished in this particular place. He had heard stories about the fish that were in this place, but he had never seen them himself.

One day, he decided to go fishing in this place. He had heard that there were a lot of fish in this place, so he was excited. He had never fished in this place before, so he was a little nervous. He had heard that the fish were very big, so he was excited.

He went out to sea, and he started fishing. He caught a lot of fish, and he was very happy. He had never fished in this place before, but he had heard about it, and he was glad that he had decided to go.

The fisherman was very happy with his catch, and he decided to go back to the same place the next day. He was very excited to go back, because he had heard that there were even more fish in this place. He was very happy with his catch, and he was glad that he had decided to go.

The fisherman went back to the same pla

[[10758,
  2220,
  247,
  673,
  13,
  627,
  13164,
  247,
  27633,
  1342,
  387,
  6150,
  13,
  665,
  574,
  644,
  15133,
  323,
  247,
  1048,
  673,
  15,
  754,
  574,
  269,
  1428,
  275,
  1142,
  1027,
  5053,
  13,
  533,
  344,
  574,
  1620,
  269,
  1428,
  275,
  436,
  1798,
  1659,
  15,
  754,
  574,
  3735,
  6281,
  670,
  253,
  6773,
  326,
  497,
  275,
  436,
  1659,
  13,
  533,
  344,
  574,
  1620,
  2326,
  731,
  2994,
  15,
  187,
  187,
  4041,
  1388,
  13,
  344,
  4425,
  281,
  564,
  15133,
  275,
  436,
  1659,
  15,
  754,
  574,
  3735,
  326,
  627,
  497,
  247,
  2257,
  273,
  6773,
  275,
  436,
  1659,
  13,
  594,
  344,
  369,
  9049,
  15,
  754,
  574,
  1620,
  269,
  1428,
  275,
  436,
  1659,
  1078,
  13,
  594,
  344,
  369,
  247,
  1652,
  11219,
  15,
  754,
  574,
  3735,
  326,
  253,
  6773,
  497,
  1077,
  1943,
  13,
  594,
  344,
  369,
  9049,
  15,
  187,
  187,
  1328,
  2427,
  562,
  281,
  6150,
  13,
  285,
  34

# Conclusion and Discussion

### Conclusion

This workshop demonstrated the practical application of Intel optimizations in conjunction with Hugging Face's powerful Transformers library. We explored the nuances of model loading, tokenization, and efficient text generation using Intel's neural compressor and extension APIs.

### Discussion

The skills and knowledge gained here are essential for developers looking to implement optimized NLP models in production environments. The ability to generate text in a streamed manner and leverage Intel's hardware optimizations showcases the potential for building responsive and efficient AI-powered applications.

As we continue to advance in the field of AI, understanding and applying such optimizations will be crucial for developers to stay ahead in creating high-performance, scalable, and user-friendly applications.
