# ITREX - Leveraging Intel Optimizations for Enhanced Inference with Hugging Face

<img src="https://miro.medium.com/v2/resize:fit:1400/1*AfoKgyTN6l7xNg30GmbsLg.png" alt="Alt Text" style="width: 800px;"/>

Welcome to this developer-focused workshop, where we explore the integration of Intel extensions with Hugging Face models for optimized inference. The goal of this notebook is to demonstrate how developers can use Intel's extensions to achieve efficient and performant inference in production applications.

## Why Intel Optimizations Matter

In the realm of machine learning, particularly in NLP, the ability to perform efficient and speedy inference is crucial. By using the Intel extension for Transformers, we can load models directly from the Hugging Face Hub, like the "Intel/neural-chat-7b-v1-1" model, and optimize them for high-performance inference.

### Key Learning Points

- **Model Optimization**: Learn how to load and optimize Hugging Face models using Intel's neural compressor and extension APIs.
- **Streaming Output**: We'll use the TextStreamer functionality from Hugging Face Transformers to deliver a constant stream of tokens, enhancing the user experience by avoiding large text dumps.
- **Intel's Neural Chat Model**: Explore the "Intel/neural-chat-7b-v1-1" model, fine-tuned on Gaudi 2 processors, to understand its capabilities in generating text based on input prompts.
- **Practical Application**: Understand how these optimizations can be applied in real-world scenarios to deliver performant inference with minimal code.

By the end of this notebook, you'll have a practical understanding of how to apply Intel's optimizations to Hugging Face models for efficient inference.

Let's dive in and explore the power of optimized model inference!


In [None]:
!source /opt/intel/oneapi/setvars.sh #comment out if not running on Intel Developer Cloud Jupyter
!pip install transformers==4.34.1
!pip install intel_extension_for_transformers==1.2.2
!pip install intel_extension_for_pytorch==2.1.100
!pip install tqdm
!pip install einops
!pip install neural_speed==0.2
!pip install torch==2.1.1

#### Importing Required Libraries

This cell sets the foundation for our model optimization and text generation tasks. We import:
- `AutoTokenizer` and `TextStreamer` from Hugging Face's `transformers` library, crucial for tokenizing our input text and streaming the model's output.
- `AutoModelForCausalLM` from `intel_extension_for_transformers`, which is a specialized version of the model class optimized for Intel hardware.

In [None]:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

#### Model and Prompt Setup

Here, we specify the model and the initial text prompt for our text generation task.
- `model_name`: We set this to "Intel/neural-chat-7b-v1-1", a model fine-tuned on Intel's hardware, available on the Hugging Face model hub.
- `prompt`: This is our starting text for the model to generate from, setting the context for the text generation.


In [None]:
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a fisherman at sea,"

#### Tokenizer Initialization and Input Preparation

In this cell, we initialize the tokenizer with our chosen model and prepare our input text for the model.
- `tokenizer`: Loaded with the `AutoTokenizer.from_pretrained` method, tailored for our specific model.
- `inputs`: The prompt is tokenized to be fed into the model.
- `streamer`: An instance of `TextStreamer` is created with our tokenizer, enabling efficient and user-friendly text generation output.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

#### Model Loading and Text Generation

This is where the action happens:
- We load our model using `AutoModelForCausalLM.from_pretrained`, with `load_in_4bit=True` to enable optimized inference.
- The model's `generate` function is called with the `streamer` parameter, which enables streaming output of the text. We set `max_new_tokens` to 300 to control the length of the generated text.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

In [None]:
output = model.generate(inputs, streamer=streamer, max_new_tokens=300)

# Conclusion and Discussion

### Conclusion

This workshop demonstrated the practical application of Intel optimizations in conjunction with Hugging Face's powerful Transformers library. We explored the nuances of model loading, tokenization, and efficient text generation using Intel's neural compressor and extension APIs.

### Discussion

The skills and knowledge gained here are essential for developers looking to implement optimized NLP models in production environments. The ability to generate text in a streamed manner and leverage Intel's hardware optimizations showcases the potential for building responsive and efficient AI-powered applications.

As we continue to advance in the field of AI, understanding and applying such optimizations will be crucial for developers to stay ahead in creating high-performance, scalable, and user-friendly applications.
