# Introduction to Optimizing and Using Pretrained Language Models with Intel Extension for PyTorch (IPEX) and Smooth Quantization

In this notebook, we explore the process of loading a pre-trained language model, optimizing it using Intel Extension for PyTorch (IPEX), and generating text responses in a conversational AI context. This notebook demonstrates the practical use of model quantization and optimization techniques to improve performance on Intel hardware.

![smoothquant](https://miro.medium.com/v2/resize:fit:4800/format:webp/0*RH6ou7jL5Fw9KwGG.png)

### Learning Objectives:
1. Understand how to load and use pretrained language models with Hugging Face's `transformers` library.
2. Learn how to apply model optimization and quantization techniques using Intel Extension for PyTorch (IPEX).
3. Explore how to tokenize and stream inputs for text generation tasks.
4. Execute inference efficiently using a quantized, optimized model.

### Technology Summary:
- **Hugging Face Transformers:** Provides access to state-of-the-art pretrained models for various NLP tasks, such as text generation, classification, and more.
- **Intel Extension for PyTorch (IPEX):** A library that optimizes deep learning models for Intel hardware by applying quantization, optimizations, and inference improvements.
- **Quantization:** A technique to reduce model size and improve inference performance by using lower-precision data types for weights.
- **Text Generation:** Using a causal language model for generating text responses based on prompts provided by the user.

Through this notebook, you'll see how these technologies work together to build an efficient and powerful AI system.


## Environment Setup for the Notebook

In [None]:
import sys
!{sys.executable} -m pip install intel-extension-for-pytorch==2.2 --no-warn-script-location > /dev/null
!{sys.executable} -m pip install transformers==4.35.2 --no-warn-script-location > /dev/null
!{sys.executable} -m pip install torch==2.2.0 --no-warn-script-location > /dev/null

In [None]:
# force restart kernel to pull latest environment
exit()

## Import Necessary Libraries

In this cell, we import essential libraries for model loading and optimization:

- **torch:** The PyTorch framework used for deep learning.
- **intel_extension_for_pytorch (ipex):** Intelâ€™s extension to PyTorch for optimizing deep learning models on Intel hardware.
- **transformers:** A library from Hugging Face that provides access to pretrained models and tokenizers.
    - **AutoTokenizer:** Automatically loads the appropriate tokenizer for the model.
    - **AutoModelForCausalLM:** Loads a pre-trained causal language model for tasks like text generation.
    - **TextStreamer:** A utility for streaming and handling text generation from the model.


In [None]:
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

## Load Pretrained Model and Tokenizer

In this cell, we load the pre-trained language model and tokenizer from Hugging Face's `Intel/neural-chat-7b-v3-3` model. 

### Neural-Chat-v3-3 Model Overview

This model is a fine-tuned 7B parameter LLM on the Intel Gaudi 2 processor from the Intel/neural-chat-7b-v3-1 on the meta-math/MetaMathQA dataset. The model was aligned using the Direct Performance Optimization (DPO) method with Intel/orca_dpo_pairs. The Intel/neural-chat-7b-v3-1 was originally fine-tuned from mistralai/Mistral-7B-v-0.1. For more information, refer to the blog The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2.

- **Base Model:** `mistralai/Mistral-7B-v-0.1`
- **Context Length:** 8192 tokens
- **License:** Apache 2.0

You can use this model for various language-related tasks like math problem solving and generating coherent text. Check the [LLM Leaderboard](https://huggingface.co/datasets/open-llm-leaderboard/details_Intel__neural-chat-7b-v3-3) for evaluation results.

In [None]:
Model = 'Intel/neural-chat-7b-v3-3'

model = AutoModelForCausalLM.from_pretrained(Model)
tokenizer = AutoTokenizer.from_pretrained(Model)

## Model Optimization and Quantization with Intel Extension for PyTorch (IPEX)

This cell performs model optimization using Intel's IPEX library, including applying quantization to improve inference speed and reduce memory usage.

- **qconfig:** Configuration for weight-only quantization.
  - **weight_dtype:** Specifies the data type for weights, here using `torch.quint4x2` for quantization, with the option for `torch.qint8`.
  - **lowp_mode:** Specifies the mode for lower precision. Options include `NONE`, `FP16`, `BF16`, and `INT8`.
- **checkpoint:** Optional parameter to load a pre-quantized checkpoint (e.g., INT4 or INT8).

- **model_ipex:** Optimizes the loaded model using IPEX, applying the specified quantization configuration.
  - **ipex.llm.optimize:** Optimizes the model for better performance on Intel hardware.

After optimization, the original `model` object is deleted to free memory.


In [None]:
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping(
  weight_dtype=torch.quint4x2, # or torch.qint8
  lowp_mode=ipex.quantization.WoqLowpMode.NONE, # or FP16, BF16, INT8
)
checkpoint = None # optionally load int4 or int8 checkpoint

# PART 3: Model optimization and quantization
model_ipex = ipex.llm.optimize(model, quantization_config=qconfig, low_precision_checkpoint=checkpoint)

del model 

## Preparing System Message and User Prompt for Model Input

In this cell, we define the system message and user prompt to guide the model's response, followed by tokenizing the input for the model.

In [None]:
system_message= """\n\n You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. If you don't know the answer to a question, please don't share false information."""
prompt= "\n\n You are an expert in astronomy. Can you tell me 5 fun facts about the universe?"
model_answer_1 = 'None'

prompt_tempate = f"""
### System:
{system_message}

### User:
{prompt}

### Assistant:
"""

inputs = tokenizer(prompt_tempate, return_tensors="pt").input_ids

## Generating Model Response with Streamer

In this cell, we generate a response from the optimized model using the provided input tokens and stream the output.

- **streamer:** Utilizes `TextStreamer` to handle the streaming of generated tokens.
  - **skip_prompt:** Ensures that the initial prompt is not repeated in the streamed output.

- **torch.inference_mode():** A context manager that disables gradient computation, optimizing the model for inference.

- **model_ipex.generate:** Generates a sequence of tokens based on the input provided.
  - **inputs:** The tokenized input prompt for the model.
  - **streamer:** Streams the output using the `TextStreamer`.
  - **max_new_tokens:** Limits the number of new tokens generated to 300.
  - **repetition_penalty:** Penalizes repeated phrases or words by setting a repetition penalty, encouraging more diverse outputs.


In [None]:
streamer = TextStreamer(tokenizer,skip_prompt=True)

with torch.inference_mode():
    tokens = model_ipex.generate(
        inputs,
        streamer=streamer,
        max_new_tokens=300,
        repetition_penalty=1.5,
)

# Conclusion

In this notebook, we successfully demonstrated how to load, optimize, and generate text responses using a pretrained language model with Intel's IPEX library. By applying quantization techniques, we were able to enhance the model's performance on Intel hardware, while maintaining the quality of the generated text. This workflow can be extended to a variety of use cases, from conversational AI to content generation, enabling efficient and scalable deployments in real-world applications.

### Key Takeaways:
- Pretrained models from Hugging Face can be optimized for performance using Intel's IPEX.
- Quantization is an effective technique for improving the speed and memory efficiency of models.
- Streaming text generation enables real-time responses in applications like chatbots or virtual assistants.
