# QuickStart: Intel® Extension For Transformers*: NeuralChat on 4th Generation Intel® Xeon® Scalable Processors

## Prepare Environment

Follow the README to install the necessary requirements to run this tutorial. In summary, you will need to install the following:
-  Intel(R) Extension for Transformers* from source (to get latest updates)
-  NeuralChat requirements
-  Retrieval Plugin Requirements
-  Audio Plugin (TTS and ASR) Requirements
  

Check hardware

In [None]:
!lscpu

Library imports

## Building a Simple Chatbot


Building a chatbot only requires the 3 lines of code below! By default, the model is Intel's Neural-Chat-7B-V3-1 model. Without any optimizations, the model runs in FP32.

In [None]:
# Build chatbot
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, GenerationConfig
config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1')
chatbot = build_chatbot(config)

# Perform inference/generate a response

gen_config = GenerationConfig(return_stats=True, format_version="v2")
results, _ = chatbot.predict_stream("Tell me about Intel Xeon Scalable Processors.", config=gen_config)
stream_text = ""
for text in results:
   stream_text += text
print(stream_text)



## Optimizing your Chatbot
Enable mixed precision with bfloat16 (BF16). Using a lower precision data type will reduce memory usage and speed up runtime without sacrifice to accuracy, since BF16 has the same range as FP32, just less precision in terms of decimal places. Starting with the 4th Gen Intel® Xeon® Scalable Processors, there is an instruction set Advanced Matrix Extensions (AMX) which accelerates operators in BF16 and integer8 (INT8) format.

In [None]:
# Build chatbot in BF16
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, GenerationConfig
from intel_extension_for_transformers.transformers import MixedPrecisionConfig
mix_config = MixedPrecisionConfig(dtype="bfloat16")
config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1',
                        optimization_config=mix_config)
chatbot = build_chatbot(config)

# Perform inference/generate a response
gen_config = GenerationConfig(return_stats=True, format_version="v2")
results, _ = chatbot.predict_stream("Tell me about Intel Xeon Scalable Processors.", config=gen_config)
stream_text = ""
for text in results:
   stream_text += text
print(stream_text)


INT4 weight-only quantization can be used to further reduce memory and speed up performance without too much loss to accuracy. Note that the _compute_dtype_ is "int8" because AMX only supports down to INT8.

In [None]:
# Build chatbot with INT4 weight-only quantization, computations in AMX INT8
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
from intel_extension_for_transformers.neural_chat.config import LoadingModelConfig
config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1',
                        optimization_config=WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4_fullrange"), 
                        loading_config=LoadingModelConfig(use_neural_speed=False))
chatbot = build_chatbot(config)

# Perform inference/generate a response
gen_config = GenerationConfig(return_stats=True, format_version="v2")
results, _ = chatbot.predict_stream("Tell me about Intel Xeon Scalable Processors.", config=gen_config)
stream_text = ""
for text in results:
   stream_text += text
print(stream_text)

## Swapping out Models: Llama2 Example
You can swap out the Neural-Chat-7B model with another transformer model from [HuggingFace](https://huggingface.co/models), including the most popular LLMs. Pass in the model card for the _model_name_or_path_ argument. For example, this is how you can build a chatbot using Llama2 in FP32 and BF16. *NOTE* You may need to log in to HuggingFace to get access to this model. To do so, use the command _huggingface-cli login_.

In [None]:
# OPTIONAL: log in to HuggingFace to access Llama2
#!huggingface-cli login --token <@TODO: enter in HF token here> --add-to-git-credential

In [None]:
# Build chatbot in BF16 using Llama2
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
from intel_extension_for_transformers.transformers import MixedPrecisionConfig
config = PipelineConfig(model_name_or_path='meta-llama/Llama-2-7b-chat-hf',
                        optimization_config=MixedPrecisionConfig(dtype='bfloat16'))
chatbot = build_chatbot(config)

# Perform inference/generate a response
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

## Customizing your Chatbot
### Plugin: Retrieval
Without the retrieval plugin, the output of the chatbot gives the wrong answer.

In [None]:
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1')
chatbot = build_chatbot(config)
response = chatbot.predict(query="Who won Super Bowl 58 and what was the score?")
print(response)

The retrieval plugin allows you to specify a file or a folder of files with information you want your chatbot to look up before outputting the final response. Here, _sample_workshop.txt_ contains the correct answer. For more information about the retrieval plugin and the file types supported, go to the [Retrieval README](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md).

You can specify a single file or a a folder. In this example, the files will be placed inside a folder _docs_. 

In [None]:
!mkdir docs
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample_workshop.txt
!mv sample_workshop.txt ./docs

In [None]:
!cat ./docs/sample_workshop.txt

In [None]:
# Build chatbot with retrieval
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import plugins
plugins.retrieval.enable=True
plugins.retrieval.args["input_path"]="./docs/sample_workshop.txt"
config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1',
                        plugins=plugins)
chatbot = build_chatbot(config)
response = chatbot.predict(query="Who won Super Bowl 58 and what was the score?")
print(response)

plugins.retrieval.enable=False # disable retrieval

### Plugin: ASR & TTS
The ASR and TTS plugin enables voice chat for a more interactive experience. Instead of passing in text and getting text responses, you can pass in audio files and get audio files in response. 

In [None]:
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav

In [None]:
# Build chatbot with AST and TTS plugin
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig
from intel_extension_for_transformers.neural_chat import plugins
plugins.tts.enable = True
plugins.tts.args["output_audio_path"] = "./response.wav"
plugins.asr.enable = True

config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1',
                        plugins=plugins)
chatbot = build_chatbot(config)
result = chatbot.predict(query="./sample.wav")
print(result)

plugins.tts.enable = False # disable tts
plugins.asr.enable = False # disable asr

Open the audio files using your own audio player to hear the query and response. 

In [None]:
import IPython
IPython.display.Audio("response.wav")

### [Optional]: Fine-tuning

We use the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca) from Stanford University as the general domain dataset to fine-tune the model. This dataset is provided in the form of a JSON file, [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). In Alpaca, researchers have manually crafted 175 seed tasks to guide `text-davinci-003` in generating 52K instruction data for diverse tasks.

In [None]:
!curl -OL https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json

Finetune the model on Alpaca-format dataset to conduct text generation.

We employ the [LoRA approach](https://arxiv.org/pdf/2106.09685.pdf) to finetune the LLM efficiently.

In [None]:
from transformers import TrainingArguments
from intel_extension_for_transformers.neural_chat.config import (
    ModelArguments,
    DataArguments,
    FinetuningArguments,
    TextGenerationFinetuningConfig,
)
from intel_extension_for_transformers.neural_chat.chatbot import finetune_model
model_args = ModelArguments(model_name_or_path='Intel/neural-chat-7b-v3-1')
data_args = DataArguments(train_file="alpaca_data.json")
training_args = TrainingArguments(
    output_dir='./finetuned_model_path',
    do_train=True,
    do_eval=False,
    num_train_epochs=3,
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=1,
    save_strategy="no",
    log_level="info",
    save_total_limit=2,
    bf16=True
)
finetune_args = FinetuningArguments()
finetune_cfg = TextGenerationFinetuningConfig(
            model_args=model_args,
            data_args=data_args,
            training_args=training_args,
            finetune_args=finetune_args,
        )
finetune_model(finetune_cfg)

Load the fine tuned model

In [None]:
from intel_extension_for_transformers.neural_chat import build_chatbot
from intel_extension_for_transformers.neural_chat import PipelineConfig
from intel_extension_for_transformers.neural_chat.config import LoadingModelConfig

config = PipelineConfig(model_name_or_path='Intel/neural-chat-7b-v3-1',
                      loading_config=LoadingModelConfig(peft_path="./finetuned_model_path"))
chatbot = build_chatbot(config)
response = chatbot.predict(query="Tell me about Intel Xeon Scalable Processors.")
print(response)

### Congratulations! You have completed the NeuralChat quickstart. Now go build your own custom chatbots!
Visit [notebooks directory](https://github.com/intel/intel-extension-for-transformers/blob/c30353fcb0e5ceab440a7508b5980ccebcac8750/intel_extension_for_transformers/neural_chat/docs/full_notebooks.md) to see more examples