# Chapter 6 Chatting with any big language model! 💬

In this final lesson, we will build an application that chats with an open source LLM. We will use one of the best open source models, the Falcon 40B.

We will build a chatbot application using the open source LLM. You may have used ChatGPT to chat, but it is expensive to run and the interaction patterns are rigid. A custom LLM can be run locally, fine-tuned on your own data, or run on the cloud at a lower cost. In this lesson, we will use the inference endpoint running "falcon-40B-Instruct".

It's convenient to run locally using the Text Generation Inference Library. Of course, you can also use Gradio to create interfaces. Gradio is an API-based LLM, so it supports not only open source LLMs. But in this course, we will focus on the open source LLM Falcon 40B.

Loading the HF API key and related Python libraries

In [1]:
# Import necessary libraries
import os                # 用于操作系统相关的操作，例如读取环境变量
import io                # 用于处理流式数据（例如文件流）
import IPython.display   # 用于在IPython环境中显示数据，例如图片
from PIL import Image    # 用于处理图像数据
import base64            # 用于处理base64编码，通常用于编码图像数据
import requests          # 用于进行HTTP请求，例如GET和POST请求

# Set the default request timeout to 60 seconds
requests.adapters.DEFAULT_TIMEOUT = 60

# Import the function of dotenv library
# dotenv allows you to read environment variables from a .env file
# This is particularly useful when developing to avoid hard-coding sensitive information (such as API keys) into your code
from dotenv import load_dotenv, find_dotenv

# Find the .env file and load its contents
# This allows you to use os.environ to read environment variables set in a .env file
_ = load_dotenv(find_dotenv())

# Read 'HF_API_KEY' from the environment variable and store it in the hf_api_key variable
hf_api_key = os.environ['HF_API_KEY']

We set up our token and our helper functions here. You can see that we're using different libraries here. We're using the Text Generation Library, which is a stripped-down library for working with open source LLMs that allows you to both load an API (like we're doing here) and also run your own LLM locally.

In [2]:
# Helper functions

# Import necessary libraries
import requests        # 用于进行HTTP请求
import json            # 用于处理JSON数据

# Import the Client class in the custom module
# Assume this class is used to communicate with the FalcomLM-instruct endpoint
from text_generation import Client

# Use os.environ to get the value of 'HF_API_FALCOM_BASE' from the environment variable, this should be the base URL of FalcomLM
# Create a client instance using hf_api_key as part of the authentication
# Set the request timeout to 120 seconds
client = Client(os.environ['HF_API_FALCOM_BASE'], 
                headers={"Authorization": f"Basic {hf_api_key}"}, 
                timeout=120)

## Build an app to chat with any LLM!

Here we will use an [inference endpoint](https://huggingface.co/inference-endpoints) to call `falcon-40b-instruct`, which is ranked high on the [🤗 open source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

If you want to call it locally, you can use the [Transformers library](https://huggingface.co/docs/transformers/index) or the [text generation inference library](https://github.com/huggingface/text-generation-inference)

In [3]:
# Set a Chinese prompt, which will be sent to the model for generating text
prompt_chinese = "数学是被发明还是被发现的？"

#Call the client's generate method
# Use the previously set Chinese prompt prompt_chinese
# max_new_tokens parameter is used to limit the length of the generated text
# Then get the generated text from the returned result
client.generate(prompt_chinese, max_new_tokens=256).generated_text

'\n数学是被发现的，因为它是自然界的基本规律和模式。人类只是发现了这些规律和模式，并将它们用符号和公式表达出来。'

In [33]:
prompt = "Has math been invented or discovered?"
client.generate(prompt, max_new_tokens=256).generated_text

'\nMath has been both invented and discovered. It is a human invention in the sense that it is a system of rules and concepts that we have created to help us understand the world around us. However, it is also a discovery in the sense that it is a fundamental aspect of the universe that we have uncovered through our observations and experiments.'

In Lesson 2, we used a very simple Gradio interface with a textbox input and an output. Here, we can chat with LLM in a similar way. Duplicate our prompt again. Here, we can decide how many tokens we want. That's how you can ask LLM a question very simply. But we still can't chat because it won't understand or retain the context if you ask a follow-up question.

In [None]:
# Back to Lesson 2, time flies!
# Import required libraries
import gradio as gr  # 用于创建Web界面
import os  # 用于与操作系统交互，如读取环境变量

# Define a function to generate text based on input
def generate(input, slider):
# Use the predefined client object's generate method to generate text from the input
# The slider value limits the number of tokens generated
    output = client.generate(input, max_new_tokens=slider).generated_text
    return output  # 返回生成的文本

# Create a web interface
# Input: a text box and a slider
# Output: A text box displays the generated text
demo = gr.Interface(
    fn=generate, 
    inputs=[
        gr.Textbox(label="Prompt"),  # 文本输入框
        gr.Slider(label="Max new tokens", value=20,  maximum=1024, minimum=1)  # 滑块用于选择生成的token的最大数量
    ], 
    outputs=[gr.Textbox(label="Completion")]  # 显示生成文本的文本框
)

# Close any previous gradio instances that may have been started
gr.close_all()

# Start the Web interface
# Use the environment variable PORT1 as the server port number
demo.launch(share=True, server_port=int(os.environ['PORT1']))

So basically what we do is, we send the model our previous question, its own answer, and the follow-up question. But setting all of this up is a bit cumbersome. That's where the Gradio chatbot component comes in, as it allows us to simplify the process of sending our conversation history to the model.

Therefore, we want to solve this problem. To do this, we will introduce a new Gradio component - Gradio Chatbot.

![math](images/ch06_math.png)

## Use `gr.Chatbot()` to help!

Let's get started with the Gradio Chatbot component. Here we instantiate a Gradio ChatBot component with a text box prompt and a submit button, which is a very simple user interface. But we are not chatting with LLM yet.

Just randomly select three canned responses and then add my message and the bot's message to the chat. So here, you can see I can say anything and it will basically randomly look at those three responses.

In [None]:
# Import the random library to randomly select messages
import random
import gradio as gr  # 用于创建Web界面
import os  # 用于与操作系统交互，如读取环境变量

# Define the robot's response function
def respond(message, chat_history):
# Randomly select a Chinese reply
    bot_message_chinese = random.choice(["告诉我更多信息", 
                                         "酷，但我不感兴趣", 
                                         "嗯，那好吧"])
    
# Randomly select an English reply (this reply is not used in the interface)
    bot_message = random.choice(["Tell me more about it", 
                                 "Cool, but I'm not interested", 
                                 "Hmmmm, ok then"]) 
    
# Add the user's message and the robot's Chinese reply to the chat history
    chat_history.append((message, bot_message_chinese))
    
# Returns an empty string and the updated chat history
    return "", chat_history

# Use the gr.Blocks() context manager to define the user interface
with gr.Blocks() as demo:
# Define a chatbot window with a height of 240
    chatbot = gr.Chatbot(height=240) 
# Define a text box for user input messages
    msg = gr.Textbox(label="Prompt")
# Define a submit button
    btn = gr.Button("Submit")
# Define a button to clear the text box and chat window contents
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear console")

# When the user clicks the submit button, the response function is called
    btn.click(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])
# When the user presses the Enter key in the text box, the response function is also called
    msg.submit(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])

# Close any previous gradio instances that may have been started
gr.close_all()

# Start the Web interface
# Use the environment variable PORT2 as the server port number
demo.launch(share=True, server_port=int(os.environ['PORT2']))

![math_with_template](images/ch06_math_with_template.png)

We have to format the chat prompt. This format chat prompt function is being defined here.
Here, all we want to do is make it include the chat history so that LLM knows the context.
But that's not enough. We also need to tell it which information is from the user and which information is from LLM itself, which is the assistant we are calling.
So we set up the format chat prompt function, and in each turn of the chat history, include a user information and an assistant information so that our model can accurately answer follow-up questions.
Now, we want to pass the formatted prompt to our API.

In [None]:
# Define a function to format the chat prompt.
def format_chat_prompt(message, chat_history):
# Initialize an empty string to store the formatted chat prompt.
    prompt = ""
# Traverse the chat history.
    for turn in chat_history:
# Extract user and bot messages from the chat log.
        user_message, bot_message = turn
# Updated prompts to include messages for users and robots.
        prompt = f"{prompt}\nUser: {user_message}\nAssistant: {bot_message}"
# Add the current user message to the prompt and reserve a place for the robot's reply.
    prompt = f"{prompt}\nUser: {message}\nAssistant:"
# Returns the formatted prompt.
    return prompt

# Define a function to generate the robot's reply.
def respond(message, chat_history):
# Call the above function to format the user's message and chat history into a prompt.
    formatted_prompt = format_chat_prompt(message, chat_history)
# Use the generate method of the client object to generate the robot's reply (note: the client object is not defined in this code).
    bot_message = client.generate(formatted_prompt,
                                  max_new_tokens=1024,
                                  stop_sequences=["\nUser:", ""]).generated_text
# Add the user's message and the robot's reply to the chat history.
    chat_history.append((message, bot_message))
# Returns an empty string and the updated chat history (the empty string here can be replaced with a real robot reply and displayed on the interface if necessary).
    return "", chat_history

# The following code is the part that sets up the Gradio interface.

# Define a code block using Gradio's Blocks feature.
with gr.Blocks() as demo:
# Create a Gradio chatbot component and set its height to 240.
    chatbot = gr.Chatbot(height=240) 
# Create a text box component for input prompts.
    msg = gr.Textbox(label="Prompt")
# Create a submit button.
    btn = gr.Button("Submit")
# Create a clear button that clears the contents of the text box and chatbot components.
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear console")

# Set the button's click event. When clicked, call the respond function defined above, pass in the user's message and chat history, and then update the text box and chatbot components.
    btn.click(respond, inputs=[msg, chatbot], outputs=[msg, chatbot])
# Set the submit event of the text box (when the Enter key is pressed). The function is the same as the button click event above.
    msg.submit(respond, inputs=[msg, chatbot], outputs=[msg, chatbot]) 

# Close all existing Gradio instances.
gr.close_all()
# Start a new Gradio application, set the sharing function to True, and use the environment variable PORT3 to specify the server port.
demo.launch(share=True, server_port=int(os.environ['PORT3']))

![animal](images/ch06_animal.png)

![animal_in_context](images/ch06_animal_in_context.png)

Now our chatbot should be able to answer follow-up questions.
We can see that we sent it context. We sent it information and then asked it to complete. Once we enter another iteration loop, we send it our entire context and then ask it to complete. This is cool. However, if we keep iterating like this, then the model will reach a limit on the amount of information it can take in in a single conversation because we are always giving it more and more of the previous conversation.

To maximize the model's performance, we can set the maximum number of tokens `max_new_tokens` to 1024. This is the maximum value that the model can accept under the hardware conditions we run in the API.

Try the following prompts:
1. Which animals live in the savannah?
2. Which of these animals is the strongest?

Here, we have created a simple but powerful user interface for chatting with LLM. If we need to go further with the best that Gradio has to offer, we can create a user interface with more features.

### Adding other advanced features

In [None]:
# Define a function to format the chat prompt.
def format_chat_prompt(message, chat_history, instruction):
# Initialize prompts and add system commands.
    prompt = f"System:{instruction}"
# Traverse the chat history.
    for turn in chat_history:
# Extract user and bot messages from the chat log.
        user_message, bot_message = turn
# Updated prompts to include messages for users and robots.
        prompt = f"{prompt}\nUser: {user_message}\nAssistant: {bot_message}"
# Add the current user message to the prompt and reserve a place for the robot's reply.
    prompt = f"{prompt}\nUser: {message}\nAssistant:"
# Returns the formatted prompt.
    return prompt

# Define a function to generate the robot's reply.
def respond(message, chat_history, instruction, temperature=0.7):
# Call the above function to format the user's message, chat history, and system commands into a prompt.
    prompt = format_chat_prompt(message, chat_history, instruction)
# Update the chat history, adding the user's message first (the robot's reply part is empty at first).
    chat_history = chat_history + [[message, ""]]
# Generate the robot's reply using the generate_stream method of the client object (note: the client object is not defined in this code).
    stream = client.generate_stream(prompt,
                                    max_new_tokens=1024,
                                    stop_sequences=["\nUser:", ""], 
                                    temperature=temperature)  # 设置生成回复的温度，决定回复的随机性。
    acc_text = ""
# Get the bot's response using streaming.
    for idx, response in enumerate(stream):
        text_token = response.token.text

# If there is any detailed information, return it directly.
        if response.details:
            return

# If it is the first token and it starts with whitespace, remove the whitespace.
        if idx == 0 and text_token.startswith(" "):
            text_token = text_token[1:]

# Cumulative generated text.
        acc_text += text_token
# Update the last round of chat history.
        last_turn = list(chat_history.pop(-1))
        last_turn[-1] += acc_text
        chat_history = chat_history + [last_turn]
        yield "", chat_history
        acc_text = ""

# Set up the Gradio interface part.
with gr.Blocks() as demo:
# Create a Gradio chatbot component and set its height.
    chatbot = gr.Chatbot(height=240)
# Create a text box component for input prompts.
    msg = gr.Textbox(label="Prompt")
# Create an accordion component to display advanced options.
    with gr.Accordion(label="Advanced options", open=False):
# Create a text box inside the accordion component for entering system messages.
        system = gr.Textbox(label="System message", lines=2, value="一段用户和基于大语言模型的法律助手的对话. 助手会给出真实且有帮助的回答.")
# Create a slider to adjust the temperature of the response.
        temperature = gr.Slider(label="temperature", minimum=0.1, maximum=1, value=0.7, step=0.1)
# Create a submit button.
    btn = gr.Button("Submit")
# Create a clear button that clears the contents of the text box and chatbot components.
    clear = gr.ClearButton(components=[msg, chatbot], value="Clear console")

# Set the button's click event. When clicked, call the respond function defined above, pass in the user's message, chat history, and system message, and then update the text box and chatbot components.
    btn.click(respond, inputs=[msg, chatbot, system], outputs=[msg, chatbot])
# Set the submit event of the text box (when the Enter key is pressed). The function is the same as the button click event above.
    msg.submit(respond, inputs=[msg, chatbot, system], outputs=[msg, chatbot])

![law_1](images/ch06_law_1.png)

![law_2](images/ch06_law_2.png)

![law_3](images/ch06_law_3.png)

Here we have advanced options, including system messaging, which sets the mode in which LLM chats with you.
So in system messaging, you can say, for example, you're a helpful assistant, or you can give it a specific tone, a specific intonation,
you want it to be a little more playful, a little more serious, and you can really play around with system messaging and see what effect it has on your messages.

Some people might even want to give LLM a persona, like you're a lawyer giving legal advice, or you're a doctor giving medical advice,
but be aware that LLM has been known to give false information in a way that sounds real.
So while it can be fun to experiment and explore with the Falcon 40B, in real-world scenarios, further safeguards must be put in place for use cases like this.

There are other advanced parameters like the temperature here.
The temperature is basically how much you want the model to vary. So if you set the temperature to zero, the model will tend to always respond the same way to the same inputs.
So same question, same answer. The higher the temperature, the more information varies. But if the temperature is too high, it starts to give nonsense answers.
So 0.7 is a good default parameter, but we encourage you to experiment.

Apart from that, this UI also allows us to stream responses.
It is sent token by token and we can see it getting done in real time. So, we don't need to wait until the entire answer is ready. Here we can see how it is done. Don't worry if you don't understand everything here because our intention is to end the course with a very complete UI and provide all the features in LLM.

In the format chat prompt, which is the function that we used before, we added a new element, which is the system directive. So before we start the user assistant conversation, we add a directive at the top of the system. So basically at the beginning of every message that is sent to the model, there will be the system message that we set. Here, we call the `generate_stream` function of the text generation library. And the `generate_stream` function is what it does is it generates the response tokens one by one. So in this loop, what happens is that it generates the response tokens token by token, adds it to the chat log, and then returns it to the function.

In [None]:
# Close any previous gradio instances that may have been started
gr.close_all()