In [None]:
!pip install --upgrade gradio

Collecting gradio
[0m  Downloading gradio-4.36.1-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==1.0.1 (from gradio)
  Downloading gradio_client-1.0.1-py3-none-any.whl (318 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.1/318.1 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━

#**Introduction**#
##Chatbots are a popular application of large language models. Using gradio, you can easily build a demo of your chatbot model and share that with your users, or try it yourself using an intuitive chatbot UI.

##This tutorial uses `gr.ChatInterface()`, which is a high-level abstraction that allows you to create your chatbot UI fast, often with a single line of code.

#**Defining a chat function**#
##When working with `gr.ChatInterface()`, the first thing you should do is define your chat function. Your chat function should take two arguments: `message` and then `history` (the arguments can be named anything, but must be in this order).

##*message: a `str` representing the user's input.*
##*history: a `list` of `list` representing the conversations up until that point. Each inner list consists of two `str` representing a pair: `[user input, bot response]`.*
##Your function should return a single string response, which is the bot's response to the particular user input message. Your function can take into account the history of messages, as well as the current message.

#**Example: a chatbot that responds yes or no**#
##Let's write a chat function that responds `Yes` or `No` randomly.

##Here's our chat function:

In [None]:
import random

def random_response(message, history):
    return random.choice(["Yes", "No"])

##Now, we can plug this into `gr.ChatInterface()` and call the .`launch()` method to create the web interface:

In [None]:
import gradio as gr

gr.ChatInterface(random_response).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://e9926b25e7257828fd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




#**Another example using the user's input and history**#
##Of course, the previous example was very simplistic, it didn't even take user input or the previous history into account! Here's another simple example showing how to incorporate a user's input as well as the history.

# Comments:
# This code sets up a simple Gradio chat interface that alternately agrees or disagrees with the user's messages.
# The `random` library is imported, but it is not used in this code.
# The `alternatingly_agree` function takes two arguments: the user's message and the conversation history.
# The function checks the length of the conversation history using the `len` function and the modulo operator `%`.
# If the length is even, the function agrees with the user's message by returning a string that includes the message.
# If the length is odd, the function disagrees with the user's message by returning a simple "I don't think so" string.
# The `gr.ChatInterface` function from the Gradio library is used to create a chat interface.
# The `alternatingly_agree` function is passed as an argument to `gr.ChatInterface`, which sets it as the callback function for handling user messages.
# The `launch` method is called to start the Gradio chat interface and make it accessible to users.
# When a user sends a message, the `alternatingly_agree` function is called with the message and the conversation history as arguments.
# The function returns a response based on the length of the conversation history, and the response is displayed in the chat interface.
# The chat interface will continue to alternate between agreeing and disagreeing with the user's messages until it is closed.

In [None]:
import random
import gradio as gr

# Define a function to agree or disagree with the user's message alternately
def alternatingly_agree(message, history):
    # Check if the length of the conversation history is even or odd
    if len(history) % 2 == 0:
        # If even, agree with the user's message
        return f"Yes, I do think that '{message}'"
    else:
        # If odd, disagree with the user's message
        return "I don't think so"

# Create a Gradio chat interface and connect it to the alternatingly_agree function
gr.ChatInterface(alternatingly_agree).launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://cf1a7f8585fd18375b.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




#**Streaming chatbots**#
##In your chat function, you can use yield to generate a sequence of partial responses, each replacing the previous ones. This way, you'll end up with a streaming chatbot. It's that simple!

# Comments:
# This code sets up a Gradio chat interface that slowly echoes the user's messages, one character at a time.
# The `time` library is imported to introduce delays between each character.
# The `slow_echo` function takes two arguments: the user's message and the conversation history.
# The function iterates over the characters in the message using a `for` loop and the `range` function.
# For each character, the function introduces a delay of 0.3 seconds using the `time.sleep` function.
# After the delay, the function yields a string that includes the message up to the current character, prefixed with "You typed: ".
# The `yield` statement is used to generate a sequence of strings, one for each character in the message.
# The `gr.ChatInterface` function from the Gradio library is used to create a chat interface.
# The `slow_echo` function is passed as an argument to `gr.ChatInterface`, which sets it as the callback function for handling user messages.
# The `launch` method is called to start the Gradio chat interface and make it accessible to users.
# When a user sends a message, the `slow_echo` function is called with the message and the conversation history as arguments.
# The function generates a sequence of strings, one for each character in the message, with a delay of 0.3 seconds between each string.
# The generated strings are displayed in the chat interface, creating the effect of slowly echoing the user's message, one character at a time.
# The chat interface will continue to slowly echo the user's messages until it is closed.

In [None]:
import time
import gradio as gr

# Define a function to slowly echo the user's message
def slow_echo(message, history):
     # Iterate over the characters in the message
     for i in range(len(message)):
        # Delay for 0.3 seconds
        time.sleep(0.1)
        # Yield the message up to the current character
        yield "You typed: " + message[: i+1]

# Create a Gradio chat interface and connect it to the slow_echo function
gr.ChatInterface(slow_echo).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://25fa6291841dba6c20.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




#**Customizing your chatbot**#
##If you're familiar with Gradio's Interface class, the `gr.ChatInterface` includes many of the same arguments that you can use to customize the look and feel of your Chatbot. For example, you can:

##*add a `title` and `description` above your chatbot using title and description arguments.*
##*add a theme or custom css using theme and css arguments respectively.*
##*add `examples` and even enable `cache_examples`, which make it easier for users to try it out .*
##You can change the text or disable each of the buttons that appear in the chatbot interface: `submit_btn`, `retry_btn`, `undo_btn`, `clear_btn`.
##If you want to customize the `gr.Chatbot` or gr.Textbox that compose the ChatInterface, then you can pass in your own chatbot or textbox as well. Here's an example of how we can use these parameters:

# Comments:
# This code sets up a Gradio chat interface called "Yes Man" that responds to user messages with either "Yes" or a default message.
# The `yes_man` function is defined to handle user messages. It checks if the message ends with a question mark using the `endswith` method.
# If the message ends with a question mark, the function returns "Yes". Otherwise, it returns a default message "Ask me anything!".
# The `gr.ChatInterface` function is used to create the chat interface. It takes several arguments to configure the appearance and behavior of the interface.
# The `yes_man` function is passed as the first argument to handle user messages.
# The `chatbot` argument configures the appearance of the chatbot, setting the height to 300 pixels.
# The `textbox` argument configures the text input box, setting a placeholder text and scaling the size.
# The `title` and `description` arguments set the title and description of the interface, respectively.
# The `theme` argument sets the visual theme of the interface to "soft".
# The `examples` argument provides a list of example messages that users can try.
# The `cache_examples` argument is set to `True` to cache the example messages for better performance.
# The `retry_btn` argument is set to `None` to remove the retry button.
# The `undo_btn` and `clear_btn` arguments set the labels for the undo and clear buttons, respectively.
# Finally, the `launch` method is called to start the Gradio interface and make it accessible to users.
# When a user sends a message, the `yes_man` function is called with the message and the conversation history as arguments.
# The function checks if the message ends with a question mark and returns either "Yes" or the default message, which is displayed in the chat interface.

In [None]:
import gradio as gr

# Define a function to respond with "Yes" or a default message
def yes_man(message, history):
     # Check if the message ends with a question mark
     if message.endswith("?"):
        # If it does, return "Yes"
        return "Yes"
     else:
        # If not, return a default message
        return "Ask me anything!"

gr.ChatInterface(
    yes_man,  # The function to handle user messages
    chatbot = gr.Chatbot(height=300,
                         placeholder="<strong>Your Personal Yes-Man</strong><br>Ask Me Anything"),  # Configure the chatbot appearance
    # Configure the text input box
    textbox = gr.Textbox(placeholder="Ask me a yes or no question",
                       container=False,
                       scale=7),
    title="Yes Man",  # Set the title of the interface
    description="Ask Yes Man any question",  # Set the description of the interface
    theme="soft",  # Set the theme of the interface
    examples=["Hello", "Am I cool?", "Are tomatoes vegetables?"],  # Set example messages
    cache_examples=True,  # Cache the example messages
    retry_btn=None,  # Remove the retry button
    undo_btn="Delete Previous",  # Set the label for the undo button
    clear_btn="Clear",  # Set the label for the clear button

).launch()  # Launch the Gradio interface



Caching examples at: '/content/gradio_cached_examples/51'
Caching example 1/3
Caching example 2/3
Caching example 3/3
Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://800dd4f6dd931fe587.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




##In particular, if you'd like to add a "placeholder" for your chat interface, which appears before the user has started chatting, you can do so using the placeholder argument of gr.Chatbot, which accepts Markdown or HTML.

In [None]:
gr.ChatInterface(
    yes_man,
    chatbot=gr.Chatbot(placeholder="<strong>Your Personal Yes-Man</strong><br>Ask Me Anything"),
    ...
)

#**Add Multimodal Capability to your chatbot**#
##You may want to add multimodal capability to your chatbot. For example, you may want users to be able to easily upload images or files to your chatbot and ask questions about it. You can make your chatbot "multimodal" by passing in a single parameter `(multimodal=True)` to the `gr.ChatInterface` class.

# Comments:
# This code sets up a Gradio chat interface that counts the number of files uploaded by the user.
# The `gradio` and `time` libraries are imported.
# The `count_files` function is defined to handle user messages.
# The function takes two arguments: the user's message and the conversation history.
# The `len` function is used to get the number of files from the `message["files"]` list.
# The function returns a string indicating the number of files uploaded.
# The `gr.ChatInterface` function is used to create the chat interface.
# The `count_files` function is passed as the `fn` argument to handle user messages.
# The `examples` argument provides an example input with no files.
# The `title` argument sets the title of the interface to "Echo Bot".
# The `multimodal` argument is set to `True` to enable multimodal input (text and files).
# The `launch` method is called to start the Gradio interface and make it accessible to users.
# When a user sends a message or uploads files, the `count_files` function is called with the message and the conversation history as arguments.
# The function counts the number of files in the `message["files"]` list and returns a string indicating the count.
# The returned string is displayed in the chat interface.
# The chat interface will continue to count the number of files uploaded until it is closed.

In [None]:
import gradio as gr
import time

# Define a function to count the number of files uploaded
def count_files(message, history):
     # Get the number of files from the message object
     num_files = len(message["files"])
     # Return a string indicating the number of files uploaded
     return f"You uploaded {num_files} files"

# Create a Gradio chat interface
demo = gr.ChatInterface(
    fn=count_files,  # The function to handle user messages
    examples=[{"text": "Hello", "files": []}],  # Example input with no files
    title="Echo Bot",  # Set the title of the interface
    multimodal=True  # Enable multimodal input (text and files)
        )

# Launch the Gradio interface
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://d95a7c374bc3ec3149.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




##When `multimodal=True`, the signature of `fn` changes slightly. The first parameter of your function should accept a dictionary consisting of the submitted text and uploaded files that looks like this: `{"text": "user input", "file": ["file_path1", "file_path2", ...]}`. Similarly, any examples you provide should be in a dictionary of this form. Your function should still return a single `str` message.

##✍️ Tip: If you'd like to customize the UI/UX of the textbox for your multimodal chatbot, you should pass in an instance of `gr.MultimodalTextbox` to the `textbox` argument of `ChatInterface` instead of an instance of `gr.Textbox`.

#**Additional Inputs**#
##You may want to add additional parameters to your chatbot and expose them to your users through the Chatbot UI. For example, suppose you want to add a `textbox` for a system prompt, or a `slider` that sets the number of tokens in the chatbot's response. The ChatInterface class supports an additional_inputs parameter which can be used to add additional input components.

##The additional_inputs parameters accepts a component or a list of components. You can pass the component instances directly, or use their string shortcuts `(e.g. "textbox"` instead of `gr.Textbox())`. If you pass in component instances, and they have not already been rendered, then the components will appear underneath the chatbot (and any examples) within a `gr.Accordion()`. You can set the label of this accordion using the additional_inputs_accordion_name parameter.

##Here's a complete example:

# Comments:
# This code sets up a Gradio chat interface that echoes the user's message with a delay, based on a system prompt and a token limit.
# The `gradio` and `time` libraries are imported.
# The `echo` function is defined to handle user messages.
# The function takes four arguments: the user's message, the conversation history, the system prompt, and the token limit.
# The `response` string is constructed by concatenating the system prompt and the user's message.
# The function iterates over the characters in the `response` string using a `for` loop and the `range` function.
# The `min` function is used to limit the iteration to the length of the `response` string or the token limit, whichever is smaller.
# For each character, the function introduces a delay of 0.05 seconds using the `time.sleep` function.
# After the delay, the function yields the `response` string up to the current character.
# The `yield` statement is used to generate a sequence of strings, one for each character in the `response` string.
# The `gr.ChatInterface` function is used to create the chat interface.
# The `echo` function is passed as an argument to `gr.ChatInterface`, which sets it as the callback function for handling user messages.
# The `additional_inputs` argument is used to add additional input components to the interface.
# The first additional input is a `gr.Textbox` for entering the system prompt, with a default value of "You are helpful AI." and a label.
# The second additional input is a `gr.Slider` for setting the token limit, with a range from 10 to 100.
# The `if __name__ == "__main__"` block ensures that the code inside it is executed only when the script is run directly, not when it's imported as a module.
# Inside the block, the `demo.queue().launch()` line starts the Gradio interface and makes it accessible to users.
# When a user sends a message, the `echo` function is called with the message, the conversation history, the system prompt, and the token limit as arguments.
# The function generates a sequence of strings, one for each character in the `response` string, with a delay of 0.05 seconds between each string.
# The generated strings are displayed in the chat interface, creating the effect of slowly echoing the user's message, one character at a time.
# The chat interface will continue to echo the user's messages until it is closed.

In [None]:
import gradio as gr
import time

# Define a function to echo the user's message with a delay
def echo(message, history, system_prompt, tokens):
    # Construct the response string
    response = f"System prompt: {system_prompt}\n Message: {message}."

    # Iterate over the characters in the response string
    for i in range(min(len(response), int(tokens))):
        # Delay for 0.05 seconds
        time.sleep(0.05)
        # Yield the response string up to the current character
        yield response[: i + 1]

# Create a Gradio chat interface
demo = gr.ChatInterface(
    echo,  # The function to handle user messages
    additional_inputs=[
        # Input for system prompt
        gr.Textbox("You are helpful AI.", label="System Prompt"),
        gr.Slider(10, 100),  # Input for token limit

    ]
)

# Check if the script is being run directly
if __name__ == "__main__":
    # Launch the Gradio interface
    demo.queue().launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://0afa1396f6beff209f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


In [None]:
import gradio as gr
import time

# Define a function to echo the user's message with a delay
def echo(message, history, system_prompt, tokens):
    # Construct the response string
    response = f"System prompt: {system_prompt}\n Message: {message}."

    # Iterate over the characters in the response string
    for i in range(min(len(response), int(tokens))):
        # Delay for 0.05 seconds
        time.sleep(0.05)
        # Yield the response string up to the current character
        yield response[: i + 1]

with gr.Blocks() as demo:
        system_prompt = gr.Textbox("You are helpful AI.", label="System Prompt")
        slider = gr.Slider(10, 100, render=False)

        gr.ChatInterface(
                    echo, additional_inputs=[system_prompt, slider]
                        )


# Check if the script is being run directly
if __name__ == "__main__":
    # Launch the Gradio interface
    demo.queue().launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://6f309a49e2877ad34f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


#**A langchain example**#
##Now, let's actually use the gr.ChatInterface with some real large language models. We'll start by using langchain on top of openai to build a general-purpose streaming chatbot application in 19 lines of code. You'll need to have an OpenAI key for this example (keep reading for the free, open-source equivalent!)

In [None]:
!pip install langchain
!pip install openai

Collecting langchain
  Downloading langchain-0.2.5-py3-none-any.whl (974 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.7 (from langchain)
  Downloading langchain_core-0.2.8-py3-none-any.whl (315 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.8/315.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.1-py3-none-any.whl (23 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.79-py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.3/125.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.7->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting jsonpointer>=1.9 (from jsonpatch<2.0,>

In [None]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.2.5-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensio

# Comments:
# This code sets up a Gradio chat interface that uses the OpenAI GPT-3.5 language model to generate responses to user messages.
# The necessary libraries are imported, including `langchain.chat_models`, `langchain.schema`, `openai`, and `gradio`.
# The OpenAI API key is set using the `os.environ["OPENAI_API_KEY"]` variable. Replace `"sk-..."` with your actual API key.
# The `ChatOpenAI` language model is initialized with a temperature of 1.0 and the `gpt-3.5-turbo-0613` model.
# The `predict` function is defined to handle user messages and generate responses.
# Inside the `predict` function, the conversation history is converted to the LangChain format using `HumanMessage` and `AIMessage` objects.
# The current user message is added to the history using `HumanMessage`.
# The language model is called with the formatted conversation history using `llm(history_langchain_format)`.
# The generated response from the language model is returned by the `predict` function.
# A Gradio chat interface is created using `gr.ChatInterface(predict)`.
# The `launch()` method is called to start the Gradio interface and make it accessible to users.
# When a user sends a message, the `predict` function is called with the message and the conversation history as arguments.
# The function formats the conversation history, adds the current message, and generates a response using the language model.
# The generated response is displayed in the chat interface.
# The chat interface will continue to generate responses to user messages until it is closed.

In [None]:
import os
from langchain.chat_models import ChatOpenAI
from langchain.schema import AIMessage, HumanMessage
import openai
import gradio as gr

# Use the correct parameter name for the API key
llm = ChatOpenAI(temperature=1.0, model='gpt-3.5-turbo-0613',
                 openai_api_key="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")

# Define the function to handle user messages and generate responses
def predict(message, history):
    # Convert the conversation history to the LangChain format
    history_langchain_format = []

    for human, ai in history:
        history_langchain_format.append(HumanMessage(content=human))
        history_langchain_format.append(AIMessage(content=ai))

    # Add the current user message to the history
    history_langchain_format.append(HumanMessage(content=message))

    # Generate a response from the language model
    gpt_response = llm(history_langchain_format)

    # Return the generated response
    return gpt_response.content

# Create a Gradio chat interface
gr.ChatInterface(predict).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://b25a1ef08446497446.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




#**A streaming example using openai**#
##Of course, we could also use the openai library directy. Here a similar example, but this time with streaming results as well:

# Comments:
# This code sets up a Gradio chat interface that uses the OpenAI API to generate responses to user messages.
# The `gradio` and `openai` libraries are imported.
# The OpenAI API key is set using the `api_key` variable. Replace `"sk-..."` with your actual API key.
# The `OpenAI` client is initialized with the provided API key.
# The `predict` function is defined to handle user messages and generate responses.
# Inside the `predict` function, the conversation history is converted to the OpenAI format using dictionaries with "role" and "content" keys.
# The current user message is added to the history with the "user" role.
# The OpenAI API is called using `client.chat.completions.create` with the formatted conversation history, the `gpt-3.5-turbo` model, a temperature of 1.0, and `stream=True` to enable streaming responses.
# An empty string `partial_message` is initialized to store the partial response.
# The function iterates over the response stream using a `for` loop.
# For each chunk in the response stream, if the `delta.content` field is not `None`, the content is appended to the `partial_message` string.
# The updated `partial_message` is yielded using the `yield` statement, allowing the Gradio interface to display the response as it is generated.
# A Gradio chat interface is created using `gr.ChatInterface(predict)`.
# The `launch()` method is called to start the Gradio interface and make it accessible to users.
# When a user sends a message, the `predict` function is called with the message and the conversation history as arguments.
# The function formats the conversation history, adds the current message, and generates a response using the OpenAI API.
# The generated response is streamed and displayed in the chat interface as it is generated.
# The chat interface will continue to generate responses to user messages until it is closed.

In [1]:
!pip install gradio
!pip install openai

Collecting gradio
  Downloading gradio-4.36.1-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==1.0.1 (from gradio)
  Downloading gradio_client-1.0.1-py3-none-any.whl (318 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.1/318.1 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━

In [2]:
import gradio as gr
from openai import OpenAI

# Set your OpenAI API key
api_key = "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"  # Replace with your actual API key

# Initialize the OpenAI client
client = OpenAI(api_key=api_key)

# Define the function to handle user messages and generate responses
def predict(message, history):
    # Convert the conversation history to the OpenAI format
    history_openai_format = []
    for human, assistant in history:
        history_openai_format.append({"role": "user", "content": human})
        history_openai_format.append({"role": "assistant", "content": assistant})

    # Add the current user message to the history
    history_openai_format.append({"role": "user", "content": message})

    # Generate a response from the OpenAI API
    response = client.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=history_openai_format,
        temperature=1.0,
        stream=True
                    )
    # Initialize an empty string to store the partial response
    partial_message = ""

    # Iterate over the response stream and yield the partial response
    for chunk in response:
        if chunk.choices[0].delta.content is not None:
            partial_message += chunk.choices[0].delta.content
            yield partial_message

# Create a Gradio chat interface
gr.ChatInterface(predict).launch()



Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f03e7b1e7b8a0cdbd3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




#**Example using a local, open-source LLM with Hugging Face**#
##Of course, in many cases you want to run a chatbot locally. Here's the equivalent example using Together's RedePajama model, from Hugging Face (this requires you to have a GPU with CUDA).

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.36.1-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==1.0.1 (from gradio)
  Downloading gradio_client-1.0.1-py3-none-any.whl (318 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.1/318.1 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━

# Comments:
# This code sets up a Gradio chat interface that uses a pre-trained language model (RedPajama-INCITE-Chat-3B-v1) to generate responses to user messages.
# The necessary libraries are imported, including `gradio`, `torch`, and `transformers`.
# The pre-trained tokenizer and model are loaded using `AutoTokenizer` and `AutoModelForCausalLM`.
# The model is moved to the GPU using `model.to('cuda:0')`.
# A custom stopping criteria class `StopOnTokens` is defined to stop generation when specific token IDs are encountered.
# The `predict` function is defined to handle user messages and generate responses.
# Inside the `predict` function, the conversation history is formatted for the model.
# The input for the model is prepared by concatenating the conversation history with special tokens.
# A `TextIteratorStreamer` is set up to stream the generated tokens.
# The generation parameters are defined, including sampling settings, temperature, and the custom stopping criteria.
# The generation process is started in a separate thread using `Thread` and `model.generate`.
# An empty string `partial_message` is initialized to store the partial response.
# The function iterates over the generated tokens from the streamer.
# For each token, if it is not a special token, it is appended to the `partial_message` string.
# The updated `partial_message` is yielded using the `yield` statement, allowing the Gradio interface to display the response as it is generated.
# A Gradio chat interface is created using `gr.ChatInterface(predict)`.
# The `launch()` method is called to start the Gradio interface and make it accessible to users.
# When a user sends a message, the `predict` function is called with the message and the conversation history as arguments.
# The function formats the conversation history, prepares the input, and generates a response using the language model.
# The generated response is streamed and displayed in the chat interface as it is generated.
# The chat interface will continue to generate responses to user messages until it is closed.

In [None]:
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread

# Load the pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-3B-v1")

model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-3B-v1", torch_dtype=torch.float16)

model = model.to('cuda:0')  # Move the model to GPU

# Define a custom stopping criteria class
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [29, 0]  # Token IDs to stop generation
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

# Define the function to handle user messages and generate responses
def predict(message, history):
    # Format the conversation history for the model
    history_transformer_format = history + [[message, ""]]
    stop = StopOnTokens()  # Initialize the custom stopping StoppingCriteria

    # Prepare the input for the model
    messages = "".join(["".join(["\n<human>:"+item[0],
                                 "\n<bot>:"+item[1]]) for item in history_transformer_format])
    model_inputs = tokenizer([messages], return_tensors="pt").to("cuda")

    # Set up the text streamer for generating responses
    streamer = TextIteratorStreamer(
        tokenizer,
        timeout=10.,
        skip_prompt=True,
        skip_special_tokens=True
    )

    # Define the generation parameters
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        # Use the custom stopping criteria
        stopping_criteria=StoppingCriteriaList([stop])
    )

    # Start the generation process in a separate thread
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    # Initialize an empty string to store the partial response
    partial_message = ""

    # Iterate over the generated tokens and yield the partial response
    for new_token in streamer:
        if new_token != '<':
            partial_message += new_token
            yield partial_message

# Create a Gradio chat interface
gr.ChatInterface(predict).launch()



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://c216561ce1ab01c787.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


