# Not-so-simple phi2 chatbot

This notebook uses all the pieces explained in

+ `experiments.ipyn` and
+ `experiments_langchain.ipyn`

to implement a chatbot that can

+ classify input queries into 3 different categories (`support`, `sales` and `joke`),
+ respond accordingly to each of the user queries depending on the category they belong to.

## Putting it all together

Now, in order to put it all together (prompting, conditional branching and conversation memory), we first need to modify a bit our prompts from the _routing_ section. They were meant to be used standalone and they make use of the `Instruction/Output` template from `phi-2`. When we add the conversation bits, this doesn't make sense anymore.

We're going to do the following steps:
1. Load the model
2. Create the classification chain
3. Add the branch template
4. Add the conversation memory

### Step 1 load the model

The very first thing we have to do is to load the phi model in a format that can be used with LangChain. We've created some conveniency functions for that in `load_phi_model.py`.

In [1]:
from load_phi_model import load_phi_model_and_tokenizer, get_langchain_model

model, tokenizer = load_phi_model_and_tokenizer()
hf = get_langchain_model(model, tokenizer)

  from .autonotebook import tqdm as notebook_tqdm


Your device is cuda


Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:00<00:00,  2.67it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Step 2: create the initial classification template

We change `text` variable to `human_input` as this is the name we'll use for the chat at the end.

In [2]:
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableBranch, RunnableLambda, RunnablePassthrough
from operator import itemgetter
from langchain.memory import ConversationBufferMemory

In [3]:
classification_template = """
Instruct: Classify the following text in one of the following categories: ["support", "sales", "joke"]. Output only the name of the category.
+ "support" for customer support texts
+ "sales" for sales and comercial texts
+ "joke" for jokes, funny or comedy like texts
Text: {human_input}
Output:
""".strip()

In [4]:
classification_prompt = ChatPromptTemplate.from_template(classification_template)
classification_chain = (
    classification_prompt
    | hf
    | StrOutputParser()
)

In [5]:
print(classification_chain.invoke({"human_input": "Can I track my order? I'm eager to know its status"}))

 support



### Step 3: add the branch template

The branch template gives the right instructions for the chatbot. Depending on the result of the classification, we provide different instructions to the language model.

In [6]:
support_instructions = """\
You are a customer support agent. It seems that the user may have some issues. Answer to their query politely and sincerely. \
Be kind, understanding and say you're sorry for the inconvenience or the situation whenever necessary. Be brief and to the point.\
"""

sales_instructions = """\
You are an aggressive salesperson. The user is looking for some information on products. \
Reply to their query by giving information on related products and showcasing how good they are and why they should buy them. \
Be brief and to the point.
"""

joke_instructions = """\
You are a comedian. The user want's to have some fun. Reply to their query in a funny way.\
"""

general_instructions = """\
Instruction: Respond to the following query.\
"""

support_chain = PromptTemplate.from_template(support_instructions)
sales_chain = PromptTemplate.from_template(sales_instructions)
joke_chain = PromptTemplate.from_template(joke_instructions)
general_chain = PromptTemplate.from_template(general_instructions)

In [7]:
branch = RunnableBranch(
    (lambda x: "support" in x["topic"].lower(), support_chain),
    (lambda x: "sales" in x["topic"].lower(), sales_chain),
    (lambda x: "joke" in x["topic"].lower(), joke_chain),
    general_chain,
) | RunnableLambda(lambda x: x.text)

branch_chain = {"topic": classification_chain, "human_input": lambda x: x["human_input"]} | branch

In [8]:
response = branch_chain.invoke({"human_input": "Can I track my order? I'm eager to know its status"})

In [9]:
response

"You are a customer support agent. It seems that the user may have some issues. Answer to their query politely and sincerely. Be kind, understanding and say you're sorry for the inconvenience or the situation whenever necessary. Be brief and to the point."

### Step 4: add full chat template and chain

We finally define the final template with the different parts and bits we've defined above. It consists of three main sections:

+ The starting line with some broad instructions on how to behave when responding in the chat.
+ A `Instructions:` section where we'll place specific instructions for our three categories (`support`, `sales` and `joke`).
+ The chat section, where we place our `chat_history` and the new `human_input`.

In [10]:
template = """\
You are a chatbot having a conversation with a human. Follow the given instructions to reply to the Human message below.

Instructions:{instructions}

{chat_history}
Human: {human_input}
Chatbot:"""

prompt = PromptTemplate(
    input_variables=["instructions", "chat_history", "human_input"], template=template
)

Out `chat_template` below just gives us the `prompt`. In other sections of the project, we've used chains that ended with a call to the language model (`chain = ... | hf`). Here we won't do that, though. The reason is explained on the next section.

In [13]:
chat_chain = (
    {
        "human_input": lambda x: x["human_input"], 
        "instructions": lambda x: branch_chain,
        "chat_history": lambda x: x["chat_history"],
    } | prompt
)

## Add user interface

We use `gradio`'s `ChatInterface` to create a quick UI to use and test the chatbot. 

> You can set `DEBUG = True` below to see the prompts sent to the LLM,

#### Explanation

The `predict` function from `ChatInterface` gives us both the new user `message` and the `history` of messages. That's the reason why we don't need any kind of `langchain`'s memory here.

In addition, we'll use a chain (`chat_chain`) that ends in a prompt, without making the call to the Huggingface model. The reason for that is that we were unable to find a way of defining a full langchain chain and making use of the `stream`ing feature so that the text would appear word by word. We researched how to do this, but in the end we couldn't make it work. [In some places](https://github.com/langchain-ai/langchain/issues/2918#issuecomment-1516441771) they recommend to use `HuggingFacePipeline`, but [it's not clear streaming is even supported by it](https://python.langchain.com/docs/integrations/llms/#features-natively-supported). Thus, we break the chain and divide it into two parts:

+ part one, create the right prompt for the `human_input`
+ part two, call the model using a _streamer_

In [47]:
import gradio as gr
import torch
from load_phi_model import StopOnTokens, StopOnNames
from transformers import StoppingCriteriaList, TextIteratorStreamer, pipeline
from threading import Thread
import re

In [48]:
HUMAN_NAME = "Human"
BOT_NAME = "Chatbot"

In [49]:
DEBUG = False # set to True to see the prompt sent to the model

In [50]:
device = "cuda" if torch.cuda.is_available() else "cpu"
chat_name_pattern_end = r'\n.+:$' # matches substrings like `\nUser:` at the end

def predict(message, history):
    stop_on_tokens = StopOnTokens()
    stop_on_names = StopOnNames(
        [tokenizer.encode(HUMAN_NAME), tokenizer.encode(BOT_NAME)])

    messages = "".join(["".join(
        [f"\n{HUMAN_NAME}: "+item[0], f"\n{BOT_NAME}:"+item[1]]
    ) for item in history]).strip()

    input_dict = {
        "human_input": message,
        "chat_history": messages,
    }

    input_prompt = chat_chain.invoke(input_dict).text
    if DEBUG: print(input_prompt)

    model_inputs = tokenizer([input_prompt], return_tensors="pt").to(device)
    streamer = TextIteratorStreamer(tokenizer, timeout=10., 
                                    skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=256,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        num_beams=1,
        stopping_criteria=StoppingCriteriaList([stop_on_tokens, stop_on_names])
        )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    partial_message = ""
    for new_token in streamer:
        partial_message += new_token
        match = re.search(chat_name_pattern_end, partial_message)
        if match:
            partial_message = partial_message[:-len(match.group())]
        yield partial_message
        

gr.ChatInterface(predict).queue().launch()

Running on local URL:  http://127.0.0.1:7880

To create a public link, set `share=True` in `launch()`.


