In [9]:
!pip install -q transformers accelerate bitsandbytes
!pip install -q protobuf accelerate optimum auto-gptq gradio

## Load Quantized Zephyr model from Hugginface

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_name_or_path = "TheBloke/zephyr-7B-alpha-GPTQ" # "TheBloke/Mistral-7B-OpenOrca-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                            #  torch_dtype=torch.bfloat16,
                                             revision="main")

## Loding tokenizer from the parent repo
  -- Quantized model repo doesn't has chat template

In [2]:
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Defination of chat template in Jinja format

In [3]:
tokenizer.chat_template

"{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"

## Example of Model inference using the chat template

In [4]:
messages = [
    # Information about SiftMed: Product About Sift Uni Contact Request Demo Request Demo We Want To Transform Medical Reviews Our Mission: It all started with a bankerâ\x80\x99s box of files, a kitchen table and one overwhelmed medical file reviewer. Our Co-Founders had a vision - develop a way to easily sift through medical records to find quintessential facts.No one should struggle to find facts in medical data. Ever.â\x80\x8dOur Mission - To revolutionize the way the world reviews medical records.Â\xa0No matter your role in a personal injury case, reviewing medical files for the facts can be a demanding, yet critical, process. Our goal is to help law firms and legal departments automate and streamline their medical record review processes, saving them time, increasing efficiency, and improving report accuracy, leading to better outcomes for clients and greater success for the firm. It all started with a bankerâ\x80\x99s box of files, a kitchen table and one overwhelmed medical file reviewer. Our Co-Founders had a vision - develop a way to easily sift through medical records to find quintessential facts.No one should struggle to find facts in medical data. Ever. â\x80\x8dOur Mission - To revolutionize the way the world reviews medical records.Â\xa0 No matter your role in a personal injury case, reviewing medical files for the facts can be a demanding, yet critical, process. Our goal is to help law firms and legal departments automate and streamline their medical record review processes, saving them time, increasing efficiency, and improving report accuracy, leading to better outcomes for clients and greater success for the firm. Find Out More SiftMed is built with proprietary technology SiftMed is dedicated to developing and using the most advanced technology available to help improve your case preparation process. Simply upload the medical records you need to review and let our technology guide you. Natural Language Processing &Â\xa0ML Automates the organization and categorization of records. We take large medical files, allow for them to be chronologically sorted, create categories, and resolve duplicates. Optical Character Recognition Instantly makes your medical records searchable and allows you to easily copy and paste key pieces of information from PDF image files, including handwritten text. Artificial Intelligence Our team is developing Â\xa0the groundwork for a predictive analytics model that will highlight pertinent diagnosises and critical information in a file. Our Founders Our TeamFounded in 2020, SiftMed has grown significantly! The team is now a diverse group of innovative leaders, data scientists, software engineers, and medical experts.
    {"role": "user", "content": "You are an helpful Assistant, called RIO. Interact cheerfully, humorously, and friendly."},
    {"role": "assistant", "content": "Hello, I will try my best to assist you today"},
    {"role": "user", "content": "Hi RIO"}
]
device = "cuda"
encodeds = tokenizer.apply_chat_template(messages, return_tensors='pt')

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


<|user|>
You are an helpful Assistant, called RIO. Interact cheerfully, humorously, and friendly.</s> 
<|assistant|>
Hello, I will try my best to assist you today</s> 
<|user|>
Hi RIO</s> 
<|assistant|>
Hi there! How can I assist you today? I'm just a friendly RIO waiting to help you out!</s>


## Gradio Demo

This demo can help to create Chat interface using which we can talk with LLM and understand it's working. How it's responding to input text. Some of the LLM's hyperparameter are on the screen so you can change it and see how model is reacting to it.

1. Chat-Bot:
  - This bot is kind of normal bot which respond to user based on user input
  - Prompt format is create based on Huggingface Chat template
2. Instruction Chat-Bot
  - This bot is added to check LLM performace
  - This bot react based on Instruction statement
  - Very basic prompt was added but it's quite powerful.
  - Model response is highly dependnt on the Custom instruction statement


  Note: This is the best small model(7B) as of Oct 15, 2023.


In [7]:
import random
import gradio as gr
import gradio as gr
import json
import io
def variable_args_function(text, **kwargs):
    for key, value in kwargs.items():
        text = text.replace(f"##{key}", str(value))

    return(text)

chat_prompt = """<|system|>\nYou are an helpful Assistant. Interact cheerfully, humorously, and friendly. Make your response to the point and response must be less than 200 tokens</s>"""

custom_chat_prompt = """###Instruction:\n##instruction</s>###Statement:\n##statement</s>\n###Response:\n"""

model_prompts = ['Chat BOT', 'Custom instructions - Chat BOT']

def gen_user_assistant_chat(user_list, bot_list):
    prompt = ""
    for i, j in enumerate(user_list):
      prompt = prompt + "<|user|>\n" + j +"</s>" + "<|assistant|>\n"
      try:
        prompt = prompt + bot_list[i] + "</s>"
      except:
        continue

    if len(prompt) > 1000:
      prompt = prompt[500:]

    return prompt

def format_prompt_gen(chat_prompt, user_list, bot_list):
    prompt = gen_user_assistant_chat(user_list, bot_list)
    prompt = chat_prompt + prompt
    return prompt

def format_custom_prompt(custom_prompt, instruction, statement):
    return variable_args_function(custom_chat_prompt, instruction=instruction, statement=statement)

def reset(z):
    return [], []

def create_gradio_app(
    model,concurrency_count=4, share=True
):
  prompt_list = []
  bot_list = []
  def generate(
          prompt,
          history,
          selected_task,
          custom_instruct,
          top_p_slider, temperature_slider, max_new_tokens_textbox
      ):
          prompt_list.append(prompt)
          if selected_task == 'Custom instructions - Chat BOT':
              # if custom_instruct == '':
              #     custom_prompt = chat_prompt
              formatted_prompt = format_custom_prompt(custom_chat_prompt, custom_instruct, prompt)
          else:
              #General
              formatted_prompt = format_prompt_gen(chat_prompt, prompt_list, bot_list)


          input_ids = tokenizer(formatted_prompt, return_tensors='pt').input_ids.cuda()
          output = model.generate(inputs=input_ids, temperature=temperature_slider, do_sample=True, top_p=top_p_slider, top_k=40, max_new_tokens=int(max_new_tokens_textbox))

          output_text = tokenizer.decode(output[0])
          final_output = output_text.split("<|assistant|>")[-1]
          final_output = final_output.replace("<s>", "").replace(":", "")
          if selected_task == 'Chat BOT':
              final_output = output_text.split("<|assistant|>")[-1]
              final_output = final_output.replace("<s>", "").replace(":", "")
              bot_list.append(final_output)
          else:
              final_output = output_text.split("###Response")[1]
              final_output = final_output.replace("<s>", "").replace(":", "")
          return final_output

  with gr.Blocks() as demo:
      gr.Markdown("## Chat with your Buddy")
      task_dropdown = gr.Dropdown(model_prompts, label="Select Task", allow_custom_value=False)
      task_dropdown.value = model_prompts[0]

      # Define additional inputs
      custom_instruct_textbox = gr.Textbox(default="", label="Custom instructions", visible=task_dropdown.value == "Custom instructions - Chat BOT")
      # https://discuss.huggingface.co/t/clear-chat-interface/49866/6
      bot = gr.Chatbot(render=False)
      # Create a chat interface with live updates
      # Additional fields
      top_p_slider = gr.Slider(0.1, 1.0, value=0.5, label="Top P")
      temperature_slider = gr.Slider(0.1, 1.0, value=0.8, label="Temperature")
      max_new_tokens_textbox = gr.Textbox("50", label="Max New Tokens", type="text")


      chat_interface = gr.ChatInterface(
          generate,
          chatbot=bot,
          additional_inputs=[task_dropdown, custom_instruct_textbox, top_p_slider, temperature_slider, max_new_tokens_textbox],
          autofocus=True  # Enable live updates
      )

      # Set initial visibility based on the default selected task
      def update_visibility(selected_task):
          prompt_list = []
          bot_list = []
          if selected_task == "Custom instructions - Chat BOT":
              return gr.Textbox(label="Custom instructions", visible=True, default="")
          else:
              return gr.Textbox(visible=False)

      # update_visibility()  # Call the function to set the initial visibility
      task_dropdown.change(update_visibility, task_dropdown, [custom_instruct_textbox])
      task_dropdown.change(reset, task_dropdown, outputs=[bot, chat_interface.chatbot_state])

      demo.queue(concurrency_count=concurrency_count).launch(share=share)

In [8]:
create_gradio_app(model)

  custom_instruct_textbox = gr.Textbox(default="", label="Custom instructions", visible=task_dropdown.value == "Custom instructions - Chat BOT")


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://c8a2fba43aba09412a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
