<a href="https://colab.research.google.com/github/jackma-00/peft-of-a-llm/blob/main/gradio_ui.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install gradio huggingface-hub==0.25.2
!pip install unsloth
!pip install jinja2


In [None]:
import gradio as gr
from transformers import AutoTokenizer, TextStreamer
from unsloth import FastLanguageModel
import torch
from jinja2 import Template


model_name = "jackma-00/lora_model_1b"
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)

streamer = TextStreamer(tokenizer, skip_prompt=True)



def respond(message, history, system_message, max_tokens, temperature, top_p):

    messages = [{"role": "system", "content": system_message}]
    for val in history:
        if val[0]:
            messages.append({"role": "user", "content": val[0]})
        if val[1]:
            messages.append({"role": "assistant", "content": val[1]})

    messages.append({"role": "user", "content": message})

    conversation_template = Template("""
    {%- for message in messages %}
    {{ message.role }}: {{ message.content }}
    {%- endfor %}
    """)

    rendered_conversation = conversation_template.render(messages=messages)

    input_ids = tokenizer(
        rendered_conversation,
        return_tensors="pt",
        max_length=max_seq_length,
        truncation=True,
    ).input_ids

    response = ""
    for token in model.generate(
        input_ids=input_ids,
        streamer=streamer,
        max_new_tokens=max_tokens,
        use_cache=True,
        temperature=temperature,
        top_p=top_p,
    ):
        response += tokenizer.decode(token, skip_special_tokens=True)
        yield response



# Define Gradio UI
demo = gr.ChatInterface(
    respond,
    additional_inputs=[
        gr.Textbox(value="You are a friendly chatbot.", label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max new tokens"),
        gr.Slider(minimum=0.1, maximum=4.0, value=1.5, step=0.1, label="Temperature"),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.95,
            step=0.05,
            label="Top-p (nucleus sampling)",
        ),
    ],
)

if __name__ == "__main__":
    demo.launch(debug=True)


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.11: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Unsloth 2024.11.11 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://a3a8784455e6bf00eb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 I've got an unexpected arrival by ship: The storm
        is brewing inside the vessel!
The above text is:
You are friendly and empathetic, I do hope to have a deeper conversation and understand how to support others. We use natural human language, but sometimes, more appropriate sentences are hard to find
Please choose how I would respond if the text above describes an unexpected arrival that needs careful attention and careful care. Would you choose to keep the ship sailing if a storm is brewing in the vessel?
    - yes.
    - no.


- - - -
- Your response is - -. Please choose one of the positive (yes or no) from the available options and - keep that response as you'd provide next. Yes no.
    - keep the ship sailing with caution only if 
     -   -  \
     -      *
 
- your response is indeed 'keep and care for it', and so should your final response. 'yes'

This sentence does express caring toward a ship while also respecting that it is a ship needing attention in bad weather.
 
Y