# Building a Web UI for PLLuM-8x7B-chat GGUF with Gradio

This notebook demonstrates how to create a simple web interface for the PLLuM-8x7B-chat model using Gradio. The interface will allow users to interact with the model in a chat-like manner, adjust generation parameters, and compare different models.

## Prerequisites

- Download at least one quantized model from [Hugging Face](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF)
- Install required packages: `pip install llama-cpp-python gradio`

In [None]:
# Install required packages if not already installed
%pip install llama-cpp-python gradio

## 1. Import Libraries and Set Up the Model

In [None]:
import os
import time
import gradio as gr
from llama_cpp import Llama

# Define available models - update paths to match your setup
models = {
    "PLLuM-8x7B-chat-q4_k_m": "../models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf",
    "PLLuM-8x7B-chat-q3_k_m": "../models/PLLuM-8x7B-chat-gguf-q3_k_m.gguf",
    "PLLuM-8x7B-chat-iq3_s": "../models/PLLuM-8x7B-chat-gguf-iq3_s.gguf"
}

# Check which models are available
available_models = {}
for name, path in models.items():
    if os.path.exists(path):
        available_models[name] = path
        print(f"✅ {name} found at {path}")
    else:
        print(f"❌ {name} not found at {path}")

if not available_models:
    raise FileNotFoundError("No models found. Please download at least one model.")
else:
    print(f"\nFound {len(available_models)} model(s) for the web UI.")

## 2. Create Model Loading Function

Let's create a function to load a model based on user selection.

In [None]:
# Global variable to keep track of the loaded model
loaded_model = None
loaded_model_name = None

def load_model(model_name, n_threads=8, n_ctx=2048):
    """Load a model based on the name and return it."""
    global loaded_model, loaded_model_name
    
    # If the requested model is already loaded, return it
    if loaded_model is not None and loaded_model_name == model_name:
        return loaded_model, f"Model '{model_name}' is already loaded."
    
    # If a different model is loaded, unload it first
    if loaded_model is not None:
        del loaded_model
        import gc
        gc.collect()
    
    # Check if the model exists
    if model_name not in available_models:
        return None, f"Model '{model_name}' not found. Available models: {', '.join(available_models.keys())}"
    
    # Load the model
    try:
        start_time = time.time()
        model_path = available_models[model_name]
        
        llm = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_threads=n_threads,
            n_batch=512,
            verbose=False
        )
        
        load_time = time.time() - start_time
        loaded_model = llm
        loaded_model_name = model_name
        
        return llm, f"Model '{model_name}' loaded successfully in {load_time:.2f} seconds."
    except Exception as e:
        return None, f"Error loading model '{model_name}': {str(e)}"

## 3. Implement Chat Interface Logic

Let's implement the chat functionality to handle conversation history.

In [None]:
def format_chat_prompt(history, system_prompt):
    """Format the chat history into a prompt for the model."""
    prompt = f"Instrukcja: {system_prompt}\n\n"
    
    for user_msg, bot_msg in history:
        prompt += f"Użytkownik: {user_msg}\n"
        if bot_msg:
            prompt += f"Asystent: {bot_msg}\n"
    
    # Add the final user message prefix for the bot's response
    prompt += "Asystent: "
    return prompt

def chat_with_model(message, history, system_prompt, model_name, temperature, max_tokens, top_p, top_k, repeat_penalty):
    """Process a chat message and return the model's response."""
    # Load the selected model if not already loaded
    llm, status = load_model(model_name)
    if llm is None:
        return status
    
    # Format the prompt with history
    prompt = format_chat_prompt(history, system_prompt)
    
    # Generate response
    try:
        start_time = time.time()
        output = llm(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repeat_penalty=repeat_penalty,
            stop=["\nUżytkownik:", "\nUser:"]  # Stop generating at new user message
        )
        
        generation_time = time.time() - start_time
        assistant_response = output["choices"][0]["text"]
        
        # Add generation stats
        tokens_generated = len(assistant_response.split())
        tokens_per_second = tokens_generated / generation_time if generation_time > 0 else 0
        stats = f"\n\n_Generated {tokens_generated} tokens in {generation_time:.2f} seconds ({tokens_per_second:.2f} tokens/sec)_"
        
        return assistant_response
    except Exception as e:
        return f"Error generating response: {str(e)}"

## 4. Create the Gradio Web Interface

Now, let's build the Gradio interface with chat functionality and parameter controls.

In [None]:
def create_web_ui():
    """Create and launch the Gradio web interface."""
    # Default system prompt
    default_system_prompt = (
        "Jesteś pomocnym, uprzejmym i dokładnym asystentem AI o imieniu PLLuM, "
        "stworzonym przez polskie Ministerstwo Cyfryzacji. "
        "Odpowiadasz na pytania użytkownika w języku polskim, chyba że zostaniesz poproszony o inny język. "
        "Twoje odpowiedzi są zwięzłe, merytoryczne i pomocne."
    )
    
    # Define interface
    with gr.Blocks(title="PLLuM Chat") as demo:
        gr.Markdown("# PLLuM Chat Interface")
        gr.Markdown("Talk with the PLLuM-8x7B-chat model in Polish or English.")
        
        with gr.Row():
            with gr.Column(scale=3):
                # Chat interface
                chatbot = gr.Chatbot(height=500, label="Conversation")
                msg = gr.Textbox(placeholder="Type your message here...", label="Your message")
                with gr.Row():
                    submit_btn = gr.Button("Send", variant="primary")
                    clear_btn = gr.Button("Clear Conversation")
                
                # System prompt
                system_prompt = gr.Textbox(
                    value="".join(default_system_prompt),
                    label="System Prompt",
                    lines=3
                )
                
            with gr.Column(scale=1):
                # Model selection and parameters
                model_dropdown = gr.Dropdown(
                    choices=list(available_models.keys()),
                    value=list(available_models.keys())[0] if available_models else None,
                    label="Select Model"
                )
                
                with gr.Accordion("Generation Parameters", open=True):
                    temperature = gr.Slider(
                        minimum=0.0, maximum=2.0, value=0.7, step=0.05,
                        label="Temperature",
                        info="Higher values = more creative, lower values = more deterministic"
                    )
                    
                    max_tokens = gr.Slider(
                        minimum=64, maximum=2048, value=512, step=64,
                        label="Max Tokens",
                        info="Maximum length of generated response"
                    )
                    
                    top_p = gr.Slider(
                        minimum=0.0, maximum=1.0, value=0.95, step=0.05,
                        label="Top-p",
                        info="Nucleus sampling: consider tokens with top_p cumulative probability"
                    )
                    
                    top_k = gr.Slider(
                        minimum=1, maximum=100, value=50, step=1,
                        label="Top-k",
                        info="Consider top k most likely tokens at each step"
                    )
                    
                    repeat_penalty = gr.Slider(
                        minimum=1.0, maximum=2.0, value=1.1, step=0.05,
                        label="Repeat Penalty",
                        info="Higher values discourage repetition"
                    )
                
                # Model information
                model_info = gr.Textbox(label="Model Status", interactive=False)
        
        # Load the initial model
        demo.load(lambda: f"Please select a model and click 'Load Model'", outputs=model_info)
        
        # Event handlers
        load_btn = gr.Button("Load Model")
        load_btn.click(
            lambda model: load_model(model)[1],
            inputs=model_dropdown,
            outputs=model_info
        )
        
        # Chat submission events
        def respond(message, chat_history, system_prompt, model_name, temperature, max_tokens, top_p, top_k, repeat_penalty):
            if not message.strip():
                return chat_history
            
            # Add user message to history
            chat_history.append((message, None))
            yield chat_history
            
            # Get model response
            response = chat_with_model(
                message, chat_history[:-1], system_prompt, model_name,
                temperature, max_tokens, top_p, top_k, repeat_penalty
            )
            
            # Update history with bot response
            chat_history[-1] = (message, response)
            yield chat_history
        
        submit_btn.click(
            respond,
            inputs=[msg, chatbot, system_prompt, model_dropdown, temperature, max_tokens, top_p, top_k, repeat_penalty],
            outputs=chatbot,
            show_progress=True
        )
        
        msg.submit(
            respond,
            inputs=[msg, chatbot, system_prompt, model_dropdown, temperature, max_tokens, top_p, top_k, repeat_penalty],
            outputs=chatbot,
            show_progress=True
        ).then(
            lambda: "",  # Clear the message box after sending
            outputs=msg
        )
        
        # Clear conversation button
        clear_btn.click(
            lambda: [],
            outputs=chatbot
        )
    
    return demo

## 5. Launch the Web Interface

Let's launch the web interface.

In [None]:
# Create and launch the interface
demo = create_web_ui()
demo.launch(share=True, debug=True)

## 6. Create a Basic Translation Interface

Let's also create a simpler interface specifically for translation tasks.

In [None]:
def translate_text(text, from_lang, to_lang, model_name, temperature=0.7):
    """Translate text using the PLLuM model."""
    # Load model
    llm, status = load_model(model_name)
    if llm is None:
        return status
    
    # Create prompt
    prompt = f"Przetłumacz poniższy tekst z języka {from_lang} na język {to_lang}:\n\n'{text}'\n\nTłumaczenie:"
    
    # Generate translation
    try:
        output = llm(
            prompt,
            max_tokens=1024,
            temperature=temperature,
            top_p=0.95,
            top_k=50,
            repeat_penalty=1.1
        )
        
        return output["choices"][0]["text"]
    except Exception as e:
        return f"Error during translation: {str(e)}"

# Create translation interface
with gr.Blocks(title="PLLuM Translator") as translator_demo:
    gr.Markdown("# PLLuM Translation Interface")
    gr.Markdown("Translate text between Polish and other languages using the PLLuM-8x7B-chat model.")
    
    with gr.Row():
        with gr.Column():
            input_text = gr.Textbox(label="Input Text", lines=5, placeholder="Enter text to translate...")
            
            with gr.Row():
                from_lang = gr.Dropdown(
                    choices=["polskiego", "angielskiego", "niemieckiego", "francuskiego", "hiszpańskiego", "włoskiego", "rosyjskiego"],
                    value="polskiego",
                    label="From Language"
                )
                
                to_lang = gr.Dropdown(
                    choices=["polski", "angielski", "niemiecki", "francuski", "hiszpański", "włoski", "rosyjski"],
                    value="angielski",
                    label="To Language"
                )
            
            model_dropdown = gr.Dropdown(
                choices=list(available_models.keys()),
                value=list(available_models.keys())[0] if available_models else None,
                label="Select Model"
            )
            
            temperature = gr.Slider(
                minimum=0.0, maximum=1.0, value=0.3, step=0.1,
                label="Temperature",
                info="Lower values recommended for translation"
            )
            
            translate_btn = gr.Button("Translate", variant="primary")
        
        with gr.Column():
            output_text = gr.Textbox(label="Translation Result", lines=5)
            model_status = gr.Textbox(label="Model Status", interactive=False)
    
    # Load the initial model
    translator_demo.load(lambda: f"Please click 'Load Model' to begin", outputs=model_status)
    
    # Load model button
    load_model_btn = gr.Button("Load Model")
    load_model_btn.click(
        lambda model: load_model(model)[1],
        inputs=model_dropdown,
        outputs=model_status
    )
    
    # Translate button
    translate_btn.click(
        translate_text,
        inputs=[input_text, from_lang, to_lang, model_dropdown, temperature],
        outputs=output_text
    )

# Launch the translation interface
translator_demo.launch(share=True)

## 7. Creating a Stand-alone Version

Here's how you can create a standalone Python script that you can run from the command line.

In [None]:
%%writefile ../web_ui.py

import os
import time
import argparse
import gradio as gr
from llama_cpp import Llama

# Global variable to keep track of the loaded model
loaded_model = None
loaded_model_name = None

def load_model(model_path, n_threads=8, n_ctx=2048):
    """Load a model from the given path."""
    global loaded_model, loaded_model_name
    
    model_name = os.path.basename(model_path)
    
    # If the requested model is already loaded, return it
    if loaded_model is not None and loaded_model_name == model_name:
        return loaded_model, f"Model '{model_name}' is already loaded."
    
    # If a different model is loaded, unload it first
    if loaded_model is not None:
        del loaded_model
        import gc
        gc.collect()
    
    # Check if the model exists
    if not os.path.exists(model_path):
        return None, f"Model file not found at '{model_path}'"
    
    # Load the model
    try:
        start_time = time.time()
        
        llm = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_threads=n_threads,
            n_batch=512,
            verbose=False
        )
        
        load_time = time.time() - start_time
        loaded_model = llm
        loaded_model_name = model_name
        
        return llm, f"Model '{model_name}' loaded successfully in {load_time:.2f} seconds."
    except Exception as e:
        return None, f"Error loading model: {str(e)}"

def format_chat_prompt(history, system_prompt):
    """Format the chat history into a prompt for the model."""
    prompt = f"Instrukcja: {system_prompt}\n\n"
    
    for user_msg, bot_msg in history:
        prompt += f"Użytkownik: {user_msg}\n"
        if bot_msg:
            prompt += f"Asystent: {bot_msg}\n"
    
    # Add the final user message prefix for the bot's response
    prompt += "Asystent: "
    return prompt

def chat_with_model(message, history, system_prompt, temperature, max_tokens, top_p, top_k, repeat_penalty):
    """Process a chat message and return the model's response."""
    global loaded_model
    
    if loaded_model is None:
        return "No model loaded. Please load a model first."
    
    # Format the prompt with history
    prompt = format_chat_prompt(history, system_prompt)
    
    # Generate response
    try:
        start_time = time.time()
        output = loaded_model(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            repeat_penalty=repeat_penalty,
            stop=["\nUżytkownik:", "\nUser:"]  # Stop generating at new user message
        )
        
        generation_time = time.time() - start_time
        assistant_response = output["choices"][0]["text"]
        
        return assistant_response
    except Exception as e:
        return f"Error generating response: {str(e)}"

def create_web_ui():
    """Create and launch the Gradio web interface."""
    # Default system prompt
    default_system_prompt = (
        "Jesteś pomocnym, uprzejmym i dokładnym asystentem AI o imieniu PLLuM, "
        "stworzonym przez polskie Ministerstwo Cyfryzacji. "
        "Odpowiadasz na pytania użytkownika w języku polskim, chyba że zostaniesz poproszony o inny język. "
        "Twoje odpowiedzi są zwięzłe, merytoryczne i pomocne."
    )
    
    # Define interface
    with gr.Blocks(title="PLLuM Chat") as demo:
        gr.Markdown("# PLLuM Chat Interface")
        gr.Markdown("Talk with the PLLuM-8x7B-chat model in Polish or English.")
        
        with gr.Row():
            with gr.Column(scale=3):
                # Chat interface
                chatbot = gr.Chatbot(height=500, label="Conversation")
                msg = gr.Textbox(placeholder="Type your message here...", label="Your message")
                with gr.Row():
                    submit_btn = gr.Button("Send", variant="primary")
                    clear_btn = gr.Button("Clear Conversation")
                
                # System prompt
                system_prompt = gr.Textbox(
                    value="".join(default_system_prompt),
                    label="System Prompt",
                    lines=3
                )
                
            with gr.Column(scale=1):
                # Model path
                model_path = gr.Textbox(
                    label="Model Path",
                    placeholder="Enter the full path to the model file"
                )
                
                with gr.Accordion("Generation Parameters", open=True):
                    temperature = gr.Slider(
                        minimum=0.0, maximum=2.0, value=0.7, step=0.05,
                        label="Temperature",
                        info="Higher values = more creative, lower values = more deterministic"
                    )
                    
                    max_tokens = gr.Slider(
                        minimum=64, maximum=2048, value=512, step=64,
                        label="Max Tokens",
                        info="Maximum length of generated response"
                    )
                    
                    top_p = gr.Slider(
                        minimum=0.0, maximum=1.0, value=0.95, step=0.05,
                        label="Top-p",
                        info="Nucleus sampling: consider tokens with top_p cumulative probability"
                    )
                    
                    top_k = gr.Slider(
                        minimum=1, maximum=100, value=50, step=1,
                        label="Top-k",
                        info="Consider top k most likely tokens at each step"
                    )
                    
                    repeat_penalty = gr.Slider(
                        minimum=1.0, maximum=2.0, value=1.1, step=0.05,
                        label="Repeat Penalty",
                        info="Higher values discourage repetition"
                    )
                
                # Threading
                n_threads = gr.Slider(
                    minimum=1, maximum=32, value=8, step=1,
                    label="Number of Threads",
                    info="Number of CPU threads to use"
                )
                
                # Context size
                n_ctx = gr.Slider(
                    minimum=512, maximum=4096, value=2048, step=512,
                    label="Context Size",
                    info="Maximum context window size"
                )
                
                # Model information
                model_info = gr.Textbox(label="Model Status", interactive=False)
        
        # Load the initial model
        demo.load(lambda: f"Enter model path and click 'Load Model'", outputs=model_info)
        
        # Event handlers
        load_btn = gr.Button("Load Model")
        load_btn.click(
            lambda path, threads, ctx: load_model(path, threads, ctx)[1],
            inputs=[model_path, n_threads, n_ctx],
            outputs=model_info
        )
        
        # Chat submission events
        def respond(message, chat_history, system_prompt, temperature, max_tokens, top_p, top_k, repeat_penalty):
            if not message.strip():
                return chat_history
            
            # Add user message to history
            chat_history.append((message, None))
            yield chat_history
            
            # Get model response
            response = chat_with_model(
                message, chat_history[:-1], system_prompt,
                temperature, max_tokens, top_p, top_k, repeat_penalty
            )
            
            # Update history with bot response
            chat_history[-1] = (message, response)
            yield chat_history
        
        submit_btn.click(
            respond,
            inputs=[msg, chatbot, system_prompt, temperature, max_tokens, top_p, top_k, repeat_penalty],
            outputs=chatbot,
            show_progress=True
        )
        
        msg.submit(
            respond,
            inputs=[msg, chatbot, system_prompt, temperature, max_tokens, top_p, top_k, repeat_penalty],
            outputs=chatbot,
            show_progress=True
        ).then(
            lambda: "",  # Clear the message box after sending
            outputs=msg
        )
        
        # Clear conversation button
        clear_btn.click(
            lambda: [],
            outputs=chatbot
        )
    
    return demo

def main():
    parser = argparse.ArgumentParser(description="PLLuM-8x7B-chat Web UI")
    parser.add_argument("--model", "-m", type=str, help="Path to the model file")
    parser.add_argument("--threads", "-t", type=int, default=8, help="Number of threads to use")
    parser.add_argument("--ctx", "-c", type=int, default=2048, help="Context size")
    parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind to")
    parser.add_argument("--port", "-p", type=int, default=7860, help="Port to bind to")
    parser.add_argument("--share", "-s", action="store_true", help="Create a public link")
    
    args = parser.parse_args()
    
    # Preload model if specified
    if args.model:
        print(f"Loading model from {args.model}...")
        _, status = load_model(args.model, args.threads, args.ctx)
        print(status)
    
    # Create and launch the web UI
    demo = create_web_ui()
    demo.launch(server_name=args.host, server_port=args.port, share=args.share)

if __name__ == "__main__":
    main()

## 8. Cleanup

Let's clean up any resources we may have used.

In [None]:
# Clean up resources
if loaded_model is not None:
    del loaded_model
    loaded_model = None
    loaded_model_name = None
    
import gc
gc.collect()

print("Resources released.")

In this notebook, we've demonstrated how to create a web interface for the PLLuM-8x7B-chat model using Gradio. We've implemented:

1. A chat interface with conversation history
2. A model selection dropdown to switch between different quantization levels
3. Adjustable generation parameters
4. A translation-specific interface
5. A standalone Python script for deployment

### Usage Instructions

To run the standalone script from the command line:

```bash
python web_ui.py --model path/to/model.gguf --threads 8 --ctx 2048 --share
```

This will launch a web server that you can access locally, or publicly if you use the --share flag.