# TL;DR
In this notebook we'll see how to recreate in a local environment an application like ChatGPT, but without using external APIs or models. Here are the main objectives:
- run an LLM locally
- create a user interface

# Setup
Before starting with the project, it is important to ALWAYS create a virtual environment. Follow the instructions in the README, then proceed with the notebook.

# 🦙 llamacpp (python)
The Python package provides simple bindings for the [llama.cpp](https://github.com/ggerganov/llama.cpp) library, offering access to the C API via python interface. These bindings provide a high-level interface to the library, so we don’t have to worry about the low-level details of C/C++. [Here](https://github.com/abetlen/llama-cpp-python) the repo.

# 🤗 HuggingFace

Huggingface is an established platform in the Deep learning world as has defined many standard frameworks. 
Among the various services it offers, it is also a **repository of open source models** trained and released by contributors from all over the world.

For this user case we will use the latest model of the llama family released by Meta the 18th of April 2024, i.e 🦙 [Llama 3](https://ai.meta.com/blog/meta-llama-3/) which with only 8B parameters is showing notable capabilities even compared with models 10x bigger (see [here](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)). 

We could quantize the original model on our own, but others already did the hard work, and we can just download the checkpoints of the pre-quantized model. More specifically, we'll take it from this repo [bartowski/Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF).



In [2]:
from huggingface_hub import hf_hub_download

# dowload the model from the hub
model_name_or_path = "bartowski/Meta-Llama-3-8B-Instruct-GGUF"
model_basename = "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf" # the model is in bin format
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

<p align="left">
  <img src="repo.png" alt="Descriptive Alt Text" width="400">
</p>

The prefix Q4 signifies the quantization method we used. If you want more details about this see this [source](https://huggingface.co/docs/hub/en/gguf), we won't delve into too many details here. Just take into account that there is a trade-off between the quantization precision and memory requiremenets. In many cases, the Q4 scheme represents the optimal choice.

However, there are other quantization methods available, you can read about them in the model card.

# First Generation

In [3]:
from llama_cpp import Llama

# init the model
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=10, # CPU cores
)

llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/navya/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader:

In [4]:
prompt = "Scivi un programma in python per contare fino a dieci"
prompt_template=f'''SYSTEM: Se un assistente virtuale esperto in python.

USER: {prompt}

ASSISTANT:
'''

In [5]:
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=100,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response["choices"][0]["text"])


llama_print_timings:        load time =   10547.47 ms
llama_print_timings:      sample time =      37.55 ms /    20 runs   (    1.88 ms per token,   532.67 tokens per second)
llama_print_timings: prompt eval time =   10547.37 ms /    34 tokens (  310.22 ms per token,     3.22 tokens per second)
llama_print_timings:        eval time =    8433.35 ms /    19 runs   (  443.86 ms per token,     2.25 tokens per second)
llama_print_timings:       total time =   19291.24 ms /    53 tokens


SYSTEM: Se un assistente virtuale esperto in python.

USER: Scivi un programma in python per contare fino a dieci

ASSISTANT:
```
for i in range(1, 11):
    print(i)
```




In [6]:
for i in range(1, 11):
    print(i)

1
2
3
4
5
6
7
8
9
10


# Streaming
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif" width="800" height="300" />

**Always think about user experience!**

Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.


# User Interface + Streaming

**Gradio** is an open-source Python package that allows you to quickly build a demo or web application for your machine learning model, API, or any arbitary Python function. You can then share a link to your demo or web application in just a few seconds using Gradio’s built-in sharing features. Check out the library on [github](https://github.com/gradio-app/gradio-UI) and see the [getting started](https://www.gradio.app/guides/quickstart) page for more demos.

Thanks to the **ChatInterface** class we can create a web-based demo around a chatbot model in a few lines of code. Only one parameter is required: fn, which takes a function that governs the response of the chatbot based on the user input and chat history. This class olso takes care of printing the tokens as they are gnerated in the stream.

## Assistente Personale - Motivatore per Programmatori

In [None]:
import os
import gradio as gr
import copy
from llama_cpp import Llama
from huggingface_hub import hf_hub_download  

# load the downloaded model
llm = Llama(
    model_path=hf_hub_download(
        repo_id=os.environ.get("REPO_ID", "bartowski/Meta-Llama-3-8B-Instruct-GGUF"),
        filename=os.environ.get("MODEL_FILE", "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"),
    ),
) 

# setting up prompt
history = []

system_message = """
Sei un assistente virtuale per studenti. Il tuo compito è motivarli a migliorare nello scrivere codice e contribuire al codice di progetti open source. Devi essere molto persuasivo. Rispondi con frasi brevi ma efficaci. 
"""

# fn to pass to the gradio class
def generate_text(message, history):
    temp = ""
    input_prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>{system_message}<|eot_id|>" # More on this on https://huggingface.co/docs/transformers/main/en/chat_templating
    for interaction in history:
        input_prompt = input_prompt + '<|start_header_id|>user<|end_header_id|>' + str(interaction[0]) + "<|eot_id|>" + '<|start_header_id|>assistant<|end_header_id|>' + str(interaction[1]) + "<|eot_id|>"

    input_prompt = input_prompt + '<|start_header_id|>user<|end_header_id|>' + str(message) + "<|eot_id|>"
    output = llm(
        input_prompt,
        temperature=0.15,
        top_p=0.1,
        top_k=40, 
        repeat_penalty=1.1,
        max_tokens=1024,
        stop=[
            '<|eot_id|>'
        ],
        stream=True,
    )
    for out in output:
        stream = copy.deepcopy(out)
        temp += stream["choices"][0]["text"]
        yield temp

    history = ["init", input_prompt]

# init UI
demo = gr.ChatInterface(
    generate_text,
    title="Demo Roma 3 with Llama 3",
    cache_examples=False,
    retry_btn=None,
    undo_btn="Undo",
    clear_btn="Clear",
)

demo.launch()

llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from /home/navya/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader:

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.





llama_print_timings:        load time =   17252.75 ms
llama_print_timings:      sample time =     176.35 ms /    91 runs   (    1.94 ms per token,   516.03 tokens per second)
llama_print_timings: prompt eval time =   17252.62 ms /    99 tokens (  174.27 ms per token,     5.74 tokens per second)
llama_print_timings:        eval time =   25316.92 ms /    90 runs   (  281.30 ms per token,     3.55 tokens per second)
llama_print_timings:       total time =   44018.44 ms /   189 tokens


# Next steps
There are many more possibilities and applications for LLM that you can start experiment with immediately. Here are some examples:
- multimodality
- personality change
- multi agent
- function calling
- document analysis
