## Testing a quantized version of Poro by LumiOpen/Poro-34B using llama-cpp-python

###  Installation of llama-cpp-python

**WARNING:** Poro works well with llama-cpp-python==0.2.36 and not with the current version 0.2.39

In [1]:
# Mac OS
# !CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install  --force-reinstall llama-cpp-python==0.2.36 --no-cache-dir

# Linux
# CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python==0.2.36 --no-cache-dir

# ! pip install 'llama-cpp-python[server]'

In [None]:
# If cuBLAS error, we can still install llama-cpp-python without GPU (very slow for inference)
# ! pip install llama-cpp-python==0.2.36

In [2]:
import os
from pathlib import Path
from llama_cpp import Llama



## Download gguf file

In [3]:
quantized_model_id = "hbacard/Poro-34B-GGUF" # You can use another model_id like one of TheBloke's
gguf_file = "Poro-34B.Q4_0.gguf"

# Uncomment the line below to download the gguf file
# ! wget -P ../models https://huggingface.co/{quantized_model_id}/resolve/main/{gguf_file}

In [15]:
GGUF_FILE_NAME = gguf_file
GGUFS_DIR = "../models" 
MODEL_PATH = os.path.join(GGUFS_DIR, GGUF_FILE_NAME)
print(f"Choosen model: {MODEL_PATH}")
print(os.path.exists(MODEL_PATH))

Choosen model: ../models/Poro-34B.Q4_0.gguf
True


In [None]:
LLM_INSTANCE = Llama(
    model_path=MODEL_PATH,
    n_gpu_layers=-1, # is set to another value it doesn't activate gpu on mac silicon
    seed=43,
    verbose=True,
    n_batch=512,
    n_ctx=800,
)


In [8]:
def generate_response(prompt: str, llm=LLM_INSTANCE) -> str:
    response = llm(prompt=prompt,
            temperature=0.0,
            top_p=0.95,
            top_k=2,
            repeat_penalty=1.1,
            max_tokens=200,
            seed=34, 
            stop=["##"],
            
            )
    text_response = response["choices"][0]["text"]
    return text_response.strip()


## Prompt template

In [9]:


DEFAULT_SYSTEM_PROMPT = "Perform the following instruction to the best of your ability."

prompt_template = """{system_prompt}\n ### Instruction: {instruction}\n ### Response:\n"""
prompt_template = """### Instruction: {instruction}\n ### Response:\n"""

## Question-Answering

In [10]:
def question_answering(instruction, system_prompt: str =DEFAULT_SYSTEM_PROMPT):
    prompt = prompt_template.format(system_prompt=system_prompt,instruction=instruction)
    return generate_response(prompt=prompt)
    

In [11]:
question_answering(instruction="Quelle est la capitale de la France?")


llama_print_timings:        load time =    4493.26 ms
llama_print_timings:      sample time =       2.14 ms /     6 runs   (    0.36 ms per token,  2808.99 tokens per second)
llama_print_timings: prompt eval time =    4491.81 ms /    18 tokens (  249.55 ms per token,     4.01 tokens per second)
llama_print_timings:        eval time =     612.41 ms /     5 runs   (  122.48 ms per token,     8.16 tokens per second)
llama_print_timings:       total time =    5139.97 ms /    23 tokens


'- Paris\n\n---'

## Summarization

In [12]:
def summarize(input_text: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
    summarization_instruction = f"""Give me a summary of the following text: ```{input_text}```"""
    prompt = prompt_template.format(system_prompt=system_prompt, instruction=summarization_instruction)
    return generate_response(prompt=prompt).strip()

In [13]:
summarize(input_text=""" 
Mr. Jones, of the Manor Farm, had locked the hen-houses for the night, but was too drunk to remember to shut the popholes. With the ring of light from his lantern dancing from side to side, he lurched across the yard, kicked off his boots at the back door, drew himself a last glass of beer from the barrel in the scullery, and made his way up to bed, where Mrs. Jones was already snoring.
As soon as the light in the bedroom went out there was a stirring and a fluttering all through the farm buildings. Word had gone round during the day that old Major, the prize Middle White boar, had had a strange dream on the previous night and wished to communicate it to the other animals.
""")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4493.26 ms
llama_print_timings:      sample time =      40.98 ms /   108 runs   (    0.38 ms per token,  2635.62 tokens per second)
llama_print_timings: prompt eval time =    3751.09 ms /   176 tokens (   21.31 ms per token,    46.92 tokens per second)
llama_print_timings:        eval time =   13322.15 ms /   107 runs   (  124.51 ms per token,     8.03 tokens per second)
llama_print_timings:       total time =   17689.70 ms /   283 tokens


'``` \nThe text is about Mr Jones who was drunk at home when he locked his hen houses for the night but forgot to close popholes which are holes in doors or windows that allow air, light, sound etc., into a building.\nHe then went upstairs and fell asleep while Mrs.Jones snored loudly. \nAt midnight Major (a boar) had strange dream about an animal who was going to be killed by Mr Jones the next day so he wanted other animals on his farm to know this before it happened.\n\n```'

## Coding

In [14]:
instruction = 'Implement a python function that gives the list of all prime numbers less than a given integer.'
print(question_answering(instruction=instruction)) # Not a good answer

Llama.generate: prefix-match hit


```
  def primes(n):
    if n < 2:
      return []

    sieve = [True] * (n+1)
    for i in range(2, int((n-1).__ceil__())):
        if not sieve[i]:
            continue

        j = 2**i - 1 # the square root of i; this is a prime number itself.
        while True:
          k = 2*j + 1

          if k > n:
              break

          else:
             sieve[k] = False

    return [k for (k, v) in enumerate(sieve) if v]
```



llama_print_timings:        load time =    4493.26 ms
llama_print_timings:      sample time =      43.53 ms /   121 runs   (    0.36 ms per token,  2779.44 tokens per second)
llama_print_timings: prompt eval time =     589.40 ms /    22 tokens (   26.79 ms per token,    37.33 tokens per second)
llama_print_timings:        eval time =   14896.59 ms /   120 runs   (  124.14 ms per token,     8.06 tokens per second)
llama_print_timings:       total time =   16127.05 ms /   142 tokens


## Chat

In [23]:
prompt = """The following is a conversation between a helpful AI assistant and a human. The AI assistant answers politely and truthfully to the best of his ability. The answers of the AI assistant are always clear and concise.
###Human:
What is the capital city of Canada
###AI:
The capital city of Canada is Ottawa.
###Human:
What can i visit there ?
###AI:
You could go see Parliament Hill, Rideau Canal or Byward Market.
###Human: 
Anything else ?
###AI: 
There are many other things to do in the area such as visiting museums and galleries.
###Human: 
How is the food like there ?
###AI: 

"""

In [24]:
response = LLM_INSTANCE(prompt=prompt,
            temperature=0.0,
            top_p=0.95,
            top_k=1,
            repeat_penalty=1.1,
            max_tokens=100,
            seed=34, 
            stop=["##"],
            )
text_response = response["choices"][0]["text"]
print(text_response)

Llama.generate: prefix-match hit


The cuisine of Ottawa includes French, Italian, Chinese, Indian, Greek, Middle Eastern, Japanese, Korean, Vietnamese, Thai, Mexican, American, Canadian, Filipino dishes.
There are many restaurants in Ottawa that serve a variety of cuisines. Some popular places to eat include The Keg Steakhouse + Bar and Bier Markt.





llama_print_timings:        load time =     930.50 ms
llama_print_timings:      sample time =      27.21 ms /    71 runs   (    0.38 ms per token,  2609.33 tokens per second)
llama_print_timings: prompt eval time =    9886.56 ms /   149 tokens (   66.35 ms per token,    15.07 tokens per second)
llama_print_timings:        eval time =    8642.33 ms /    70 runs   (  123.46 ms per token,     8.10 tokens per second)
llama_print_timings:       total time =   18914.82 ms /   219 tokens


In [25]:
response

{'id': 'cmpl-0d65f280-cc4c-4800-8c17-f7eeeba6b837',
 'object': 'text_completion',
 'created': 1707817555,
 'model': '../models/Poro-34b.Q4_0.gguf',
 'choices': [{'text': 'The cuisine of Ottawa includes French, Italian, Chinese, Indian, Greek, Middle Eastern, Japanese, Korean, Vietnamese, Thai, Mexican, American, Canadian, Filipino dishes.\nThere are many restaurants in Ottawa that serve a variety of cuisines. Some popular places to eat include The Keg Steakhouse + Bar and Bier Markt.\n\n',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 150, 'completion_tokens': 71, 'total_tokens': 221}}