# llama-cpp-python



In [1]:
#https://python.langchain.com/docs/integrations/llms/llamacpp
#Download gguf model from https://huggingface.co/TheBloke/OpenBuddy-Llama2-13B-v11.1-GGUF


The library works the same with a CPU, but the inference can take about three times longer compared to using it on a GPU.

If you want to use only the CPU, you can replace the content of the cell below with the following lines.
```
# CPU llama-cpp-python
!pip install llama-cpp-python
```

In [3]:
#Model path from your local
model_path=r'/Users/kpathak/Downloads/openbuddy-llama2-13b-v11.1.Q4_0.gguf'
from llama_cpp import Llama
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2 # CPU cores
    )

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /Users/kpathak/Downloads/openbuddy-llama2-13b-v11.1.Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  5120, 37632,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q4_0 

We will use this prompt.

In [7]:
prompt = "Write a linear regression in python"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

Generating response

If you only use CPU, the response can take a long time. You can reduce the max_tokens to get a faster response.

In [8]:
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    echo=True
    )

print(response["choices"][0]["text"])

Llama.generate: prefix-match hit


SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a linear regression in python

ASSISTANT:
```python
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

# Load the Boston Housing dataset
boston = datasets.load_boston()
X, y = boston['data'], boston['target']

# Create a linear regression model and fit it to the data
lr = LinearRegression().fit(X, y)
```
This code loads the Boston Housing dataset from scikit-learn's built-in datasets. It then creates a `LinearRegression` object and fits it to the training data using the `fit()` method. The resulting model can be used for making predictions on new data.




llama_print_timings:        load time = 28156.55 ms
llama_print_timings:      sample time =   150.74 ms /   151 runs   (    1.00 ms per token,  1001.73 tokens per second)
llama_print_timings: prompt eval time = 14453.07 ms /    13 tokens ( 1111.77 ms per token,     0.90 tokens per second)
llama_print_timings:        eval time = 165528.64 ms /   150 runs   ( 1103.52 ms per token,     0.91 tokens per second)
llama_print_timings:       total time = 180524.76 ms


In [17]:
prompt = "Explain linear regression in simple words."
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    echo=True
    )

print(response["choices"][0]["text"])

Llama.generate: prefix-match hit


SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Explain linear regression in simple words.

ASSISTANT:
Linear regression is a statistical method used to analyze the relationship between two variables - one independent variable (also known as predictor or explanatory) and one dependent variable (also known as response or outcome). The goal of linear regression is to find an equation that best fits this relationship, so we can use it to make predictions about future values of the dependent variable based on changes in the independent variable.

In simple words, imagine you have a set of data where each point represents one observation (or measurement) of both variables - let's say X and Y. Linear regression helps us find an equation that describes how well these two variables are related to each other. This equation is called a line of best fit because it fits the data as closely as possible, like drawing a straight line through all the poin


llama_print_timings:        load time = 28156.55 ms
llama_print_timings:      sample time =   170.25 ms /   168 runs   (    1.01 ms per token,   986.77 tokens per second)
llama_print_timings: prompt eval time = 16273.91 ms /    15 tokens ( 1084.93 ms per token,     0.92 tokens per second)
llama_print_timings:        eval time = 185780.11 ms /   167 runs   ( 1112.46 ms per token,     0.90 tokens per second)
llama_print_timings:       total time = 202669.30 ms


# Inference with langchain




In [18]:
#!pip install langchain update

We can use the model that we loaded earlier. However, for illustrative purposes, we will load one from the 'langchain' library. Due to vRAM limitations, before running these cells, you need to delete the previous model from memory.


Prompt for the model.

In [37]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# from huggingface_hub import hf_hub_download

template = """USER: {question}
ASSISTANT: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate(template=template, input_variables=["question"])

Stream tokens

In [14]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

Load the model

In [15]:
# n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=model_path,
    temperature=0.75,
    max_tokens=2000,
    top_p=1,
    callback_manager=callback_manager,
    verbose=True,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /Users/kpathak/Downloads/openbuddy-llama2-13b-v11.1.Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  5120, 37632,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_0     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_0     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q4_0 

In [29]:
prompt = """
Question: Which country won the highest number of gold medals in the last olympics.
"""
llm(prompt)

Llama.generate: prefix-match hit


Answer: The United States won the most gold medals (109) at the Tokyo Olympics held in 2021.



llama_print_timings:        load time =  4124.97 ms
llama_print_timings:      sample time =    30.02 ms /    30 runs   (    1.00 ms per token,   999.27 tokens per second)
llama_print_timings: prompt eval time =  7128.42 ms /    16 tokens (  445.53 ms per token,     2.24 tokens per second)
llama_print_timings:        eval time = 13670.71 ms /    30 runs   (  455.69 ms per token,     2.19 tokens per second)
llama_print_timings:       total time = 20927.39 ms


'Answer: The United States won the most gold medals (109) at the Tokyo Olympics held in 2021.\n'

In [33]:
question = "What year Tesla and SpaceX founder was born?"
llm(question)


Llama.generate: prefix-match hit



Elon Musk, the billionaire entrepreneur behind both Tesla and SpaceX, was born on June 28, 1971.



llama_print_timings:        load time =  4124.97 ms
llama_print_timings:      sample time =    37.45 ms /    37 runs   (    1.01 ms per token,   988.04 tokens per second)
llama_print_timings: prompt eval time =  5351.48 ms /    12 tokens (  445.96 ms per token,     2.24 tokens per second)
llama_print_timings:        eval time = 16325.03 ms /    36 runs   (  453.47 ms per token,     2.21 tokens per second)
llama_print_timings:       total time = 21827.01 ms


'\nElon Musk, the billionaire entrepreneur behind both Tesla and SpaceX, was born on June 28, 1971.\n'

In [38]:
template = """USER: {question}
ASSISTANT: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What year Tesla and SpaceX founder was born?."
llm_chain.run(question)


Llama.generate: prefix-match hit


 

Step 1: We know that Elon Musk is the co-founder of both Tesla and SpaceX, so let's start with his birth year. 

Step 2: According to public records, Elon Musk was born on June 28, 1971. 

Step 3: So, the answer is that Elon Musk, co-founder of Tesla and SpaceX, was born in the year 1971.



llama_print_timings:        load time =  4124.97 ms
llama_print_timings:      sample time =   112.72 ms /   112 runs   (    1.01 ms per token,   993.59 tokens per second)
llama_print_timings: prompt eval time = 17772.90 ms /    40 tokens (  444.32 ms per token,     2.25 tokens per second)
llama_print_timings:        eval time = 51432.64 ms /   112 runs   (  459.22 ms per token,     2.18 tokens per second)
llama_print_timings:       total time = 69670.52 ms


" \n\nStep 1: We know that Elon Musk is the co-founder of both Tesla and SpaceX, so let's start with his birth year. \n\nStep 2: According to public records, Elon Musk was born on June 28, 1971. \n\nStep 3: So, the answer is that Elon Musk, co-founder of Tesla and SpaceX, was born in the year 1971.\n"

In [40]:
prompt = """
Question: Write a short and inspiring sppech for college graduation.
"""
llm(prompt)



Llama.generate: prefix-match hit



Dear Graduates,

As you stand here today in your caps and gowns, I want to remind you of something that is truly remarkable about your achievement. You're not just graduating from college, but from the process of becoming who you are. The challenges you faced, the sacrifices you made, and the knowledge you gained will define and shape you as individuals for the rest of your lives.

Congratulations on this remarkable milestone in your lives!

With great joy, I congratulate each and every one of you on your graduation. It's a moment to celebrate your hard work, dedication, and achievement. You have worked tirelessly for years to reach this point, and I am proud to see you standing here today as accomplished individuals.

This is not the end but rather the beginning of a new chapter in your lives. The knowledge and skills you have acquired will serve you well in your future endeavors. Use them wisely and make something out of your lives that is worthwhile and meaningful.

I urge each one


llama_print_timings:        load time =  4124.97 ms
llama_print_timings:      sample time =   369.02 ms /   366 runs   (    1.01 ms per token,   991.82 tokens per second)
llama_print_timings: prompt eval time =  6721.58 ms /    15 tokens (  448.11 ms per token,     2.23 tokens per second)
llama_print_timings:        eval time = 168961.84 ms /   365 runs   (  462.91 ms per token,     2.16 tokens per second)
llama_print_timings:       total time = 177257.11 ms


'\nDear Graduates,\n\nAs you stand here today in your caps and gowns, I want to remind you of something that is truly remarkable about your achievement. You\'re not just graduating from college, but from the process of becoming who you are. The challenges you faced, the sacrifices you made, and the knowledge you gained will define and shape you as individuals for the rest of your lives.\n\nCongratulations on this remarkable milestone in your lives!\n\nWith great joy, I congratulate each and every one of you on your graduation. It\'s a moment to celebrate your hard work, dedication, and achievement. You have worked tirelessly for years to reach this point, and I am proud to see you standing here today as accomplished individuals.\n\nThis is not the end but rather the beginning of a new chapter in your lives. The knowledge and skills you have acquired will serve you well in your future endeavors. Use them wisely and make something out of your lives that is worthwhile and meaningful.\n\nI

In [43]:
Question = """
I have three houses, each with three cats.
each cat owns 4 mittens, and a hat. Each mitten was
knit from 7m of yarn, each hat from 4m.
How much yarn was needed to make all the items?

Think about it step by step, and show your work.
"""
template = """USER: {question}
ASSISTANT: Let's work this out in a step by step way to be sure we have the right answer."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain.run(Question)


Llama.generate: prefix-match hit




First, let's calculate how many mittens were made for each house:
3 houses x 3 cats/house = 9 cats
9 cats x 4 mittens/cat = 36 mittens

Next, let's find out how much yarn was used to make the mittens:
36 mittens x 7m of yarn/mitten = 252m of yarn

Now, we need to calculate the amount of yarn needed for each hat:
Each cat owns a hat which is 3 cats per house so 9 cats in total.
This means there are 3*9 = 27 hats
Each hat was made from 4m of yarn, so we use this to find out that they used 27 \* 4 m = 108 m of yarn for the hats

Finally, let's add both amounts of yarn:
108 + 252 = 360m of yarn needed.

So, to make all the mittens and hats needed for each cat in each house, we would need a total of 360m of yarn.



llama_print_timings:        load time =  4124.97 ms
llama_print_timings:      sample time =   271.90 ms /   268 runs   (    1.01 ms per token,   985.66 tokens per second)
llama_print_timings: prompt eval time = 46967.92 ms /   104 tokens (  451.61 ms per token,     2.21 tokens per second)
llama_print_timings:        eval time = 124570.80 ms /   267 runs   (  466.56 ms per token,     2.14 tokens per second)
llama_print_timings:       total time = 172684.43 ms


"\n\nFirst, let's calculate how many mittens were made for each house:\n3 houses x 3 cats/house = 9 cats\n9 cats x 4 mittens/cat = 36 mittens\n\nNext, let's find out how much yarn was used to make the mittens:\n36 mittens x 7m of yarn/mitten = 252m of yarn\n\nNow, we need to calculate the amount of yarn needed for each hat:\nEach cat owns a hat which is 3 cats per house so 9 cats in total.\nThis means there are 3*9 = 27 hats\nEach hat was made from 4m of yarn, so we use this to find out that they used 27 \\* 4 m = 108 m of yarn for the hats\n\nFinally, let's add both amounts of yarn:\n108 + 252 = 360m of yarn needed.\n\nSo, to make all the mittens and hats needed for each cat in each house, we would need a total of 360m of yarn.\n"

In [None]:
#LLAMA is good, may be increasing the model size, it will perform better.