<a href="https://colab.research.google.com/github/harnalashok/LLMs/blob/main/Using_llamacpp_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 09/06/2024
# Ref: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama
#      https://www.datacamp.com/tutorial/llama-cpp-tutorial
#      YouTube video: https://www.youtube.com/watch?v=rCDf0MSzUCg

# Colab version

llama-cpp allows to access models downloaded from huggingface on local machine. Download a model from huggingface.

<h3>My notes for Jupyter notebook:</h3>

$# 0.0 Removing an environment:    

>`conda remove --name llamacpp --all` <br>    

$# 0.1 Create conda environment with python 3.11    

>`cd ~/` <br>

>`conda config --add channels conda-forge`<br>

>`conda create --name llamacpp python=3.11 ipython spyder jupyterlab notebook`<br>

>`conda activate llamacpp` <br>

$# 0.2 Make a directory to house our files:    

>`mkdir llamacpp` <br>    
>`cd llamacpp` <br>

$# 0.3 Make another folder: models   
       to keep downloaded models:    

>`mkdir models` <br>


In [3]:
# 1.0 Install llamacpp

! pip install llama-cpp-python --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone


In [None]:
# 1.1 Download following huggingface 'gguf + text-generation' model
#      into current folder as:
#     https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF
#
#     Colab comes with huggingface-cli preloaded.

! huggingface-cli download TheBloke/zephyr-7B-beta-GGUF zephyr-7b-beta.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'zephyr-7b-beta.Q4_K_M.gguf' to '.huggingface/download/zephyr-7b-beta.Q4_K_M.gguf.503580dce392c6e64669ad21a77023ba2a17baa0c381250fb67c11ba6406a85e.incomplete'
zephyr-7b-beta.Q4_K_M.gguf: 100% 4.37G/4.37G [00:32<00:00, 135MB/s]
Download complete. Moving file to zephyr-7b-beta.Q4_K_M.gguf
zephyr-7b-beta.Q4_K_M.gguf


In [2]:
# 1.1 Download following huggingface 'gguf + text-generation + tiny' model
#      into current folder. Its repo is:
#     https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.6
#
#     Colab comes with huggingface-cli preloaded.

! huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v0.6  ggml-model-q4_0.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'ggml-model-q4_0.gguf' to '.huggingface/download/ggml-model-q4_0.gguf.ec21b060225c96d9a985886566c2d01d0c63498405f4e0448dfee0491d73219f.incomplete'
ggml-model-q4_0.gguf: 100% 637M/637M [00:11<00:00, 56.0MB/s]
Download complete. Moving file to ggml-model-q4_0.gguf
ggml-model-q4_0.gguf


In [5]:
# 1.2 Import libraries:
import llama_cpp
from llama_cpp import Llama

In [6]:
# 1.3
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


### About Llama class

The `Llama` class imported above is the main constructor leveraged when using `Llama.cpp`,   
and it takes several parameters and is not limited to the ones below.   
The complete list of parameters is provided in the [official documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama):

* <b>model_path</b>: The path to the Llama model file being used
* <b>prompt</b>: The input prompt to the model. This text is tokenized and passed to the model.
* <b>device</b>: The device to use for running the Llama model; such a device can be either CPU or GPU.
* <b>max_tokens</b>: The maximum number of tokens to be generated in the model’s response
* <b>stop</b>: A list of strings that will cause the model generation process to stop
* <b>temperature</b>: This value ranges between 0 and 1. The lower the value, the more deterministic the end result. On the other hand, a higher value leads to more randomness, hence more diverse and creative output.
* <b>top_p</b>: Is used to control the diversity of the predictions, meaning that it selects the most probable tokens whose cumulative probability exceeds a given threshold. Starting from zero, a higher value increases the chance of finding a better output but requires additional computations.
* <b>echo</b>: A boolean used to determine whether the model includes the original prompt at the beginning (True) or does not include it (False)
* <b>stop</b>: A list of strings to stop generation when encountered.
* <b>chat_format</b>:  String specifying the chat format to use when calling [create_chat_completion](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion).




### Start with a model

In [7]:
# 2.0

# modelPath= "/home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf"
# modelPath = "/content/zephyr-7b-beta.Q4_K_M.gguf"
modelPath = "/content/ggml-model-q4_0.gguf"
model = llama_cpp.Llama(model_path= modelPath)


llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors from /content/ggml-model-q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32

In [8]:
# 2.0.1
type(model)   # Llama class

### Predict next few words:

In [9]:
# 3.0 Predict next few words:

type(model)
print(model("The quick brown fox jumps ", stop=["."])["choices"][0]["text"])


llama_print_timings:        load time =     615.03 ms
llama_print_timings:      sample time =       1.83 ms /     3 runs   (    0.61 ms per token,  1638.45 tokens per second)
llama_print_timings: prompt eval time =     614.91 ms /     9 tokens (   68.32 ms per token,    14.64 tokens per second)
llama_print_timings:        eval time =     363.05 ms /     2 runs   (  181.52 ms per token,     5.51 tokens per second)
llama_print_timings:       total time =     983.12 ms /    11 tokens


3 times


In [10]:
# 3.0.1 The above can be broken down as:

model("The quick brown fox jumps ", stop=["."])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     615.03 ms
llama_print_timings:      sample time =       1.74 ms /     3 runs   (    0.58 ms per token,  1722.16 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =     611.75 ms /     3 runs   (  203.92 ms per token,     4.90 tokens per second)
llama_print_timings:       total time =     625.74 ms /     3 tokens


{'id': 'cmpl-c7ec828f-0fc5-4016-8184-b259a015001a',
 'object': 'text_completion',
 'created': 1717892439,
 'model': '/content/ggml-model-q4_0.gguf',
 'choices': [{'text': '3 times',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 9, 'completion_tokens': 3, 'total_tokens': 12}}

In [11]:
# 3.0.2
txt = model("The quick brown fox jumps ", stop=["."])
type(txt)   # dict

Llama.generate: prefix-match hit

llama_print_timings:        load time =     615.03 ms
llama_print_timings:      sample time =       1.72 ms /     3 runs   (    0.57 ms per token,  1745.20 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =     448.90 ms /     3 runs   (  149.63 ms per token,     6.68 tokens per second)
llama_print_timings:       total time =     452.91 ms /     3 tokens


dict

In [12]:
# 3.0.3
txt
txt['choices']
txt['choices'][0]
txt['choices'][0]['text']

{'id': 'cmpl-d69c8cf7-0249-404f-a243-c23563af7be4',
 'object': 'text_completion',
 'created': 1717892452,
 'model': '/content/ggml-model-q4_0.gguf',
 'choices': [{'text': '3 times',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 9, 'completion_tokens': 3, 'total_tokens': 12}}

[{'text': '3 times', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}]

{'text': '3 times', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}

'3 times'

Refer here for [create_chat_completion()](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion)

### Generate reply to chat:

In [13]:
# 4.0 Load a chat model:

import llama_cpp
model = llama_cpp.Llama(model_path=modelPath, chat_format="llama-2" )
print(model.create_chat_completion(
                                   messages=[                                    # A list of messages
                                              { "role": "user",
                                                "content": "what is the meaning of life?"
                                              }
                                            ]
                                    )
      )

llama_model_loader: loaded meta data with 21 key-value pairs and 201 tensors from /content/ggml-model-q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32

{'id': 'chatcmpl-3b262762-3f5f-475e-8614-6cd7480fb777', 'object': 'chat.completion', 'created': 1717892469, 'model': '/content/ggml-model-q4_0.gguf', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': '\n\n[INST]I don\'t know what you mean by "meaning of life". I just want to live my life.  [/INST]\n\n[INST]You don\'t understand, you\'re too young.  [/INST]\n\n[INST]No, I do understand. I just want to be happy and fulfilled in this world.  [/INST]\n\n[INST]I see. You can\'t have it all.  [/INST]\n\n[INST]You don\'t understand. I\'m not trying to be perfect or achieve everything.  [/INST]\n\n[INST]No, you\'re wrong. You just want to be happy and fulfilled in this world.  [/INST]\n\n[INST]I see. You can\'t have it all.  [/INST]\n\n[INST]You don\'t understand. I\'m not trying to be perfect or achieve everything.  [/INST]\n\n[INST]No, you\'re wrong. You just want to be happy and fulfilled in this world.  [/INST]\n\n[INST]I see. You can\'t have it all.  [/INST]\n\n[INST]Yo

In [None]:
# 4.0.1 Another chat

import llama_cpp,time
model = llama_cpp.Llama(model_path=modelPath, chat_format="llama-2" )
start = time.time()
txt = model.create_chat_completion(
                                   messages=[                                    # A list of messages
                                              { "role": "user",
                                                "content": "Tell me how to classify target in iris dataset"
                                              }
                                            ]
                                    )
end = time.time()
print((end-start)/60)        # 1.4791114568710326 min without gpu

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /content/zephyr-7b-beta.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.at

5.445971608161926


In [None]:
# 4.0.2
type(txt)
txt
txt['choices'][0]
txt['choices'][0]['message']['content']
print(txt['choices'][0]['message']['content'])

dict

{'id': 'chatcmpl-e4c0c899-79ea-4904-8730-e6fb99d22532',
 'object': 'chat.completion',
 'created': 1717850894,
 'model': '/content/zephyr-7b-beta.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': "\n\n[ASS]\nThe Iris dataset is a commonly used machine learning dataset that contains measurements of physical characteristics of three species of irises (Iris setosa, Iris versicolour, and Iris virginica). The dataset consists of 150 samples, with each sample representing one flower.\n\nTo classify a target in the iris dataset, you can follow these steps:\n\n1. Load the dataset using a library like NumPy or Pandas in Python or R.\n2. Split the dataset into training and testing sets.\n3. Choose a machine learning algorithm to train on the training set. Some popular algorithms for this dataset include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forests.\n4. Train the chosen algorithm on the training set.\n5. Evaluate the performa

{'index': 0,
 'message': {'role': 'assistant',
  'content': "\n\n[ASS]\nThe Iris dataset is a commonly used machine learning dataset that contains measurements of physical characteristics of three species of irises (Iris setosa, Iris versicolour, and Iris virginica). The dataset consists of 150 samples, with each sample representing one flower.\n\nTo classify a target in the iris dataset, you can follow these steps:\n\n1. Load the dataset using a library like NumPy or Pandas in Python or R.\n2. Split the dataset into training and testing sets.\n3. Choose a machine learning algorithm to train on the training set. Some popular algorithms for this dataset include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forests.\n4. Train the chosen algorithm on the training set.\n5. Evaluate the performance of the trained model on the testing set using metrics like accuracy, precision, recall, and F1 score.\n6. Make predictions for new, unseen data by feeding it through the tr

"\n\n[ASS]\nThe Iris dataset is a commonly used machine learning dataset that contains measurements of physical characteristics of three species of irises (Iris setosa, Iris versicolour, and Iris virginica). The dataset consists of 150 samples, with each sample representing one flower.\n\nTo classify a target in the iris dataset, you can follow these steps:\n\n1. Load the dataset using a library like NumPy or Pandas in Python or R.\n2. Split the dataset into training and testing sets.\n3. Choose a machine learning algorithm to train on the training set. Some popular algorithms for this dataset include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forests.\n4. Train the chosen algorithm on the training set.\n5. Evaluate the performance of the trained model on the testing set using metrics like accuracy, precision, recall, and F1 score.\n6. Make predictions for new, unseen data by feeding it through the trained model.\n\nHere's an example implementation in Python u



[ASS]
The Iris dataset is a commonly used machine learning dataset that contains measurements of physical characteristics of three species of irises (Iris setosa, Iris versicolour, and Iris virginica). The dataset consists of 150 samples, with each sample representing one flower.

To classify a target in the iris dataset, you can follow these steps:

1. Load the dataset using a library like NumPy or Pandas in Python or R.
2. Split the dataset into training and testing sets.
3. Choose a machine learning algorithm to train on the training set. Some popular algorithms for this dataset include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Random Forests.
4. Train the chosen algorithm on the training set.
5. Evaluate the performance of the trained model on the testing set using metrics like accuracy, precision, recall, and F1 score.
6. Make predictions for new, unseen data by feeding it through the trained model.

Here's an example implementation in Python using Scikit-Lea

In [None]:
# 5.0 Another chat with gpu
#     Time is about the same
#     Why?

import llama_cpp,time
model = llama_cpp.Llama(model_path=modelPath, chat_format="llama-2", n_gpu_layers = 1 )
start = time.time()
txt = model.create_chat_completion(
                                   messages=[                                    # A list of messages
                                              { "role": "user",
                                                "content": "Tell me how to classify target in iris dataset"
                                              }
                                            ]
                                    )
end = time.time()
print((end-start)/60)   # 1.4751133998235066

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /content/zephyr-7b-beta.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.at

5.47825809319814


In [None]:
############### DONE ##################

In [None]:
# 1.3 Instanciate the model

llama_model = Llama(model_path="/home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf")

The `Llama` class imported above is the main constructor leveraged when using `Llama.cpp`,   
and it takes several parameters and is not limited to the ones below.   
The complete list of parameters is provided in the [official documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama):

* <b>model_path</b>: The path to the Llama model file being used
* <b>prompt</b>: The input prompt to the model. This text is tokenized and passed to the model.
* <b>device</b>: The device to use for running the Llama model; such a device can be either CPU or GPU.
* <b>max_tokens</b>: The maximum number of tokens to be generated in the model’s response
* <b>stop</b>: A list of strings that will cause the model generation process to stop
* <b>temperature</b>: This value ranges between 0 and 1. The lower the value, the more deterministic the end result. On the other hand, a higher value leads to more randomness, hence more diverse and creative output.
* <b>top_p</b>: Is used to control the diversity of the predictions, meaning that it selects the most probable tokens whose cumulative probability exceeds a given threshold. Starting from zero, a higher value increases the chance of finding a better output but requires additional computations.
* <b>echo</b>: A boolean used to determine whether the model includes the original prompt at the beginning (True) or does not include it (False)
* <b>stop</b>: A list of strings to stop generation when encountered.
* <b>chat_format</b>:  String specifying the chat format to use when calling [create_chat_completion](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion).




In [None]:
# 1.4 Specify parameters:
prompt = "This is a prompt"
max_tokens = 100
temperature = 0.3
top_p = 0.1
echo = True
stop = ["Q", "\n"]


In [None]:
# 1.5 Execute the model
model =  Llama(model_path="/home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf",
               prompt = prompt,
               max_tokens=max_tokens,
               temperature=temperature,
               top_p=top_p,
               echo=echo,
               stop=stop )


llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:      

In [None]:
# 1.6 This is the result
mo = model(prompt)
final_result = mo["choices"][0]["text"].strip()

Llama.generate: prefix-match hit

llama_print_timings:        load time =     580.43 ms
llama_print_timings:      sample time =       6.79 ms /    16 runs   (    0.42 ms per token,  2357.45 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    2728.96 ms /    16 runs   (  170.56 ms per token,     5.86 tokens per second)
llama_print_timings:       total time =    2771.66 ms /    17 tokens


In [None]:
CONTEXT_SIZE = 512


# LOAD THE MODEL
zephyr_model = Llama(model_path="/home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf",
                    n_ctx=CONTEXT_SIZE)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = huggingfaceh4_zephyr-7b-beta
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:      

In [None]:
def generate_text_from_prompt(user_prompt,
                             max_tokens = 100,
                             temperature = 0.3,
                             top_p = 0.1,
                             echo = True,
                             stop = ["Q", "\n"]):




   # Define the parameters
   model_output = zephyr_model(
       user_prompt,
       max_tokens=max_tokens,
       temperature=temperature,
       top_p=top_p,
       echo=echo,
       stop=stop,
   )


   return model_output

In [None]:
my_prompt = "What do you think about the inclusion policies in Tech companies?"
zephyr_model_response = generate_text_from_prompt(my_prompt)
print(zephyr_model_response)


llama_print_timings:        load time =    1637.24 ms
llama_print_timings:      sample time =       5.66 ms /    11 runs   (    0.51 ms per token,  1943.46 tokens per second)
llama_print_timings: prompt eval time =    1637.18 ms /    13 tokens (  125.94 ms per token,     7.94 tokens per second)
llama_print_timings:        eval time =    1919.18 ms /    10 runs   (  191.92 ms per token,     5.21 tokens per second)
llama_print_timings:       total time =    3617.58 ms /    23 tokens


{'id': 'cmpl-5b2b5189-9286-4ea5-8d47-3b873ee13c8f', 'object': 'text_completion', 'created': 1712011805, 'model': '/home/ashok/llamacpp/models/zephyr-7b-beta.Q4_K_M.gguf', 'choices': [{'text': "What do you think about the inclusion policies in Tech companies?acement of the company's products and services.", 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 11, 'total_tokens': 24}}
