Credit where credit is due: this is based on  https://www.kaggle.com/code/gpreda/fast-test-of-llama-v2-pre-quantized-with-llama-cpp.

Loading the model data takes a while so please be patient waiting for the notebook to start.

First we need to install the llama cpp package; we need the no-CUDA version. If the install fails it will leave stuff in the filespace we need to clean up before we can try again.

In [1]:
# clean up after failed efforts if necessary
!rm -rf llama.cpp
# we need no CUDA for CPU; this flag has changed since the initial release
!CMAKE_ARGS="-DLLAMA_CUDA=off" pip install llama-cpp-python 
!git clone https://github.com/ggerganov/llama.cpp.git

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.79.tar.gz (50.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l- \ | / - \ done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25l- \ | / - \ | / done
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
[?25h  Created 

We also need to load the quantized version of the 7B model from our local resources.

In [2]:
import arrow
from llama_cpp import Llama

time_load = arrow.now()
llm = Llama(model_path="/kaggle/input/test-llama-2-quantized-with-llama-cpp/llama-7b.gguf")
print("Model load time: {:5.4f} seconds".format((arrow.now() - time_load).total_seconds()))

llama_model_loader: loaded meta data with 18 key-value pairs and 291 tensors from /kaggle/input/test-llama-2-quantized-with-llama-cpp/llama-7b.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = 7b-chat-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.

Model load time: 47.7547 seconds


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '2048', 'general.name': '7b-chat-hf', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '11008', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '32', 'tokenizer.ggml.bos_token_id': '1', 'llama.block_count': '32', 'llama.attention.head_count_kv': '32', 'tokenizer.ggml.model': 'llama', 'general.file_type': '7'}
Using fallback chat format: llama-2


Let's wrap up the code to do the query and parse the result in a function. Sometimes the results look weird when parsed, so for the moment we return the underlying object for debugging purposes.

In [3]:
def ask(query: str, max_tokens: int):
    time_ask = arrow.now()
    output = llm(prompt="Q: {} A: ".format(query), max_tokens=max_tokens, stop=["Q:", "\n"], echo=False)
    return [output['choices'][index]['text'] for index in range(len(output['choices']))], (arrow.now() - time_ask).total_seconds(), output

query_cities = 'What are the names of three European capital cities?'
result_cities, time_cities, debug_cities = ask(query=query_cities, max_tokens=40)
print(query_cities)
print(result_cities)
print('time: {:5.4f}'.format(time_cities))


llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      17.22 ms /    21 runs   (    0.82 ms per token,  1219.37 tokens per second)
llama_print_timings: prompt eval time =    2805.02 ms /    16 tokens (  175.31 ms per token,     5.70 tokens per second)
llama_print_timings:        eval time =   12706.37 ms /    20 runs   (  635.32 ms per token,     1.57 tokens per second)
llama_print_timings:       total time =   15563.92 ms /    36 tokens


What are the names of three European capital cities?
['1. - Paris, France 2. - London, England 3. - Rome, Italy']
time: 15.5759


What's interesting is that if we ask this question ten times we get a range of answers, most of the time correct and not weird.

In [4]:
for index in range(10):
    result_cities, time_cities, _ = ask(query=query_cities, max_tokens=40)
    print(result_cities)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      11.12 ms /    19 runs   (    0.59 ms per token,  1708.17 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7156.00 ms /    19 runs   (  376.63 ms per token,     2.66 tokens per second)
llama_print_timings:       total time =    7182.01 ms /    19 tokens
Llama.generate: prefix-match hit


['1. Paris, France 2. London, United Kingdom 3. Rome, Italy']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      13.52 ms /    22 runs   (    0.61 ms per token,  1627.46 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    8609.69 ms /    22 runs   (  391.35 ms per token,     2.56 tokens per second)
llama_print_timings:       total time =    8637.98 ms /    22 tokens
Llama.generate: prefix-match hit


[' Three European capital cities are London (United Kingdom), Paris (France), and Rome (Italy).']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.53 ms /    18 runs   (    0.58 ms per token,  1709.73 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6707.63 ms /    18 runs   (  372.65 ms per token,     2.68 tokens per second)
llama_print_timings:       total time =    6729.01 ms /    18 tokens
Llama.generate: prefix-match hit


['1. Berlin, Germany 2. Paris, France 3. London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      11.21 ms /    19 runs   (    0.59 ms per token,  1694.31 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7145.85 ms /    19 runs   (  376.10 ms per token,     2.66 tokens per second)
llama_print_timings:       total time =    7173.83 ms /    19 tokens
Llama.generate: prefix-match hit


['1. Berlin, Germany 2. Paris, France 3. London, United Kingdom']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.44 ms /    18 runs   (    0.58 ms per token,  1724.30 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6770.74 ms /    18 runs   (  376.15 ms per token,     2.66 tokens per second)
llama_print_timings:       total time =    6793.16 ms /    18 tokens
Llama.generate: prefix-match hit


['1. Paris, France 2. Rome, Italy 3. Madrid, Spain']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.62 ms /    18 runs   (    0.59 ms per token,  1694.28 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7108.81 ms /    18 runs   (  394.93 ms per token,     2.53 tokens per second)
llama_print_timings:       total time =    7130.95 ms /    18 tokens
Llama.generate: prefix-match hit


['1. Vienna, Austria 2. Paris, France 3. Rome, Italy']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.58 ms /    18 runs   (    0.59 ms per token,  1702.13 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6772.55 ms /    18 runs   (  376.25 ms per token,     2.66 tokens per second)
llama_print_timings:       total time =    6793.87 ms /    18 tokens
Llama.generate: prefix-match hit


['1. Berlin, Germany 2. Paris, France 3. London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.37 ms /    18 runs   (    0.58 ms per token,  1735.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6777.60 ms /    18 runs   (  376.53 ms per token,     2.66 tokens per second)
llama_print_timings:       total time =    6798.91 ms /    18 tokens
Llama.generate: prefix-match hit


['1. Paris, France 2. London, England 3. Rome, Italy']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.61 ms /    18 runs   (    0.59 ms per token,  1696.83 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6841.05 ms /    18 runs   (  380.06 ms per token,     2.63 tokens per second)
llama_print_timings:       total time =    6862.46 ms /    18 tokens
Llama.generate: prefix-match hit


['1. Paris, France 2. London, UK 3. Rome, Italy']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =       8.63 ms /    15 runs   (    0.58 ms per token,  1738.32 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    5655.11 ms /    15 runs   (  377.01 ms per token,     2.65 tokens per second)
llama_print_timings:       total time =    5672.50 ms /    15 tokens


['3 European Capital Cities Berlin, Germany Paris, France London, UK']


But if we phrase the question slightly differently we get the same answer every time.

In [5]:
query_capitals = 'What are three capitals of Europe?'
for index in range(10):
    result_capitals, time_capitals, debug_capitals = ask(query=query_capitals, max_tokens=40)
    print(result_cities)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.96 ms /    19 runs   (    0.58 ms per token,  1733.26 tokens per second)
llama_print_timings: prompt eval time =    2141.36 ms /     9 tokens (  237.93 ms per token,     4.20 tokens per second)
llama_print_timings:        eval time =    6868.00 ms /    18 runs   (  381.56 ms per token,     2.62 tokens per second)
llama_print_timings:       total time =    9031.87 ms /    27 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      12.12 ms /    21 runs   (    0.58 ms per token,  1732.67 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7865.64 ms /    21 runs   (  374.55 ms per token,     2.67 tokens per second)
llama_print_timings:       total time =    7890.56 ms /    21 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.76 ms /    19 runs   (    0.57 ms per token,  1765.80 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7193.72 ms /    19 runs   (  378.62 ms per token,     2.64 tokens per second)
llama_print_timings:       total time =    7216.68 ms /    19 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      13.84 ms /    24 runs   (    0.58 ms per token,  1733.98 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    9065.70 ms /    24 runs   (  377.74 ms per token,     2.65 tokens per second)
llama_print_timings:       total time =    9095.38 ms /    24 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      11.08 ms /    19 runs   (    0.58 ms per token,  1715.27 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7427.57 ms /    19 runs   (  390.92 ms per token,     2.56 tokens per second)
llama_print_timings:       total time =    7451.49 ms /    19 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.17 ms /    18 runs   (    0.56 ms per token,  1770.26 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6810.76 ms /    18 runs   (  378.38 ms per token,     2.64 tokens per second)
llama_print_timings:       total time =    6831.81 ms /    18 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      10.54 ms /    18 runs   (    0.59 ms per token,  1706.97 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6815.90 ms /    18 runs   (  378.66 ms per token,     2.64 tokens per second)
llama_print_timings:       total time =    6837.58 ms /    18 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =       7.59 ms /    13 runs   (    0.58 ms per token,  1712.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    4818.99 ms /    13 runs   (  370.69 ms per token,     2.70 tokens per second)
llama_print_timings:       total time =    4834.83 ms /    13 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      11.39 ms /    19 runs   (    0.60 ms per token,  1668.28 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    7454.46 ms /    19 runs   (  392.34 ms per token,     2.55 tokens per second)
llama_print_timings:       total time =    7478.36 ms /    19 tokens
Llama.generate: prefix-match hit


['3 European Capital Cities Berlin, Germany Paris, France London, UK']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      15.67 ms /    27 runs   (    0.58 ms per token,  1723.26 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   10102.83 ms /    27 runs   (  374.18 ms per token,     2.67 tokens per second)
llama_print_timings:       total time =   10136.39 ms /    27 tokens


['3 European Capital Cities Berlin, Germany Paris, France London, UK']


For a more open-ended question we may need several tries to get a useful answer.

In [6]:
query_catcher = 'What happens in the J. D. Salinger novel The Catcher in the Rye?'
for index in range(10):
    result_catcher, time_catcher, debug_catcher = ask(query=query_catcher, max_tokens=1000)
    print(result_catcher)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =       0.59 ms /     1 runs   (    0.59 ms per token,  1703.58 tokens per second)
llama_print_timings: prompt eval time =    3267.26 ms /    22 tokens (  148.51 ms per token,     6.73 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    3269.36 ms /    23 tokens
Llama.generate: prefix-match hit


['']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      94.66 ms /   145 runs   (    0.65 ms per token,  1531.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   55857.20 ms /   145 runs   (  385.22 ms per token,     2.60 tokens per second)
llama_print_timings:       total time =   56078.11 ms /   145 tokens
Llama.generate: prefix-match hit


[' The protagonist, Holden Caulfield, runs away from his boarding school and spends three days wandering around New York City, grappling with his disillusionment with society and his sense of alienation from those around him. Along the way, he encounters a variety of characters who challenge or reinforce his worldview, including his former roommate Stradlater, his sister Phoebe, and several women he meets in bars and clubs. Throughout the novel, Holden struggles with feelings of loneliness, anger, and disillusionment, and his experiences raise questions about the nature of identity, morality, and the human condition.']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      80.22 ms /   123 runs   (    0.65 ms per token,  1533.36 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   47048.45 ms /   123 runs   (  382.51 ms per token,     2.61 tokens per second)
llama_print_timings:       total time =   47226.48 ms /   123 tokens
Llama.generate: prefix-match hit


[' In J.D. Salinger\'s novel, "The Catcher in the Rye," the protagonist, Holden Caulfield, struggles to navigate the challenges of adolescence and adulthood. He is a disillusioned teenager who has just been expelled from a prestigious boarding school and is struggling to find his place in the world. The novel follows Holden\'s experiences over the course of three days, during which he grapples with feelings of alienation, loneliness, and disillusionment.']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =     125.33 ms /   191 runs   (    0.66 ms per token,  1523.95 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   73861.36 ms /   191 runs   (  386.71 ms per token,     2.59 tokens per second)
llama_print_timings:       total time =   74157.46 ms /   191 tokens
Llama.generate: prefix-match hit


[" Holden Caulfield, a disillusioned teenager, runs away from his boarding school and spends three days wandering around New York City, alienating himself from those he meets and reflecting on his feelings of alienation and disgust towards phoniness and superficiality in society. He eventually ends up at a nightclub where he confronts his former roommate and rival, Stradlater, who is on a date with a woman named Judith. Holden becomes drunk and depressed, and eventually returns to his school for Christmas vacation. Along the way, Holden narrates his experiences in a sarcastic and cynical tone, frequently breaking the fourth wall and addressing the reader directly. The novel is known for its themes of teenage angst, rebellion, and the struggle to find one's place in society."]



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      76.79 ms /   120 runs   (    0.64 ms per token,  1562.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   46357.14 ms /   120 runs   (  386.31 ms per token,     2.59 tokens per second)
llama_print_timings:       total time =   46528.74 ms /   120 tokens
Llama.generate: prefix-match hit


[' The Catcher in the Rye is a novel by J.D. Salinger that follows the story of Holden Caulfield, a disillusioned teenager who runs away from his boarding school and spends three days exploring New York City. During this time, he grapples with feelings of alienation and disconnection from society, and struggles to find his place in the world. The novel is known for its candid portrayal of adolescence and its themes of identity, belonging, and the challenges of growing up.']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      71.72 ms /   110 runs   (    0.65 ms per token,  1533.72 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   42383.60 ms /   110 runs   (  385.31 ms per token,     2.60 tokens per second)
llama_print_timings:       total time =   42541.40 ms /   110 tokens
Llama.generate: prefix-match hit


[' Holden Caulfield, a disillusioned teenager, runs away from his boarding school and spends three days in New York City, where he grapples with feelings of alienation and disconnection from society. Along the way, he encounters various characters who challenge his perceptions of the world around him. At the end of the novel, Holden is hospitalized after attempting to jump off a bridge, leaving readers to wonder about his fate and whether or not he will ever find happiness.']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      70.35 ms /   110 runs   (    0.64 ms per token,  1563.57 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   42871.90 ms /   110 runs   (  389.74 ms per token,     2.57 tokens per second)
llama_print_timings:       total time =   43027.77 ms /   110 tokens
Llama.generate: prefix-match hit


[" The Catcher in the Rye is a classic coming-of-age novel by J.D. Salinger, first published in 1951. The story follows the protagonist, Holden Caulfield, a disillusioned teenager who has been expelled from several prestigious boarding schools. The novel explores themes of alienation, rebellion, and the struggle to find one's identity in a phony adult world. Here is a brief summary of the plot:"]



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      57.99 ms /    90 runs   (    0.64 ms per token,  1551.99 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   34546.23 ms /    90 runs   (  383.85 ms per token,     2.61 tokens per second)
llama_print_timings:       total time =   34671.96 ms /    90 tokens
Llama.generate: prefix-match hit


[' Holden Caulfield, a disillusioned teenager, narrates his experiences after being expelled from boarding school. He spends three days in New York City, wandering the streets and encountering various characters, including his former friends and a potential love interest. Along the way, he grapples with feelings of alienation and disillusionment, questioning the values and conformity of society.']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      62.67 ms /    96 runs   (    0.65 ms per token,  1531.78 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   36971.51 ms /    96 runs   (  385.12 ms per token,     2.60 tokens per second)
llama_print_timings:       total time =   37107.95 ms /    96 tokens
Llama.generate: prefix-match hit


[' Holden Caulfield, a disillusioned teenager, runs away from his boarding school and spends three days wandering around New York City, alienating himself from those he encounters due to his cynical views on society. He eventually ends up in a diner where he has a conversation with a former teacher that reveals the deep-seated feelings of loneliness and disconnection he feels from the world around him.']



llama_print_timings:        load time =    2805.36 ms
llama_print_timings:      sample time =      43.75 ms /    68 runs   (    0.64 ms per token,  1554.29 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   26202.63 ms /    68 runs   (  385.33 ms per token,     2.60 tokens per second)
llama_print_timings:       total time =   26295.42 ms /    68 tokens


[' Holden Caulfield, the protagonist of the book, is a disillusioned teenager who becomes increasingly alienated from society after being expelled from prep school. He spends three days wandering around New York City, reflecting on his past and interactions with various people he meets along the way.']
