- In the terminal, create ('python3 -m venv .venv') and activate ('source .venv/bin/activate') the virtual environment, then 'pip3 install -r requirements.txt'
- Make sure the correct Python venv is selected in the notebook.

In [7]:
import os.path, os
from llama_index.core import (
    Settings,
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    set_global_tokenizer,
)
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
from transformers import AutoTokenizer
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    # model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin",
    model_path="/Users/u1155516/Dropbox/Technical/llms/models/llama/llama-mac-metal/llama.cpp/models/llama-2-13b-chat.Q8_0.gguf",
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 20},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /Users/u1155516/Dropbox/Technical/llms/models/llama/llama-mac-metal/llama.cpp/models/llama-2-13b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_mo

In [8]:
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)


llama_print_timings:        load time =   12256.89 ms
llama_print_timings:      sample time =      16.20 ms /   140 runs   (    0.12 ms per token,  8641.98 tokens per second)
llama_print_timings: prompt eval time =   12256.66 ms /    79 tokens (  155.15 ms per token,     6.45 tokens per second)
llama_print_timings:        eval time =   31308.99 ms /   139 runs   (  225.24 ms per token,     4.44 tokens per second)
llama_print_timings:       total time =   43799.71 ms /   218 tokens


  Of course! Here's a short poem about cats and dogs:

Cats and dogs, so different yet so dear,
Both bring joy and love, without a fear.

Cats purr and curl up at our feet,
Dogs wag their tails and give sweet treats.

Cats hunt mice with stealthy grace,
Dogs chase after balls with eager face.

Both are loyal friends through and through,
Bringing happiness to me and you.

So here's to our feline and canine friends,
Both equally loved until the very end.


In [10]:
response_iter = llm.stream_complete("Can you write me a poem about aliens?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Llama.generate: prefix-match hit


  Sure! Here's a poem about aliens:

In the vastness of space and time,
Alien worlds and beings entwine,
Their forms and features so divine,
A mystery that's all mine.

With eyes that glow like stars so bright,
And skin that shimmers with delight,
They roam the cosmos with grace and might,
Their presence fills the space with light.

Their ships, like silver wings so sleek,
Through galaxies they do seek,
Exploring realms unknown to man,
Their journey is but just begun.

With technologies so advanced and grand,
They harness powers of the land,
Their wisdom and knowledge so profound,
Their secrets yet to be found.

In the vastness of space and time,
Alien worlds and beings entwine,
Their presence fills the cosmos with wonder,
A mystery that's all mine.


llama_print_timings:        load time =   12256.89 ms
llama_print_timings:      sample time =      26.18 ms /   223 runs   (    0.12 ms per token,  8516.98 tokens per second)
llama_print_timings: prompt eval time =    1166.07 ms /     7 tokens (  166.58 ms per token,     6.00 tokens per second)
llama_print_timings:        eval time =   27709.43 ms /   222 runs   (  124.82 ms per token,     8.01 tokens per second)
llama_print_timings:       total time =   29356.95 ms /   229 tokens


In [12]:
# Load the RAG vector index

set_global_tokenizer(
    AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-Llama2-13b").encode
)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# create vector store index
# check if storage already exists
if not os.path.exists("./storage_au"):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data_au").load_data()
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
    # store it for later
    index.storage_context.persist("./storage_au")
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir="./storage_au")
    index = load_index_from_storage(storage_context)
    

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [15]:
# Request a basic summary of the documents in the corpus
query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("Summarize all of the Hansard documents for 1901 please. It is a big document set so you might want to summarize by themes, noting which sessions emphasised which themes and which speakers spoke the most")
print(response)

Llama.generate: prefix-match hit

llama_print_timings:        load time =   12256.89 ms
llama_print_timings:      sample time =      26.48 ms /   256 runs   (    0.10 ms per token,  9669.13 tokens per second)
llama_print_timings: prompt eval time =   46044.10 ms /  1976 tokens (   23.30 ms per token,    42.92 tokens per second)
llama_print_timings:        eval time =   48128.24 ms /   255 runs   (  188.74 ms per token,     5.30 tokens per second)
llama_print_timings:       total time =   94712.72 ms /  2231 tokens


  Based on the provided context information, I will summarize the Hansard documents for 1901 by themes and highlight the most prominent speakers. Please note that the summaries are based on the provided documents only and do not include any external knowledge or context.

1. Themes:
	* Opening of Parliament and proclamations (1901-05-09)
	* Debate on the first speech of the Governor-General (1901-05-09)
	* Discussion on the estimates (1901-05-09)
	* Addresses by the Governor-General and the Prime Minister (1901-05-09)
	* Debate on the Writs of Election (1901-05-09)
2. Speakers:
	* Sir GEORGE TURNER (BalaclavaTreasurer)
	* Mr McMILLAN (William)
	* Sir John LANGDON BONYTHON (Kt.)
	* Right Hon. EDMUND BARTON (P.O., K
