## Create Model

In [1]:
#https://gpt-index.readthedocs.io/en/latest/api_reference/llms/llama_cpp.html#
#https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_2_llama_cpp.html

from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    model_path="../models/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_url=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens,
    # but we set it lower to allow for some wiggle room
    context_window=3900,
    generate_kwargs={},
    # All to GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ../models/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 14336,  4096,     

In [4]:
# Completion: basic text completion
response = llm.complete("Hello! Can you tell me a poem about cats and dogs?")
print(response.text)

 Sure, here's a short poem about cats and dogs:

Cats and dogs, they both have fur,
But cats are more independent, that's for sure.
Dogs love to play and run around,
While cats prefer to lounge and be found.

Cats are known for their graceful leaps,
And dogs are great at fetching sticks and balls.
Both animals have unique personalities,
But they both make wonderful companions, that's a guarantee.

So if you're looking for a furry friend,
Consider adopting a cat or a dog, they'll never end.



llama_print_timings:        load time =    4102.45 ms
llama_print_timings:      sample time =      54.85 ms /   140 runs   (    0.39 ms per token,  2552.46 tokens per second)
llama_print_timings: prompt eval time =    4102.36 ms /    76 tokens (   53.98 ms per token,    18.53 tokens per second)
llama_print_timings:        eval time =   25789.85 ms /   139 runs   (  185.54 ms per token,     5.39 tokens per second)
llama_print_timings:       total time =   30183.71 ms


In [5]:
# stream_completion() - streaming completion, returns a GENERATOR

response_iter = llm.stream_complete("Can you write me a poem about fast cars?")
for response in response_iter:
    print(response.delta, end="", flush=True)

Llama.generate: prefix-match hit


 Sure! Here's a short poem about fast cars:

Speeding down the road,
A sleek and shiny show,
Fast cars roar and race,
Leaving all in their wake.

With engines revving high,
And tires screeching by,
They leave the world behind,
In a blur of speed and sound.

But with power comes risk,
And danger lurks at every twist,
So drive with care and caution,
And always stay in control.

Fast cars may be fun,
But they're not for everyone,
So choose wisely and safely,
And enjoy the thrill of speed.


llama_print_timings:        load time =    4102.45 ms
llama_print_timings:      sample time =      56.09 ms /   143 runs   (    0.39 ms per token,  2549.47 tokens per second)
llama_print_timings: prompt eval time =     846.47 ms /    14 tokens (   60.46 ms per token,    16.54 tokens per second)
llama_print_timings:        eval time =   27043.42 ms /   142 runs   (  190.45 ms per token,     5.25 tokens per second)
llama_print_timings:       total time =   28242.01 ms


In [3]:
from llama_index import Prompt

text_qa_template = Prompt(
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

refine_template = Prompt(
    "We have the opportunity to refine the original answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question: {query_str}. "
    "If the context isn't useful, output the original answer again.\n"
    "Original Answer: {existing_answer}"
)

In [4]:

with open("./starter_example.md", "r") as f:
    text = f.read()

In [5]:
text

'````\n# Starter Tutorial\n\n```{tip}\nMake sure you\'ve followed the [installation](installation.md) steps first.\n````\n\nHere is a starter example for using LlamaIndex.\n\n### Download\n\nLlamaIndex examples can be found in the `examples` folder of the LlamaIndex repository.\nWe first want to download this `examples` folder. An easy way to do this is to just clone the repo:\n\n```bash\n$ git clone https://github.com/jerryjliu/llama_index.git\n```\n\nNext, navigate to your newly-cloned repository, and verify the contents:\n\n```bash\n$ cd llama_index\n$ ls\nLICENSE                data_requirements.txt  tests/\nMANIFEST.in            examples/              pyproject.toml\nMakefile               experimental/          requirements.txt\nREADME.md              llama_index/             setup.py\n```\n\nWe now want to navigate to the following folder:\n\n```bash\n$ cd examples/paul_graham_essay\n```\n\nThis contains LlamaIndex examples around Paul Graham\'s essay, [&#34;What I Worked On&#3

In [None]:
question = "How can I install llama-index?"
prompt = text_qa_template.format(context_str=text, query_str=question)
print('PROMPT: ', prompt)
response = llm.complete(prompt)
print('RESPONSE: ', response.text)

In [15]:
question = "How do I create an index? Write your answer using only code."
prompt = text_qa_template.format(context_str=text, query_str=question)
response_gen = llm.stream_complete(prompt)
for response in response_gen:
    print(response.delta, end="")

Llama.generate: prefix-match hit


 To create an index in LlamaIndex, you can follow these steps:

1. First, clone the LlamaIndex repository using `git clone` command.
2. Navigate to the newly-cloned repository and verify the contents of the folder.
3. Next, navigate to the folder where you want to create your index.
4. Create a new Python file and import the necessary modules: `VectorStoreIndex`, `SimpleDirectoryReader`.
5. Load the data from the directory using `SimpleDirectoryReader` and pass it to `VectorStoreIndex.from_documents()` to build an index over the documents in the directory.
6. You can then use the `as_query_engine()` method to query the index and get responses.
7. To view queries and events using logging, you can import `logging` and set up a logger with `logging.basicConfig()`.
8. To save and load your index, you can use the `persist()` method to persist the index to disk and the `load_index_from_storage()` method to reload it from disk.

Here is an example code snippet that demonstrates how to create 


llama_print_timings:        load time =    4102.45 ms
llama_print_timings:      sample time =     105.92 ms /   256 runs   (    0.41 ms per token,  2416.80 tokens per second)
llama_print_timings: prompt eval time =    1152.90 ms /    17 tokens (   67.82 ms per token,    14.75 tokens per second)
llama_print_timings:        eval time =   54334.17 ms /   255 runs   (  213.08 ms per token,     4.69 tokens per second)
llama_print_timings:       total time =   56176.60 ms


In [6]:
question = "How do I create an index? Write your answer using only code."
existing_answer = """To create an index using LlamaIndex, you need to follow these steps:

1. Download the LlamaIndex repository by cloning it from GitHub.
2. Navigate to the `examples/paul_graham_essay` folder in the cloned repository.
3. Create a new Python file and import the necessary modules: `VectorStoreIndex` and `SimpleDirectoryReader`.
4. Load the documents from the `data` folder using `SimpleDirectoryReader('data').load_data()`.
5. Build the index using `VectorStoreIndex.from_documents(documents)`.
6. To persist the index to disk, use `index.storage_context.persist()`.
7. To reload the index from disk, use the `StorageContext` and `load_index_from_storage` functions.

Note: This answer assumes that you have already installed LlamaIndex and have the necessary dependencies."""
prompt = refine_template.format(context_msg=text, query_str=question, existing_answer=existing_answer)
response_gen = llm.stream_complete(prompt)
for response in response_gen:
    print(response.delta, end="")

AttributeError: 'generator' object has no attribute 'text'