# Llama-v2 7b for Retrieval Augmented Generation

## Dependencies

In [1]:
%pip install langchain chromadb sentence_transformers --user

Note: you may need to restart the kernel to use updated packages.


In [2]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python



## Model Setup

### Model Download

In [15]:
# Download a Llama.cpp optmized model
# List of models can be found at: https://huggingface.co/TheBloke
# In this case I will use Llama-2-7B-GGML: https://huggingface.co/TheBloke/Llama-2-7B-GGML
!mkdir -p /tmp/models/
!wget -P /tmp/models/ https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin

--2023-07-22 05:38:44--  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin
Resolving huggingface.co (huggingface.co)... 65.8.178.118, 65.8.178.27, 65.8.178.93, ...
Connecting to huggingface.co (huggingface.co)|65.8.178.118|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/30/e3/30e3aca7233f7337633262ff6d59dd98559ecd8982e7419b39752c8d0daae1ca/8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-7b-chat.ggmlv3.q4_0.bin%3B+filename%3D%22llama-2-7b-chat.ggmlv3.q4_0.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1690261102&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY5MDI2MTEwMn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8zMC9lMy8zMGUzYWNhNzIzM2Y3MzM3NjMzMjYyZmY2ZDU5ZGQ5ODU1OWVjZDg5ODJlNzQxOWIzOTc1MmM4ZDBkY

### Model Load and Test

In [6]:
from pathlib import Path

from langchain import LLMChain, PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp

model_dir = "/tmp/models/"
model_path = Path(model_dir)
model_file = list(model_path.glob("*.bin"))[0]

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=str(model_file.resolve()),
    input={"temperature": 0.0, "max_length": 2000, "top_p": 1},
    callback_manager=callback_manager,
    verbose=True,
)

llama.cpp: loading model from /tmp/models/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5185.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_

In [7]:
prompt = """
Question: Name all the planets in the solar system?
"""
llm(prompt)


Answer: Here are the names of all the planets in our solar system, listed in order from closest to farthest from the Sun:
Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune.


llama_print_timings:        load time =  2605.59 ms
llama_print_timings:      sample time =    24.51 ms /    57 runs   (    0.43 ms per token,  2325.30 tokens per second)
llama_print_timings: prompt eval time =  3979.97 ms /    16 tokens (  248.75 ms per token,     4.02 tokens per second)
llama_print_timings:        eval time =  9830.86 ms /    56 runs   (  175.55 ms per token,     5.70 tokens per second)
llama_print_timings:       total time = 13935.57 ms


'\nAnswer: Here are the names of all the planets in our solar system, listed in order from closest to farthest from the Sun:\nMercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune.'

## Accessing Embeddings Database

In [3]:
import chromadb
from chromadb.utils import embedding_functions

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)

client = chromadb.PersistentClient(path="./db")
collection = client.get_collection(
    name="airflow_docs_stable", embedding_function=sentence_transformer_ef
)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [19]:
question = "Python Code to create a Dag Class"
results = collection.query(
    query_texts=[question],
    n_results=1,
)
formatted_result = "\n\n".join(results["documents"][0])
print(formatted_result)

current: def __init__(     dag_folder=None,     include_examples=conf.getboolean("core", "LOAD_EXAMPLES"),     safe_mode=conf.getboolean("core", "DAG_DISCOVERY_SAFE_MODE"),     read_dags_from_db=False, ):     ...   If you were using positional arguments, it requires no change but if you were using keyword arguments, please change store_serialized_dags to read_dags_from_db. Similarly, if you were using DagBag().store_serialized_dags property, change it to DagBag().read_dags_from_db.


## Setting up Retrieval Augmeneted Generation (RAG)

In [8]:
prompt = (
    "You are a helpful question and answer bot, your task is to provide the best answer to a given user's question.\n"
    "Only use the context below to answer the user's question, if you don't have the necessary information to answer say: 'I don't know!'\n"
    "Context and Question are denoted by ```\n"
    f"Context: ```{formatted_result}```\n\n"
    f"Question: ```{question}?```\n\n"
    "Response:"
)
llm(prompt)
# response = text_generation(prompt)
# print(response[0]["generated_text"].lstrip())

Llama.generate: prefix-match hit


 ```Create a Dag class by inheriting from `airflow.operators.Dagle` and defining the `__init__` method with the required parameters, like so:
class MyDag(Dagle):
    def __init__(self, dag_folder=None, include_examples=conf.getboolean("core", "LOAD_EXAMPLES"), safe_mode=conf.getboolean("core", "DAG_DISCOVERY_SAFE_MODE"), read_dags_from_db=False):
        super().__init__()
If you were using positional arguments, it requires no change but if you were using keyword arguments, please change store_serialized_dags to read_dags_from_db. Similarly, if you were using DagBag().store_serialized_dags property, change it to DagBag().read_dags_from_db.
You can then use the `MyDag` class as a template for creating your own custom DAGs, and pass in any additional parameters or metadata as needed.```


llama_print_timings:        load time =  2605.59 ms
llama_print_timings:      sample time =   103.33 ms /   234 runs   (    0.44 ms per token,  2264.50 tokens per second)
llama_print_timings: prompt eval time = 41304.04 ms /   240 tokens (  172.10 ms per token,     5.81 tokens per second)
llama_print_timings:        eval time = 42683.13 ms /   234 runs   (  182.41 ms per token,     5.48 tokens per second)
llama_print_timings:       total time = 84561.48 ms


' ```Create a Dag class by inheriting from `airflow.operators.Dagle` and defining the `__init__` method with the required parameters, like so:\nclass MyDag(Dagle):\n    def __init__(self, dag_folder=None, include_examples=conf.getboolean("core", "LOAD_EXAMPLES"), safe_mode=conf.getboolean("core", "DAG_DISCOVERY_SAFE_MODE"), read_dags_from_db=False):\n        super().__init__()\nIf you were using positional arguments, it requires no change but if you were using keyword arguments, please change store_serialized_dags to read_dags_from_db. Similarly, if you were using DagBag().store_serialized_dags property, change it to DagBag().read_dags_from_db.\nYou can then use the `MyDag` class as a template for creating your own custom DAGs, and pass in any additional parameters or metadata as needed.```'