# RAG with Ollama and Llama 3

To begin, we setup our prerequisite libraries.

In [None]:
!pip install -qU \
    datasets==2.14.5 \
    ollama==0.1.9 \
    openai==1.35.3 \
    "semantic-router[local]==0.0.45" \
    qdrant-client==1.10.1

## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv2-semantic-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-semantic-chunks) contains scraped data from many popular ArXiv papers centred around LLMs and GenAI.

In [1]:
from datasets import load_dataset

data = load_dataset(
    "jamescalam/ai-arxiv2-semantic-chunks",
    split="train[:10000]"
)
data

Dataset({
    features: ['id', 'title', 'content', 'prechunk_id', 'postchunk_id', 'arxiv_id', 'references'],
    num_rows: 10000
})

We have 200K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [2]:
data[0]

{'id': '2401.04088#0',
 'title': 'Mixtral of Experts',
 'content': '4 2 0 2 n a J 8 ] G L . s c [ 1 v 8 8 0 4 0 . 1 0 4 2 : v i X r a # Mixtral of Experts Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, LÃ©lio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, ThÃ©ophile Gervet, Thibaut Lavril, Thomas Wang, TimothÃ©e Lacroix, William El Sayed Abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`.

In [3]:
data = data.map(lambda x: {
    "id": x["id"],
    "metadata": {
        "title": x["title"],
        "content": x["content"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "content", "prechunk_id",
    "postchunk_id", "arxiv_id", "references"
])
data

Dataset({
    features: ['id', 'metadata'],
    num_rows: 10000
})

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using a variation of the `e5-base` model with a longer context length of `4k` tokens. Ideally we should be running this on GPU for optimal runtimes.

In [None]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder(name="dwzhu/e5-base-4k")

We can check whether our `encoder` will use `cpu` or a `cuda` GPU (where available).

In [5]:
encoder.device

'cuda'

We can create embeddings now like so:

In [6]:
embeds = encoder(["this is a test"])

We can view the dimensionality of our returned embeddings, which we'll need soon when initializing our vector index:

In [37]:
dims = len(embeds[0])
dims

768

#### Qdrant
Now we create our vector DB to store our vectors. For this we used the Qdrant installation like this: \
Using docker command: \
```sudo docker run -d -p 6333:6333 qdrant/qdrant```


Creating an index, we set `dimension` equal to the dimensionality of our encoder (`384`), and use a `metric` also compatible with the model (this can be `cosine`). We also pass our `spec` to index initialization.

In [38]:
import time
from qdrant_client import QdrantClient
from qdrant_client.http import models
from tqdm.auto import tqdm

# Define index name and dimensions
index_name = "qdrant-llama-3-rag-1"

# Initialize Qdrant client
client = QdrantClient("http://localhost:6333")

# Check if the collection already exists
existing_collections = client.get_collections().collections
existing_collection_names = [collection.name for collection in existing_collections]

# Check if collection already exists (it shouldn't if this is the first time)
if index_name not in existing_collection_names:
    # If it does not exist, create collection
    client.create_collection(
        collection_name=index_name,
        vectors_config=models.VectorParams(size=dims, distance=models.Distance.COSINE),
    )
    # Wait for collection to be initialized (Qdrant usually initializes quickly)
    time.sleep(1)

# Connect to collection (In Qdrant, you interact directly with the client)
# View collection stats
collection_info = client.get_collection(index_name)
print(collection_info)

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=0 segments_count=8 config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=None), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None) 

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with our embeddings.

In [40]:
batch_size = 128  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    # find end of batch
    i_end = min(len(data), i + batch_size)
    # create batch
    batch = data[i:i_end]
    # create embeddings
    chunks = [f'{x["title"]}: {x["content"]}' for x in batch["metadata"]]
    embeds = encoder(chunks)
    assert len(embeds) == (i_end - i)

    # prepare points for Qdrant
    points = [
        models.PointStruct(
            id=int(id.split('#')[0].replace('.', '')),  # Remove '.' and everything after '#', then convert to integer
            vector=embed,
            payload=metadata
        )
        for id, embed, metadata in zip(batch["id"], embeds, batch["metadata"])
    ]


    # upsert to Qdrant
    client.upsert(
        collection_name=index_name,
        points=points
    )

  0%|          | 0/8 [00:00<?, ?it/s]

In [41]:
batch["metadata"]

[{'content': 'We observe that RG-2Lâ s ranking scores are mostly positively correlated with RG-S(0, 4)â s (Figure 3(a)). However, RG-2L struggles to dis- tinguish query-document pairs with higher (> 3.0) ranking scores from RG-S(0, 4) and scores them al- most equally with scores close to 1.0. This suggests that providing more fine-grained relevance labels helps the LLM differentiate better among some query-document pairs, particularly with the top- ranked documents. When we compare the ranking scores from RG-3L where more than 2 relevance levels are used (Figure 3(b)), there is almost no such â',
  'title': 'Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels'},
 {'content': 'plateauâ . The performance of RG-3L and RG-S(0, 4) are also very close. # 6 Conclusion In this work, we explore the use of more fine- grained relevance labels in the prompt for point- wise zero-shot LLM rankers instead of the binary labels used in existing works. We propose

Now let's test retrieval!

In [44]:
def get_docs(query: str, top_k: int) -> list[str]:
    # Encode query
    xq = encoder([query])

    # Perform the search on Qdrant
    search_result = client.search(
        collection_name=index_name,
        limit=top_k,
        query_vector=xq[0],  # Qdrant expects a single vector here
        with_payload=True  # Include payload in the results

    )
    # Extract document text from search results
    docs = [point.payload['content'] for point in search_result]
    return docs

In [46]:
query = "can you tell me about the Llama LLMs?"
docs = get_docs(query, 5)
print("\n---\n".join(docs))

â ¢ LLaMA 2 (Touvron et al., 2023b): Appendix A.6 â ¢ FLAN (Wei et al., 2022): Appendix C â ¢ (Dodge et al., 2021): Section 4.2 â ¢ GLaM (Du et al., 2021): Appendix D An updated version can be found in the LM Con- tamination Index.
---
Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364. Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language models for infor- mation retrieval: A survey. CoRR, abs/2308.07107.
---
37
---
Query: {query} Document: {document} Output: # 4-Level Relevance Generation (RG-4L) For the following query and document, judge whether they are â Perfectly Relevantâ , â Highly Relevantâ , â Somewhat Relevantâ , or â Not Relevantâ . Query: {query} Document: {document} Output: # F.6 Rating Scale Relevance Generation (RG-S(0, k)) From a scale of 0 to {k}, judge the relevance between the query and the document. Query: {query} Document: {document} Output:
---
21

Our retrieval component works, now let's try feeding this into a Llama 3 70B model hosted by Groq to produce an answer.

In [47]:
from openai import OpenAI


client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)
MODEL = 'llama3'

Now we can generate responses using Llama 3, we'll wrap this logic into a help function called `generate`:

In [48]:
def generate(query: str, docs: list[str]):
    system_message = (
        "You are a helpful assistant that answers questions about AI using the "
        "context provided below.\n\n"
        "CONTEXT:\n"
        "\n---\n".join(docs)
    )
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": query}
    ]
    # generate response
    chat_response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )
    return chat_response.choices[0].message.content

In [None]:
out = generate(query=query, docs=docs)
print(out)

---