## Building simple RAG example with LlamaIndex and Jina AI
In this article, you'll learn how to create a RAG (Retrieval-Augmented Generation) system using: 
LlamaIndex, Jina Embeddings, and the Mixtral-8x7B-Instruct-v0.1 language model, which is available on [HuggingFace](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
If you need more information on the Mixtral language model, please visit the [Mistral AI website](https://mistral.ai/news/mixtral-of-experts/) or look at the model card on HuggingFace.

### What is RAG ?
Retrieval Augmented Generation is a technique that combines search with language generation. 
Here's how it works: an external information retrieval system is used to identify documents likely to provide information relevant to the user's query. 
These documents, along with the user's request, are then passed on to a text-generating language model, producing a natural response.

This method enables a language model to respond to questions and access information from a much larger set of documents than it could see otherwise. 
The language model only looks at a few relevant sections of the documents when generating responses, which also helps to reduce inexplicable errors.

#### Getting started
First, install all dependencies:
```shell
!pip install -U  \
    llama-index  \
    python-dotenv \
    llama-index-embeddings-jinaai  \
    llama-index-llms-huggingface  \
    llama-index-vector-stores-qdrant  \
    "huggingface_hub[inference]"  \
    datasets
```

Set up secret key values on `.env` file: 
```bash
JINAAI_API_KEY
HF_INFERENCE_API_KEY
QDRANT_HOST
QDRANT_API_KEY
```

Load all environment variables:

In [1]:
import os
from dotenv import load_dotenv
load_dotenv('./.env')

True

#### 1 - Connect Jina Embeddings and Mixtral LLM
LlamaIndex provides built-in support for the [Jina Embeddings API](https://jina.ai/embeddings/).
To use it, you need to initialize the `JinaEmbedding`object with your API Key and model name.

For the LLM, you need wrap it in a subclass of `llama_index.llms.CustomLLM` to make it compatible with LlamaIndex.

In [3]:
# connect embeddings
from llama_index.embeddings.jinaai import JinaEmbedding

jina_embedding_model = JinaEmbedding(
    model="jina-embeddings-v2-base-en",
    api_key=os.getenv("JINAAI_API_KEY"),
)

# connect LLM
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1",
    token=os.getenv("HF_INFERENCE_API_KEY"),
)

#### 2 - Prepare data for RAG
Let's download a document that is ready to use.
This small dataset contains the paper on Mistral 7B, already separated into smaller parts.

So, let's load it straight from HuggingFace.
Then, each chunk is transformed into a LlamaIndex document to be stored in Qdrant.

In [4]:
# prepare data for RAG
from datasets import load_dataset

dataset = load_dataset("infoslack/mistral-7b-arxiv-paper-chunked", split="train")
data = dataset.to_pandas()
df = data[['chunk', 'source']]

In [5]:
# transform each chunk into llamaindex document
from llama_index.core import Document

docs = []

for i, row in df.iterrows():
    docs.append(Document(
        text=row['chunk'],
        source=row['source']
    ))
docs[0]

Document(id_='77eac010-7cb6-4dec-8902-590ff185d0da', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Mistral 7B\nAlbert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,\nDevendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel,\nGuillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux,\nPierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,\nWilliam El Sayed\nAbstract\nWe introduce Mistral 7B, a 7–billion-parameter language model engineered for\nsuperior performance and efficiency. Mistral 7B outperforms the best open 13B\nmodel (Llama 2) across all evaluated benchmarks, and the best released 34B\nmodel (Llama 1) in reasoning, mathematics, and code generation. Our model\nleverages grouped-query attention (GQA) for faster inference, coupled with sliding\nwindow attention (SWA) to effectively handle sequences of arbitrary length with a\nreduced in

#### 3 - Store data into Qdrant
The code below does the following:
- create a vector store with Qdrant client;
- get an embedding for each chunk using Jina Embeddings API;
- stores all data into Qdrant

> An explanation of hybrid cloud can be inserted here!

In [6]:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

from llama_index.core import Settings
Settings.embed_model = jina_embedding_model

client = qdrant_client.QdrantClient(
    url = os.getenv("QDRANT_HOST"),
    api_key = os.getenv("QDRANT_API_KEY")
)

vector_store = QdrantVectorStore(client=client, collection_name="demo")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents=docs, storage_context=storage_context)

#### 4 - Prepare a prompt
Here we will create a custom prompt template.
This prompt asks the LLM to use only the context information retrieved from the vector database (Qdrant).

Then, we assemble the query engine using the prompt.

In [7]:
from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "-------------------------------"
    "{context_str}\n"
    "-------------------------------"
    "Given the context information and not prior knowledge,"
    "answer the query. Please be concise, and complete.\n"
    "If the context does not contain an answer to the query,"
    "respond with \"I don't know!\"."
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
from llama_index.core import Settings
Settings.embed_model = jina_embedding_model
Settings.llm = mixtral_llm

# retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# response synthesizer
response_synthesizer = get_response_synthesizer(
    llm=mixtral_llm,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

# query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

#### Final - Asking questions

Now you can ask questions and receive answers based on the data.

In [8]:
result = query_engine.query("What is so special about Mistral 7B?")
print(result.response)

 Mistral 7B is a large language model that takes a significant step in balancing the goals of high performance and efficiency. It uses a sliding window attention mechanism to reduce the number of operations and memory usage, making it more affordable and efficient for real-world applications.
