## Building simple RAG example with LlamaIndex and Jina AI
In this article, you'll learn how to create a RAG (Retrieval-Augmented Generation) system using: 
LlamaIndex, Jina Embeddings, and the Mixtral-8x7B-Instruct-v0.1 language model, which is available on [HuggingFace](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
If you need more information on the Mixtral language model, please visit the [Mistral AI website](https://mistral.ai/news/mixtral-of-experts/) or look at the model card on HuggingFace.

### What is RAG ?
Retrieval Augmented Generation is a technique that combines search with language generation. 
Here's how it works: an external information retrieval system is used to identify documents likely to provide information relevant to the user's query. 
These documents, along with the user's request, are then passed on to a text-generating language model, producing a natural response.

This method enables a language model to respond to questions and access information from a much larger set of documents than it could see otherwise. 
The language model only looks at a few relevant sections of the documents when generating responses, which also helps to reduce inexplicable errors.

#### Getting started
First, install all dependencies:
```shell
!pip install -U  \
    llama-index  \
    llama-parse \
    python-dotenv \
    llama-index-embeddings-jinaai  \
    llama-index-llms-huggingface  \
    llama-index-vector-stores-qdrant  \
    "huggingface_hub[inference]"  \
    datasets
```

Set up secret key values on `.env` file: 
```bash
JINAAI_API_KEY
HF_INFERENCE_API_KEY
LLAMA_CLOUD_API_KEY
QDRANT_HOST
QDRANT_API_KEY
```

Load all environment variables:

In [1]:
import os
from dotenv import load_dotenv
load_dotenv('./.env')

True

#### 1 - Connect Jina Embeddings and Mixtral LLM
LlamaIndex provides built-in support for the [Jina Embeddings API](https://jina.ai/embeddings/).
To use it, you need to initialize the `JinaEmbedding`object with your API Key and model name.

For the LLM, you need wrap it in a subclass of `llama_index.llms.CustomLLM` to make it compatible with LlamaIndex.

In [3]:
# connect embeddings
from llama_index.embeddings.jinaai import JinaEmbedding

jina_embedding_model = JinaEmbedding(
    model="jina-embeddings-v2-base-en",
    api_key=os.getenv("JINAAI_API_KEY"),
)

# connect LLM
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1",
    token=os.getenv("HF_INFERENCE_API_KEY"),
)

#### 2 - Prepare data for RAG
This example will use household appliance manuals, which are generally available as [PDF documents](https://www.manua.ls/samsung/wf80f5ebw4w/manual).

In the `data` folder, we have three documents, and we will use LlamaParse to extract the textual content from the PDF and use it as a knowledge base in a simple RAG.

The [free LlamaIndex Cloud plan](https://cloud.llamaindex.ai/parse) is sufficient for our example

In [5]:
import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse

llamaparse_api_key = os.getenv("LLAMA_CLOUD_API_KEY")

llama_parse_documents = LlamaParse(api_key=llamaparse_api_key, result_type="markdown").load_data([
    "data/DJ68-00682F_0.0.pdf", 
    "data/F500E_WF80F5E_03445F_EN.pdf", 
    "data/O_ME4000R_ME19R7041FS_AA_EN.pdf"
])

Started parsing the file under job_id 80f488cb-4538-4aa6-8453-7f3c65b9e06e
Started parsing the file under job_id 39e6f47b-ea60-4bc4-8bcc-44c29ed2a272
Started parsing the file under job_id 6801ca57-438b-4146-aada-0f7b1379d2a3


#### 3 - Store data into Qdrant
The code below does the following:
- create a vector store with Qdrant client;
- get an embedding for each chunk using Jina Embeddings API;
- combining `sparse`and `dense` vectors for hybrid search;
- stores all data into Qdrant;

Hybrid search with Qdrant must be enabled from the beginning - we can simply set `enable_hybrid=True`.

> An explanation of using hybrid cloud can be inserted here!

In [9]:
# By default llamaindex uses OpenAI models
# setting embed_model to Jina and llm model to Mixtral
from llama_index.core import Settings
Settings.embed_model = jina_embedding_model
Settings.llm = mixtral_llm

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(
    url = os.getenv("QDRANT_HOST"),
    api_key = os.getenv("QDRANT_API_KEY")
)

vector_store = QdrantVectorStore(
    client=client, collection_name="demo", enable_hybrid=True, batch_size=20
)
Settings.chunk_size = 512

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents=llama_parse_documents, 
    storage_context=storage_context
)

#### 4 - Prepare a prompt
Here we will create a custom prompt template.
This prompt asks the LLM to use only the context information retrieved from the vector database (Qdrant).

**When querying with hybrid mode, we can set `similarity_top_k`and `sparse_top_k` separately:**
- `sparse_top_k` represents how many nodes will be retrieved from each dense and sparse query.
- `similarity_top_k` controls the final number of returned nodes. In the above setting, we end up with 10 nodes.

Then, we assemble the query engine using the prompt.

In [11]:
from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "-------------------------------"
    "{context_str}\n"
    "-------------------------------"
    "Given the context information and not prior knowledge,"
    "answer the query. Please be concise, and complete.\n"
    "If the context does not contain an answer to the query,"
    "respond with \"I don't know!\"."
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
from llama_index.core import Settings
Settings.embed_model = jina_embedding_model
Settings.llm = mixtral_llm

# retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
    sparse_top_k=12,
    vector_store_query_mode="hybrid"
)

# response synthesizer
response_synthesizer = get_response_synthesizer(
    llm=mixtral_llm,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

# query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

#### Final - Asking questions

Now you can ask questions and receive answers based on the data.

In [12]:
result = query_engine.query("What temperature should I use for my laundry?")
print(result.response)


The water temperature is set to 70 ˚C during the Eco Drum Clean cycle. You cannot change the water temperature. However, the temperature for other cycles is not specified in the context.
