<a href="https://colab.research.google.com/github/projecte-aina/rag_notebook/blob/main/RAGDemo_LlamaIndex_Flor6.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generation demo using Flor6.3b

This notebook shows how to create a basic Retrieval Augmented Generation (RAG) system for Catalan using a QA-optimized model based on the Flor6.3b foundational model created for the AINA project. [FlorQARAG](https://huggingface.co/projecte-aina/FlorQARAG) 
This demo should run with the Tesla instance provided free of charge in Google Colab. You can create a copy of this model and use another pdf document to interrogate, if you wish.
For a more in-depth description of RAG, go to this blog from IBM research: ["What is retrieval-augmented generation?"](https://research.ibm.com/blog/retrieval-augmented-generation-RAG)
First step: install all the libraries you are going to need:
If you're opening this Notebook on Colab, you will probably need to install LlamaIndex ðŸ¦™. (Based on an [original tutorial](https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.ipynb) by the LlamaIndex Team).

In [None]:
%pip install huggingface_hub llama-index-embeddings-huggingface llama-index-llms-huggingface

In [None]:
!pip install llama-index langchain

We now need to implement the embedding model we will use in order to convert the documents into searchable vectors, and to select the vector store that will allow us to store and retrieve them. We will be using a small, multilingual one from the Huggingface [infloat](https://huggingface.co/intfloat/multilingual-e5-small) repo.

In [None]:
import logging
import sys
from llama_index.core import Settings

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.embed_model = HuggingFaceEmbedding(
    model_name="intfloat/multilingual-e5-small",
    embed_batch_size=2
)



#### Download Data

The document we'll use for the demo is the White Paper describing the Generalitat de Catalunya AI Strategy, a 77 page document in catalan downloadable from this site: [link](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai). You can use another pdf, if it is not a scanned version and can be converted into plain text.

In [None]:
!mkdir -p 'data/estrategia/'
!wget 'https://politiquesdigitals.gencat.cat/web/.content/00-arbre/economia/catalonia-ai/Estrategia_IA_Catalunya_VFinal_CAT.pdf' -O 'data/estrategia/Estrategia_IA_Catalunya.pdf'

#### Load documents, build the VectorStoreIndex

The next step is to convert the document into text, and store it in a vector store index. The important parameters are the size of the searchable chunks to divide the document into, as well as the overlap of the chunks so you always have a relevant context around the sentences you are going to use to search the answers to your questions.

In [None]:
# load documents, creates one document per page from the PDF
documents = SimpleDirectoryReader("./data/estrategia/").load_data()

#index documents segmented into chunks of max. 512 tokens
Settings.chunk_size = 512
Settings.chunk_overlap = 100
index = VectorStoreIndex.from_documents(documents)

In [None]:
#check one of the indexed documents/pages (page 1)
documents[1]

#### Set up LLM model and prompt

Now you need to set up the Large Language Model you will be using to look for the information and generate an answer, in our case a RAG-optimized version of Flor6.3b from the projecte-aina repo in Huggingface. This model has been fine-tuned using an istruction set that requires a prompt template that provides a question ("instruction"), a searchable context (that is going to be provided by the query engine based on the similarities of the embedings with the stored chunks), and will generate an answer based on these elements. 

In [None]:
from llama_index.core import PromptTemplate

# dolly format prompt template
query_prompt = PromptTemplate(
    "### Instruction:\n{query_str}\n\n### Context\n{context_str}### Answer:\n"
)

In [None]:
import torch

llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": True},
    tokenizer_name="projecte-aina/FlorQARAG",
    model_name="projecte-aina/FlorQARAG",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.float16}
)

Settings.llm = llm

#### Query Index

Once we have this setup, we can pass on our query to the query engine using the correct prompt format, and generate a response that will be based on the selected chunks from the document that have the highest probability of conveying the correct and relevant information.

In [None]:
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine()

#set prompt template to use Dolly format
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": query_prompt}
)

response = query_engine.query("QuÃ© Ã©s lâ€™EstratÃ¨gia dâ€™intelÂ·ligÃ¨ncia artificial de Catalunya?")
print(response)


#### Query Index - Streaming

In [None]:
query_engine = index.as_query_engine(streaming=True)
query_engine.update_prompts({"response_synthesizer:text_qa_template": query_prompt})

In [None]:
response_stream = query_engine.query("En Europa, quin paÃ­s Ã©s capdavanter en intelÂ·ligÃ¨ncia artificial?")
response_stream.print_response_stream()