<a href="https://colab.research.google.com/github/pierfrancescomartinello/NLP-Project/blob/main/rag_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG pipelines applied to Unipa's website


A HuggingFace API key is necessary:
1. generate it at https://huggingface.co/settings/tokens
2. on the notebook, click on the key icon in the left sidebar
3. click "Add new secret" and name it HF_TOKEN, then insert the generated key


<font color="red">**WARNING:**</font> If on Google Colab, a session restart (e.g. Runtime -> Restart session and run all) is **necessary** after installing dependencies, in order to set all the new installed packages correctly. Also it is suggested to do a single run of the entire code. The last part `Interactive Testing` is dedicated to the interaction with the models.


## Dependencies
- `haystack-ai` is the preview of Haystack 2.0
- `sentence_transformers` is needed for embeddings
- `transformers` is needed to use open-source LLMs
- `accelerate` and `bitsandbytes` are required to use quantized versions of these models (with smaller memory footprint)

In [1]:
%%capture
! pip install haystack-ai transformers accelerate bitsandbytes sentence_transformers

## Setup

In [2]:
from IPython.display import Image
from pprint import pprint
import torch
import rich
import random

In [3]:
!wget https://raw.githubusercontent.com/pierfrancescomartinello/NLP-Project/main/output/unipa_dataset.json

--2024-06-19 13:18:51--  https://raw.githubusercontent.com/pierfrancescomartinello/NLP-Project/main/output/unipa_dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7606379 (7.3M) [text/plain]
Saving to: ‘unipa_dataset.json.2’


2024-06-19 13:18:51 (125 MB/s) - ‘unipa_dataset.json.2’ saved [7606379/7606379]



In [4]:
import pandas as pd
import numpy as np

# The dataset is loaded from Google Colab
filepath = "/content/unipa_dataset.json"
df = pd.read_json(filepath)
df.columns = ["title", "addr", "text"]

df

Unnamed: 0,title,addr,text
0,Università degli Studi di Palermo,https://www.unipa.it/,
1,Organi di Governo | Università degli Studi di ...,https://www.unipa.it/ateneo/OrganiDiGovernoECo...,
2,Fatturazione elettronica | Università degli St...,https://www.unipa.it/target/imprese/informazio...,Il D.M 55 del 3 aprile 2013 prevede l'obbligo ...
3,Presentazione | Università degli Studi di Palermo,https://www.unipa.it/ateneo/presentazione/,
4,Credits | Università degli Studi di Palermo,https://www.unipa.it/credits.html,I contenuti della home page e delle relative s...
...,...,...,...
6333,Calendari didattici DARCH | Centro per l’innov...,https://www.unipa.it/strutture/cimdu/Calendari...,Calendario didattico DARCH A.A. 2024/2025 Ca...
6334,| Università degli Studi di Palermo,https://www.unipa.it/amministrazione/rettorato...,
6335,Settore Comunicazione e URP | Settore Comunica...,https://www.unipa.it/amministrazione/rettorato...,
6336,Regolamenti per aree tematiche di interesse | ...,https://www.unipa.it/servizi/prevenzionedellac...,REGOLAMENTI PERSONALE DOCENTE E RICERCATORE


## Preprocessing

In [5]:
import re

# It has been noticed that there are some entries that consist only of a date in a particular format
def remove_dates(s: str) -> str:
    regex = r"\d{1,2}-(gen|feb|mar|apr|mag|giu|lug|ago|set|ott|nov|dic)-\d{4}"

    return re.sub(regex, "", s)

# Empty strings are replaced with the numpy NaN representation so that dropna can be used
def remove_empty_docs(df: str) -> str:
    df["text"] = df["text"].replace("", np.nan)
    df.dropna(subset=["text"], inplace=True)

    return df


def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df["text"] = df["text"].apply(remove_dates)
    df["text"] = df["text"].apply(lambda x: x.strip())

    df = remove_empty_docs(df)


    # Wpecify minimum document length
    # Most documents below threshold tend to be garbage documents
    df = df[df["text"].str.len() >= 200]

    df.reset_index(inplace=True, drop=True)

    return df

In [6]:
df = preprocess(df)
df

Unnamed: 0,title,addr,text
0,Fatturazione elettronica | Università degli St...,https://www.unipa.it/target/imprese/informazio...,Il D.M 55 del 3 aprile 2013 prevede l'obbligo ...
1,Credits | Università degli Studi di Palermo,https://www.unipa.it/credits.html,I contenuti della home page e delle relative s...
2,Sostegno allo studio | Centro per l’innovazion...,https://www.unipa.it/strutture/cimdu/Sostegno-...,"Da 15 anni ItaStra, Scuola di Lingua Italiana ..."
3,PNRR | PNRR | Università degli Studi di Palermo,https://www.unipa.it/progetti/pnrr/,Il Piano Nazionale di Ripresa e Resilienza - P...
4,Corsi di preparazione alle prove di accesso A....,https://www.unipa.it/Corsi-di-preparazione-all...,Sono aperte le iscrizioni all’edizione inverna...
...,...,...,...
2771,U.O. Didattica e Internazionalizzazione - Desc...,https://www.unipa.it/strutture/cimdu/U.O.-Dida...,Il Responsabile dell’U.O. Didattica e Internaz...
2772,Ricercatori neoassunti | Università degli Stud...,https://www.unipa.it/strutture/cimdu/ricercato...,Per restare sempre aggiornati è possiblie iscr...
2773,Declaratorie della U.O. | Centro per l’innovaz...,https://www.unipa.it/strutture/cimdu/Declarato...,La U.O. Didattica e Internazionalizzazione si ...
2774,Calendari didattici DARCH | Centro per l’innov...,https://www.unipa.it/strutture/cimdu/Calendari...,Calendario didattico DARCH A.A. 2024/2025 Cal...


In [7]:
from haystack.dataclasses import Document

# Population of the Document dataclasses. Title and URLs are added as metadata
titles = list(df["title"].values)
texts = list(df["text"].values)
urls = list(df["addr"].values)

raw_docs = []
for title, text, url in zip(titles, texts, urls):
    raw_docs.append(Document(content=text, meta={"name": title or "", "url": url}))

## Indexing Pipeline

In [8]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import (
    SentenceTransformersTextEmbedder,
    SentenceTransformersDocumentEmbedder,
)
from haystack.components.generators import HuggingFaceLocalGenerator, HuggingFaceAPIGenerator
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.builders import PromptBuilder
from haystack.utils import ComponentDevice, Secret

In [9]:
# We use In memory Document Store, albeit a different kind, like FAISSDocumentStore should be used for heavier document stores
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

In [10]:
# Creation of the indexing pipeline

indexing = Pipeline() # Empty Pipeline
# Cleaning the document from regex, headers, footers and similar
indexing.add_component("cleaner", DocumentCleaner())

# We split the documents in two sentence long subdocument
indexing.add_component(
    "splitter", DocumentSplitter(split_by="sentence", split_length=2)
)

# We transform the sentences in embeddings
indexing.add_component(
    "doc_embedder",
    SentenceTransformersDocumentEmbedder(
        model="thenlper/gte-large", # This is the model used
        device=ComponentDevice.from_str("cuda:0"), # We use the GPU for this operation
        meta_fields_to_embed=["title"], # We add the
    ),
)

# This simply writes in the document store
indexing.add_component(
    "writer",
    DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE),
)

# Cleaner -> Splitter -> Document Embedder -> Writer
indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "doc_embedder")
indexing.connect("doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f9bbd6e7a30>
🚅 Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - doc_embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> doc_embedder.documents (List[Document])
  - doc_embedder.documents -> writer.documents (List[Document])

In [11]:
# This command takes the raw documents and transform them in embeddings
indexing.run({"cleaner": {"documents": raw_docs}})



Batches:   0%|          | 0/948 [00:00<?, ?it/s]

{'writer': {'documents_written': 30314}}

### Information about the embeddings

Let's inspect the total number of chunked Documents and examine a Document

In [12]:
print(f"We have a total of {len(document_store.filter_documents())} documents in the document store")
print(f"The size of the embeddings is: {len(document_store.filter_documents()[0].embedding)}")

We have a total of 27842 documents in the document store
The size of the embeddings is: 1024


In [13]:
pprint(document_store.filter_documents()[0])

Document(id=c81160b1c9bd873b1ae604c85641fd6e8290df7a920af385697caf26e8afec58, content: 'Il D.M 55 del 3 aprile 2013 prevede l'obbligo della fatturazione elettronica in tutti i rapporti con...', meta: {'name': 'Fatturazione elettronica | Università degli Studi di Palermo', 'url': 'https://www.unipa.it/target/imprese/informazioni/fatturazione-elettronica/', 'source_id': '78eb0a1349eff5f5fda46f5d6efeeffc87096deee81660c984d1f79334123146', 'page_number': 1}, embedding: vector of size 1024)


## RAG Pipeline

It is suggested to run this code only once, because it would create problems with the pipelines. Each element (generator, prompt_builder, ect...) are bounded to only one pipeline, hence if several execution of the same code are done, it will be necessary to change the name of the variables used in pipeline.

In [15]:
# Due to the limited resources, a 4-bit quantization is done
zephyr_generator = HuggingFaceLocalGenerator(
    "HuggingFaceH4/zephyr-7b-beta", #Name of the model
    huggingface_pipeline_kwargs={
        "device_map": "auto",
        "model_kwargs": {
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": torch.bfloat16,
        },
    },
    generation_kwargs={"max_new_tokens": 500},
)

In [16]:
# A warm up is in order, since the generator is saved into memory
zephyr_generator.warm_up()

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

### `Llama-3-8B`

In [17]:
# Due to the limited resources, a 4-bit quantization is done
# Warm-up is not possible here since we are loading from remote
from google.colab import userdata

llama_generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
                                    api_params={"model": "meta-llama/Meta-Llama-3-8B-Instruct"},
                                    token = Secret.from_token(userdata.get("HF_TOKEN")))

In [18]:
zephyr_prompt_template = """<|system|>Using the information contained in the context, give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  </s>
<|assistant|>
"""
# Since an PromptBuilder object can be associated with only one Pipeline, two copies of the same object had to be made
zephyr_prompt_builder = PromptBuilder(template=zephyr_prompt_template)


llama_prompt_template = """
    Question : {{query}}
    Context:
    {% for doc in documents %}
    {{ doc.content }} URL:{{ doc.meta['url'] }}
    {% endfor %};

"""
llama_prompt_builder = PromptBuilder(template=llama_prompt_template)

### Pipeline Creation

In [19]:
zephyr_rag = Pipeline() # Empty Pipeline
# Text embedder
zephyr_rag.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="thenlper/gte-large", # This is the model for the sentence transformer
        device=ComponentDevice.from_str("cuda:0") # We use the GPU
    ),
)
# Embedding Retriever
zephyr_rag.add_component(
    "retriever", InMemoryEmbeddingRetriever(
        document_store=document_store,
        top_k=5 #Find the 5 most similar documents to the query
    )
)
zephyr_rag.add_component("prompt_builder", zephyr_prompt_builder)
zephyr_rag.add_component("llm", zephyr_generator)

# Text Embedder -> Retriever -> Prompt Builder -> LLM
zephyr_rag.connect("text_embedder", "retriever")
zephyr_rag.connect("retriever.documents", "prompt_builder.documents")
zephyr_rag.connect("prompt_builder.prompt", "llm.prompt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f9ab7bcdf00>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: HuggingFaceLocalGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

In [20]:
llama_rag = Pipeline() # Empty Pipeline
# Text embedder
llama_rag.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="thenlper/gte-large",# This is the model for the sentence transformer
        device=ComponentDevice.from_str("cuda:0")# We use the GPU
    ),
)
# Embedding Retriever
llama_rag.add_component(
    "retriever", InMemoryEmbeddingRetriever(
        document_store=document_store,
        top_k=5 #Find the 5 most similar documents to the query
    )
)
llama_rag.add_component("prompt_builder", llama_prompt_builder)
llama_rag.add_component("llm", llama_generator)

# Text Embedder -> Retriever -> Prompt Builder -> LLM
llama_rag.connect("text_embedder", "retriever")
llama_rag.connect("retriever.documents", "prompt_builder.documents")
llama_rag.connect("prompt_builder.prompt", "llm.prompt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f9ac3ab55d0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: HuggingFaceAPIGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

## Questions

In [21]:
def get_generative_answer(query, model_object):
    # We pass the query to the particular model
    results = model_object.run(
        {"text_embedder": {"text": query}, "prompt_builder": {"query": query}}
    )
    # A reply is outputted
    answer = results["llm"]["replies"][0]
    rich.print(answer)

In [22]:
def query_both_models(query: str):
    print("ZEPHYR:")
    get_generative_answer(query, zephyr_rag)
    print("\n LLAMA:")
    get_generative_answer(query, llama_rag)

## Test questions

In [23]:
query_both_models("Tell me about Data, Algorithms and Machine Intelligence")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
query_both_models("Dimmi gli eventi dell'Università di Palermo del 2024")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [25]:
query_both_models("Parlami di Giorgio Ausiello")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [26]:
query_both_models("What were the deans of University of Palermo?")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [27]:
query_both_models("Quali materie sono insegnate alla Laurea Triennale in Informatica dell'Università di Palermo?")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [28]:
query_both_models("Dimmi gli orari della Biblioteca del Dipartimento di Matematica e Informatica dell'Università di Palermo")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [29]:
query_both_models("Dimmi gli eventi e i seminari del Dipartimento di Matematica e Informatica dell'Università di Palermo del 2023")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [30]:
query_both_models("Quali professori dell'Università di Palermo che insegnano materie che contengono la parola 'Programmazione'?")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [31]:
query_both_models("Explain the history of the University of Palermo")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [32]:
query_both_models("Che cos'è il Palazzo Steri?")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [33]:
query_both_models("Who is Massimo Midiri?")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



 LLAMA:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Interactive Testing

### Zephyr

In [None]:
raise Exception("Execution has successfully terminated! Feel free to inspect the answers above or manually query the system below.")

In [None]:
try:
    while (query := input("Input your query (type EXIT to finish): ")) != "EXIT":
        print(query)
        get_generative_answer(query, zephyr_rag)
        print("\n\n")
except KeyboardInterrupt:
    pass

### Falcon

In [None]:
try:
    while (query := input("Input your query (type EXIT to finish): ")) != "EXIT":
        print(query)
        get_generative_answer(query, llama_rag)
        print("\n\n")
except KeyboardInterrupt:
    pass

### Both

In [None]:
try:
    while (query := input("Input your query (type EXIT to finish): ")) != "EXIT":
        print(query)
        query_both_models(query)
        print("\n\n")
except KeyboardInterrupt:
    pass

______