<a href="https://colab.research.google.com/github/pierfrancescomartinello/NLP-Project/blob/main/rag_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG pipelines applied to Unipa's website


A HuggingFace API key is necessary:
1. generate it at https://huggingface.co/settings/tokens
2. on the notebook, click on the key icon in the left sidebar
3. click "Add new secret" and name it HF_TOKEN, then insert the generated key


<font color="red">**WARNING:**</font> If on Google Colab, a session restart (e.g. Runtime -> Restart session and run all) is **necessary** after installing dependencies, in order to set all the new installed packages correctly


## Dependencies
- `haystack-ai` is the preview of Haystack 2.0
- `sentence_transformers` is needed for embeddings
- `transformers` is needed to use open-source LLMs
- `accelerate` and `bitsandbytes` are required to use quantized versions of these models (with smaller memory footprint)
- `ragas-haystack` is required in order to define some metric of goodness for the models

In [1]:
%%capture
! pip install haystack-ai transformers accelerate bitsandbytes sentence_transformers ragas-haystack

## Setup

In [2]:
from IPython.display import Image
from pprint import pprint
import torch
import rich
import random

In [3]:
!wget https://raw.githubusercontent.com/pierfrancescomartinello/NLP-Project/main/output/unipa_dataset.json

--2024-06-19 09:39:07--  https://raw.githubusercontent.com/pierfrancescomartinello/NLP-Project/main/output/unipa_dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7606379 (7.3M) [text/plain]
Saving to: ‘unipa_dataset.json’


2024-06-19 09:39:08 (326 MB/s) - ‘unipa_dataset.json’ saved [7606379/7606379]



In [4]:
import pandas as pd
import numpy as np

# The dataset is loaded from Google Colab
filepath = "/content/unipa_dataset.json"
df = pd.read_json(filepath)
df.columns = ["title", "addr", "text"]

df

Unnamed: 0,title,addr,text
0,Università degli Studi di Palermo,https://www.unipa.it/,
1,Organi di Governo | Università degli Studi di ...,https://www.unipa.it/ateneo/OrganiDiGovernoECo...,
2,Fatturazione elettronica | Università degli St...,https://www.unipa.it/target/imprese/informazio...,Il D.M 55 del 3 aprile 2013 prevede l'obbligo ...
3,Presentazione | Università degli Studi di Palermo,https://www.unipa.it/ateneo/presentazione/,
4,Credits | Università degli Studi di Palermo,https://www.unipa.it/credits.html,I contenuti della home page e delle relative s...
...,...,...,...
6333,Calendari didattici DARCH | Centro per l’innov...,https://www.unipa.it/strutture/cimdu/Calendari...,Calendario didattico DARCH A.A. 2024/2025 Ca...
6334,| Università degli Studi di Palermo,https://www.unipa.it/amministrazione/rettorato...,
6335,Settore Comunicazione e URP | Settore Comunica...,https://www.unipa.it/amministrazione/rettorato...,
6336,Regolamenti per aree tematiche di interesse | ...,https://www.unipa.it/servizi/prevenzionedellac...,REGOLAMENTI PERSONALE DOCENTE E RICERCATORE


## Preprocessing

In [5]:
import re

# It has been noticed that there are some entries that consist only of a date in a particular format
def remove_dates(s: str) -> str:
    regex = r"\d{1,2}-(gen|feb|mar|apr|mag|giu|lug|ago|set|ott|nov|dic)-\d{4}"

    return re.sub(regex, "", s)

# Empty strings are replaced with the numpy NaN representation so that dropna can be used
def remove_empty_docs(df: str) -> str:
    df["text"] = df["text"].replace("", np.nan)
    df.dropna(subset=["text"], inplace=True)

    return df


def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df["text"] = df["text"].apply(remove_dates)
    df["text"] = df["text"].apply(lambda x: x.strip())

    df = remove_empty_docs(df)


    # Wpecify minimum document length
    # Most documents below threshold tend to be garbage documents
    df = df[df["text"].str.len() >= 200]

    df.reset_index(inplace=True, drop=True)

    return df

In [6]:
df = preprocess(df)
df

Unnamed: 0,title,addr,text
0,Fatturazione elettronica | Università degli St...,https://www.unipa.it/target/imprese/informazio...,Il D.M 55 del 3 aprile 2013 prevede l'obbligo ...
1,Credits | Università degli Studi di Palermo,https://www.unipa.it/credits.html,I contenuti della home page e delle relative s...
2,Sostegno allo studio | Centro per l’innovazion...,https://www.unipa.it/strutture/cimdu/Sostegno-...,"Da 15 anni ItaStra, Scuola di Lingua Italiana ..."
3,PNRR | PNRR | Università degli Studi di Palermo,https://www.unipa.it/progetti/pnrr/,Il Piano Nazionale di Ripresa e Resilienza - P...
4,Corsi di preparazione alle prove di accesso A....,https://www.unipa.it/Corsi-di-preparazione-all...,Sono aperte le iscrizioni all’edizione inverna...
...,...,...,...
2771,U.O. Didattica e Internazionalizzazione - Desc...,https://www.unipa.it/strutture/cimdu/U.O.-Dida...,Il Responsabile dell’U.O. Didattica e Internaz...
2772,Ricercatori neoassunti | Università degli Stud...,https://www.unipa.it/strutture/cimdu/ricercato...,Per restare sempre aggiornati è possiblie iscr...
2773,Declaratorie della U.O. | Centro per l’innovaz...,https://www.unipa.it/strutture/cimdu/Declarato...,La U.O. Didattica e Internazionalizzazione si ...
2774,Calendari didattici DARCH | Centro per l’innov...,https://www.unipa.it/strutture/cimdu/Calendari...,Calendario didattico DARCH A.A. 2024/2025 Cal...


In [7]:
from haystack.dataclasses import Document

# Population of the Document dataclasses. Title and URLs are added as metadata
titles = list(df["title"].values)
texts = list(df["text"].values)
urls = list(df["addr"].values)

raw_docs = []
for title, text, url in zip(titles, texts, urls):
    raw_docs.append(Document(content=text, meta={"name": title or "", "url": url}))

## Indexing Pipeline

In [8]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import (
    SentenceTransformersTextEmbedder,
    SentenceTransformersDocumentEmbedder,
)
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice

In [9]:
# We use In memory Document Store, albeit a different kind, like FAISSDocumentStore should be used for heavier document stores
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

In [10]:
# Creation of the indexing pipeline

indexing = Pipeline() # Empty Pipeline
# Cleaning the document from regex, headers, footers and similar
indexing.add_component("cleaner", DocumentCleaner())

# We split the documents in two sentence long subdocument
indexing.add_component(
    "splitter", DocumentSplitter(split_by="sentence", split_length=2)
)

# We transform the sentences in embeddings
indexing.add_component(
    "doc_embedder",
    SentenceTransformersDocumentEmbedder(
        model="thenlper/gte-large", # This is the model used
        device=ComponentDevice.from_str("cuda:0"), # We use the GPU for this operation
        meta_fields_to_embed=["title"], # We add the
    ),
)

# This simply writes in the document store
indexing.add_component(
    "writer",
    DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE),
)

# Cleaner -> Splitter -> Document Embedder -> Writer
indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "doc_embedder")
indexing.connect("doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7a8a97aaea40>
🚅 Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - doc_embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> doc_embedder.documents (List[Document])
  - doc_embedder.documents -> writer.documents (List[Document])

In [11]:
# This command takes the raw documents and transform them in embeddings
indexing.run({"cleaner": {"documents": raw_docs}})

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Batches:   0%|          | 0/948 [00:00<?, ?it/s]

{'writer': {'documents_written': 30314}}

Let's inspect the total number of chunked Documents and examine a Document

### Information about the embeddings

In [12]:
print(f"We have a total of {len(document_store.filter_documents())} documents in the document store")
print(f"The size of the embeddings is: {len(document_store.filter_documents()[0].embedding)}")

We have a total of 27842 documents in the document store
The size of the embeddings is: 1024


In [13]:
pprint(document_store.filter_documents()[0])

Document(id=c81160b1c9bd873b1ae604c85641fd6e8290df7a920af385697caf26e8afec58, content: 'Il D.M 55 del 3 aprile 2013 prevede l'obbligo della fatturazione elettronica in tutti i rapporti con...', meta: {'name': 'Fatturazione elettronica | Università degli Studi di Palermo', 'url': 'https://www.unipa.it/target/imprese/informazioni/fatturazione-elettronica/', 'source_id': '78eb0a1349eff5f5fda46f5d6efeeffc87096deee81660c984d1f79334123146', 'page_number': 1}, embedding: vector of size 1024)


## RAG Pipeline

In [14]:
from haystack.components.generators import HuggingFaceLocalGenerator

In [15]:
# Due to the limited resources, a 4-bit quantization is done
zephyr_generator = HuggingFaceLocalGenerator(
    #Name of the model
    "HuggingFaceH4/zephyr-7b-beta",
    huggingface_pipeline_kwargs={
        "device_map": "auto",
        "model_kwargs": {
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": torch.bfloat16,
        },
    },
    generation_kwargs={"max_new_tokens": 500},
)

In [16]:
zephyr_generator.warm_up()

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

### `falcon-7b`

In [17]:
# Due to the limited resources, a 4-bit quantization is done
falcon_generator = HuggingFaceLocalGenerator(
    # Name of the model
    "tiiuae/falcon-7b",
    huggingface_pipeline_kwargs={
        "device_map": "auto",
        "model_kwargs": {
            "load_in_4bit": True,
            "bnb_4bit_use_double_quant": True,
            "bnb_4bit_quant_type": "nf4",
            "bnb_4bit_compute_dtype": torch.bfloat16,
        },
    },
    generation_kwargs={"max_new_tokens": 500},
)

In [18]:
falcon_generator.warm_up()

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [19]:
from haystack.components.builders import PromptBuilder

prompt_template = """<|system|>Using the information contained in the context, give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  </s>
<|assistant|>
"""
# Since an PromptBuilder object can be associated with only one Pipeline, two copies of the same object had to be made
zephyr_prompt_builder = PromptBuilder(template=prompt_template)
falcon_prompt_builder = PromptBuilder(template=prompt_template)

### Pipeline Creation

In [20]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

In [21]:
zephyr_rag = Pipeline() # Empty Pipeline
# Text embedder
zephyr_rag.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="thenlper/gte-large", # This is the model for the sentence transformer
        device=ComponentDevice.from_str("cuda:0") # We use the GPU
    ),
)
# Embedding Retriever
zephyr_rag.add_component(
    "retriever", InMemoryEmbeddingRetriever(
        document_store=document_store,
        top_k=5 #Find the 5 most similar documents to the query
    )
)
zephyr_rag.add_component("prompt_builder", zephyr_prompt_builder)
zephyr_rag.add_component("llm", zephyr_generator)

# Text Embedder -> Retriever -> Prompt Builder -> LLM
zephyr_rag.connect("text_embedder", "retriever")
zephyr_rag.connect("retriever.documents", "prompt_builder.documents")
zephyr_rag.connect("prompt_builder.prompt", "llm.prompt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7a898cb46aa0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: HuggingFaceLocalGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

In [22]:
falcon_rag = Pipeline() # Empty Pipeline
# Text embedder
falcon_rag.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="thenlper/gte-large",# This is the model for the sentence transformer
        device=ComponentDevice.from_str("cuda:0")# We use the GPU
    ),
)
# Embedding Retriever
falcon_rag.add_component(
    "retriever", InMemoryEmbeddingRetriever(
        document_store=document_store,
        top_k=5 #Find the 5 most similar documents to the query
    )
)
falcon_rag.add_component("prompt_builder", falcon_prompt_builder)
falcon_rag.add_component("llm", falcon_generator)

# Text Embedder -> Retriever -> Prompt Builder -> LLM
falcon_rag.connect("text_embedder", "retriever")
falcon_rag.connect("retriever.documents", "prompt_builder.documents")
falcon_rag.connect("prompt_builder.prompt", "llm.prompt")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7a898cb47460>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - llm: HuggingFaceLocalGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)

## Questions

In [23]:
def get_generative_answer(query, model_object):
    # We pass the query to the particular model
    results = model_object.run(
        {"text_embedder": {"text": query}, "prompt_builder": {"query": query}}
    )
    # A reply is outputted
    answer = results["llm"]["replies"][0]
    rich.print(answer)

In [32]:
def query_both_models(query: str):
    print("ZEPHYR:")
    get_generative_answer(query, zephyr_rag)
    print("\nFALCON:")
    get_generative_answer(query, falcon_rag)

## Test questions

In [34]:
query_both_models("Tell me about Data, Algorithms and Machine Intelligence")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


FALCON


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [36]:
query_both_models("Dimmi gli eventi dell'università di palermo del 2024")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


FALCON


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [37]:
query_both_models("Parlami di Ausiello")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


FALCON


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [38]:
query_both_models("What were the deans of University of Palermo?")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


FALCON


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [39]:
query_both_models("Tell me about Raffaele Giancarlo")

ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


FALCON


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


### Zephyr

In [27]:
while (query := input("Input your query (type EXIT to finish): ")) != "EXIT":
    print(query)
    get_generative_answer(query, zephyr_rag)
    print("\n\n")

Input your query (type EXIT to finish): cosa ha fatto Raffaele Giancarlo?
cosa ha fatto Raffaele Giancarlo?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]






KeyboardInterrupt: Interrupted by user

### Falcon

In [28]:
while (query := input("Input your query (type EXIT to finish): ")) != "EXIT":
    print(query)
    get_generative_answer(query, falcon_rag)
    print("\n\n")

Input your query (type EXIT to finish): Cosa ha fatto Raffaele Giancarlo?
Cosa ha fatto Raffaele Giancarlo?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.





Input your query (type EXIT to finish): EXIT


### Both

In [29]:
while (query := input("Input your query (type EXIT to finish): ")) != "EXIT":
    print(query)
    query_both_models(query)
    print("\n\n")

Orari di apertura della biblioteca del Dipartimento di matematica e informatica
ZEPHYR:


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


FALCON


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.







KeyboardInterrupt: Interrupted by user

______

In [None]:
from haystack import Pipeline

from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

QUESTIONS = [
    "Which is the most popular global sport?",
    "Who created the Python language?",
]
CONTEXTS = [
    [
        "The popularity of sports can be measured in various ways, including TV viewership, social media presence, number of participants, and economic impact. Football is undoubtedly the world's most popular sport with major events like the FIFA World Cup and sports personalities like Ronaldo and Messi, drawing a followership of more than 4 billion people."
    ],
    [
        "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming language. Its design philosophy emphasizes code readability, and its language constructs aim to help programmers write clear, logical code for both small and large-scale software projects."
    ],
]
RESPONSES = [
    "Football is the most popular sport with around 4 billion followers worldwide",
    "Python language was created by Guido van Rossum.",
]

GROUND_TRUTHS = [
    "Football is the most popular sport",
    "Python language was created by Guido van Rossum.",
]

ragas_pipe = Pipeline()
evaluator_context = RagasEvaluator(
    metric=RagasMetric.CONTEXT_PRECISION,
)
evaluator_aspect = RagasEvaluator(
    metric=RagasMetric.ASPECT_CRITIQUE,
    metric_params={"name": "custom", "definition": "Is this answer problematic for children?", "strictness": 3},
)
ragas_pipe.add_component("evaluator_context", evaluator_context)
ragas_pipe.add_component("evaluator_aspect", evaluator_aspect)

# Each metric expects a specific set of parameters as input. Refer to the
# Ragas class' documentation for more details.
results = ragas_pipe.run(
    {
        "evaluator_context": {"questions": QUESTIONS, "contexts": CONTEXTS, "ground_truths": GROUND_TRUTHS},
        "evaluator_aspect": {"questions": QUESTIONS, "contexts": CONTEXTS, "responses": RESPONSES},
    }
)

for component in ["evaluator_context", "evaluator_aspect"]:
    for output in results[component]["results"]:
        print(output)
