# Data Ingestion

This notebook shows how to preprocess and ingest the documents. All the Azure services necessary for the RAG application must be running to use it.

Basically, 3 sections are essential:

1. [List Files to Ingest](#list-files-to-ingest): we define the file paths here.
2. [Ingest Documents Using `rag.ingest`](#ingest-documents-using-ragingest): the listed documents are ingested, i.e., parsed, chunked and uploaded to the document store.
3. [Retrieve Documents Using `rag.retrieve`](#retrieve-documents-using-ragretrieve): top `k` documents are retrieved from the document store using a query. However, this section is only for testing purposes, since the `rag.chat` function carries out retrieval internally.

**In practice, we need to run the sections 1 and 2 to set up the index for RAG usage.**

There are other sections created for testing purposes, too:

- [Test Preprocessing](#test-preprocessing)
- [Test Retrieval of Documents](#test-retrieval-of-documents)


In [1]:
import sys
from pathlib import Path
from pprint import pprint

# Add the backend directory to the path
sys.path.append(str(Path(".").resolve().parent / "backend"))

In [2]:
%load_ext autoreload
%autoreload 2

## List Files to Ingest

In [8]:
# We can use the paths to these formats: CSV, PDF, Markdown, URL
FILENAMES = [
    "../data/wine-ratings.csv",
    "../data/Lewis_RAG_2021_one_page.pdf",
    "../README.md",
    "https://medium.com/@iamleonie/building-retrieval-augmented-generation-systems-be587f42aedb",
]

# Convert to Path objects
FILENAMES = [Path(f) if not f.startswith("http") else f for f in FILENAMES]

## Test Preprocessing

In [39]:
from preprocessing import get_preprocessor

In [9]:
docs = []
for file in FILENAMES:
    preprocessor = get_preprocessor(str(file), chunk_size=1000, chunk_overlap=200)
    d = preprocessor.load_split(file)
    docs.extend(d)

In [10]:
len(docs)

33300

In [11]:
pprint(docs[-10].page_content)

('Basic usage of the Azure services via the LangChain library is showcased in '
 'the notebook '
 '[`notebooks/azure_search_rag.ipynb`](./notebooks/azure_search_rag.ipynb).\n'
 '\n'
 'Once the infrastructure is setup and all the secrets/keys are known, we '
 'can:\n'
 '\n'
 '- launch the docker container of the backend locally\n'
 '- or deploy the docker container of the backend on Azure as a Container '
 'App.\n'
 '\n'
 'Then, we can use the RAG backend\n'
 '\n'
 '- via its API\n'
 '- or via the GUI or frontend (which uses the backend API).\n'
 '\n'
 'The following subsections give details for all those operation modi.\n'
 '\n'
 '### Running the Backend Locally\n'
 '\n'
 'Assuming the Azure resources have been launched successfully and the '
 'documents have been ingested, we can build the backend image and run it '
 'locally:\n'
 '\n'
 '```bash\n'
 '# Start backend, e.g., locally\n'
 'cd .../azure-rag-app\n'
 'docker run --rm --env-file .env -p 8080:8000 azure-rag-backend    \n'
 '# 

## Ingest Documents Using `rag.ingest`

Here the documents in `FILENAMES` are ingested, i.e., parsed, chunked and uploaded to the document store.

In [40]:
from rag import ingest

In [18]:
ids = ingest(
    filenames=FILENAMES,
    chunk_size=1000,
    chunk_overlap=200,
    percentage=0.001,  # 0.1% of the data/chunks are ingested, for testing purposes
)

In [19]:
ids

['NzRjYjAzNTktZmUyOC00NWJkLWJhODQtZDg5ZmFiYmFiODZk',
 'ZDMyODNmNTctNTVmMi00Yzk0LWE0YWItZDI1N2M4ZWFkNzY0',
 'M2JkZTgzNjUtMjc0OC00YmJjLTg2ZWUtYjllODhkZjY5YjJm',
 'MzMyMzhiYzEtOTVmZS00ZDZkLWJmZGMtOTBkMTg2ZTQxYTYw',
 'MDM3ZWZjMzItZTc2ZS00Y2UzLWEyOTMtYTBjMTFmODdhMThk',
 'NGE4ODBlNTYtYjc5NC00NWRmLTkyMWMtMWQ0MDE5NjZlNDFh',
 'ZTY3MzcyZTUtZDE0Yi00ZDExLWJmYzMtYjgxYzVkMWE2MTdk',
 'ODIzZjExODYtOGU1ZS00NWE2LWFhMDMtNzZjZThkOTAzZjIw',
 'MDYwMzc0ZDMtYWEzYS00YmZlLWJhOGYtNzAwYjBmMDI2ZDBk',
 'ZDJjN2ZmMjUtM2NkNi00ZTUzLWIyMWUtMDY1NzRjMmZhYjEz',
 'NTM4NmNjNTAtYTJlMC00M2MzLWJhYzQtYWRmYmNkM2Q4NjZi',
 'MTFiYWEwMjEtYmI2Yy00NzZlLThlMmItMzA3MzViNjNiYTRi',
 'OTM0ZjkxNjctZDA1MS00ZGU5LWFlMzctMTdkZjFkM2Q2YmQz',
 'MTAzNDllYWMtYzRhYi00Nzc4LWIxNGUtOTlkM2YyMGUxYWUx',
 'MGQwYjM4NWUtNDg2Ni00ZDZkLWFmZDUtY2M3OTZlZmRiMmJi',
 'MTZkYWE0NWQtOTJlMy00ZGIyLWExZjAtOTY2M2MwNzU1Y2Yx',
 'MmQ4MTBiZjEtNTc2OS00YzNiLWIwNzItZDgwYmZmZTlhMTNm',
 'MDJmZDk5ZjUtZmUwMC00YjkxLTg5YmQtZmY0MWUxYTU2Y2Zi',
 'NTAxYjllMGMtMDAwNC00YjU4LWJiMWUtYzlhMDkzNWMw

## Test Retrieval of Documents

In [20]:
import os
from os.path import dirname
from pprint import pprint
from dotenv import load_dotenv

# Load environment variables
current_dir = os.path.abspath(".")
root_dir = dirname(current_dir)
env_file = os.path.join(root_dir, '.env')
load_dotenv(env_file, override=True)

True

In [21]:
# Retrieve Azure credentials and variables
load_dotenv(env_file, override=True)
azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_endpoint_uri = os.getenv("AZURE_OPENAI_ENDPOINT_URI")
embedding_deployment_name = os.getenv("EMBEDDING_DEPLOYMENT_NAME")
chat_deployment_name = os.getenv("CHAT_DEPLOYMENT_NAME")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION")
azure_search_api_key = os.getenv("AZURE_SEARCH_API_KEY")
azure_search_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")
azure_search_index_name = os.getenv("AZURE_SEARCH_INDEX_NAME")

In [23]:
from langchain_community.retrievers import AzureAISearchRetriever

retriever = AzureAISearchRetriever(
    content_key="content",
    api_key=azure_search_api_key,
    index_name=azure_search_index_name,
    service_name=azure_search_endpoint,
    top_k=5,
)

In [30]:
# The AzureAISearchRetriever returns a list of documents.
# The documents are Pydantic models.
# These documents contain, among others:
# - @search.score
# - index id
# - content_vector
# - metadata
# - page_content
docs = retriever.invoke("What is the best Bourbon Barrel wine?")
print(f"Number of documents: {len(docs)}")
print("Top document:")
pprint(docs[0].model_dump())

Number of documents: 5
Top document:
{'id': None,
 'metadata': {'@search.score': 2.4468806,
              'content_vector': [0.014189559,
                                 -0.035788182,
                                 -0.0092747025,
                                 -0.018094696,
                                 0.01838892,
                                 0.017091665,
                                 -0.020889813,
                                 -0.0066233543,
                                 -0.014523903,
                                 -0.022026582,
                                 -0.012999294,
                                 0.021184035,
                                 -0.013547619,
                                 -0.023992525,
                                 -0.0060783736,
                                 0.028833825,
                                 0.015165843,
                                 0.0042060474,
                                 0.0154600665,
                   

In [37]:
for doc in docs:
    print(f"chunk/doc id: {doc.metadata['id']}")
    print(f"score: {doc.metadata['@search.score']}")


chunk/doc id: YjdmMzVjOTQtMDY3Ny00MmI1LTlhMDEtYmRjYjJkYWFmY2Qw
score: 2.4468806
chunk/doc id: ZWQ0YmEzNjMtNjNkNy00OTIwLTk3NzAtNjFmZjQ2MGQwMjlj
score: 2.4468806
chunk/doc id: MDg4MWI0NTYtY2U4OS00NTllLWJiYjgtNjdmNGMzNDU0NzRm
score: 2.4468806
chunk/doc id: ZmU5YzgyZTMtZTM5ZC00ZmIyLWI2ODgtNWVkYTllZjNhMDcx
score: 2.4468806
chunk/doc id: YjM0ODQ1OWMtNjk4ZS00ZDE5LTlkZWUtZjljOWZiY2ZmYWQx
score: 2.4468806


## Retrieve Documents Using `rag.retrieve`

Here top `k` documents are retrieved from the document store using a query.

In [41]:
from rag import retrieve

In [44]:
docs = retrieve(
    query="What is the best Bourbon Barrel wine?",
    top_k=2
)

In [45]:
for doc in docs:
    print(f"chunk/doc id: {doc.metadata['id']}")
    print(f"score: {doc.metadata['@search.score']}")

chunk/doc id: YjdmMzVjOTQtMDY3Ny00MmI1LTlhMDEtYmRjYjJkYWFmY2Qw
score: 2.4468806
chunk/doc id: ZWQ0YmEzNjMtNjNkNy00OTIwLTk3NzAtNjFmZjQ2MGQwMjlj
score: 2.4468806
