# Data Ingestion

This notebook shows how to preprocess and ingest the documents. All the Azure services necessary for the RAG application must be running to use it.

Basically, 3 sections are essential:

1. [List Files to Ingest](#list-files-to-ingest): we define the file paths here.
2. [Ingest Documents Using `rag.ingest`](#ingest-documents-using-ragingest): the listed documents are ingested, i.e., parsed, chunked and uploaded to the document store.
3. [Retrieve Documents Using `rag.retrieve`](#retrieve-documents-using-ragretrieve): top `k` documents are retrieved from the document store using a query. However, this section is only for testing purposes, since the `rag.chat` function carries out retrieval internally.

**In practice, we need to run the sections 1 and 2 to set up the index for RAG usage.**

There are other sections created for testing purposes, too:

- [Test Preprocessing](#test-preprocessing)
- [Test Retrieval of Documents](#test-retrieval-of-documents)


In [47]:
import sys
from pathlib import Path
from pprint import pprint

# Add the backend directory to the path
sys.path.append(str(Path(".").resolve().parent / "backend"))

In [48]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## List Files to Ingest

In [49]:
# We can use the paths to these formats: CSV, PDF, Markdown, URL
FILENAMES = [
    "../data/wine-ratings.csv",
    "../data/Lewis_RAG_2021_one_page.pdf",
    "../README.md",
    "https://medium.com/@iamleonie/building-retrieval-augmented-generation-systems-be587f42aedb",
]

# Convert to Path objects
FILENAMES = [Path(f) if not f.startswith("http") else f for f in FILENAMES]

## Test Preprocessing

In [50]:
from preprocessing import get_preprocessor

In [51]:
docs = []
for file in FILENAMES:
    preprocessor = get_preprocessor(str(file), chunk_size=1000, chunk_overlap=200)
    d = preprocessor.load_split(file)
    docs.extend(d)

In [52]:
len(docs)

33300

In [53]:
pprint(docs[-10].page_content)

('Basic usage of the Azure services via the LangChain library is showcased in '
 'the notebook '
 '[`notebooks/azure_search_rag.ipynb`](./notebooks/azure_search_rag.ipynb).\n'
 '\n'
 'Once the infrastructure is setup and all the secrets/keys are known, we '
 'can:\n'
 '\n'
 '- launch the docker container of the backend locally\n'
 '- or deploy the docker container of the backend on Azure as a Container '
 'App.\n'
 '\n'
 'Then, we can use the RAG backend\n'
 '\n'
 '- via its API\n'
 '- or via the GUI or frontend (which uses the backend API).\n'
 '\n'
 'The following subsections give details for all those operation modi.\n'
 '\n'
 '### Running the Backend Locally\n'
 '\n'
 'Assuming the Azure resources have been launched successfully and the '
 'documents have been ingested, we can build the backend image and run it '
 'locally:\n'
 '\n'
 '```bash\n'
 '# Start backend, e.g., locally\n'
 'cd .../azure-rag-app\n'
 'docker run --rm --env-file .env -p 8080:8000 azure-rag-backend    \n'
 '# 

## Ingest Documents Using `rag.ingest`

Here the documents in `FILENAMES` are ingested, i.e., parsed, chunked and uploaded to the document store.

In [54]:
from rag import ingest

In [55]:
ids = ingest(
    filenames=FILENAMES,
    chunk_size=1000,
    chunk_overlap=200,
    percentage=0.001,  # 0.1% of the data/chunks are ingested, for testing purposes
)

In [56]:
ids

['YThkNzQ3NGYtNDliNS00ODQ2LWE1MGUtZTJjOTU2OTBlZTAx',
 'MDczZjgyZTktNjc4Yi00N2QwLWFiN2ItYWFhMDRkZjdkMGFk',
 'MTk2NmJmNDctM2NkNy00Mzg3LWFlYTktZjNlYjMxYTI3ODUx',
 'ZTE2MDQzMGQtNTE0Yy00MTk0LThkMjItMmQzNDY5M2MwN2U5',
 'MWIwMTlmYTgtMWJjNy00ZGQ5LThlMjItNWFjNjJjMTExNmI1',
 'NDk5YzkyNDctNTRiNS00MjRmLWJlOTMtNDFkNjRhZjhmZTM5',
 'ZTU4YmUxOGUtNTM0Ni00YTM1LWIxOTQtNTBmZjk5OWE0MDYz',
 'Mzc2ZTllODMtN2IzYi00NTVkLThiM2ItODk2YTVlNDZhNjBh',
 'MzYzY2MyMjgtNzJjOS00MmUwLWEzZTMtYzkxNTM1NjZlMDc5',
 'NzZhNDEwZmItZmFiZi00YzA5LWFhMDgtMDBlYjhiYjg0MzVj',
 'MGI1MTkwZDktMjU1My00ZTIwLWFhMjQtZDFlYzc1YTk5OGIy',
 'YTQ4YTRhOWItYzg3Yy00MzRlLWI0ZTYtMmExYjQ0NjlmOWQ5',
 'YTkzZjZmYjQtYzgyZi00ZGVhLWIwOTAtMWEzZTZlZmIxYzVk',
 'MTgxZjQyMDMtYjRkZC00YTQ3LTgxOWMtZTE1NDNiYjQ4MzQz',
 'NWQ1YjM2NmYtNmRlOC00OWRlLTgxNWItYTc1ZmI2MmM1MTgw',
 'MmVhZDk0NjMtMDdlYy00MTQ5LTgxOTYtYTliMzgzZWJhNGEw',
 'MWY5Nzc3ZGYtN2Q4YS00ZTk5LTg1NjctZmU4MjUxNGVlZmQw',
 'ZTUxZTY3MGUtODdkNS00ZDM1LThkMjQtYmM1NjM3MmU2YWYw',
 'NGZiZTJmODUtNThjNi00MWM0LTgxN2EtMGUwNzE3NDYw

## Test Retrieval of Documents

In [61]:
import os
from os.path import dirname
from pprint import pprint
from dotenv import load_dotenv

# Load environment variables
current_dir = os.path.abspath(".")
root_dir = dirname(current_dir)
env_file = os.path.join(root_dir, '.env')
load_dotenv(env_file, override=True)

True

In [62]:
# Retrieve Azure credentials and variables
load_dotenv(env_file, override=True)
azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_endpoint_uri = os.getenv("AZURE_OPENAI_ENDPOINT_URI")
embedding_deployment_name = os.getenv("EMBEDDING_DEPLOYMENT_NAME")
chat_deployment_name = os.getenv("CHAT_DEPLOYMENT_NAME")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION")
azure_search_api_key = os.getenv("AZURE_SEARCH_API_KEY")
azure_search_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")
azure_search_index_name = os.getenv("AZURE_SEARCH_INDEX_NAME")

In [63]:
from langchain_community.retrievers import AzureAISearchRetriever

retriever = AzureAISearchRetriever(
    content_key="content",
    api_key=azure_search_api_key,
    index_name=azure_search_index_name,
    service_name=azure_search_endpoint,
    top_k=5,
)

In [64]:
# The AzureAISearchRetriever returns a list of documents.
# The documents are Pydantic models.
# These documents contain, among others:
# - @search.score
# - index id
# - content_vector
# - metadata
# - page_content
docs = retriever.invoke("What is the best Bourbon Barrel wine?")
print(f"Number of documents: {len(docs)}")
print("Top document:")
pprint(docs[0].model_dump())

Number of documents: 5
Top document:
{'id': None,
 'metadata': {'@search.score': 3.8138385,
              'content_vector': [0.014251245,
                                 -0.035915814,
                                 -0.009172984,
                                 -0.018145246,
                                 0.018305825,
                                 0.017074732,
                                 -0.020808157,
                                 -0.006623818,
                                 -0.01466607,
                                 -0.021999106,
                                 -0.013060296,
                                 0.021182837,
                                 -0.013575482,
                                 -0.024033086,
                                 -0.006158813,
                                 0.028770119,
                                 0.01530838,
                                 0.004235229,
                                 0.015549246,
                         

In [65]:
for doc in docs:
    print(f"chunk/doc id: {doc.metadata['id']}")
    print(f"score: {doc.metadata['@search.score']}")


chunk/doc id: NTcxYmMwMWItZDdlMS00OWY2LTg2ZWUtOTJjZTFmNjhjYWMy
score: 3.8138385
chunk/doc id: Mzc2ZTllODMtN2IzYi00NTVkLThiM2ItODk2YTVlNDZhNjBh
score: 3.8138385
chunk/doc id: M2NlODYwNmQtYTQ0OC00ZDExLWI3ODYtZjQ4ZTA2N2IwODdk
score: 3.4852285
chunk/doc id: MTk2NmJmNDctM2NkNy00Mzg3LWFlYTktZjNlYjMxYTI3ODUx
score: 3.4852285
chunk/doc id: YmI3ZTM0ODYtYzU0NS00NzBmLTg0YjEtMDAyMDE5YmI0OTAw
score: 3.3086472


## Retrieve Documents Using `rag.retrieve`

Here top `k` documents are retrieved from the document store using a query.

In [66]:
from rag import retrieve

In [67]:
docs = retrieve(
    query="What is the best Bourbon Barrel wine?",
    top_k=2
)

In [68]:
for doc in docs:
    print(f"chunk/doc id: {doc.metadata['id']}")
    print(f"score: {doc.metadata['@search.score']}")

chunk/doc id: NTcxYmMwMWItZDdlMS00OWY2LTg2ZWUtOTJjZTFmNjhjYWMy
score: 3.8138385
chunk/doc id: Mzc2ZTllODMtN2IzYi00NTVkLThiM2ItODk2YTVlNDZhNjBh
score: 3.8138385
