# Data Ingestion 
In this example, for the sake of simplicity we're using extracts of the Databricks docs across the 3 clouds which we have available to us pre-chunked as a jsonl file, which we will ingest the into a delta table. We also have a a few example questions and answers about Databricks features which we'll use to further optimize our prompts. 

Your data might be in the form of pdfs, jsons, csvs, in which case you will need to take additional steps to extract the text and chunk it to make it available for use in a RAG agent. For more details about chunking and preparing data, see [here](https://docs.databricks.com/aws/en/generative-ai/tutorials/ai-cookbook/quality-data-pipeline-rag).   

In [0]:
%pip install -U -qqqq mlflow>=2.18.0 tokenizers torch transformers openpyxl databricks-sdk langchain==0.1.13
dbutils.library.restartPython()

In [0]:
from mlflow.models import ModelConfig

In [0]:
config_file = "../config.yaml"
model_config = ModelConfig(development_config=config_file)

In [0]:
CATALOG = model_config.get("catalog")
SCHEMA = model_config.get("schema")
VOLUME = model_config.get("volume")
path = f"{CATALOG}/{SCHEMA}/{VOLUME}"

### Ingest the docs

In [0]:
azure_docs = spark.read.json(f"/Volumes/{path}/databricks_docs_azure_chunked.jsonl").drop("metadata")
aws_docs = spark.read.json(f"/Volumes/{path}/databricks_docs_aws_chunked.jsonl").drop("metadata")
gcp_docs = spark.read.json(f"/Volumes/{path}/databricks_docs_gcp_chunked.jsonl").drop("metadata")

In [0]:
# Ensure schemas are aligned
aws_docs = aws_docs.select(sorted(aws_docs.columns))
gcp_docs = gcp_docs.select(sorted(gcp_docs.columns))
azure_docs = azure_docs.select(sorted(azure_docs.columns))

# Merge DataFrames using unionByName
all_docs = aws_docs.unionByName(gcp_docs).unionByName(azure_docs)

display(all_docs)

In [0]:
# UC locations to store the chunked documents
DBDOCS_CHUNKS_DELTA_TABLE = f"{CATALOG}.{SCHEMA}.db_docs_bronze"
print(DBDOCS_CHUNKS_DELTA_TABLE)

# Ensure unique column names after transformations
all_docs = all_docs.withColumnRenamed("chunk_index", "unique_chunk_index")

all_docs.write.format("delta").mode("overwrite").saveAsTable(DBDOCS_CHUNKS_DELTA_TABLE)
spark.sql(
    f"ALTER TABLE {DBDOCS_CHUNKS_DELTA_TABLE} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)"
)