# 01. Data Ingestion

## Introduction
Welcome to the **Databricks RAG (Retrieval-Augmented Generation) Demo**!

In this series of notebooks, we will build a complete AI system that can answer questions based on your own documents.

### What is this notebook for?
This first notebook is all about **getting data ready**. We will:
1.  Create a database (Schema) to organize our tables.
2.  Create some dummy text files (simulating uploaded PDFs or docs).
3.  Read these files and save them into a **Delta Table**.

### What is Delta Lake?
Delta Lake is the storage layer we use. It's like a super-powered version of CSV or Parquet files that supports:
- **ACID Transactions**: Keeps data safe and consistent.
- **Versioning**: You can see history of changes.
- **Speed**: Optimized for fast reading and writing.

## Step 1: Setup Database
We need a place to store our tables. In Databricks, we use a **Schema** (also called a Database).

In [None]:
# Create a new schema named 'rag_demo' if it doesn't exist yet
spark.sql("CREATE SCHEMA IF NOT EXISTS rag_demo")

# Switch to this schema so all future tables are created here
spark.sql("USE rag_demo")

# NOTE: We are using the default Hive Metastore (or local DB) which works on Community Edition.

## Step 2: Create Sample Data
Since we don't have external files yet, we will use Python to create some simple text files directly in the Databricks File System (DBFS).

In [None]:
import os

# Define the folder where we will save our raw text files
# /dbfs/FileStore/ is a special folder in Databricks that acts like a hard drive
raw_data_path = "/dbfs/FileStore/rag_data/"

# Create the directory if it doesn't exist
os.makedirs(raw_data_path, exist_ok=True)

# These are our sample documents. In a real project, these would be your PDFs or Docs.
sample_docs = {
    "doc1.txt": "Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.",
    "doc2.txt": "Databricks is a unified data analytics platform for massive scale data engineering and data science.",
    "doc3.txt": "Retrieval-Augmented Generation (RAG) combines an LLM with a retrieval system to provide accurate, up-to-date answers.",
    "doc4.txt": "Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters."
}

# Loop through the dictionary and write each file to the disk
for filename, content in sample_docs.items():
    file_path = os.path.join(raw_data_path, filename)
    with open(file_path, "w") as f:
        f.write(content)

print(f"Success! Created sample files in {raw_data_path}")

## Step 3: Ingest Data into Delta
Now we use **PySpark** to read those text files and save them into a structured table.

In [None]:
from pyspark.sql.functions import input_file_name, current_timestamp

# 1. Read the text files
# 'wholetext' means we read the entire content of the file into one row
df_raw = spark.read.format("text") \
    .option("wholetext", True) \
    .load("dbfs:/FileStore/rag_data/*.txt")

# 2. Add some useful metadata columns
# input_file_name() tells us which file the row came from
df_raw = df_raw.withColumnRenamed("value", "raw_content") \
    .withColumn("source_file", input_file_name()) \
    .withColumn("ingestion_time", current_timestamp())

# 3. Display the dataframe to check our work
display(df_raw)

## Step 4: Save as Table
Finally, we save this dataframe as a **Delta Table** named `raw_documents`. This is our "Bronze" layer (raw data).

In [None]:
# Write the data to a table
# mode("overwrite") means if the table exists, replace it entirely
df_raw.write.format("delta").mode("overwrite") \
    .saveAsTable("raw_documents")

print("Table 'raw_documents' created successfully!")

In [None]:
# Let's verify using SQL
display(spark.sql("SELECT * FROM raw_documents"))