# 01. Data Ingestion

## Introduction
Welcome to the **Databricks RAG (Retrieval-Augmented Generation) Demo**!

In this series of notebooks, we will build a complete AI system that can answer questions based on your own documents.

### What is this notebook for?
This first notebook is all about **getting data ready**. We will:
1.  Create a database (Schema) to organize our tables.
2.  Create some dummy text data (simulating uploaded PDFs or docs).
3.  Save this data into a **Delta Table**.

### What is Delta Lake?
Delta Lake is the storage layer we use. It's like a super-powered version of CSV or Parquet files that supports:
- **ACID Transactions**: Keeps data safe and consistent.
- **Versioning**: You can see history of changes.
- **Speed**: Optimized for fast reading and writing.

## Step 1: Setup Database
We need a place to store our tables. In Databricks, we use a **Schema** (also called a Database).

In [None]:
print("Initializing SparkSession...")
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Create a new schema named 'rag_demo' if it doesn't exist yet
spark.sql("CREATE SCHEMA IF NOT EXISTS rag_demo")

# Switch to this schema so all future tables are created here
spark.sql("USE rag_demo")

# NOTE: We are using the default Hive Metastore (or local DB) which works on Community Edition.

## Step 2: Create Sample Data
Since we are in a restricted environment where file system access might be limited, we will create the data directly in memory using Spark.

In [None]:
from pyspark.sql.functions import current_timestamp, lit
from pyspark.sql.types import StringType, StructType, StructField

# Sample data simulating documents
sample_docs = [
    ("doc1.txt", "Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads."),
    ("doc2.txt", "Databricks is a unified data analytics platform for massive scale data engineering and data science."),
    ("doc3.txt", "Retrieval-Augmented Generation (RAG) combines an LLM with a retrieval system to provide accurate, up-to-date answers."),
    ("doc4.txt", "Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.")
]

# Define schema
schema = StructType([
    StructField("source_file", StringType(), True),
    StructField("raw_content", StringType(), True)
])

# Create DataFrame directly
df_raw = spark.createDataFrame(sample_docs, schema)

# Add ingestion time
df_raw = df_raw.withColumn("ingestion_time", current_timestamp())

display(df_raw)

## Step 3: Save as Table
Finally, we save this dataframe as a **Delta Table** named `raw_documents`. This is our "Bronze" layer (raw data).

In [None]:
# Write the data to a table
# mode("overwrite") means if the table exists, replace it entirely
df_raw.write.format("delta").mode("overwrite") \
    .saveAsTable("raw_documents")

print("Table 'raw_documents' created successfully!")

In [None]:
# Let's verify using SQL
display(spark.sql("SELECT * FROM raw_documents"))