# Unstructured Data Processing for Knowledge Base

This notebook processes unstructured text documents (FAQs and guides) and converts them into structured Delta tables for AI/RAG applications.

## Overview
- Load multiple text files from S3
- Structure unstructured content with source attribution
- Save as external Delta tables
- Prepare knowledge base for vector search

## 1. Setup and Installation

Install required packages for vector search functionality.

In [0]:
%pip install databricks-vectorsearch

Collecting databricks-vectorsearch
  Downloading databricks_vectorsearch-0.57-py3-none-any.whl.metadata (2.8 kB)
Collecting deprecation>=2 (from databricks-vectorsearch)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Downloading databricks_vectorsearch-0.57-py3-none-any.whl (16 kB)
Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: deprecation, databricks-vectorsearch
Successfully installed databricks-vectorsearch-0.57 deprecation-2.1.0
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

## 2. Process Unstructured Knowledge Base Files

Load and structure multiple FAQ and guide documents from S3 storage.

In [0]:
# Databricks Notebook: 02_Load_Unstructured_to_Delta (Fixed for External S3)

# Step 1: Set catalog and schema
spark.sql("USE CATALOG workspace")
spark.sql("USE SCHEMA smart_support")

# Step 2: Define source paths
s3_base_path = "s3://awsdbjuly"
files = {
    "billing_faq": f"{s3_base_path}/billing_faq.txt",
    "product_guide": f"{s3_base_path}/product_guide.txt",
    "technical_faq": f"{s3_base_path}/technical_faq.txt"
}

# Step 3: Read files and save as external Delta tables
from pyspark.sql.functions import lit

for name, path in files.items():
    print(f"📂 Processing: {name} from {path}")

    # Read text file into DataFrame
    df = spark.read.text(path).withColumnRenamed("value", "content")
    df = df.withColumn("source", lit(name))
    display(df)

    # Define S3 target Delta path
    delta_path = f"{s3_base_path}/bronze/{name}"

    # Save as Delta to S3
    df.write.format("delta").mode("overwrite").save(delta_path)

    # Register external Delta table
    spark.sql(f"""
        CREATE TABLE IF NOT EXISTS bronze_{name}
        USING DELTA
        LOCATION '{delta_path}'
    """)

    print(f"✅ Saved as external Delta table: bronze_{name}")


📂 Processing: billing_faq from s3://awsdbjuly/billing_faq.txt


content,source
# Billing FAQ - TechCorp Customer Support,billing_faq
,billing_faq
## Payment Methods,billing_faq
Q: What payment methods do you accept?,billing_faq
"A: We accept all major credit cards (Visa, MasterCard, American Express), PayPal, and wire transfers for enterprise customers.",billing_faq
,billing_faq
## Billing Cycles,billing_faq
Q: When will I be charged?,billing_faq
A: Billing occurs monthly on the anniversary of your subscription start date. Enterprise customers can request quarterly or annual billing.,billing_faq
,billing_faq


✅ Saved as external Delta table: bronze_billing_faq
📂 Processing: product_guide from s3://awsdbjuly/product_guide.txt


content,source
# Product User Guide - TechCorp Platform,product_guide
,product_guide
## Getting Started,product_guide
Welcome to TechCorp Platform! This guide will help you get up and running with our data analytics and machine learning tools.,product_guide
,product_guide
## DataLake Pro,product_guide
"Our flagship data storage solution provides scalable, secure data lake capabilities:",product_guide
- Supports structured and unstructured data,product_guide
- Built-in data governance and lineage tracking,product_guide
- Automated backup and disaster recovery,product_guide


✅ Saved as external Delta table: bronze_product_guide
📂 Processing: technical_faq from s3://awsdbjuly/technical_faq.txt


content,source
# Technical Support FAQ - TechCorp,technical_faq
,technical_faq
## Account Access,technical_faq
"Q: I can't log into my account, what should I do?",technical_faq
"A: First, try resetting your password using the ""Forgot Password"" link. If that doesn't work, check if your account has been locked due to multiple failed attempts.",technical_faq
,technical_faq
## Data Pipeline Issues,technical_faq
"Q: My data pipeline is failing, how do I troubleshoot?",technical_faq
"A: Check the pipeline logs in your dashboard. Common issues include: insufficient permissions, data format mismatches, or network connectivity problems.",technical_faq
,technical_faq


✅ Saved as external Delta table: bronze_technical_faq
