# 02. Generate Chunks

## Why do we need chunking?
Large Language Models (LLMs) have a limit on how much text they can read at once (called the **Context Window**). 
If we try to feed an entire book into an LLM, it will fail or forget the beginning.

**Chunking** breaks our large documents into smaller, manageable pieces (e.g., 500 characters or tokens). This allows us to find the specific paragraph that answers a user's question.

## Step 1: Install Libraries
We need `langchain`, a popular library for building LLM applications.

In [None]:
%pip install langchain

## Step 2: Load Raw Data
We load the `raw_documents` table we created in the previous notebook.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Load the table into a DataFrame
# Make sure we are using the correct schema
spark.sql("USE rag_demo")

df_raw = spark.table("raw_documents")
display(df_raw)

## Step 3: Clean Text
Before chunking, it's good practice to clean the text. We will remove HTML tags and extra spaces.

In [None]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
import re

# Define a plain Python function to clean text
def clean_text(text):
    if text is None:
        return ""
    # Remove HTML tags like <div> or <br>
    text = re.sub(r'<[^>]+>', '', text)
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    # Convert to lowercase for consistency
    return text.lower()

# Convert the Python function to a Spark UDF (User Defined Function)
clean_text_udf = udf(clean_text, StringType())

# Apply the cleaning function to our dataframe
df_cleaned = df_raw.withColumn("cleaned_content", clean_text_udf(col("raw_content")))

display(df_cleaned)

## Step 4: Split Text into Chunks
We will use `RecursiveCharacterTextSplitter` from LangChain. This is smart enough to split text at logical points (like periods or newlines) so we don't cut sentences in half.

In [None]:
from pyspark.sql.functions import pandas_udf, explode
from pyspark.sql.types import ArrayType, StringType
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Configuration
CHUNK_SIZE = 500  # Number of characters per chunk
CHUNK_OVERLAP = 50 # Overlap between chunks to maintain context

# We use a Pandas UDF for better performance on large data
@pandas_udf(ArrayType(StringType()))
def chunk_text_udf(content_series: pd.Series) -> pd.Series:
    # Initialize the splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        length_function=len,
    )
    
    # Helper function to apply to each row
    def split_content(content):
        if not content:
            return []
        return splitter.split_text(content)
    
    # Apply to the whole series (column)
    return content_series.apply(split_content)
    
    # Apply to the whole series (column)
    return content_series.apply(split_content)

# Apply the chunking UDF
# This creates a list of chunks for each file
df_chunked = df_cleaned.withColumn("chunks", chunk_text_udf(col("cleaned_content")))

# 'Explode' the list so that each chunk gets its own row
df_exploded = df_chunked.select(
    col("source_file"),
    explode(col("chunks")).alias("chunk_text")
)

display(df_exploded)

## Step 5: Save to Silver Table
We'll add a unique ID to each chunk and save it as `silver_chunks`. This is our processed data.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

# Add a unique ID column
df_final = df_exploded.withColumn("chunk_id", monotonically_increasing_id())

# Save to Delta table
df_final.write.format("delta").mode("overwrite").saveAsTable("silver_chunks")

print("Success! Saved chunks to 'silver_chunks' table.")