# Snowflake RAG Pipeline Tutorial

## Introduction

In this tutorial, we'll build a basic Retrieval-Augmented Generation (RAG) pipeline using Snowflake, LlamaIndex, and Snowflake Cortex within a Snowflake native notebook. We will handle unstructured text data stored in Snowflake, preprocess it using Snowflake Cortex's new SQL functions `PARSE_DOCUMENT` and `SPLIT_TEXT_RECURSIVE_CHARACTER`, build an index using LlamaIndex, and generate intelligent, context-aware responses.

---

## Prerequisites

- Access to a Snowflake account with permissions to create databases, schemas, and execute Python code.
- Familiarity with Snowflake's native notebook in Snowsight.
- Access to Snowflake Cortex and a deployed language model.
- LlamaIndex library available in your environment.

---

## Steps Overview

1. **Data Preparation**: Load and preprocess documents using Snowflake Cortex SQL functions.
2. **Index Building**: Use LlamaIndex to build an index over the preprocessed data.
3. **Setting Up Cortex LLM**: Utilize Snowflake Cortex to interact with a language model.
4. **Querying the Index**: Perform queries to retrieve augmented responses.
5. **Before and After Comparison**: Show raw input vs. RAG output.
6. **Cortex Use Cases**: Demonstrate simple use cases with Cortex.

---

## 1. Data Preparation

### 1.1 Load Documents into Snowflake

First, we'll create a table to store the documents. We'll assume the documents are stored in an external stage (e.g., AWS S3) and accessible to Snowflake.

In [None]:
-- Create a table to store document metadata
CREATE OR REPLACE TABLE document_files (
    id INT AUTOINCREMENT,
    file_name STRING,
    file_url STRING
);

Insert the document metadata into the table.

In [None]:
-- Insert document metadata (replace with your actual file URLs)
INSERT INTO document_files (file_name, file_url) VALUES
('document1.pdf', 's3://your-bucket/document1.pdf'),
('document2.pdf', 's3://your-bucket/document2.pdf'),
('document3.pdf', 's3://your-bucket/document3.pdf');

### 1.2 Extract Text Using PARSE_DOCUMENT

Use the `PARSE_DOCUMENT` function to extract text from the documents.

In [None]:
-- Create a table to store extracted text
CREATE OR REPLACE TABLE document_texts AS
SELECT
    id,
    PARSE_DOCUMENT(
        FILE => file_url,
        FILE_TYPE => 'PDF',
        CONTENT => 'TEXT',
        PARSE_STRATEGY => 'LAYOUT'
    ) AS content
FROM document_files;

This function reads the PDF files, extracts the text content while preserving the layout, and stores it in the `content` column.

### 1.3 Split Text Using SPLIT_TEXT_RECURSIVE_CHARACTER

Now, we'll split the extracted text into smaller chunks suitable for indexing.

In [None]:
-- Create a table to store text chunks
CREATE OR REPLACE TABLE document_chunks AS
SELECT
    id,
    SEQ4() OVER (PARTITION BY id ORDER BY SEQ4()) AS chunk_id,
    value::STRING AS chunk_text
FROM (
    SELECT
        id,
        SPLIT_TEXT_RECURSIVE_CHARACTER(
            TEXT => content,
            CHUNK_LENGTH => 500,
            CHUNK_OVERLAP => 50
        ) AS chunks
    FROM document_texts
), LATERAL FLATTEN(input => chunks);

This function splits the text into chunks of approximately 500 characters with an overlap of 50 characters between chunks.

---

## 2. Index Building with LlamaIndex

### 2.1 Import Libraries

In the first Python cell of your notebook, import the necessary libraries.

In [None]:
import snowflake.snowpark as snowpark
from snowflake.snowpark.session import Session
from llama_index import (
    LLMPredictor,
    ServiceContext,
    GPTVectorStoreIndex
)
from llama_index.data_structs import Node
from llama_index.embeddings.base import BaseEmbedding
from llama_index.vector_stores import SimpleVectorStore
from typing import List

### 2.2 Use the Active Snowflake Session

In the Snowflake native notebook environment, you can use the active session without creating a new connection.

In [None]:
# Get the active Snowflake session
session = snowpark.Session.builder.getOrCreate()

### 2.3 Define a Custom Embedding Class Using Cortex

We'll define a custom embedding class that uses Snowflake Cortex's embedding capabilities.

In [None]:
class CortexEmbedding(BaseEmbedding):
    def __init__(self, session):
        self.session = session

    def get_text_embedding(self, text: str) -> List[float]:
        df = self.session.create_dataframe([[text]], schema=["TEXT"])
        result_df = df.select(
            self.session.call_function("CORTEX_EMBED", df["TEXT"]).alias("EMBEDDING")
        )
        result = result_df.collect()
        embedding = result[0]["EMBEDDING"]
        return embedding

    async def aget_text_embedding(self, text: str) -> List[float]:
        return self.get_text_embedding(text)

**Note**: You'll need to have the `CORTEX_EMBED` function available in your Snowflake environment.

### 2.4 Build the Index

Now, we'll read the data from the `document_chunks` table and build the index.

In [None]:
# Read data from the document_chunks table
df_chunks = session.table("DOCUMENT_CHUNKS").to_pandas()

# Convert the data into Nodes
nodes = []
for idx, row in df_chunks.iterrows():
    node = Node(text=row['CHUNK_TEXT'], doc_id=str(row['ID']), extra_info={'chunk_id': row['CHUNK_ID']})
    nodes.append(node)

# Initialize the embedding model
embed_model = CortexEmbedding(session=session)

# Create the vector store index
vector_store = SimpleVectorStore()
index = GPTVectorStoreIndex(nodes, embed_model=embed_model, vector_store=vector_store)

---

## 3. Setting Up Cortex LLM

### 3.1 Define a Custom LLM Class for Snowflake Cortex

We'll define a custom LLM class that uses Snowflake Cortex's text generation capabilities.

In [None]:
from llama_index.llms.base import LLM

class CortexLLM(LLM):
    def __init__(self, session):
        self.session = session

    def generate(self, prompt: str, **kwargs) -> str:
        df = self.session.create_dataframe([[prompt]], schema=["PROMPT"])
        result_df = df.select(
            self.session.call_function("CORTEX_GENERATE_TEXT", df["PROMPT"]).alias("RESPONSE")
        )
        result = result_df.collect()
        return result[0]["RESPONSE"]

**Note**: Ensure that the `CORTEX_GENERATE_TEXT` function is available in your Snowflake environment.

### 3.2 Initialize LLM Predictor and Service Context

In [None]:
from llama_index import LLMPredictor, ServiceContext

# Initialize the language model predictor using the custom CortexLLM
cortex_llm = CortexLLM(session=session)
llm_predictor = LLMPredictor(llm=cortex_llm)

# Create a service context for LlamaIndex
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    embed_model=embed_model
)

---

## 4. Querying the Index

### 4.1 Raw Input vs. RAG Output

#### Raw Input Example

In [None]:
# Raw input without RAG
raw_question = "What is Snowflake?"
print("Raw Question:", raw_question)

# Attempt to get a response without context
try:
    raw_response = cortex_llm.generate(raw_question)
    print("Raw Response:", raw_response)
except Exception as e:
    print("Error generating response:", e)

#### RAG Output Example

In [None]:
# Query using the RAG pipeline
response = index.query(raw_question, service_context=service_context)
print("RAG Response:", response)

### 4.2 Comparison

You should observe that the RAG response is more informative and accurate due to the context provided by the indexed documents.

---

## 5. Cortex Use Cases

### Use Case 1: Knowledge Base Question Answering

In [None]:
question = "How does Snowflake Cortex help with AI applications?"
response = index.query(question, service_context=service_context)
print("Use Case 1 Response:", response)

### Use Case 2: Contextual Summarization

In [None]:
summary_prompt = "Summarize the key features of LlamaIndex."
response = index.query(summary_prompt, service_context=service_context)
print("Use Case 2 Response:", response)

### Use Case 3: Personalized Recommendations

In [None]:
recommendation_prompt = "What are the next steps for integrating AI capabilities in our data platform?"
response = index.query(recommendation_prompt, service_context=service_context)
print("Use Case 3 Response:", response)

---

## Conclusion

In this notebook, we've demonstrated how to:

- Preprocess documents using Snowflake Cortex's `PARSE_DOCUMENT` and `SPLIT_TEXT_RECURSIVE_CHARACTER` functions.
- Use the active Snowflake session within a native notebook.
- Build an index over the data using LlamaIndex with custom embedding and LLM classes that leverage Snowflake Cortex.
- Set up a RAG pipeline to handle queries using Snowflake Cortex.
- Compare raw input responses with RAG-augmented responses.
- Explore different use cases for Cortex within the RAG pipeline.

By leveraging these tools, you can build powerful AI applications that interact with your data stored in Snowflake.

---

## Next Steps

- **Expand Your Data**: Incorporate more extensive and diverse datasets.
- **Enhance the Model**: Experiment with different Cortex models or parameters.
- **Deploy Applications**: Integrate this pipeline into applications or dashboards for end-users.

---

**Notes**:

- Ensure that you have the necessary permissions to access Cortex functions in your Snowflake account.
- Replace placeholder values like `'s3://your-bucket/document1.pdf'` with your actual file URLs.
- The implementation details may vary depending on your Snowflake Cortex setup.

---

# Appendix: Additional Information

## Snowflake Cortex Functions

### PARSE_DOCUMENT

Extracts text from documents (e.g., PDFs) and preserves layout.

**Syntax**:

```sql
PARSE_DOCUMENT(
    FILE => '<file_url>',
    FILE_TYPE => '<file_type>',
    CONTENT => 'TEXT',
    PARSE_STRATEGY => 'LAYOUT' | 'OCR'
)
```

### SPLIT_TEXT_RECURSIVE_CHARACTER

Splits text into chunks based on character count and overlap.

**Syntax**:

```sql
SPLIT_TEXT_RECURSIVE_CHARACTER(
    TEXT => '<text>',
    CHUNK_LENGTH => <int>,
    CHUNK_OVERLAP => <int>
)
```

---

## LlamaIndex Overview

LlamaIndex is a library that facilitates the creation of indices over your data to be used with large language models. It allows for efficient querying and retrieval of relevant information, which is essential for building RAG pipelines.

---

# Final Remarks

By completing this tutorial, you have learned how to build a Retrieval-Augmented Generation pipeline within Snowflake's native notebook environment using LlamaIndex and Snowflake Cortex's new SQL functions. This powerful combination enables you to create intelligent applications that can interact with your data securely and efficiently.

Feel free to explore further and customize the pipeline to suit your specific needs!