# RAGStack and Astra vector db

This notebook demonstrates a RAG pattern using RAGStack and the AstraDB vector database.

The pattern is:

1. Construct information base
2. Basic retrieval
3. Generation with augmented context
4. Advanced retrieval and generation
5. Evaluate quality


## Setup
RAGStack includes all the libraries you need for the RAG pattern, including the vector database, embeddings pipeline, and retrieval.

In [1]:
!pip3 install ragstack-ai datasets

Collecting ragstack-ai
  Using cached ragstack_ai-0.1.0-py3-none-any.whl.metadata (1.4 kB)
Collecting datasets
  Using cached datasets-2.14.7-py3-none-any.whl.metadata (19 kB)
Collecting PyYAML>=5.3 (from ragstack-ai)
  Using cached PyYAML-6.0.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from ragstack-ai)
  Using cached SQLAlchemy-2.0.23-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from ragstack-ai)
  Using cached aiohttp-3.8.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting anyio<4.0 (from ragstack-ai)
  Using cached anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting astrapy==0.5.8 (from ragstack-ai)
  Using cached astrapy-0.5.8-py3-none-any.whl.metadata (11 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from ragstack-ai)
  Using cached dataclasses_json-0.6.2-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from ragstack-ai)
  Using cached jsonpatch-1.33-py2.py3-none-

Import the necessary dependencies:

In [2]:
import getpass
from datasets import load_dataset
from langchain.vectorstores.astradb import AstraDB 
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

  from .autonotebook import tqdm as notebook_tqdm


Enter your environment variables:

In [3]:
astra_token = getpass.getpass("Astra token:")
astra_endpoint = getpass.getpass("Astra db endpoint:")
openai_key = getpass.getpass("OpenAI Key:")
collection = getpass.getpass("Collection name:")

## RAG workflow

With your environment set up, you're ready to create a RAG workflow.

### Construct information base

Declare the embeddings model you'll use and define its required parameters.

In [4]:
embedding = OpenAIEmbeddings(openai_api_key=openai_key)
vstore = AstraDB(
        collection_name=collection,
        embedding=embedding,
        token=astra_token,
        api_endpoint=astra_endpoint
    )
print("Astra configured")

Astra configured


Load a small dataset of quotes with the Python dataset module.

In [5]:
philo_dataset = load_dataset("datastax/philosopher-quotes")["train"]
print("An example entry:")
print(philo_dataset[16])

An example entry:
{'author': 'aristotle', 'quote': 'Love well, be loved and do something of value.', 'tags': 'love;ethics'}


Process metadata and convert:

In [6]:
docs = []
for entry in philo_dataset:
    metadata = {"author": entry["author"]}
    if entry["tags"]:
        # Add metadata tags to the metadata dictionary
        for tag in entry["tags"].split(";"):
            metadata[tag] = "y"
    # Add a LangChain document with the quote and metadata tags
    doc = Document(page_content=entry["quote"], metadata=metadata)
    docs.append(doc)

Compute embeddings:

In [7]:
inserted_ids = vstore.add_documents(docs)
print(f"\nInserted {len(inserted_ids)} documents.")


Inserted 450 documents.


### Basic retrieval



In [8]:
retriever = vstore.as_retriever()