<a href="https://colab.research.google.com/github/krishnannarayanaswamy/astra-langchain-chatbot/blob/main/RAG_files_astra_langchain_dev_jam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/langchain_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prerequisites

You will need a vector-enabled Astra database and an OpenAI Account.

* Create an [Astra vector database](https://docs.datastax.com/en/astra-serverless/docs/getting-started/create-db-choices.html).
* Create an [OpenAI account](https://openai.com/)
* Within your database, create an [Astra DB Access Token](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html) with Database Administrator permissions.
* Get your Astra DB Endpoint:
  * `https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com`


See the [Prerequisites](https://docs.datastax.com/en/ragstack/docs/prerequisites.html) page for more details.

## Setup
`ragstack-ai` includes all the packages you need to build a RAG pipeline.

In [None]:
! pip install -q ragstack-ai pypdf

In [None]:
import os
from getpass import getpass

# Enter your settings for Astra DB and OpenAI:
os.environ["ASTRA_DB_API_ENDPOINT"] = input("Enter your Astra DB API Endpoint: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

## Create RAG Pipeline

### Embedding Model and Vector Store

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_astradb import AstraDBVectorStore
import os

# Configure your embedding model and vector store
embedding = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
vstore = AstraDBVectorStore(
    collection_name="thailand_dev_jam",
    embedding=embedding,
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
)
print("Astra vector store configured")

In [None]:
# Retrieve the text of a short story that will be indexed in the vector store
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA = ["amontillado.txt"]

In [None]:
# Alternatively, provide your own file. However, you will want to update your queries to match the content of your file.

# Upload sample file (Note: this cell assumes you are on Google Colab)
# Local Jupyter notebooks can provide the path to their files directly by uncommenting and running just the next line).
# SAMPLEDATA = ["<path_to_file>"]

from google.colab import files

print("Please upload your own sample file:")
uploaded = files.upload()
if uploaded:
    SAMPLEDATA = uploaded
else:
    raise ValueError("Cannot proceed without Sample Data. Please re-run the cell.")

print(f"Please make sure to change your queries to match the contents of your file!")

In [None]:
import os
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Loop through each file and load it into our vector store
documents = []
#SAMPLEDATA = "CCD_0046.pdf"
for filename in SAMPLEDATA:
    path = os.path.join(os.getcwd(), filename)

    # Supported file types are pdf and txt
    if filename.endswith(".pdf"):
        pdf_loader = PyPDFLoader(path)
        #Create document chunks & embeddings
        splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=64)
        documents = pdf_loader.load_and_split(text_splitter=splitter)
        print(documents)
        print(f"Documents from PDF: {len(documents)}.")
    elif filename.endswith(".txt"):
        loader = TextLoader(path)
        documents = loader.load_and_split()
        print(f"Processed txt file: {filename}")
    else:
        print(f"Unsupported file type: {filename}")

# empty the list of file names in case this cell is run multiple times
SAMPLEDATA = []

print(f"\nProcessing done.")

In [None]:
# Create embeddings by inserting your documents into the vector store.
inserted_ids = vstore.add_documents(documents)
print(f"\nInserted {len(inserted_ids)} documents.")

In [None]:
# Checks your Collection to verify the Documents are embedded.
print(vstore.astra_db.collection("thailand_dev_jam").find())

### Basic Retrieval

Retrieve context from your vector database, and pass it to the model with a prompt.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retriever = vstore.as_retriever(search_kwargs={"k": 5})

prompt_template = """
Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer. Respond to the question in the same language as the user query.
Context: {context}
Question: {question}
Your answer:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)
model = ChatOpenAI(model_name="gpt-4-0125-preview")

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke(
    "Tell me more"
)