## LangChain Testing with new Astra developer API
### Including condensed content embeddings using LangChain's ParentDocumentRetriever

This notebook is used to test various chains and modules from LangChain and uses AstraDB as a vector store and for vector search. This demo has been modified to connect to Astra using the new Astra Vector API for Python.

The demo includes modules important to Retrieval Augmented Generation (RAG) and improving RAG. For example, while splitting the raw text for RAG, what should be the ideal length of each chunk? What’s the sweet spot?

Strike a balance between small vs large chunks using LangChain's ParentDocumentRetriever. This helps condense the content embedding by performing Top K retrieval on embedded chunks or sentences, but return expanded window or full doc.

In [2]:
# install required dependencies
! pip install -q --progress-bar off \
    "cassio>=0.1.0" \
    "jupyter>=1.0.0" \
    "openai==0.28.1" \
    "cohere" \
    "tiktoken" \
    "langchain" \
    "pypdf"
exit()

In [1]:
!pip install --quiet --upgrade astrapy

In [1]:
import os, json

from getpass import getpass
apiSecret = getpass(f'Your OpenAI Key: ')
os.environ['OPENAI_API_KEY'] = apiSecret

Your OpenAI Key: ··········


In [12]:
# necessary imports
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains import SimpleSequentialChain
from langchain.chains import SequentialChain
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Cassandra
# from langchain.vectorstores import AstraDB
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from pprint import pprint
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from langchain.vectorstores import AstraDB as LCAstraDB
from astrapy.db import AstraDB as LibAstraDB

llm = OpenAI(temperature=0.4)

In [4]:
# this code ensures that long text generations for the text generation wrap for
# readability
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [5]:
## testing different chains/modules from LangChain

#summary = llm("I want a one sentence summary of chapters from the Bible. Please provide a summary of Hebrews 11.")
#print(summary)

prompt_theme = PromptTemplate(
      input_variables = ["book", "chapter"],
      template = "I want a one word theme of for chapters from {book}. Please provide a theme for chapter {chapter}."
)
#prompt_theme.format(book="Bible", chapter="Hebrews 11")

prompt_summary = PromptTemplate(
      input_variables = ["book", "chapter"],
      template = "I want a bulleted summary of chapters from {book} with no more than 5 bullets. Please provide a summary of chapter {chapter}."
)

In [6]:
theme_chain = LLMChain(llm=llm, prompt=prompt_theme, output_key="theme")
summary_chain = LLMChain(llm=llm, prompt=prompt_summary, output_key="summary")

#theme_chain.run({'book':"Bible", 'chapter':"Hebrews 11"})
summary_chain.run({'book':"Bible", 'chapter':"Hebrews 11"})



'\n\n• Chapter 11 of Hebrews is known as the "Hall of Faith" and is a tribute to the faith of Old Testament characters \n• It begins by defining faith as being sure of what we hope for and certain of what we do not see\n• It then goes on to list examples of those who had faith in God, such as Abel, Noah, Abraham, and Sarah\n• It also mentions those who were delivered by faith, such as Moses and the Israelites \n• The chapter ends with a reminder that those who have faith will receive what God has promised them.'

In [None]:
full_chain = SequentialChain(
    chains = [theme_chain, summary_chain],
    input_variables = ["book", "chapter"],
    output_variables = ["theme", "summary"]
    )

full_chain({'book':"Bible", 'chapter':"Hebrews 11"})


{'book': 'Bible',
 'chapter': 'Hebrews 11',
 'theme': '\n\nFaith',
 'summary': '\n\n• Hebrews 11 is a chapter about faith and the examples of faith from Old Testament figures. \n• It speaks of the faith of Abel, Enoch, Noah, Abraham, Sarah, Isaac, Jacob, Joseph, Moses, and Rahab. \n• It also speaks of the faith of those who were persecuted and tortured and how they endured it. \n• It speaks of the faith of those who were faithful to God even when it seemed impossible. \n• It concludes by saying that faith is the assurance of things hoped for, the conviction of things not seen.'}

In [7]:
# access google drive for PDFs
from google.colab import drive
drive.mount('/content/drive')

# gdrive_dir:
#    - "Path/on/google/drive/" to a directory on google drive that has all the PDFs
#       you wish to load
gdrive_dir = "Astra/Demo/PDFData/"

Mounted at /content/drive


In [8]:
## Astra Connectivity - now modified to use the new Astra Vector API endpoint
# Input your Astra DB endpoint and token string, the one starting with "AstraCS:..."
ASTRA_DB_API_ENDPOINT = input("ASTRA_DB_API_ENDPOINT = ")
ASTRA_DB_TOKEN_BASED_PASSWORD = getpass('Your Astra DB Token ("AstraCS:..."): ')

ASTRA_DB_API_ENDPOINT = https://35a9be06-aeee-4be9-9d64-dd54abc2c738-us-east-2.apps.astra.datastax.com
Your Astra DB Token ("AstraCS:..."): ··········


In [8]:
# Create the client
#astra_db = LibAstraDB(
#    api_endpoint=ASTRA_DB_API_ENDPOINT,
#    token=ASTRA_DB_TOKEN_BASED_PASSWORD,
#)

In [None]:
## Embeddings

# optionally drop the table to regenerate the embeddings
#astraSession.execute(f"DROP TABLE IF EXISTS {astraKeyspace}.pdf_embedding_demo;")

<cassandra.cluster.ResultSet at 0x78d9479f3490>

In [None]:
# Create the collection
#collection = astra_db.create_collection("pdf_embedding_collection", dimension=1536)

In [9]:
FILE_SUFFIX = ".pdf"

embeddings = OpenAIEmbeddings()

list_of_pdfs = []

src_dir = "/content/drive/MyDrive/" + gdrive_dir
# generate the list of PDF files
for f in os.listdir(src_dir):
  filename = os.path.join(src_dir, f)
  if os.path.isfile(filename) and f[-len(FILE_SUFFIX):] == FILE_SUFFIX:
    list_of_pdfs.append(filename)

# tell us what files are being processed
print("Files found:")
pprint(list_of_pdfs)

pdf_loaders = [
    PyPDFLoader(pdf_name)
    for pdf_name in list_of_pdfs
]

Files found:
['/content/drive/MyDrive/Astra/Demo/PDFData/tbu-intermediate.pdf',
 '/content/drive/MyDrive/Astra/Demo/PDFData/tbu-foundations.pdf']


In [10]:
docs = []
for l in pdf_loaders:
    docs.extend(l.load())

In [14]:
## Using Parent Document retriever
# Sometimes, the full documents can be too big to want to retrieve them as is.
# In that case, what we really want to do is to first split the raw documents into
# larger chunks, and then split it into smaller chunks. We then index the smaller
# chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

# This process helps improve RAG by condnsing the content embedding

# strip and load the docs
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

#text_splitter = RecursiveCharacterTextSplitter(
#    chunk_size=500,
#    chunk_overlap=80,
#)

# set up the vector store for the child chunks - this uses the new Astra vector API with LangChain
vectorstore = LCAstraDB(
    embedding=embeddings,
    collection_name="pdf_embedding_collection",
    token=ASTRA_DB_TOKEN_BASED_PASSWORD,
    api_endpoint=ASTRA_DB_API_ENDPOINT,
)

# The storage layer for the parent documents
store = InMemoryStore()

In [15]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)

#documents = [
#    doc
#    for loader in pdf_loaders
#    for doc in loader.load_and_split(text_splitter=text_splitter)
#]
#
#texts, metadatas = zip(*((doc.page_content, doc.metadata) for doc in documents))
#vectorstore.add_texts(texts=texts, metadatas=metadatas)

In [16]:
# We see there are many more than two documents now (or however many pdfs)
# these are larger chunks

len(list(store.yield_keys()))

560

In [18]:
# Let's make sure the underlying vector store retrieves the small chunks

sub_docs = vectorstore.similarity_search("What distinguishes Christianity from other religions?")

print(sub_docs[0].page_content)

8 Unit A.  God and Spiritual Powers  
3. Jesus Christ  
What distinguishes Christianity from other religions is largely its 
teachings about Jesus Christ. This chapter looks at this central figure, 
including the amazing  claim that Jesus Christ is the Son of God.  
Jesus Christ ’s Eternity  
Jesus Christ existed in the beginning  
[JOHN, TO BELIEVERS :]  I’m writing to you, fathers, because you know 
Christ who has existed from the beginning .   1 JOHN 2:13 A GW


In [19]:
retrieved_docs = retriever.get_relevant_documents("justice breyer")
len(retrieved_docs[0].page_content)

1775

In [20]:
print(retrieved_docs[0].page_content)

10. God’s Judgment  81 
 God’s judgment is to discipline God ’s people  
[PAUL, TO BELIEVERS :] But when we are judged by the Lord, we are 
disciplined  so that we may not be condemned with the world.   
1 CORINTHIANS 11:32  NET 
God’s judgment is to punish the wicked  
But by the same word the present heav ens and earth have been 
reserved for fire, by being kept for the day of judgment and 
destruction of the ungodly .   2 PETER 3:7 NET 
God’s judgment is also to re ward God ’s people  
[ELDERS IN HEAVEN , TO GOD:] The nations were enraged, but your wrath 
has come, and the time has come for the dead to be judged, and the 
time has come to give to your servants, the prophets, their reward, as 
well as to the saints and to th ose who revere your name, both small 
and great , and the time has come to destroy those who destroy the 
earth.   REVELATION 11:18  NET 
As well as punishmen t for wrongdoing, God ’s judgment includes 
reward for godliness.  
God’s Judgment Is Just  
God judges 