<a href = "https://www.pieriantraining.com"><img src="../PT Centered Purple.png"> </a>

<em style="text-align:center">Copyrighted by Pierian Training</em>

#  Data Connections Exercise

## Ask a Legal Research Assistant Bot about the US Constitution

Let's revisit our first exercise and add offline capability using ChromaDB. Your function should do the following:

* Read the US_Constitution.txt file inside the some_data folder
* Split this into chunks (you choose the size)
* Write this to a ChromaDB Vector Store
* Use Context Compression to return the relevant portion of the document to the question

In [1]:
from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor 

import os
from dotenv i   mport load_dotenv
load_dotenv()

openai_token = os.getenv("OPENAI_API_KEY")

In [20]:
def us_constitution_helper(question):
    '''
    Takes in a question about the US Constitution and returns the most relevant
    part of the constitution. Notice it may not directly answer the actual question!
    
    Follow the steps below to fill out this function:
    '''
    # PART ONE:
    # LOAD "some_data/US_Constitution in a Document object
    loader = TextLoader("some_data/US_Constitution.txt")
    documents = loader.load()
    
    # PART TWO
    # Split the document into chunks (you choose how and what size)
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
    docs = text_splitter.split_documents(documents)
    
    # PART THREE
    # EMBED THE Documents (now in chunks) to a persisted ChromaDB
    embedding_function = OpenAIEmbeddings()
    embedding_function = OpenAIEmbeddings()
    db = Chroma.from_documents(docs, embedding_function,persist_directory='./US_Constitution')
    db.persist()

    # PART FOUR
    # Use ChatOpenAI and ContextualCompressionRetriever to return the most
    # relevant part of the documents.

    # results = db.similarity_search("What is the 13th Amendment?")
    # print(results[0].page_content) # NEED TO COMPRESS THESE RESULTS!
    llm = ChatOpenAI(temperature=0)
    compressor = LLMChainExtractor.from_llm(llm)

    compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, 
                                                           base_retriever=db.as_retriever())

    compressed_docs = compression_retriever.invoke(question)

    return compressed_docs[0].page_content

In [21]:
print(us_constitution_helper("What is the 1st Amendment?"))

First Amendment
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.


In [5]:
from typing import List
from langchain.schema import Document


class QABot:
    def __init__(self, persist_directory: str):
        self.persist_directory = persist_directory
        self.docs: List[Document] = []
        self.db = None
        self.llm = None
        self.retriever = None

    def load_text_file(self, file_path: str, chunk_size: int = 500) -> None:
        # Load the text file into a Document object
        loader = TextLoader(file_path)
        documents = loader.load()

        # Split the document into chunks
        text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size)
        self.docs = text_splitter.split_documents(documents)
        
        # Embed the documents and create a persisted ChromaDB
        embedding_function = OpenAIEmbeddings()
        self.db = Chroma.from_documents(self.docs,
                                        embedding_function,
                                        persist_directory=self.persist_directory)
        #self.db.persist()
    
    def _connect_llm(self) -> ChatOpenAI:
        # Initialize and return the LLM chat model
        self.llm = ChatOpenAI(temperature=0.0)
        return self.llm

    def _get_retriever(self) -> ContextualCompressionRetriever:
        # Initialize the LLM compressor
        if self.llm is None:
            self._connect_llm()
        compressor = LLMChainExtractor.from_llm(self.llm)

        # Create and return the contextual compression retriever
        self.retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=self.db.as_retriever())
        return self.retriever

    def ask(self, question: str) -> str:
        # Ensure the LLM and retriever are loaded
        if self.retriever is None:
            self._get_retriever()

        # Use the retriever to get the most relevant part of the documents
        compressed_docs = self.retriever.invoke(question)
        return compressed_docs[0].page_content if compressed_docs else "No relevant content found."

In [6]:
# First, load the documents and create a specific chatbot
qa_bot = QABot(persist_directory='./US_Constitution')
qa_bot.load_text_file("some_data/US_Constitution.txt")

In [7]:
# Then, ask questions to the chatbot
answer = qa_bot.ask("What is the 1st Amendment?")
print(answer) # First Amendment: Congress shall make no law respecting an establishment of religion...

First Amendment
Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.
