# Langchain - Data Connections Excercise
### Ask a Legal Research Assistant Bot about US Constitution
Write a function that will
- Read a US_constitution.txt file
- Split it into chunks
- Write it to a ChromaDB Vector store
- Use Context Comprehension to retern relevant portion of the document

In [49]:
# %pip install langchain
# %pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/509.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting build>=1.0.3 (from chromadb)
  Downloading build-1.0.3-py3-none-any.whl (18 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/2.4 MB[0m [31m109.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m2.1/2.4 MB[0m [31m29.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [87]:
from huggingface_hub import login
login()

In [65]:
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_community.embeddings import HuggingFaceHubEmbeddings
from langchain_community.llms import HuggingFaceHub

from transformers import AutoTokenizer

In [91]:
with open('api_key.txt', 'r') as file:
    api_key = file.read()

In [62]:
checkpoint = "HuggingFaceH4/zephyr-7b-beta"

In [92]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [93]:
llm = HuggingFaceHub(
    repo_id=checkpoint,
    task="text-generation",
    huggingfacehub_api_token=api_key,
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

In [99]:
# This will retrieve the chunk, where the question asked is most similar.
# It will return most relevant documents to the asked question

# docs = db.similarity_search('What is it about?')
# print(docs[1].page_content)

To exercise exclusive Legislation in all Cases whatsoever, over such District (not exceeding ten Miles square) as may, by Cession of particular States, and the Acceptance of Congress, become the Seat of the Government of the United States, and to exercise like Authority over all Places purchased by the Consent of the Legislature of the State in which the Same shall be, for the Erection of Forts, Magazines, Arsenals, dock-Yards and other needful Buildings;-And


In [120]:
def us_constitution_helper(question):
    # 1. Load File
    document = TextLoader('US_Constitution.txt').load()

    # 2. Split it into chunks
    splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer=tokenizer,
                                                                chunk_size=500,
                                                                chunk_overlap=100)
    chunks = splitter.split_documents(document)

    # 3. Embed the document to a persisted ChromaDB
    embedding_function = HuggingFaceHubEmbeddings()
    db = Chroma.from_documents(chunks,
                               embedding=embedding_function,
                               persist_directory='constitution_db')
    db.persist

    # LLM -> LLMChainExtractor
    compressor = LLMChainExtractor.from_llm(llm)

    # Contextual Compression to return most relevant parts of document
    compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                           base_retriever=db.as_retriever())
    compressed_docs = compression_retriever.get_relevant_documents(question)

    output =  compressed_docs[0].page_content
    marker_index = output.find("Extracted relevant parts:")

    return output[marker_index:].strip()

In [122]:
print(us_constitution_helper('What is the 13rd Amendment?'))

Extracted relevant parts:
- "Neither slavery nor involuntary servitude, except as a punishment for crime whereof the party shall have been duly convicted, shall exist within the United States, or any place subject to their jurisdiction." (13th Amendment, Section 1)
- "Congress shall have power to enforce this article by appropriate legislation." (13th Amendment, Section 2)
- "All persons born or naturalized in the United States, and subject to the jurisdiction thereof, are citizens of the United States and of the State wherein they reside." (14th Amendment, Section 1)
- "No State shall make or enforce any law which shall abridge the privileges or immunities of citizens of the United States; nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person within its jurisdiction the equal protection of the laws." (14th Amendment, Section 1)
- "Representatives shall be apportioned among the several States according to their respecti

Created a chunk of size 501, which is longer than the specified 500
