## Installation and Imports


In [2]:
!pip install langchain langchain_community chromadb requests langchain sentence-transformers langchain_community pypdf

Collecting langchain
  Downloading langchain-0.2.14-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_community
  Downloading langchain_community-0.2.12-py3-none-any.whl.metadata (2.7 kB)
Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting pypdf
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting langchain-core<0.3.0,>=0.2.32 (from langchain)
  Downloading langchain_core-0.2.35-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.104-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting datacl

Now, we can import the required modules:

In [3]:
import os
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA, LLMChain
from langchain.retrievers import BM25Retriever, EnsembleRetriever

## Set up Environment Variables
Before we can use the Hugging Face Hub API, we need to set up the API token as an environment variable. We'll use the os and getpass modules for this purpose.

In [6]:
import os
from getpass import getpass
HUGGINGFACEHUB_API_TOKEN = getpass("API:")

# Set the API token in the environment variable
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

API:··········


## Load and Split Documents
Here we load the PDF documents from the specified directory and splits them into smaller chunks using the RecursiveCharacterTextSplitter. The chunk size is set to 500 characters with a 50-character overlap.

In [7]:

# Load your documents (assuming they are PDFs in a directory)
loader = PyPDFDirectoryLoader('Data')
documents = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)


## Create Prompt Template:

In [49]:
from langchain.prompts.chat import ChatPromptTemplate

# Define the prompt template as a string
prompt_template = """
You are an AI Assistant that strictly follows instructions.
Your answers must only come from the CONTEXT provided. If the user's query is not related to the CONTEXT, you must respond with "I don't know."

CONTEXT: {context}
</s>

{query}
</s>

Your answer:
"""

# Create a ChatPromptTemplate object directly from the string template
prompt = ChatPromptTemplate.from_template(prompt_template)

## Initialize Embeddings and Vector Store
We initialize the Hugging Face embeddings model and use it to create a Chroma vector store from the document chunks.

In [50]:
embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-large")

In [51]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [52]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x7c00a41523e0>

## Create BM25 and Vector Retrievers
Here,We create the BM25 and vector retrievers. The BM25 retriever is created directly from the document chunks, while the vector retriever is created from the Chroma vector store.

In [53]:
!pip install rank_bm25



In [54]:
bm25_retriever = BM25Retriever.from_documents(chunks)
vector_retriever = vectorstore.as_retriever()

## Set Up the EnsembleRetriever
 The EnsembleRetriever combines the BM25 and vector retrievers. The weights parameter is set to 0.5 for each retriever, giving them equal importance in the ensemble.

In [55]:
from langchain.retrievers.ensemble import EnsembleRetriever

retrievers = [bm25_retriever, vector_retriever]
ensemble_retriever = EnsembleRetriever(retrievers=retrievers, weights=[0.5, 0.5])


## Initialize the Large Language Model

In [56]:
from langchain_community.llms import HuggingFaceHub

llm = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.1,
        "return_full_text":False
    },
)

## Create the RAG Pipeline

In [57]:
from langchain_core.output_parsers import StrOutputParser

In [58]:
from langchain_core.runnables import RunnablePassthrough

In [59]:
output_parser = StrOutputParser()

In [60]:
retriever= ensemble_retriever
chain = (
    {"context": retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | output_parser
)

## Run a Query


In [61]:
query = "what is this story about"


In [62]:
response = chain.invoke(query)

In [63]:
print(response)

Based on the context provided, it is unclear which specific story is being referred to. Please provide more information or clarify which story is being discussed. Without further context, the answer would be "I don't know."


In [66]:
print(chain.invoke("what is electrolysis?"))

Electrolysis is a chemical process that uses electricity to drive nonspontaneous chemical reactions. It involves passing an electric current through an ionic compound (electrolyte) dissolved in a solvent (such as water). This causes the compound to decompose into simpler products, such as metals and gases. Electrolysis has many practical applications, including the production of aluminum, chlorine, and sodium hydroxide. It also plays a crucial role in various industrial processes, such as the refining of metals and the purification of chemicals. In addition, electrolysis is used in electroplating, battery manufacturing, and water treatment. Overall, electrolysis is a versatile and essential technology that has revolutionized numerous industries and continues to advance scientific research today.


In [65]:
print(chain.invoke("what is the publisher's name?"))

I don't know. The context provided does not include information about the publisher's name. Please provide more context or specify which document you are referring to.
