# PDF-based Q&A with Mistral7b, LangChain, and ChromaDB

**Objective:-**

In this project, we develop an intelligent document analysis system that integrates LLM-based question-answering and PDF file processing capabilities. By leveraging advanced language models such as Mistral7b, we enable users to query and retrieve information directly from PDF documents. The system extracts and analyzes the content of uploaded PDF files, providing accurate and detailed responses to user queries. This project aims to enhance information accessibility and usability by delivering a seamless, efficient tool for interacting with documents.

This project follows the following **process**:

1. **Document Loading**: Load PDF documents using appropriate functions such as `PDFPlumberLoader`, ensuring efficient handling and processing of the documents.

2. **Text Splitting and Embeddings**: Split the loaded text into manageable chunks for analysis and generate embeddings using an advanced model like Mistral7b. This allows for more precise retrieval and answers from the documents.

3. **Vector Database Integration**: Utilize ChromaDB to create a vector database that stores document embeddings and facilitates efficient retrieval of information. This database serves as the backend for querying and answering user questions.

4. **Create Retrieval Chain**: Develop a retrieval chain that utilizes the ChromaDB vector database to fetch relevant document chunks based on user queries. The retrieval chain integrates the LLM model for answering questions accurately.

5. **Prompt Design and LLM Integration**: Design prompts that incorporate both user queries and document context. Integrate the LLM model to generate refined responses based on the prompts.

## 1. Install and Import Libraries

            *Make sure to refer to the latest versions and their compatabilities*

In [None]:
import warnings
warnings.filterwarnings('ignore')

!pip install -q -U transformers accelerate bitsandbytes langchain tiktoken sentence-transformers chromadb pdfplumber

In [None]:
import pandas as pd
import torch
from torch import bfloat16
# from tempfile import NamedTemporaryFile
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
# from transformers import AutoModelForCausalLM
# from transformers import BitsAndBytesConfig
import langchain
from langchain.chains import create_retrieval_chain
from langchain.chains import RetrievalQA, LLMChain
# from langchain.chains.combine_documents import create_stuff_documents_chain
# from langchain_core.prompts import ChatPromptTemplate
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PDFPlumberLoader, DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
# from langchain.vectorstores.base import VectorStore
from langchain.schema.runnable import RunnablePassthrough
import chromadb
from chromadb.config import Settings

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

* <b>torch.backends.cuda.enable_mem_efficient_sdp(False)</b>: This line is configuring Torch (PyTorch) to enable or disable memory-efficient structured data parallelism (SDP) for CUDA tensors. SDP is a technique for parallelizing computations across multiple GPUs by dividing data structures across them efficiently. By passing False as an argument, this line indicates that memory-efficient SDP should be disabled.

* <b>torch.backends.cuda.enable_flash_sdp(False)</b>: Similarly, this line is configuring Torch to enable or disable flash structured data parallelism (SDP) for CUDA tensors. Flash SDP is another technique for parallelizing computations across multiple GPUs, potentially offering different trade-offs compared to memory-efficient SDP. By passing False as an argument, this line indicates that flash SDP should be disabled.

I disabled this features, because in my case it's necessary to work with Mistral7b model further. Try and experiment with this feature parameter as False and True.

In [None]:
#configure torch
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

In [None]:
#execute this if out of memory error occurs
# import os
# os.environ['TF_FORCE_GPU_ALLOW_GROWTH']='true'

## 2. Data loading

I have included a pdf file to test the feature. 

In [None]:
#import pdf files
file_path = '/kaggle/input/aitrends/Data_AI_Trends_Predictions_2024_2.pdf'

In [None]:
# Define the file type (pdf or csv)
file_type = "pdf" 

# Load documents based on file type
try:
    if file_type == "pdf":
        loader = PDFPlumberLoader(file_path)
    elif file_type == "csv":
        df = pd.read_csv(file_path)
        loader = DataFrameLoader(df, page_content_column="Title")
    else:
        raise ValueError("Unsupported file type")
    
    # Load the documents
    documents = loader.load()
    print(f"Loaded {len(documents)} documents from the file.")
    
except Exception as e:
    print(f"Error loading documents: {e}")
    documents = None


In [None]:
#Verify the loaded content by iterating over the list of documents
for i, doc in enumerate(documents):
    # Print the first 50 characters of each document's content
    print(f"Document {i + 1} - Sample: {doc.page_content[:50]}")

## 3. Model Initializing

In this part, we are initializing the Mistral 7B language model and its corresponding tokenizer. This is crucial because the model and tokenizer are required for processing user queries and generating responses based on the provided PDF documents. 

Model Initialization:
* The AutoModelForCausalLM class from the transformers library is used to load the pre-trained Mistral 7B model from the provided model path (model_path), which points to the model file location on disk. 
* torch_dtype=torch.bfloat16, specifies the data type of the model as bfloat16 for efficient computation
* device_map="auto", this automatically maps the model to available devices (e.g., CPU or GPU)
* trust_remote_code=True, this will allow the method to execute code from the model repository that is not part of the official Hugging Face repository. Normally, when you load a pre-trained model, you are just loading the model weights and a standard model architecture from the repository. However, some models may include custom code (e.g., for tokenization, special model layers, or other model-specific processing) that is not part of the official library. This parameter tells the method that you trust the model's repository and allow it to load and execute any custom code that is provided alongside the model.

Tokenizer Initialization: 
* The AutoTokenizer class is used to load the tokenizer from the same model_path as the model. The role of tokenizer is to convert input text (user queries and document content) into a format that the model can process. This includes breaking the text into smaller units called tokens and converting them to numerical representations (input IDs) that the model understands.
* The pad_token of the tokenizer is set to be the same as the eos_token (end-of-sequence token). Tokenizers can add padding to ensure that all input sequences have a uniform length and truncate longer sequences to the maximum length the model can handle. Tokenizers handle special tokens such as start-of-sequence (SOS), end-of-sequence (EOS), and padding tokens that help the model understand the structure of the input.
* The padding_side is set to "right" to pad sequences on the right side, ensuring consistent input length during model inference.

In [None]:
model_path="/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"

#initialize the model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    trust_remote_code = True
)

#initialize the tokenizer
tokenizer=AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## 4. Chunking

It's a process of extracting meaningful phrases, or "chunks," from a sentence based on its grammatical structure and parts of speech. It involves identifying and grouping together contiguous words or tokens that form a syntactic unit, typically consisting of a noun phrase, verb phrase, or prepositional phrase. Later, The split_documents method is called on the text_splitter instance, which processes the input documents and returns the smaller chunks as a list of Document objects.

Here, Chunking can help in text preprocessing, content selection or context segmentation. In this example, we will proceed with Token-based Chunking. However for experimenation purposes, Semantic Chunking was also used to identify the difference between both the methods. Token based chunking yielded better results as our objective is more driven in maintaining semantic meaning and context.

Token-based Chunking:
* This approach splits text based on tokens (words or subwords), aiming to keep segments within a specified token count. It ensures that each chunk fits within the language model's token limit, which is essential for model compatibility.

Semantic Chunking:
* This type of chunking uses semantic information to divide the text into coherent and contextually meaningful chunks. For example, the text might be split around paragraphs or sections, depending on the natural breaks in the content.

<b>chunk_size:</b> This parameter specifies the desired size of each chunk in terms of the number of tokens (words or subwords).

<b>chunk_overlap:</b> This parameter specifies the overlap between consecutive chunks. A value of 0 indicates no overlap, meaning each chunk will start exactly where the previous one ends.

The split_documents method takes your list of documents (e.g., text files or PDFs) and splits them into smaller chunks based on the tokenization and parameters you've specified. The output is a list of chunks that can be used for further processing or retrieval.

In [None]:
# Configure text splitter with token based chunking
text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=105)

# Configure text splitter with semantic chunking
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=500,  
#     chunk_overlap=100, 
#     separators=["\n\n", "\n", ".", "?", "!", " ", ""],  # Specify sentence and paragraph separators
#     keep_separator=True,  # Keep the separator (e.g., sentence end markers) in the chunks
# )

# Split documents into chunks
docs = text_splitter.split_documents(documents)

## 5. Storing the data

Now we need to put chunks into an index so that we are able to retrieve them easily when we want to find something in the document or answer questions. We use embedding model and vector database for this purpose.

**Create Embeddings:**
encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"):
* This line creates an embedding function using the HuggingFaceEmbeddings class from LangChain. The model_name parameter specifies the model to use for generating embeddings. In this case, it uses the "sentence-transformers/all-MiniLM-L6-v2" model, which is a pre-trained sentence transformer model known for its efficiency and accuracy in creating sentence embeddings.
* The encoder converts text documents (or text chunks) into numerical vector representations (embeddings) that capture the semantic meaning of the text. This is necessary for ChromaDB to enable semantic search, allowing the system to match user queries with relevant text chunks based on their meanings rather than exact keyword matches. By using embeddings, ChromaDB can provide more accurate and context-aware search results, leading to better question-answering and information retrieval.

**Create Search Engine/ Database from Provided Documents - ChromaDB:**
* This code initializes a Chroma vector database using the LangChain library.
* The docs variable should contain a list of documents (or text chunks) that you want to index in the vector database.
* The embedding function (encoder) is used to convert the text documents into vector representations.
* persist_directory: Specifies the directory where the vector database should be persisted (saved). In this case, it uses the .chromadb directory.

By indexing the documents using the embedding function, you create a searchable database where you can retrieve relevant information based on similarity searches.

**Create Retriever:**
retriever = search_engine.as_retriever():
* This line creates a retriever from the search engine (Chroma vector database).
* The retriever can be used to query the vector database and retrieve relevant documents or information based on the similarity between the query and the indexed documents.
* Using the retriever, you can integrate the vector database with other components of your application, such as an LLM-based question-answering system.

In [None]:
# Create embeddings
encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create search engine/ database from provided documents - chromadb
search_engine = Chroma.from_documents(
    documents=docs,
    embedding=encoder,
    persist_directory=".chromadb"
)

#intialize the retriever to fetch data from chromaDb based on the query
retriever = search_engine.as_retriever()

# Question-answering (QA) system

The QA system uses a prompt template for generating prompts, a language model chain for generating answers, and a retrieval-augmented generation chain to combine retrieval and generation approaches in the QA process. The below code snippet initializes a text generation pipeline using the Mistral 7b model. This pipeline is responsible for generating responses based on the provided input text. The code sets up a text generation task using the `transformers` library, specifying the model, tokenizer, and parameters for text generation. The `HuggingFacePipeline` instance wraps the text generation pipeline for further integration within the rest of your application.

- **Text Generation Pipeline Initialization**:
   - **model**: The pre-trained Mistral 7b model instance is provided to the text generation pipeline. This model is responsible for generating the text output based on the provided input.
    - **tokenizer**: The tokenizer is used to convert input text into tokens that the model can understand. It also helps in converting the model's output tokens back into human-readable text.
    - **task**: The code specifies the task as "text-generation," which tells the pipeline that it should generate text based on the input provided.
    - **temperature**: This controls the randomness of the text generation. Lower values make the output more focused and deterministic, while higher values introduce more randomness.
    - **repetition_penalty**: This parameter discourages repetition in the output text, making the responses more diverse.
    - **return_full_text**: This specifies whether to return the full text generated or just a part of it.
    - **max_new_tokens**: This parameter sets the maximum number of new tokens (words) that the model can generate in response to an input prompt.

- **HuggingFacePipeline**:
    - After setting up the text generation pipeline, the code creates an instance of `HuggingFacePipeline` using the initialized pipeline. This serves as the large language model (LLM) that you can use for further operations, such as responding to user queries.

In [None]:
# create a text generation pipeline using Mistral7b
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

## Implement the Pipelines

To experiment with different approaches and chains, 3 variations have been used to question the database and get responses.

* <b>llm_chain:</b> sets up a generation-based question-answering chain using the LLMChain class. It uses the language model (llm) and the same prompt template. It doesn't use retriever.

* <b>rag_chain:</b> sets up a retrieval-augmented generation (RAG) chain. It combines retrieval (using the retriever) with generation (using the language model).

* <b>RetrievalQA:</b> is a specific type of chain optimized for question-answering tasks by integrating a language model with a retriever to provide accurate and relevant answers to user queries. It is a specific implementation of the RAG approach, focusing on question-answering.

In [None]:
# Define common prompt template function
def create_prompt_template(context_var="context", question_var="question"):
    prompt_template = f"""
    ### [INST]
    Instruction: Answer the question based on your data science knowledge. Here is context to help:

    {{{context_var}}}

    ### QUESTION:
    {{{question_var}}}

    [/INST]"""
    return PromptTemplate(input_variables=[context_var, question_var], template=prompt_template)

### 1. LLM Chain:

LLM Chain is a straightforward approach that directly uses a language model (LLM) to generate responses to user queries. It should be used when you want to rely purely on the language model's capabilities without incorporating external knowledge.

Pros:
* Simplicity and ease of use.
* Fast response times as it doesn't require retrieving context from a knowledge base.

Cons:
* May lack accuracy and relevance if the model's training data doesn't include specific or recent information.

In [None]:
# Define function to initialize RunnableSequence chain
def initialize_llm_chain(llm, context_var="context", question_var="question"):
    prompt = create_prompt_template(context_var, question_var)
    # Create a RunnableSequence by chaining the prompt and the LLM using the | operator
    llm_chain = prompt | llm
    
    return llm_chain

# Initialize the chain
llm_chain = initialize_llm_chain(llm)

# Function to generate answers using LLM chain
def generate_answer_with_llm_chain(query):
    response = llm_chain.invoke({"context": "", "question": query})
    if '[/INST]' in response:
        # Split the response at '[/INST]' and return the part after it
        answer = response.split('[/INST]')[1].strip()
    else:
        # If the closing tag is not found, return the whole response as the answer
        answer = response.strip()
    
    return answer
#     answer = response['text'].split('[/INST]')[1].strip()
#     return answer

In [None]:
query = "Give me a summary of data science"
print("Answer using LLM Chain:", generate_answer_with_llm_chain(query))

### 2. Retrieval-Augmented Generation (RAG):

RAG involves augmenting the language model with external context retrieved from a knowledge base or document database. It could be used when you want to enhance the quality and relevance of responses by providing the model with contextually relevant information.

Pros:
* Produces more accurate and context-aware responses.
* Can handle a wide variety of tasks beyond question-answering.

Cons:
* May introduce latency due to retrieval of context.
* Requires careful selection and indexing of the knowledge base.

In [None]:
# Define function to initialize RAG chain
def initialize_rag_chain(llm, retriever, context_var="context", question_var="question"):
    prompt = create_prompt_template(context_var, question_var)
    llm_chain = LLMChain(llm=llm, prompt=prompt)
    rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain
    return rag_chain

# Initialize the RAG chain
rag_chain = initialize_rag_chain(llm, retriever)

# Function to generate answers using RAG chain
def generate_answer_with_rag_chain(query):
    response = rag_chain.invoke(query)
    answer = response['text'].split('[/INST]')[1].strip()
    return answer


In [None]:
query = "What is data science?"
print("Answer using RAG Chain:", generate_answer_with_rag_chain(query))

### 3. RetrievalQA:

RetrievalQA is a specific type of chain optimized for question-answering tasks using retrieval of relevant context from a document database. It is Used when your primary focus is on answering questions based on context from a document database.

Pros:
* Streamlined for question-answering, providing accurate and informative answers.
* Optimized integration of language model and retriever.

Cons:
* Less flexible than RAG in terms of tasks it can handle.

In [None]:
# # Initialize the chain
# chain = RetrievalQA.from_chain_type(
#     llm=llm,  # Mistral 7b model instance
#     chain_type="stuff",
#     retriever=search_engine.as_retriever()
# )

In [None]:
# Function to initialize the RetrievalQA chain
def initialize_retrievalqa_chain(llm, retriever):
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever
    )
    return chain

# Function to generate answers using the RetrievalQA chain
def generate_answer_with_retrievalqa_chain(query, chain):
    response = chain.invoke(query)
    if 'answer' in response:
        answer = response['answer'].strip()
    elif 'text' in response:
        answer = response['text'].strip()
    else:
        answer = str(response)
    
    return answer
#     answer = response['text'].strip()
#     return answer


# Initialize the RetrievalQA chain
retrievalqa_chain = initialize_retrievalqa_chain(llm, search_engine.as_retriever())

In [None]:
# Generate an answer using the RetrievalQA chain
query = "Give me a summary of the document 5"
answer = generate_answer_with_retrievalqa_chain(query, retrievalqa_chain)
print("Answer using RetrievalQA Chain:", answer)