## Objective
The goal is to develop an intelligent Q&A system that can:

- **Ingest large text data** from multiple file types.
- **Index** this data into a vector store for efficient similarity search.
- **Retrieve** relevant documents that may contain the answer to a query.
- **Generate** a response by combining relevant retrieved text and an LLM.


## What is RAG?
**Retrieval-Augmented Generation (RAG)** is a technique that combines information retrieval with text generation to provide more accurate and contextually relevant responses. RAG uses a two-step process:

- **Retrieval**: Given a query, the system retrieves the most relevant documents based on embeddings.
- **Generation**: The retrieved documents are used as context to generate a well-informed answer.

RAG is especially powerful when dealing with large datasets where the answer may not be obvious or directly accessible without contextual information.


## Architecture of RAG

1. **Indexing and Embeddings**: Large texts are broken into smaller chunks and embedded into a high-dimensional space where semantically similar chunks are closer. Here, HuggingFace’s sentence-transformer model is used for embeddings.

2. **Vector Store (FAISS)**: Embeddings are stored and indexed in **FAISS** (Facebook AI Similarity Search), an efficient tool for large-scale similarity search.

3. **Retriever**: A retriever uses similarity search to find documents in FAISS that are most relevant to a query. The retriever is set to find the top 6 similar chunks.

4. **Prompt Creation**: Using the retrieved documents, a prompt template structures the information to aid the language model in answering the query.

5. **Answer Generation (LLM)**: The relevant documents are passed to an LLM, in this case, **Cohere**, which generates an answer by leveraging the provided context.


## Steps in the Implementation

1. **Document Loading and Text Extraction**:
   - Various documents (PDF, PPT, DOC) are loaded and converted to plain text using libraries like PyMuPDF, python-docx, and python-pptx.

2. **Text Splitting**:
   - To handle large text inputs, the documents are split into chunks with some overlap, aiding context retention when moving between chunks.

3. **Embedding Creation**:
   - Each chunk is converted into an embedding vector with **HuggingFaceEmbeddings**. These vectors are then stored in **FAISS**.

4. **Indexing with FAISS**:
   - **FAISS** indexes the embeddings, enabling fast similarity searches. Each chunk's embedding is stored, making it possible to search and retrieve related chunks quickly.

5. **Retrieval**:
   - For a given query, **FAISS** searches the indexed embeddings and retrieves the most similar chunks. These chunks are then formatted to form a context for the language model.

6. **Prompt Creation and LLM Invocation**:
   - The retrieved chunks are fed into a prompt template. The template instructs the LLM to answer the question based on the context of the retrieved documents.
   - **Cohere’s LLM** is used here for generating a final answer based on the question and relevant context provided.

7. **Answer Generation**:
   - Using RAG, the generated answer is precise, contextually relevant, and derived from the most pertinent information available.


## Key Components of the Code

1. **FAISS Vector Store**:
   - Provides efficient storage and search for embeddings, helping find the top-k chunks that match a query.

2. **HuggingFace Embeddings**:
   - Embeddings allow us to encode documents into dense vectors, making similarity-based retrieval feasible.

3. **LangChain Components**:
   - LangChain provides utilities for prompt formatting and handling multi-step processes like RAG, making it easier to chain retrieval and generation tasks.

4. **Cohere LLM**:
   - Processes the prompt with context to generate a coherent answer.

## Advantages of Using RAG

- **Enhanced Accuracy**: By retrieving relevant context, RAG models ensure that generated answers are grounded in factual information from the documents.
- **Scalability**: RAG allows handling large datasets by separating retrieval and generation. Retrieval ensures that only the most relevant chunks are used for generation.
- **Flexibility**: Different LLMs or embedding models can be plugged into this framework as needed.

## Example Use Case

- Given a biology-related question, the RAG system can search and retrieve relevant sections from a biology PDF, summarize a PowerPoint, or answer questions based on a Word FAQ document, enhancing student learning or research capabilities.

This RAG approach provides a powerful, modular, and flexible method for constructing Q&A systems with robust context-based answer generation.


In [None]:
%pip install python-docx
%pip install python-pptx
%pip install PyPDF2
%pip install langchain
%pip install langchain_community
%pip install langchain_google_genai
%pip install langchain_text_splitters
%pip install sentence-transformers
%pip install faiss-cpu
%pip install cohere

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2
Collecting python-pptx
  Downloading python_pptx-1.0.2-py3-none-any.whl.metadata (2.5 kB)
Collecting XlsxWriter>=0.5.7 (from python-pptx)
  Downloading XlsxWriter-3.2.0-py3-none-any.whl.metadata (2.6 kB)
Downloading python_pptx-1.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading XlsxWriter-3.2.0-py3-none-any.whl (159 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.9/159.9 kB[0m [31m9.9

In [None]:
from docx import Document
from PyPDF2 import PdfReader
from pptx import Presentation
from langchain_community.llms import Cohere
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage, HumanMessage
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts  import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd ..
%cd /content/drive/MyDrive

Mounted at /content/drive
/
/content/drive/MyDrive


In [None]:
pdf_file = open('/content/drive/MyDrive/lebo107_merged_merged.pdf','rb')
ppt_file = Presentation("/content/drive/MyDrive/Biology_NCERT_Class_12th_Summary_PPT.pptx")
doc_file = Document('/content/drive/MyDrive/Biology_NCERT_Class_12th_FAQs.docx')

In [None]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m72.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.1


In [None]:
import fitz  # PyMuPDF library

pdf_text = ""
with fitz.open("/content/drive/MyDrive/lebo107_merged_merged.pdf") as pdf_reader:
    for page_num in range(pdf_reader.page_count):
        page = pdf_reader[page_num]
        pdf_text += page.get_text()

# extracting ppt data
ppt_text = ""
for slide in ppt_file.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            ppt_text += shape.text + '\n'

# extracting doc data
doc_text = ""
for paragraph in doc_file.paragraphs:
    doc_text += paragraph.text + '\n'

In [None]:
all_text = pdf_text + '\n' + ppt_text + '\n' + doc_text
len(all_text)

287335

In [None]:

# splitting the text into chunks for embeddings creation

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200, # This is helpul to handle the data loss while chunking.
        length_function = len,
        separators=['\n', '\n\n', ' ', '']
    )

chunks = text_splitter.split_text(text = all_text)

In [None]:
len(chunks)

355

In [None]:
# Initializing embeddings model

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

  embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Indexing the data using FAISS
vectorstore = FAISS.from_texts(chunks, embedding = embeddings)

In [None]:
# creating retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
retrieved_docs = retriever.invoke("How does the concept of natural selection explain the adaptation of species to rapidly changing environments, such as antibiotic resistance in bacteria, and what does this reveal about the rate and direction of evolutionary changes?")

In [None]:
len(retrieved_docs)

6

In [None]:
print(retrieved_docs[0].page_content)

which appears to be ‘similar’ to a corresponding
marsupial (e.g., Placental wolf and Tasmanian
wolf-marsupial). (Figure 6.7).
6.5 BIOLOGICAL EVOLUTION
Evolution by natural selection, in a true sense
would have started when cellular forms of life
with differences in metabolic capability
originated on earth.
The essence of Darwinian theory about
evolution is natural selection. The rate of
appearance of new forms is linked to the life cycle
or the life span. Microbes that divide fast have
the ability to multiply and become millions of
individuals within hours. A colony of bacteria
(say A) growing on a given medium has built-in
variation in terms of ability to utilise a feed
component. A change in the medium
composition would bring out only that part of
the population (say B) that can survive under
the new conditions. In due course of time this
variant population outgrows the others and
appears as new species. This would happen
within days. For the same thing to happen in a


In [None]:
prompt_template = """Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "answer not available in context" \n\n
                Context: \n {context}?\n
                Question: \n {question} \n
                Answer:"""

prompt = PromptTemplate.from_template(template=prompt_template)

In [None]:
# function to create a single string of relevant documents given by Faiss.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
pip install --upgrade langchain




In [None]:
import os
def generate_answer(question):
    cohere_api_key = 'IhINLzcCotOvE1SBU0rgvPspxViCECg6H463RfIQ'
    cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = cohere_api_key)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | cohere_llm
        | StrOutputParser()
    )

    return rag_chain.invoke(question)

In [None]:
ans = generate_answer("How does the concept of natural selection explain the adaptation of species to rapidly changing environments, such as antibiotic resistance in bacteria?")
print(ans)

  cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = cohere_api_key)


 Natural selection explains adaptation of species to changing environments through several mechanisms:
- Genetic Variation: Within populations, there is inherent genetic variation due to mutations and recombination of genes from parents. This variation is the raw material for natural selection to act upon. 
- Selective Pressure: Changing environments, such as the introduction of antibiotics, create new challenges and selective pressures for organisms. Certain traits or behaviors become more advantageous for survival and reproduction. 
- Differential Survival and Reproduction: Organisms with traits that provide an advantage in the new environment are more likely to survive and reproduce, passing on their beneficial traits to the next generation. This is the core principle of natural selection. 
- Accumulation of Adaptive Traits: Over time, as organisms with advantageous traits reproduce more successfully, the frequency of these traits increases in the population. This can lead to the em

In [None]:
import pickle

In [None]:
components_to_save = {
    "vectorstore": vectorstore,
    "text_splitter": text_splitter,
    "prompt_template": prompt_template,
    "answer": ans  # Adding the answer to be saved
}

# Save the components and answer as a pickle file
with open("saved_components_with_answer.pkl", "wb") as file:
    pickle.dump(components_to_save, file)

print("Components and answer saved to 'saved_components_with_answer.pkl'")

Components and answer saved to 'saved_components_with_answer.pkl'


In [None]:
from google.colab.output import eval_js
print(eval_js("google.colab.kernel.proxyPort(5000)"))

https://5d9xj20dym-496ff2e9c6d22116-5000-colab.googleusercontent.com/


In [None]:
from flask import Flask, request, jsonify, render_template
import pickle
from io import BytesIO
from PIL import Image
import numpy as np

app = Flask(__name__, template_folder='/content/drive/MyDrive')

# Load your saved components
with open("saved_components_with_answer.pkl", "rb") as file:
    components = pickle.load(file)

vectorstore = components["vectorstore"]
text_splitter = components["text_splitter"]
prompt_template = components["prompt_template"]
saved_answer = components["answer"]

def preprocess_text(question):
    """
    Custom preprocessing logic for your input question.
    """
    # Example: Split the question, apply vectorstore transformations, etc.
    processed_input = vectorstore.similarity_search(question)
    return processed_input

@app.route('/', methods=['GET'])
def home():
    return render_template('new_qarag.html')

@app.route('/chat', methods=['POST'])
def chat():
    try:
        user_question = request.form.get('question', '').strip()
        if not user_question:
            return jsonify({'error': 'No question provided.'}), 400

        # Process the question
        processed_question = preprocess_text(user_question)

        # Generate a response (simulate processing logic)
        response = f"Processed Question: {processed_question}, Saved Answer: {saved_answer}"

        return jsonify({'response': response})
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run()


 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [27/Dec/2024 16:41:03] "GET /?authuser=0 HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [27/Dec/2024 16:41:03] "[33mGET /favicon.ico?authuser=0 HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [27/Dec/2024 16:41:08] "POST /chat?authuser=0 HTTP/1.1" 200 -
