# **"Generative AI ChatBot"**

### The purpose of this notebook is to make ChatBots using pretrained models and RAG.

#### **Libraries Used:**
- gradio
- PyPDF2
- faiss
- numpy
- sentence_transformers
- google.generativeai
- openai
- GTTS
- temp_file
- speech_recognition

#### **Work Flow:**
- Load model using API key
- Read PDF
- Make Chunks of the text extracted from PDF
- Encode passages using Sentence Transformers
- Create FAISS index
- Retrieve passages based on input query
- Generate answers based on retrieved passages
- Mention the source of the response of the bot (from which PDF it took the response and from which page)
- Host it on Gradio


# +_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+

## **Using RAG**

### The workflow for applying RAG is as follows:

1. Read PDF
2. Make Chunks of the text extracted from PDF (this is done to reduce the size of the model input tokens and it will be able to read large PDFs)
3. Encode passages using Sentence Transformers (this is done because we have to convert text to vectors)
4. Create FAISS index (this is done to search the passages in the index)
5. Retrieve passages based on input query
6. Generate answers based on retrieved passages
7. Mention the source of the response of the bot (from which PDF it took the response and from which page)

### Import Libraries

In [5]:
import os
import PyPDF2
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
import gradio as gr

  from tqdm.autonotebook import tqdm, trange





### Configure API

In [6]:
os.environ['API_KEY'] ="Add Your Key"  # Replace with your actual API key
genai.configure(api_key=os.environ['API_KEY'])

# Choose a model
gen_model = genai.GenerativeModel('gemini-1.5-flash')

### Initialize variables to store data

In [7]:
indexes = []
pdf_data = []  # To store passages and their metadata

### 1. Read Multiple PDFs

In [8]:
def read_pdfs(pdf_files):
    all_texts = []
    for pdf_file in pdf_files:
        text = ""
        with open(pdf_file.name, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            for page_num, page in enumerate(pdf_reader.pages):
                text += page.extract_text() + "\n"
        pdf_name = os.path.basename(pdf_file.name)
        all_texts.append((text, pdf_name))
    return all_texts

### 2. Make Chunks

In [9]:
def make_chunks(text, pdf_name, chunk_size=500, chunk_overlap=50):
    chunks = []
    page_numbers = []
    start = 0
    page = 1  # Start with page 1
    while start < len(text):
        chunk = text[start:start + chunk_size]
        if chunk:
            chunks.append(chunk)
            page_numbers.append((pdf_name, page))
        start += chunk_size - chunk_overlap
        if start % chunk_size == 0:  # Move to next page roughly after chunk size
            page += 1
    return chunks, page_numbers

### 3. Encode passages

In [10]:
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def encode_passages(passages):
    embeddings = sentence_model.encode(passages, convert_to_tensor=True)
    return embeddings



### 4. Create FAISS index

In [11]:
def create_index(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings.cpu().numpy())
    return index

### 5. Retrieve passages based on query

In [12]:
def retrieve_passages(index, passages, page_numbers, query, k=5):
    query_embedding = sentence_model.encode(query, convert_to_tensor=True)
    query_embedding = np.expand_dims(query_embedding, axis=0)
    distances, indices = index.search(query_embedding, k)

    relevant_passages = []
    relevant_page_numbers = []

    for i in range(k):
        if indices[0][i] < len(passages) and distances[0][i] < 1.3:  # Threshold for relevance
            relevant_passages.append(passages[indices[0][i]])
            relevant_page_numbers.append(page_numbers[indices[0][i]])

    return list(zip(relevant_passages, relevant_page_numbers)), distances[0]

### 6. Generate answer based on retrieved passages

In [13]:
def generate_answer(gen_model, prompt, retrieved_passages):
    response = ""
    sources_info = {}

    # Collect passages by source
    for passage, (pdf, page) in retrieved_passages:
        if (pdf, page) not in sources_info:
            sources_info[(pdf, page)] = []
        sources_info[(pdf, page)].append(passage)

    # Construct the response
    for (pdf, page), passages in sources_info.items():
        # Join all passages from the same source and page
        response += "\n".join(passages) + f" [Source: {pdf}, Page: {page}]\n\n"

    # Generate the final response text
    response_text = gen_model.generate_content(prompt + "\n\n" + response)

    # Prepare sources list
    unique_sources = set(sources_info.keys())
    sources_list = "\n".join([f"[Source: {pdf}, Page: {page}]" for pdf, page in unique_sources])

    return response_text.text + "\n\nSources:\n" + sources_list

### 7. Define chatbot function

In [14]:
def chatbot(prompt, state, pdf_files):
    global indexes, pdf_data  # Declare global variables
    pdf_data = []  # Reset for new input
    indexes = []

    # Read and process the PDFs
    all_texts = read_pdfs(pdf_files)
    for text, pdf_name in all_texts:
        passages, page_numbers = make_chunks(text, pdf_name)
        embeddings = encode_passages(passages)
        index = create_index(embeddings)
        indexes.append((index, passages, page_numbers))

    # Retrieve relevant passages from all PDFs
    retrieved_passages = []
    for index, passages, page_numbers in indexes:
        passages_batch, distances = retrieve_passages(index, passages, page_numbers, prompt)

        # Debugging output
        print(f"Distances: {distances}")
        
        # Check if any retrieved passages have a low distance (indicating relevance)
        if len(distances) > 0 and np.any(distances < 1.3):  # Adjust threshold as needed
            retrieved_passages.extend(passages_batch)

    # Generate response based on the retrieved passages
    if not retrieved_passages:
        response = "I don't have this information. For more information, contact +123456789." # If info asked is out of PDF
    else:
        response = generate_answer(gen_model, prompt, retrieved_passages)

    return response, state

### Create Gradio Interface

In [16]:
# Create Gradio interface
demo = gr.Interface(
    fn=chatbot,
    inputs=["text", "state", gr.File(label="Upload PDFs", file_count="multiple")],
    outputs=["text", "state"],
    title="PDF Chatbot with RAG",
    description="Ask me anything based on the uploaded PDFs!",
)

# Launch the Gradio app
demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7862
Running on public URL: https://9efe4097e9f256d3f1.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Distances: [1.2891536 1.4580312 1.6631095 1.6664965 1.7959213]


### Importing Libraries