 ## Part 1: Retrieval-Augmented Generation (RAG) Model for QA Bot
 Problem Statement:
 Develop a Retrieval-Augmented Generation (RAG) model for a Question Answering (QA)
 bot for a business. Use a vector database like Pinecone DB and a generative model like
 Cohere API (or any other available alternative). The QA bot should be able to retrieve
 relevant information from a dataset and generate coherent answers.
 Task Requirements:
 1. Implement a RAG-based model that can handle questions related to a provided
 document or dataset.
 2. Use a vector database (such as Pinecone) to store and retrieve document
 embeddings efficiently.
 3. Test the model with several queries and show how well it retrieves and generates
 accurate answers from the document.
 Deliverables:
 ● A Colab notebook demonstrating the entire pipeline, from data loading to question
 answering.
 ● Documentation explaining the model architecture, approach to retrieval, and how
 generative responses are created.
 ● Provide several example queries and the corresponding outputs.


In [1]:
!pip install pinecone-client cohere transformers
!pip install PyPDF2
!pip install fastapi uvicorn PyPDF2 pinecone-client

import os
import pinecone
import cohere
from transformers import AutoTokenizer, AutoModel, pipeline
import torch
from PyPDF2 import PdfReader
from fastapi import FastAPI, File, UploadFile, Form



In [2]:
import os
from pinecone import Pinecone, ServerlessSpec

# Initialize Pinecone with your API key
api_key = "f2ada18a-5bd2-45d1-bd63-c37a70f6673c"  # Replace with your Pinecone API key
pc = Pinecone(api_key=api_key)

# Define the index name
index_name = 'myindex'

# Check if the index exists, if not, create it
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # Set to 1024 for the 'multilingual-e5-large' model
        metric='cosine',  # Use 'cosine' for similarity metric
        spec=ServerlessSpec(
            cloud='aws',  # Cloud provider is AWS
            region='us-east-1'  # Region is 'us-east-1'
        )
    )

# Connect to the index
index = pc.Index(index_name)

In [3]:
# Initialize Pinecone using Pinecone class
'''from pinecone import Pinecone, ServerlessSpec # Import ServerlessSpec
pc = Pinecone(api_key='f2ada18a-5bd2-45d1-bd63-c37a70f6673c', environment='us-east-1')

# Create an index in Pinecone (use a suitable dimensionality, e.g., 768 for BERT)
index_name = 'document-embeddings'
if index_name not in pc.list_indexes(): # Use pc.list_indexes()
    pc.create_index(index_name, dimension=768, spec=ServerlessSpec(cloud='aws', region='us-east-1')) # Use pc.create_index() and add cloud and region arguments
else:
    print(f"Index '{index_name}' already exists.")
# Initialize Cohere
co = cohere.Client('rSZ7PogMXxzSXdWmnwtD77TfRe1INWHfL4yRbffi')'''

PineconeApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2024-07', 'X-Cloud-Trace-Context': 'eb0c9973d9568e1253c13b587133fd7e', 'Date': 'Sat, 21 Sep 2024 16:40:00 GMT', 'Server': 'Google Frontend', 'Content-Length': '85', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"ALREADY_EXISTS","message":"Resource  already exists"},"status":409}


In [None]:
'''# function to load the sample document
def load_document(documnet_text):
  embedding =  get_embedding(documnet_text)
  index.upsert(
    vectors=[
        {
            "id": "vec1",
            "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
            "metadata": {"genre": "drama"}
        }, {
            "id": "vec2",
            "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
            "metadata": {"genre": "action"}
        }, {
            "id": "vec3",
            "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
            "metadata": {"genre": "drama"}
        }, {
            "id": "vec4",
            "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4],
            "metadata": {"genre": "action"}
        }
    ],
    namespace= "ns1"
)'''

In [None]:
'''#query testing
index.query(
    namespace="ns1",
    vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
    top_k=2,
    include_values=True,
    include_metadata=True,
    filter={"genre": {"$eq": "action"}}
)'''

In [5]:
'''tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')'''

# Use a different pre-trained model optimized for QA tasks
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
model = AutoModel.from_pretrained('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
# Function to generate embeddings using BERT-based model
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1).squeeze()
    return embeddings.numpy().tolist()

In [7]:
# Function to load and process a document (PDF) into embeddings
def load_document(pdf_path):
    # Extract text from the PDF
    reader = PdfReader(pdf_path)
    document_text = ""
    for page in reader.pages:
        document_text += page.extract_text()

    # Get embeddings for the document
    embedding = get_embedding(document_text)

    # Upsert embeddings into Pinecone
    index.upsert(
        vectors=[{
            "id": "doc1",  # Unique ID for the document
            "values": embedding,
            "metadata": {"source": "Business Document"}
        }]
    )
    return document_text

In [8]:
!pip install pycryptodome



## File loading and testing

In [9]:
from PyPDF2 import PdfReader
from PyPDF2.errors import PdfReadError

# Function to load and process a document (PDF) into embeddings
def load_document(pdf_path, password=None):
    try:
        reader = PdfReader(pdf_path)

        # Check if the PDF is encrypted
        if reader.is_encrypted:
            print(f"PDF is encrypted. Attempting to decrypt...")
            # Try to decrypt using the provided password
            if password:
                success = reader.decrypt(password)
                if success != 1:
                    raise PdfReadError("Failed to decrypt PDF with provided password.")
            else:
                raise PdfReadError("PDF is encrypted, but no password was provided.")

        # Extract text from the PDF
        document_text = ""
        for page in reader.pages:
            document_text += page.extract_text()

        # Get embeddings for the document
        embedding = get_embedding(document_text)

        # Upsert embeddings into Pinecone
        index.upsert(
            vectors=[{
                "id": "doc1",  # Unique ID for the document
                "values": embedding,
                "metadata": {"source": "Business Document"}
            }]
        )
        return document_text

    except PdfReadError as e:
        print(f"Error reading PDF: {str(e)}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

# Example usage with a PDF (with or without a password)
pdf_path = '/content/NachiketDattatrayaDixitResume (1).pdf'
document_text = load_document(pdf_path, password=None)  # If encrypted, pass the password here


In [10]:
def retrieve_relevant_info(question):
    question_embedding = get_embedding(question)
    response = index.query(
        vector=question_embedding,
        top_k=2,  # Get top 2 relevant results
        include_metadata=True
    )

    if response['matches']:
        document_id = response['matches'][0]['id']
        score = response['matches'][0]['score']
        return document_id, score
    else:
        return None, None

In [11]:
def truncate_text(text, max_length=1024):
    tokens = tokenizer.encode(text)
    if len(tokens) > max_length:
        truncated_tokens = tokens[:max_length]
        truncated_text = tokenizer.decode(truncated_tokens, skip_special_tokens=True)
        return truncated_text
    return text

In [12]:
def generate_answer(question, document_text):
    from transformers import pipeline

    # Initialize the text generation pipeline with a specific model
    generator = pipeline("text-generation", model="gpt2")

    # Truncate the document text to fit within the model's token limit (1024 tokens)
    truncated_document_text = truncate_text(document_text, max_length=512) # Reduced max_length to 512 to accommodate the prompt and answer tokens

    # Define the prompt with truncated document text
    prompt = f"Document: {truncated_document_text}\nQuestion: {question}\nAnswer:"

    # Generate the answer using the truncated document text
    answer = generator(prompt, max_new_tokens=50, num_return_sequences=1)[0]['generated_text']

    return answer

In [13]:
question = "What is the work experience?"
doc_id, score = retrieve_relevant_info(question)

In [14]:
if doc_id:
    print(f"Relevant Document ID: {doc_id}, Score: {score}")
    answer = generate_answer(question, document_text)
    print(f"Answer: {answer}")
else:
    print("No relevant information found.")

Relevant Document ID: doc1, Score: 0.304644


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (717 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer: Document: nachiket dattatraya dixit software developer / data scientist / analyti cs pune, maharashtra, india | + 91 - 8830942045 | erdnachiket @ gmail. com | linkedin | github educat ion bachelor of engineering in mechanical engineering savitribai phule pune university 08 / 2016 - 05 / 2022 cgpa : - 8. 13 / 10 cgpa experience data science intern oasis infobyte 04 / 2024 - 05 / 2024 python sql sql server machine learning time series data analysis data visualization git github microsoft azure linux tableau business intelligence • developed ml models achieving 95 % + accuracy in text classification tasks. • conducted eda, preprocessing, and feature engineering for 98 % spam detection accuracy. • improved sales forecasting accuracy by 15 % using regression and ensemble techniques. data science intern mentorness duration 01 / 2023 - 04 / 2024 • published 5 + articles on data science trends, showcasing communication skills. • devised an innovative ai - driven churn prediction model 