<a href="https://colab.research.google.com/github/kdhenderson/msds_colab_notebooks/blob/main/RAG_workshop_part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

The following code is designed to run without any issues, as I have ensured all the packages are compatible. The primary function of this code is to read a PDF file and respond to any questions by providing relevant text from the file along with the source.

As an example, the code reads an academic paper on the impacts of greenspace on health, and answers the question: "Is cardiovascular mortality affected by greenspace exposure?"

__Improving Chunking Strategies:__ Using sentences and paragraphs for chunking ensures better context preservation and more meaningful chunks.

__Using More Sophisticated Language Models:__ Utilizing larger, more powerful models can generate better embeddings, improving the retrieval accuracy.

__Refining the Answer Generation Process:__ Combining multiple top chunks and generating a refined answer ensures a more comprehensive and accurate response

In [None]:
pip install PyMuPDF transformers faiss-cpu

In [None]:
pip install openai==0.28

In [None]:
pip install --upgrade gradio

In [None]:

pip install urllib3==1.26.12


In [None]:
pip install requests==2.28.2

## OpenAI Text Model

In [None]:
import os
import fitz  # PyMuPDF
import numpy as np
import faiss
from transformers import AutoTokenizer, AutoModel
import openai
import torch
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 1: Read PDF Files
def read_pdfs(folder_path):
    pdf_texts = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(folder_path, file_name)
            try:
                doc = fitz.open(file_path)
                text = ""
                for page in doc:
                    text += page.get_text()
                pdf_texts.append((file_name, text))
            except Exception as e:
                print(f"Error reading {file_name}: {e}")
    return pdf_texts

# Step 2: Chunk Text
def chunk_text(text, chunk_size=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        words = sentence.split()
        if current_length + len(words) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.extend(words)
        current_length += len(words)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Step 3: Create Embeddings
def create_embeddings(text_chunks, tokenizer, model):
    embeddings = []
    for chunk in text_chunks:
        inputs = tokenizer(chunk, return_tensors='pt', truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
    return np.array(embeddings)

# Step 4: Index Embeddings
def index_embeddings(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index

# Step 5: Answer Questions using OpenAI API
def answer_question_openai(question, pdf_texts, index, embeddings, tokenizer, model, openai_api_key, temperature=1.0, max_tokens=150, top_k=3):
    openai.api_key = openai_api_key

    # Create embedding for the question
    inputs = tokenizer(question, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        question_embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Search for the nearest text chunks
    _, indices = index.search(np.array([question_embedding]), k=top_k)
    indices = indices[0]

    # Collect top-k chunks
    retrieved_chunks = []
    sources = []
    for idx in indices:
        chunk_offset = idx
        pdf_idx = 0

        while chunk_offset >= len(pdf_texts[pdf_idx][1]):
            chunk_offset -= len(pdf_texts[pdf_idx][1])
            pdf_idx += 1

        pdf_name, chunks = pdf_texts[pdf_idx]
        retrieved_chunks.append(chunks[chunk_offset])
        sources.append(f"{pdf_name}, Chunk {chunk_offset}")

    # Combine retrieved chunks
    combined_text = ' '.join(retrieved_chunks)

    # Prepare the prompt for OpenAI
    prompt = f"Question: {question}\n\nContext: {combined_text}\n\nAnswer:"

    # Call OpenAI API
    response = openai.Completion.create(
        engine="davinci-002",  # or any other GPT-3 model
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        n=1,
        stop=None
    )

    refined_answer = response.choices[0].text.strip()

    return f"Answer: {refined_answer}\nSources: {sources}"

# Main function to tie everything together
def main(folder_path, question, model_name, openai_api_key, temperature=1.0, max_tokens=150):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Read and chunk PDFs
    pdf_texts = read_pdfs(folder_path)
    all_chunks = []
    chunk_mapping = []

    for pdf_name, text in pdf_texts:
        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        chunk_mapping.append((pdf_name, chunks))

    # Create and index embeddings
    embeddings = create_embeddings(all_chunks, tokenizer, model)
    index = index_embeddings(embeddings)

    # Answer question using OpenAI
    answer = answer_question_openai(question, chunk_mapping, index, embeddings, tokenizer, model, openai_api_key, temperature, max_tokens)
    print(answer)

# Example usage
if __name__ == "__main__":
    folder_path = '/content/drive/My Drive/PDFs/'  # Adjust this to your folder path
    question = "What is the homework worth in DS 6371?"
    model_name = "sentence-transformers/all-MiniLM-L6-v2"  # or any other model you prefer
    openai_api_key = "XYZ123"  # Replace with your actual OpenAI API key
    temperature = .9
    max_tokens = 30
    main(folder_path, question, model_name, openai_api_key, temperature, max_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Answer: Each case will have a point value between 10 and 50 points. Generally you can expect a total of 150-200 points depending on the
Sources: ['ds_6371_syllabus Ver 7.pdf, Chunk 13', 'ds_6371_syllabus Ver 7.pdf, Chunk 12', 'ds_6371_syllabus Ver 7.pdf, Chunk 20']


# Chat Model from OpenAI

In [None]:
import os
import fitz  # PyMuPDF
import numpy as np
import faiss
from transformers import AutoTokenizer, AutoModel
import openai
import torch
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
import numpy as np
import gradio as gr

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


# Step 1: Read PDF Files
def read_pdfs(folder_path):
    pdf_texts = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.pdf'):
            file_path = os.path.join(folder_path, file_name)
            try:
                doc = fitz.open(file_path)
                text = ""
                for page in doc:
                    text += page.get_text()
                pdf_texts.append((file_name, text))
            except Exception as e:
                print(f"Error reading {file_name}: {e}")
    return pdf_texts

# Step 2: Chunk Text
def chunk_text(text, chunk_size=100):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        words = sentence.split()
        if current_length + len(words) > chunk_size:
            chunks.append(' '.join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.extend(words)
        current_length += len(words)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Step 3: Create Embeddings
def create_embeddings(text_chunks, tokenizer, model):
    embeddings = []
    for chunk in text_chunks:
        inputs = tokenizer(chunk, return_tensors='pt', truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
    return np.array(embeddings)

# Step 4: Index Embeddings
def index_embeddings(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index

# Step 5: Answer Questions using OpenAI Chat API
def answer_question_openai_chat(question, pdf_texts, index, embeddings, tokenizer, model, openai_api_key, temperature=1.0, max_tokens=150, top_k=6):
    openai.api_key = openai_api_key

    # Create embedding for the question
    inputs = tokenizer(question, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        question_embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Search for the nearest text chunks
    _, indices = index.search(np.array([question_embedding]), k=top_k)
    indices = indices[0]

    # Collect top-k chunks
    retrieved_chunks = []
    sources = []
    for idx in indices:
        chunk_offset = idx
        pdf_idx = 0

        while chunk_offset >= len(pdf_texts[pdf_idx][1]):
            chunk_offset -= len(pdf_texts[pdf_idx][1])
            pdf_idx += 1

        pdf_name, chunks = pdf_texts[pdf_idx]
        retrieved_chunks.append(chunks[chunk_offset])
        sources.append(f"{pdf_name}, Chunk {chunk_offset}")

    # Combine retrieved chunks
    combined_text = ' '.join(retrieved_chunks)

    # Prepare the messages for the Chat API
    messages = [
        {"role": "system", "content": "You are a helpful assistant that is reading a sylabus for a student.  Be breif when possible.  You don't need use all the tokens in your response. Always start the response by saying a kind greeting."},
        {"role": "user", "content": f"Context: {combined_text}\n\nQuestion: {question}"}
    ]

    # Call OpenAI Chat API
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",  # or any other available chat model
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature,
    )

    refined_answer = response['choices'][0]['message']['content'].strip()

#Call1 REason RAG

#Go look up text from Text resolution for that call1 reason

#send that text from the test resolution and the orsignial X (2000 words of text) to a final seq2seq model to provide the text the customer.

    return f"Answer: {refined_answer}\nSources: {sources}"

# Main function to tie everything together
def main(folder_path, question, model_name, openai_api_key, temperature=1.0, max_tokens=150):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Read and chunk PDFs
    pdf_texts = read_pdfs(folder_path)
    all_chunks = []
    chunk_mapping = []

    for pdf_name, text in pdf_texts:
        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        chunk_mapping.append((pdf_name, chunks))

    # Create and index embeddings
    embeddings = create_embeddings(all_chunks, tokenizer, model)
    index = index_embeddings(embeddings)

    # Answer question using OpenAI Chat API
    answer = answer_question_openai_chat(question, chunk_mapping, index, embeddings, tokenizer, model, openai_api_key, temperature, max_tokens)
    print(answer)



# Gradio Interface Functions
def process_pdfs_and_answer_question(folder_path, question, model_name, openai_api_key, temperature=0.1, max_tokens=150):

    #os.makedirs(folder_path, exist_ok=True)

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    # Read and chunk PDFs
    pdf_texts = read_pdfs(folder_path)
    all_chunks = []
    chunk_mapping = []

    for pdf_name, text in pdf_texts:
        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        chunk_mapping.append((pdf_name, chunks))

    # Create and index embeddings
    embeddings = create_embeddings(all_chunks, tokenizer, model)
    index = index_embeddings(embeddings)

    # Answer question using OpenAI
    answer = answer_question_openai_chat(question, chunk_mapping, index, embeddings, tokenizer, model, openai_api_key, temperature, max_tokens)
    return answer

# Create Gradio Interface
iface = gr.Interface(
    fn=process_pdfs_and_answer_question,
    inputs=[
        gr.Textbox(lines=2, placeholder="RAG Folder Path", label="Path"),
        gr.Textbox(lines=2, placeholder="Enter your question here...", label="Question"),
        gr.Textbox(label="Model Name"),
        gr.Textbox(type="password", label="OpenAI API Key"),
        gr.Slider(minimum=0.0, maximum=1.0, label="Temperature"),
        gr.Slider(minimum=1, maximum=256, label="Max Tokens")
    ],
    outputs="text",
    title="PDF Question Answering System",
    description="Upload PDF files, ask questions, and get answers based on the content of the PDFs."
)

iface.launch(share=True)



## Temperature
"Temperature" is a parameter used in the generation process of large language models (LLMs) to control the randomness of the output. It essentially adjusts the probability distribution of the predicted tokens, influencing the diversity and creativity of the generated text. Here's how it works:

Low Temperature (e.g., close to 0): When the temperature is low, the model becomes more deterministic and conservative. It tends to choose the most probable next word in the sequence, leading to more predictable and repetitive outputs. This setting is useful when you want the model to generate precise and factual text.

High Temperature (e.g., 1.0 or higher): When the temperature is high, the model's predictions become more random and diverse. The probability distribution is flattened, meaning less likely words have a higher chance of being selected. This can lead to more creative and varied responses, but it can also introduce more mistakes or less coherent text.

In essence, the temperature parameter allows you to balance between coherence and creativity:

Low temperature: More focused and deterministic output.
High temperature: More diverse and creative output.
The choice of temperature depends on the specific use case and the desired characteristics of the generated text.

Here's an example illustrating the effect of temperature on a simple prompt:

Prompt: "The quick brown fox"

Low temperature (e.g., 0.2):
Output: "The quick brown fox jumps over the lazy dog."
High temperature (e.g., 1.0):
Output: "The quick brown fox danced under a glowing moon, chasing shadows."

# This is For Reference inputs to gradio above

In [None]:
#folder_path = '/content/drive/My Drive/PDFs/'  # Adjust this to your folder path
#    question = "What is the SMU Honor Code?"
#    question = "What is the FLS assignment?"
#    question = "What is homework worth in DS 6371?"
#    model_name = "sentence-transformers/all-MiniLM-L6-v2"  # or any other model you prefer
#    openai_api_key = "XYZ123"  # Replace with your actual OpenAI API key
#    temperature = .1
#    max_tokens = 20
#    main(folder_path, question, model_name, openai_api_key, temperature, max_tokens)