<div style="text-align: center;">
    <h1 style="color: #FF6347;">Retrieval-Augmented Generation (RAGs)</h1>
</div>

<div style="text-align: center;">
    <img src="https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ3FsdzRveTBrenMxM3VnbDMwaTJxN2NnZm50aGFibXk1NzNnY2Q0MCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LR5ZBwZHv02lmpVoEU/giphy.gif" alt="NLP Gif" style="width: 300px; height: 150px; object-fit: cover; object-position: center;">
</div>


**RAGs** (Retrieval-Augmented Generation) is an advanced application of Natural Language Processing (NLP) that combines document retrieval and generative models to provide context-aware, accurate, and dynamic responses. This technique is particularly useful for question-answering systems, chatbots, and technical document analysis.

- **What is RAG?**
  - Combines information retrieval with generative models.
  - Retrieves relevant context from a document corpus or database and integrates it into generated responses.
  - Designed for tasks requiring high accuracy and context sensitivity.

- **Key Use Cases:**
  - Question-answering systems.
  - Chatbots that provide real-time, context-aware responses.
  - Technical document analysis and summarization.
  - Customer support with tailored, informed replies.

- **Benefits of RAG:**
  - Dynamically adapts to new information without retraining.
  - Reduces hallucination in generative models.
  - Enhances user interaction by grounding responses in verifiable data.


<h3 style="color: #FF8C00;">By the End of This Lesson, You'll:</h3>

- Understand the fundamentals of Retrieval-Augmented Generation (RAGs).
- Learn key text preprocessing techniques for RAGs.
- Use word embeddings to create numerical representations of text.
- Apply document retrieval techniques to find relevant context.
- Employ generative models to create context-aware responses.
- Analyze and interpret the generated responses for insights.

In [10]:
# conda create -n genai python=3.10
# pip install "pydantic<2.0"
# pip install langchain==0.0.230
# pip install python-dotenv
# pip install openai
# pip install --upgrade langchain pydantic
# pip install -U langchain-community
# pip install -U langchain-openai
# pip install sentence-transformers
# pip install pypdf
# pip install chromadb

# conda env export > environment.yml
# conda env create -f environment.yml

<h1 style="color: #FF6347;">Data Storage & Retrieval</h1>


<h2 style="color: #FF8C00;">PyPDFLoader</h2>

`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.

- **What Does PyPDFLoader Do?**
  - Extracts text from PDF files, retaining formatting and layout.
  - Simplifies the preprocessing of document-based datasets.
  - Supports efficient and scalable loading of large PDF collections.

- **Key Features:**
  - Compatible with popular NLP libraries and frameworks.
  - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).
  - Provides flexible configurations for structured text extraction.

- **Use Cases:**
  - Preparing PDF documents for retrieval-based systems in RAGs.
  - Automating the text extraction pipeline for document analysis.
  - Creating datasets from academic papers, technical manuals, and reports.


In [11]:
# pip install langchain
# pip install langchain_community
# pip install pypdf

In [12]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

<h3 style="color: #FF8C00;">Loading the Documents</h3>

In [13]:
# File path for the document
document_dir = "./"
filename = "The International System for Serous Fluid Cytopathology.pdf"
file_path = os.path.join(document_dir, filename)

<h3 style="color: #FF8C00;">Documents into pages</h3>

The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks. 

This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).


In [14]:
# Load and split the document
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
len(pages)

327

<h3 style="color: #FF8C00;">Pages into Chunks</h3>

The `CharacterTextSplitter` utility helps divide text into smaller chunks, making it more manageable for downstream NLP tasks. This is particularly useful in workflows like Retrieval-Augmented Generation (RAG), where documents need to be processed as discrete sections.

- **Code Explanation:**
  - `CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)`:
    - Initializes a text splitter with a specified chunk size and overlap.
    - **`chunk_size=1000`**: Each chunk will contain up to 1,000 characters.
    - **`chunk_overlap=0`**: No overlap between consecutive chunks.
  - `split_documents(pages)`:
    - Splits the input `pages` (e.g., from `PyPDFLoader`) into smaller text chunks.
  - `chunks`: The resulting list of chunks, each containing a portion of the original document.


In [15]:
# Split pages into chunks
text_splitter = CharacterTextSplitter(chunk_size=10000, chunk_overlap=0)
chunks = text_splitter.split_documents(pages)
len(chunks)

327

<h2 style="color: #FF8C00;">Embeddings</h2>

Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.

- **What are OpenAI Embeddings?**
  - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.
  - Encapsulate semantic relationships in the text, enabling robust NLP applications.

- **Key Features of `text-embedding-3-large`:**
  - Large-scale embedding model optimized for accuracy and versatility.
  - Handles diverse NLP tasks, including retrieval, classification, and clustering.
  - Ideal for applications with high-performance requirements.

- **Benefits:**
  - Reduces the need for extensive custom training.
  - Provides state-of-the-art performance in retrieval-augmented systems.
  - Compatible with RAGs to create powerful context-aware models.


In [16]:
from langchain.embeddings import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()

True

In [None]:
# terminal: 
# echo OPENAI_API_KEY="" > .env

In [18]:
api_key = os.getenv("OPENAI_API_KEY")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

  embeddings = OpenAIEmbeddings(model="text-embedding-3-large")


<h2 style="color: #FF8C00;">ChromaDB</h2>

ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.

### Workflow Overview:
- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).
- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.
- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.

### Key Features of ChromaDB:
- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.
- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.
- **Integration:** Supports integration with popular frameworks and libraries for embedding generation.

In [20]:
# pip install langchain_chroma
# pip install --upgrade pip setuptools wheel
# pip install duckdb --only-binary=:all:
# pip install chroma-migrate

In [21]:
from langchain.vectorstores import Chroma

In [22]:
db = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")
print("ChromaDB created with document embeddings.")

ChromaDB created with document embeddings.


<h1 style="color: #FF6347;">Retrieving Documents</h1>


In [23]:
user_question = "How can I diagnose a malignant mesothelioma?" # User question
retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve

In [24]:
# Display top results
for i, doc in enumerate(retrieved_docs[:1]): # Display top 3 results
    print(f"Document {i+1}:\n{doc.page_content[36:1000]}") # Display content

Document 1:
The definitive diagnosis of malignant mesotheli-
oma can still be made with the use of ancillary studies (IC, analysis of soluble bio-
markers, and p16 FISH). Some patients have thoracoscopic biopsies of parietal 
pleura which are diagnosed as MIS. However, treatment of patients with MIS is not 
yet established. According to the NCCN guidelines, malignant mesothelioma can 
be treated with chemotherapy only if inoperable. Operable mesothelioma patients 
may be treated with pleurectomy/decortication or extrapleural pneumonectomy 
(EPP). Patients with operable mesothelioma can receive chemotherapy either before 
or after surgery. In patients who do not receive induction chemotherapy before EPP, 
postoperative sequential chemotherapy with hemithoracic radiation therapy is rec-
ommended [86]. The detection of malignant mesothelioma early enough to improve 
chances for curative treatment would probably require screening blood samples and 
the employment of


<h2 style="color: #FF8C00;">Preparing Content for GenAI</h2>

In [25]:
def _get_document_prompt(docs):
    prompt = "\n"
    for doc in docs:
        prompt += "\nContent:\n"
        prompt += doc.page_content + "\n\n"
    return prompt

In [26]:
# Generate a formatted context from the retrieved documents
formatted_context = _get_document_prompt(retrieved_docs)
print("Context formatted for GPT model.")

Context formatted for GPT model.


<h2 style="color: #FF8C00;">ChatBot Architecture</h2>

In [27]:
prompt = f"""
## SYSTEM ROLE
You are a knowledgeable and factual chatbot designed to assist with technical questions about **Cytology**, specifically focusing on **Lung Cancer**. 
Your answers must be based exclusively on provided content from technical books provided.

## USER QUESTION
The user has asked: 
"{user_question}"

## CONTEXT
Here is the relevant content from the technical books:  
'''
{formatted_context}
'''

## GUIDELINES
1. **Accuracy**:  
   - Only use the content in the `CONTEXT` section to answer.  
   - If the answer cannot be found, explicitly state: "The provided context does not contain this information."
   - Start explain cell morphology and then divide morphology in bulletpoints (nuclie, cytoplasm, background and other aspects to consider) 
   - Follow by differential diagnosis
   - Lastly explain ancillary studies for malignant mesothelioma.

2. **Transparency**:  
   - Reference the book's name and page numbers when providing information.  
   - Do not speculate or provide opinions.  

3. **Clarity**:  
   - Use simple, professional, and concise language.  
   - Format your response in Markdown for readability.  

## TASK
1. Answer the user's question **directly** if possible.  
2. Point the user to relevant parts of the documentation.  
3. Provide the response in the following format:

## RESPONSE FORMAT
'''
# [Brief Title of the Answer]
[Answer in simple, clear text.]

**Source**:  
• [Book Title], Page(s): [...]
'''
"""
print("Prompt constructed.")

Prompt constructed.


In [28]:
import openai

In [None]:
# Set up GPT client and parameters
client = openai.OpenAI()
model_params = {
    'model': 'gpt-4o',
    'temperature': 0.7,  # Increase creativity
    'max_tokens': 4000,  # Allow for longer responses
    'top_p': 0.9,        # Use nucleus sampling
    'frequency_penalty': 0.5,  # Reduce repetition
    'presence_penalty': 0.6    # Encourage new topics
}

<h1 style="color: #FF6347;">Response</h1>


In [30]:
messages = [{'role': 'user', 'content': prompt}]
completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)

In [31]:
answer = completion.choices[0].message.content
print(answer)

'''
# Diagnosing Malignant Mesothelioma

To diagnose malignant mesothelioma, a combination of cytological examination and ancillary studies is required. Here is a detailed breakdown:

### Cell Morphology
The definitive diagnosis relies heavily on the morphological characteristics of cells in body cavity fluids.

- **Nuclei**: Malignant cells typically present with prominent nucleoli and an increased nuclear-to-cytoplasmic ratio.
- **Cytoplasm**: The cytoplasm may appear vacuolated or dense, indicative of malignancy.
- **Background**: A bloody effusion might be present, which is common in cases of malignant mesothelioma.
- **Other Aspects to Consider**: Cells are positive for mesothelial markers and negative for epithelial markers in immunochemistry tests.

### Differential Diagnosis
The differential diagnosis includes distinguishing malignant mesothelioma from other mesothelial tumors or conditions such as:

- Mesothelioma in situ (MIS)
- Other epithelial malignancies

### Ancillary St

<div style="background-color: #e6f7ff; padding: 10px; border-radius: 5px; color: black;">
<h2>Diagnosing Malignant Mesothelioma</h2>

To diagnose malignant mesothelioma, a combination of cytological examination, differential diagnosis, and ancillary studies is utilized.

### Cell Morphology
- **Nuclei**: The cells may present with pyknotic nuclei, which are condensed and darkly stained.
- **Cytoplasm**: Cells often appear eosinophilic, meaning they have a pinkish hue when stained with Papanicolaou stain.
- **Cell Clusters**: Mesothelioma cells may form three-dimensional clusters such as papillary or acinar formations.

### Differential Diagnosis
- **Malignant Mesothelioma (MAL-P)**: Diagnosed when overtly malignant features are identified, often confirmed by ancillary tests.
- **Suspicious for Malignancy (SFM)**: Used when features suggest malignancy but are not confirmed by ancillary tests.
- **Atypia of Undetermined Significance (AUS)**: Used sparingly when the clinical picture does not support a reactive etiology and ancillary tests are inconclusive.

### Ancillary Studies
- **Immunocytochemistry (IC)**: Used to identify mesothelial markers and exclude epithelial markers.
- **Fluorescence In Situ Hybridization (FISH)**: Detects deletion of p16/CDKN2A, which supports the diagnosis.
- **Biomarker Analysis**: Soluble biomarkers like hyaluronan, mesothelin, and osteopontin can aid in diagnosis.
- **Loss of BAP1 Expression**: This is another supportive test for diagnosing malignant mesothelioma.

**Source**:  
• C. Michael et al., Pages: 64, 74, 90, 92, 98
</div>

<h2 style="color: #FF6347;">OpenAI Vs SentenceTransformer Embeddings</h2>

Embeddings are critical for transforming text into dense vector representations. Comparing different embedding models helps us:

1. Understand their performance in tasks like similarity search and context retrieval.
2. Determine which model is better suited for specific applications:
   - **OpenAI Embeddings**: High-quality, general-purpose embeddings for robust tasks.
   - **SentenceTransformers**: Lightweight, domain-specific embeddings, optimized for speed and efficiency.

In this section, we'll compare:
- Vector dimensions.
- Example embedding outputs for the same query.

In [32]:
# pip install langchain-huggingface
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

In [33]:
openai_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
sentence_transformer_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  sentence_transformer_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


In [34]:
from sklearn.decomposition import PCA

In [35]:
openai_vector = openai_embeddings.embed_query(user_question)
sentence_vector = sentence_transformer_embeddings.embed_query(user_question)


In [36]:
len(openai_vector)

3072

In [37]:
len(sentence_vector)

384

In [38]:
# Truncate the higher-dimensional embedding to match the lower-dimensional one
if len(openai_vector) > len(sentence_vector):
	openai_vector = openai_vector[:len(sentence_vector)] # slice the vector to match the length of the other
elif len(sentence_vector) > len(openai_vector):
	sentence_vector = sentence_vector[:len(openai_vector)]

In [39]:
len(openai_vector)

384

In [40]:
len(sentence_vector)

384

In [41]:
# Compare the two sets
common_elements = set(openai_vector[:5]).intersection(set(sentence_vector[:5]))
unique_to_openai = set(openai_vector[:5]) - set(sentence_vector[:5])
unique_to_sentence_transformer = set(sentence_vector[:5]) - set(openai_vector[:5])

print(f"Common elements: {common_elements}\n")
print(f"1. OpenAI Embeddings: \n   {unique_to_openai}\n")
print(f"2. SentenceTransformer Embeddings: \n    {unique_to_sentence_transformer}")

Common elements: set()

1. OpenAI Embeddings: 
   {0.018350934609770775, 0.006769313011318445, 0.0035643160808831453, 0.018132776021957397, -0.01743980497121811}

2. SentenceTransformer Embeddings: 
    {0.008178613148629665, 0.08663614839315414, -0.021265855059027672, 0.0016131369629874825, -0.007358015514910221}


In [42]:
from scipy.spatial.distance import cosine

In [43]:
similarity = 1 - cosine(openai_vector, sentence_vector)
print(f"Cosine Similarity between OpenAI and SentenceTransformer embeddings: {similarity:.4f}")

Cosine Similarity between OpenAI and SentenceTransformer embeddings: 0.0342


<img src="https://miro.medium.com/v2/resize:fit:824/1*GK56xmDIWtNQAD_jnBIt2g.png" alt="NLP Gif" style="width: 500px">

<h2 style="color: #FF6347;">Cosine Similarity</h2>

**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It provides a scale from -1 to 1:

- **-1**: Vectors are completely opposite.
- **0**: Vectors are orthogonal (uncorrelated or unrelated).
- **1**: Vectors are identical.

<h3 style="color: #FF8C00;">OpenAI Vs SentenceTransformer</h3>

A **cosine similarity score of 0.0342** suggests that the embeddings from OpenAI and SentenceTransformer are almost orthogonal, meaning they capture **different aspects of the text**. This result highlights:

- **Model Architecture Differences**: Each model is trained using distinct methodologies and objectives.
- **Diverse Training Data**: The models may have been exposed to varying datasets, leading to differences in how they represent semantic relationships.
- **Embedding Techniques**: Differences in how text is tokenized and transformed into vectors can lead to orthogonality in embeddings.

<img src="https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg" alt="NLP Gif" style="width: 700px">

<h2 style="color: #FF6347;">Keyword Highlighting</h2>

Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query.

Using the query keywords `["malignant", "mesothelioma", "diagnosis"]`, this code snippet identifies and highlights their occurrences in the retrieved text. Here's how it works:

- **Process**:
  - Iterate through the top 1 retrieved document.
  - Extract the first 200 characters from each document.
  - Highlight the keywords using the `highlight_keywords` function.

In [44]:
# pip install termcolor

In [45]:
from termcolor import colored

The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.


In [46]:
def highlight_keywords(text, keywords):
    for keyword in keywords:
        text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))
    return text

In [47]:
query_keywords = ["malignant", "mesothelioma", "diagnosis"]
for i, doc in enumerate(retrieved_docs[:1]):
    snippet = doc.page_content[:200]
    highlighted = highlight_keywords(snippet, query_keywords)
    print(f"Snippet {i+1}:\n{highlighted}\n{'-'*80}")

Snippet 1:
92
of typical pleural [1m[32mmesothelioma[0m. The definitive [1m[32mdiagnosis[0m of [1m[32mmalignant[0m mesotheli-
oma can still be made with the use of ancillary studies (IC, analysis of soluble bio-
markers, and p16 FISH). Some 
--------------------------------------------------------------------------------


1. `query_keywords` is a list of keywords to be highlighted.
2. The loop iterates over the first three documents in retrieved_docs.
3. For each document, a snippet of the first 200 characters is extracted.
4. The highlight_keywords function is called to highlight the keywords in the snippet.
5. The highlighted snippet is printed along with a separator line.

<h1 style="color: #FF6347;">Summary</h1>

1. **Document Loading and Preprocessing**:
   - Used `PyPDFLoader` to load and split PDFs into manageable chunks.
   - Preprocessed text into embeddings for efficient similarity search.

2. **Embedding Creation**:
   - Generated embeddings using OpenAI and SentenceTransformer models.

3. **Data Storage**:
   - Stored embeddings in ChromaDB for fast and scalable retrieval.

4. **Document Retrieval**:
   - Queried ChromaDB to retrieve relevant snippets based on user input.

5. **Answer Generation**:
   - Formatted retrieved content into a structured prompt for generative AI.
   - Produced context-aware responses using GPT models.