<a href="https://colab.research.google.com/github/pranav120705/NMGENAI/blob/main/GenAIEX5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## WIKIPEDIA ChatBot

Let's dive deeper into the theoretical concepts behind this code.

**1. Retrieval Augmented Generation (RAG):**

* **Core Idea:** RAG addresses the limitations of Large Language Models (LLMs) by grounding their knowledge in external data sources. LLMs are trained on vast datasets, but their knowledge is static and may become outdated. RAG allows them to access and use real-time or domain-specific information.
* **Process:**
    * **Retrieval:** When a user asks a question, the system first retrieves relevant information from an external database (Wikipedia in this case).
    * **Augmentation:** The retrieved information is then provided as context to the LLM.
    * **Generation:** The LLM generates a response based on both its pre-trained knowledge and the retrieved context.
* **Benefits:**
    * Improved accuracy and relevance of responses.
    * Ability to access up-to-date information.
    * Reduced hallucination (generating false information).
    * Increased transparency (source documents are provided).

**2. Vector Embeddings and Similarity Search:**

* **Vector Embeddings:**
    * Text is transformed into numerical vectors that capture its semantic meaning.
    * Words or phrases with similar meanings have vectors that are close to each other in vector space.
    * Google's `GoogleGenerativeAIEmbeddings` model is used to create these vectors.
* **Similarity Search:**
    * The user's question is also converted into a vector embedding.
    * A similarity search algorithm (implemented by FAISS) finds the text chunks in the database whose vectors are most similar to the question's vector.
    * This retrieves the chunks that are most semantically relevant to the question.
* **FAISS (Facebook AI Similarity Search):**
    * A library optimized for fast and efficient similarity search in high-dimensional vector spaces.
    * It enables the system to quickly find the most relevant text chunks from a large database.

**3. Large Language Models (LLMs):**

* **Generative Models:** LLMs are trained to generate text that is statistically similar to the text they have been trained on.
* **Contextual Understanding:** They can understand the context of a conversation or query and generate responses that are relevant to that context.
* **Gemini Pro:**
    * In this code, Google's Gemini Pro model is used for text generation.
    * It takes the retrieved text from Wikipedia as context and uses it to formulate an answer.
* **Limitations:**
    * LLMs can sometimes generate incorrect or nonsensical information (hallucinations).
    * Their knowledge is limited to the data they were trained on.
    * This is where RAG helps to mitigate those limitations.

**4. LangChain:**

* **Framework for LLM Applications:** LangChain provides tools and abstractions for building applications that use LLMs.
* **Components:**
    * **Document Loaders:** Load data from various sources (e.g., Wikipedia, PDFs, web pages).
    * **Text Splitters:** Divide large documents into smaller chunks.
    * **Vector Stores:** Store and retrieve vector embeddings.
    * **Chains:** Combine multiple LLM calls and other components to perform complex tasks.
* **Abstraction:** LangChain simplifies the process of building RAG systems by providing reusable components and a consistent API.

**In summary:**

This code leverages the power of LLMs by combining them with a retrieval system. Vector embeddings and similarity search enable efficient retrieval of relevant information from Wikipedia. LangChain provides the framework for orchestrating the entire process. The RAG approach allows the LLM to provide more accurate and contextually relevant answers by grounding its knowledge in external data.


In [None]:
import os
import langchain
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import WikipediaLoader
import wikipedia

# Directly set the API key in Colab (less secure, but simpler for Colab)
GOOGLE_API_KEY = "AIzaSyC_NSrDX__MLPdb9mKqQW12wKAs8Xt68L0"  # Replace with your actual API key

if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not set. Please replace 'YOUR_GOOGLE_API_KEY' with your key.")

def create_wikipedia_chatbot(question):
    """
    Creates a chatbot that retrieves information from Wikipedia based on the question.

    Args:
        question (str): The user's question.

    Returns:
        RetrievalQA: A LangChain RetrievalQA chain, or None if an error occurred.
    """

    try:
        # Use the question itself as the search query
        try:
            # Get the closest wikipedia page title.
            best_guess = wikipedia.search(question)[0]
            loader = WikipediaLoader(query=best_guess, load_max_docs=2)
            documents = loader.load()
        except Exception as e:
            print(f"Error loading Wikipedia data: {e}")
            return None

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        texts = text_splitter.split_documents(documents)

        embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=GOOGLE_API_KEY)
        db = FAISS.from_documents(texts, embeddings)
        retriever = db.as_retriever()
        llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-pro", google_api_key=GOOGLE_API_KEY)
        qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

        return qa_chain

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def ask_question(question):
    """
    Asks a question to the chatbot and prints the answer.

    Args:
        question (str): The question to ask.
    """
    qa_chain = create_wikipedia_chatbot(question)
    if qa_chain:
        try:
            result = qa_chain({"query": question})
            print("Question:", question)
            print("Answer:", result["result"])
            print("\nSource Documents:")
            for doc in result["source_documents"]:
                print(f"  - {doc.metadata['title']}")
        except Exception as e:
            print(f"Error asking question: {e}")
    else:
        print("Chatbot not initialized.")

# Example usage
if __name__ == "__main__":
    while True:
        user_question = input("Ask a question (or type 'exit'): ")
        if user_question.lower() == "exit":
            break
        ask_question(user_question)

Ask a question (or type 'exit'): what is space?
Question: what is space?
Answer: Space, or outer space, is the expanse beyond Earth's atmosphere and between celestial bodies.  It's a near-perfect vacuum containing very low densities of particles, mostly hydrogen and helium plasma. It also contains electromagnetic radiation, cosmic rays, neutrinos, magnetic fields, and dust.  The baseline temperature is extremely cold, at 2.7 kelvins (−270 °C; −455 °F), set by background radiation from the Big Bang.

Source Documents:
  - Outer space
  - Outer space
  - Outer space
  - Outer space
Ask a question (or type 'exit'): exit


In [None]:
?

Ask a question (or type 'exit'): hello 


  result = qa_chain({"query": question})


Question: hello 
Answer: Hello there! How can I help you today?

Source Documents:
  - Artificial general intelligence
  - Artificial intelligence
  - Artificial intelligence
  - Artificial general intelligence
Ask a question (or type 'exit'): sssssssss
Question: sssssssss
Answer: I'm not sure what you're asking. Can you please rephrase your question?

Source Documents:
  - Artificial intelligence
  - Artificial general intelligence
  - Artificial intelligence
  - Artificial general intelligence
Ask a question (or type 'exit'): why are you gay ??
Question: why are you gay ??
Answer: I'm not a person, so the concept of sexual orientation doesn't apply to me. I'm an AI, a computer program designed to provide information and complete tasks based on the data I was trained on.

Source Documents:
  - Artificial intelligence
  - Artificial general intelligence
  - Artificial general intelligence
  - Artificial intelligence
Ask a question (or type 'exit'): space
Question: space
Answer: This do

In [None]:
!pip install langchain-google-genai # Install the required package, langchain-google-genai


Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.2-py3-none-any.whl.metadata (4.7 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.16 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.6.17-py3-none-any.whl.metadata (9.8 kB)
Downloading langchain_google_genai-2.1.2-py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading google_ai_generativelanguage-0.6.17-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: filetype, google-ai-generativelanguage, langchain-google-genai
  Attempting uninstall: google-ai-generativelangu

In [None]:
!pip install langchain-community # Install the missing langchain-community package.


Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain-community)
  Downloading aiohttp-3.11.16-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp<4.0.0,>=3.8.3->langchain-community)
  Downloading aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)
Collecting aiosignal>=1.1.2 (from aiohttp<4.0.0,>=3.8.3->langchain-community)
  Downloading aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting fro

Ask a question (or type 'exit'): what is space?
No Wikipedia pages found for 'what'.
Chatbot not initialized.
Ask a question (or type 'exit'): space
Question: space
Answer: Space is a three-dimensional continuum containing positions and directions.  Classical physics often views it in three linear dimensions, while modern physics considers it, along with time, as part of a four-dimensional continuum called spacetime.  It's fundamental to understanding the physical universe, yet philosophers debate whether it's an entity, a relationship between entities, or part of a conceptual framework.  Historically, it's been conceived as both absolute (existing independently of matter) and relational (defined by relationships between objects).  Non-Euclidean geometries, where space is curved, have been developed, and Einstein's general relativity suggests space is indeed non-Euclidean around gravitational fields.

Source Documents:
  - Space
  - Space
  - Space
  - Space
Ask a question (or type 'ex



  lis = BeautifulSoup(html).find_all('li')


Question: exit 
Answer: While the play's French title, *Huis clos*, literally translates to "closed door," its more common English title is *No Exit*.  The play depicts three characters trapped in a room together for eternity, unable to leave.  Although at one point the door inexplicably opens, Garcin finds he cannot bring himself to exit. He realizes he is trapped not by a physical barrier, but by the presence and judgments of the other two characters.

Source Documents:
  - No Exit
  - No Exit
  - No Exit
  - No Exit
Ask a question (or type 'exit'): ecit
Question: ecit
Answer: ECIT (The Institute of Electronics, Communications and Information Technology) is located at Queen's University Belfast.  It focuses on research in areas such as cyber security, wireless communications, data science, and scalable computing.  It houses three research centers:

* **CSIT (Centre for Secure Information Technologies):** The UK's largest university cyber security research lab.
* **CWI (Centre for Wir

In [None]:
import os
import langchain
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import WikipediaLoader
import wikipedia

# Directly set the API key in Colab (less secure, but simpler for Colab)
GOOGLE_API_KEY = "AIzaSyC_NSrDX__MLPdb9mKqQW12wKAs8Xt68L0"  # Replace with your actual API key

if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not set. Please replace 'YOUR_GOOGLE_API_KEY' with your key.")

def create_wikipedia_chatbot(question):
    """
    Creates a chatbot that retrieves information from Wikipedia based on the question.

    Args:
        question (str): The user's question.

    Returns:
        RetrievalQA: A LangChain RetrievalQA chain, or None if an error occurred.
    """

    try:
        # Use the question itself as the search query
        try:
            # Get the closest wikipedia page title.
            best_guess = wikipedia.search(question)[0]
            loader = WikipediaLoader(query=best_guess, load_max_docs=2)
            documents = loader.load()
        except Exception as e:
            print(f"Error loading Wikipedia data: {e}")
            return None

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        texts = text_splitter.split_documents(documents)

        embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=GOOGLE_API_KEY)
        db = FAISS.from_documents(texts, embeddings)
        retriever = db.as_retriever()
        llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-pro", google_api_key=GOOGLE_API_KEY)
        qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

        return qa_chain

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def ask_question(question):
    """
    Asks a question to the chatbot and prints the answer.

    Args:
        question (str): The question to ask.
    """
    qa_chain = create_wikipedia_chatbot(question)
    if qa_chain:
        try:
            result = qa_chain({"query": question})
            print("Question:", question)
            print("Answer:", result["result"])
            print("\nSource Documents:")
            for doc in result["source_documents"]:
                print(f"  - {doc.metadata['title']}")
        except Exception as e:
            print(f"Error asking question: {e}")
    else:
        print("Chatbot not initialized.")

# Example usage
if __name__ == "__main__":
    while True:
        user_question = input("Ask a question (or type 'exit'): ")
        if user_question.lower() == "exit":
            break
        ask_question(user_question)

Ask a question (or type 'exit'): what is relativity theory?
Question: what is relativity theory?
Answer: Relativity theory usually refers to two interrelated theories by Albert Einstein: special relativity and general relativity.  Special relativity, published in 1905, describes the relationship between space and time in the absence of gravity. General relativity, published in 1915, explains gravity as a geometric property of spacetime, expanding on special relativity and refining Newton's law of universal gravitation.

Source Documents:
  - Theory of relativity
  - Theory of relativity
  - Theory of relativity
  - General relativity
Ask a question (or type 'exit'): special relativity 
Question: special relativity 
Answer: Special relativity, a theory developed by Albert Einstein, describes the relationship between space and time. It's based on two postulates:

1. **The laws of physics are the same for all observers in uniform motion** (i.e., no acceleration).  This is an extension of 

In [None]:
!pip install langchain langchain-google-genai pypdf faiss-cpu


Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0


## ChatBot using document

Absolutely, let's break down the code theoretically:

**1. Core Concepts:**

* **Retrieval Augmented Generation (RAG):**
    * This code implements a basic RAG system. RAG is a technique where a language model's responses are enhanced by retrieving relevant information from an external knowledge source (in this case, a PDF document).
    * The system first retrieves relevant chunks of text from the PDF based on the user's query and then uses a large language model (LLM) to generate an answer based on the retrieved information.

* **LangChain:**
    * LangChain is a framework that simplifies the development of applications powered by LLMs. It provides tools and abstractions for tasks like:
        * Loading documents (e.g., PDFs).
        * Splitting text into chunks.
        * Creating vector embeddings.
        * Storing and retrieving information from vector databases.
        * Chaining together LLM calls.

* **Vector Embeddings:**
    * Vector embeddings are numerical representations of text that capture the semantic meaning of the text.
    * The `GoogleGenerativeAIEmbeddings` model is used to create these embeddings.
    * By comparing the embeddings of the user's question with the embeddings of the PDF's text chunks, the system can identify the most relevant chunks.

* **FAISS (Facebook AI Similarity Search):**
    * FAISS is a library for efficient similarity search and clustering of dense vectors.
    * It's used as a vector database to store the embeddings of the PDF's text chunks.
    * FAISS allows for fast retrieval of the most similar text chunks to the user's question.

* **Large Language Model (LLM):**
    * The `ChatGoogleGenerativeAI` model (specifically, "models/gemini-1.5-pro") is used to generate the final answer to the user's question.
    * The LLM takes the retrieved text chunks from the PDF as context and generates a coherent and informative response.

**2. Code Breakdown:**

* **Importing Libraries:**
    * The code imports necessary libraries from `langchain`, `langchain_google_genai`, and `langchain_community`.

* **Setting API Key:**
    * The `GOOGLE_API_KEY` is set. This is essential for accessing the Google Generative AI models.

* **`create_pdf_chatbot` Function:**
    * **Document Loading:** `PyPDFLoader` loads the PDF document and extracts its text content.
    * **Text Splitting:** `RecursiveCharacterTextSplitter` divides the text into smaller chunks to improve retrieval accuracy.
    * **Embedding Creation:** `GoogleGenerativeAIEmbeddings` generates vector embeddings for each text chunk.
    * **Vector Database Creation:** `FAISS.from_documents` creates a FAISS vector database from the embeddings.
    * **Retriever Creation:** `db.as_retriever()` creates a retriever object that can be used to retrieve relevant text chunks from the database.
    * **LLM Initialization:** `ChatGoogleGenerativeAI` initializes the LLM.
    * **QA Chain Creation:** `RetrievalQA.from_chain_type` creates a retrieval-based question-answering chain that combines the retriever and the LLM.
    * **Error Handling:** The `try...except` block handles potential errors during the process.

* **`ask_question_pdf` Function:**
    * This function takes the PDF path and the user's question as input.
    * It calls `create_pdf_chatbot` to initialize the QA chain.
    * It then uses the QA chain to answer the question and prints the answer along with the source page numbers from the PDF.
    * It also contains error handling.

* **Example Usage:**
    * The `if __name__ == "__main__":` block executes when the script is run.
    * It prompts the user to enter the path to their PDF file.
    * It then enters a loop that prompts the user to ask questions.
    * The `ask_question_pdf` function is called to process each question.
    * It also contains a check to make sure the pdf file exists.

**In essence, the code automates the process of:**

1.  Reading a PDF.
2.  Preparing the PDF's content for efficient search.
3.  Finding relevant information within the PDF based on a question.
4.  Using an LLM to formulate a human-readable answer.
5.  Providing the answer and the source page from the pdf.


In [None]:
import os
import langchain
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PyPDFLoader

# Directly set the API key in Colab (less secure, but simpler for Colab)
GOOGLE_API_KEY = "AIzaSyC_NSrDX__MLPdb9mKqQW12wKAs8Xt68L0"  # Replace with your actual API key

if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not set. Please replace 'YOUR_GOOGLE_API_KEY' with your key.")

def create_pdf_chatbot(pdf_path, question):
    """
    Creates a chatbot that retrieves information from a PDF document based on the question.

    Args:
        pdf_path (str): The path to the PDF document.
        question (str): The user's question.

    Returns:
        RetrievalQA: A LangChain RetrievalQA chain, or None if an error occurred.
    """

    try:
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        texts = text_splitter.split_documents(documents)

        embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=GOOGLE_API_KEY)
        db = FAISS.from_documents(texts, embeddings)
        retriever = db.as_retriever()
        llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-pro", google_api_key=GOOGLE_API_KEY)
        qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

        return qa_chain

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def ask_question_pdf(pdf_path, question):
    """
    Asks a question to the chatbot based on a PDF document and prints the answer.

    Args:
        pdf_path (str): The path to the PDF document.
        question (str): The question to ask.
    """
    qa_chain = create_pdf_chatbot(pdf_path, question)
    if qa_chain:
        try:
            result = qa_chain({"query": question})
            print("Question:", question)
            print("Answer:", result["result"])
            print("\nSource Documents:")
            for doc in result["source_documents"]:
                print(f"  - Page {doc.metadata['page']}")
        except Exception as e:
            print(f"Error asking question: {e}")
    else:
        print("Chatbot not initialized.")

# Example usage
if __name__ == "__main__":
    pdf_file_path = input("Enter your_document.pdf: ")  # Replace with the path to your PDF file.
    if not os.path.exists(pdf_file_path):
        print(f"Error: PDF file '{pdf_file_path}' not found.")
    else:

        while True:
            user_question = input("Ask a question (or type 'exit'): ")
            if user_question.lower() == "exit":
                break
            ask_question_pdf(pdf_file_path, user_question)

Enter your_document.pdf: AI for marketing org.pdf
Ask a question (or type 'exit'): what is AI marketing?
Question: what is AI marketing?
Answer: AI marketing uses artificial intelligence to improve the customer journey, leading to better individual targeted campaigns and ROI.  It allows marketers to analyze data, predict trends, create more innovative and targeted advertisements, and automate tasks such as programmatic media bidding.  AI marketing also helps understand customer needs and expectations through integrated applications that track purchases and provide personalized marketing messages with suggestions and special offers.  This data-driven approach improves decision-making and provides a competitive advantage through system automation.

Source Documents:
  - Page 8
  - Page 2
  - Page 4
  - Page 3


KeyboardInterrupt: Interrupted by user

In [None]:
!pip install langchain langchain-google-genai faiss-cpu pubmed_parser

Collecting pubmed_parser
  Downloading pubmed_parser-0.5.1-py3-none-any.whl.metadata (17 kB)
Collecting lxml (from pubmed_parser)
  Downloading lxml-5.3.1-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.7 kB)
Collecting unidecode (from pubmed_parser)
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading pubmed_parser-0.5.1-py3-none-any.whl (56.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lxml-5.3.1-cp311-cp311-manylinux_2_28_x86_64.whl (5.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m103.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode, lxml, pubmed_parser
Successfully installed lxml-5.3.1 pubmed_parser-0.5.1 unidecod

In [None]:
!pip install xmltodict

Collecting xmltodict
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Installing collected packages: xmltodict
Successfully installed xmltodict-0.14.2


Let's break down the theoretical concepts and the workflow of this code, which creates a health-related question-answering system using PubMed and Google's Gemini models.

**1. Core Idea: Retrieval Augmented Generation (RAG) for Healthcare Information**

* This code implements a RAG system tailored for healthcare queries.
* It leverages PubMed, a vast repository of biomedical literature, as the external knowledge source.
* The system aims to provide accurate and contextually relevant answers to health-related questions by retrieving information from PubMed articles and using a Large Language Model (LLM) to generate responses.

**2. Key Components and Their Roles:**

* **PubMedLoader:**
    * This LangChain document loader retrieves relevant articles from PubMed based on the user's query.
    * It acts as the retrieval mechanism, fetching biomedical information.
* **RecursiveCharacterTextSplitter:**
    * This text splitter breaks down the retrieved PubMed articles into smaller, manageable chunks.
    * This is crucial because LLMs have limitations on the amount of text they can process at once.
* **GoogleGenerativeAIEmbeddings (embedding-001):**
    * This component generates vector embeddings for the text chunks from PubMed articles and the user's query.
    * Vector embeddings are numerical representations of text that capture semantic meaning.
    * `embedding-001` is the name of the model that creates those embeddings.
* **FAISS (Facebook AI Similarity Search):**
    * FAISS is a vector database that stores the embeddings of the PubMed text chunks.
    * It efficiently performs similarity searches, finding the text chunks that are most relevant to the user's query.
* **ChatGoogleGenerativeAI (gemini-1.5-pro):**
    * This is the LLM that generates the final answer to the user's question.
    * It takes the retrieved PubMed text chunks as context and formulates a coherent response.
* **RetrievalQA:**
    * This LangChain chain orchestrates the entire process, combining the retriever (FAISS and embeddings) and the LLM.
    * It handles the retrieval of relevant information and the generation of the answer.

**3. Workflow:**

1.  **User Input:**
    * The user enters a health-related question.
2.  **PubMed Retrieval:**
    * The `PubMedLoader` uses the user's question as a search query to retrieve relevant articles from PubMed.
3.  **Text Processing:**
    * The retrieved articles are split into text chunks using `RecursiveCharacterTextSplitter`.
    * Vector embeddings are generated for the text chunks using `GoogleGenerativeAIEmbeddings`.
4.  **Similarity Search:**
    * The user's question is also converted into a vector embedding.
    * FAISS performs a similarity search to find the PubMed text chunks that are most semantically similar to the question.
5.  **Answer Generation:**
    * The `RetrievalQA` chain provides the relevant text chunks as context to the `ChatGoogleGenerativeAI` model.
    * The LLM generates a response based on the retrieved information.
6.  **Output:**
    * The code prints the user's question and the generated answer.

**4. Theoretical Concepts:**

* **Vector Space Models:**
    * The use of vector embeddings and FAISS relies on the concept of vector space models, where text is represented as points in a high-dimensional space.
    * Similarity between texts is determined by the distance between their corresponding vectors.
* **Semantic Search:**
    * The system performs semantic search, which aims to find information based on its meaning rather than just keyword matching.
    * This allows the system to retrieve more relevant results, even if the user's question doesn't contain the exact keywords found in the PubMed articles.
* **Large Language Models (LLMs):**
    * LLMs are powerful generative models that can understand and generate human-like text.
    * In this context, the LLM is used to synthesize the retrieved information and generate a concise and informative answer.
* **LangChain as an Orchestration Framework:**
    * Langchain, allows for the easy connection of different models, and databases, to allow for complex tasks to be completed.

**In essence:**

This code builds a sophisticated question-answering system that combines the power of PubMed's biomedical knowledge with the natural language processing capabilities of LLMs. It represents a practical application of RAG in the healthcare domain.


In [None]:
import os
import langchain
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import PubMedLoader  # Changed to PubMedLoader

# Directly set the API key in Colab (less secure, but simpler for Colab)
GOOGLE_API_KEY = "AIzaSyC_NSrDX__MLPdb9mKqQW12wKAs8Xt68L0"  # Replace with your actual API key

if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY not set. Please replace 'YOUR_GOOGLE_API_KEY' with your key.")

def create_pubmed_chatbot(query):  # Changed to accept a query instead of pdf_path
    """
    Creates a chatbot that retrieves information from PubMed based on the question.

    Args:
        query (str): The search query for PubMed.

    Returns:
        RetrievalQA: A LangChain RetrievalQA chain, or None if an error occurred.
    """

    try:
        loader = PubMedLoader(query=query, load_max_docs=5)  # Adjusted load_max_docs
        documents = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        texts = text_splitter.split_documents(documents)

        embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=GOOGLE_API_KEY)
        db = FAISS.from_documents(texts, embeddings)
        retriever = db.as_retriever()
        llm = ChatGoogleGenerativeAI(model="models/gemini-1.5-pro", google_api_key=GOOGLE_API_KEY)
        qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

        return qa_chain

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def ask_question_pubmed(query):  # Changed to accept a query.
    """
    Asks a question to the chatbot based on PubMed and prints the answer.

    Args:
        query (str): The search query for PubMed.
    """
    qa_chain = create_pubmed_chatbot(query)
    if qa_chain:
        try:
            result = qa_chain({"query": query})
            print("Question:", query)
            print("Answer:", result["result"])
          # changed from page to title
        except Exception as e:
            print(f"Error asking question: {e}")
    else:
        print("Chatbot not initialized.")

# Example usage
if __name__ == "__main__":
    while True:
        user_question = input("Enter your health-related question (or type 'exit'): ")
        if user_question.lower() == "exit":
            break
        ask_question_pubmed(user_question) # changed to user question.

Enter your health-related question (or type 'exit'): im feeling like fever
Too Many Requests, waiting for 0.20 seconds...
Too Many Requests, waiting for 0.40 seconds...
Question: im feeling like fever
Answer: This model is not able to give medical advice.  A medical expert, like a doctor, will be able to help you understand why you are feeling feverish and recommend treatment.
Enter your health-related question (or type 'exit'): why it is occuring?
Too Many Requests, waiting for 0.20 seconds...
Too Many Requests, waiting for 0.40 seconds...
Question: why it is occuring?
Answer: This question is too broad.  I need more context. What is "it"? Please clarify what you're asking about.
Enter your health-related question (or type 'exit'): why fever is occuring?
Too Many Requests, waiting for 0.20 seconds...
Too Many Requests, waiting for 0.40 seconds...
Question: why fever is occuring?
Answer: This document discusses two separate cases.  In the first case, fever is mentioned as a possible sy

In [None]:
!pip install langchain langchain-google-genai faiss-cpu clip torch whisper pillow



In [None]:
!pip install langchain-community


Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [None]:
!pip install unstructured[audio]

Collecting unstructured[audio]
  Downloading unstructured-0.17.2-py3-none-any.whl.metadata (24 kB)
Collecting python-magic (from unstructured[audio])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[audio])
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured[audio])
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured[audio])
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rapidfuzz (from unstructured[audio])
  Downloading rapidfuzz-3.12.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)