# Medical Chatbot Research Notebook

This notebook demonstrates the RAG (Retrieval-Augmented Generation) pipeline for the Medical Chatbot. 
It covers data loading, text splitting, embedding generation, vector store creation (Pinecone), and the QA chain setup.

## 1. Environment Setup & Imports

Load environment variables suitable for the project and import necessary libraries.

In [None]:
import os
from dotenv import load_dotenv
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Load environment variables
load_dotenv()

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY
os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY

## 2. data Loading

We load PDF documents from the `data` directory. Note: Ensure you are in the correct directory or adjust the path.

In [None]:
# Ensure we are pointing to the correct data directory.
# If running from 'research' folder, data might be in '../data'
# os.chdir("../") # Uncomment if needed to change to root directory

def load_pdf_file(data_path):
    loader = DirectoryLoader(
        data_path,
        glob="*.pdf",
        loader_cls=PyPDFLoader
    )
    documents = loader.load()
    return documents

# Assuming 'data' folder is in the current or parent directory
extracted_documents = load_pdf_file("data")
print(f"Loaded {len(extracted_documents)} pages/documents.")

## 3. Text Splitting

Split the loaded documents into smaller chunks for efficient embedding and retrieval.

In [None]:
def text_split(docs):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=20,
    )
    texts_chunk = text_splitter.split_documents(docs)
    return texts_chunk

texts_chunk = text_split(extracted_documents)
print(f"Created {len(texts_chunk)} chunks.")

## 4. Embeddings

Initialize HuggingFace embeddings (`all-MiniLM-L6-v2`) to convert text into vector representations.

In [None]:
def download_embeddings():
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    return embeddings

embedding = download_embeddings()

## 5. Vector Store (Pinecone)

Set up the Pinecone index and store the document embeddings. 
**Note:** Code to create a new index is included but commented out if it already exists.

In [None]:
index_name = "medical-chatbot"
pc = Pinecone(api_key=PINECONE_API_KEY)

# Check if index exists, create if not (Optional)
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# Upload NEW documents:
docsearch = PineconeVectorStore.from_documents(
    documents=texts_chunk,
    embedding=embedding,
    index_name=index_name
)

# Connect to the existing index
# docsearch = PineconeVectorStore.from_existing_index(
#     index_name=index_name,
#     embedding=embedding,
# )



## 6. Retrieval & LLM Setup

Set up the retriever and the Chat Model (using OpenRouter/OpenAI compatible interface).

In [None]:
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Test Retriever
retrieved_doc = retriever.invoke("What is Acne?")
# print(retrieved_doc)

In [None]:
chat = ChatOpenAI(
    model_name="openai/gpt-4o-mini", 
    openai_api_base="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY
)

# Simple Test
msg = [HumanMessage(content="Say hi")]
resp = chat(msg)
print(resp.content)

## 7. RAG Chain

Combine the retriever and the LLM into a Question-Answering chain.

In [None]:
system_prompt = (
    "You are an Medical assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(chat, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

## 8. Testing the Pipeline

In [None]:
response = rag_chain.invoke({"input": "what is Acromegaly and gigantism?"})
print("Answer:", response["answer"])

In [None]:
response = rag_chain.invoke({"input": "What is Acne?"})
print("Answer:", response["answer"])