# RAG OpenAI Ollama - Medical Q&A System

## Hardware Setup

This project runs on my development machine:

- **CPU**: Intel i7-13620H (13th Gen, 10 cores)
- **GPU**: RTX 4070 (8GB VRAM)
- **RAM**: 32GB DDR5
- **Storage**: SSD

## Why I Chose Llama3.2:1B

I selected **Llama3.2:1B** model based on my hardware:

### Memory Optimization
- **My RTX 4070 has 8GB VRAM** - perfect for 1B parameter model
- Llama3.2:1B needs ~2-3GB VRAM, leaves room for embeddings
- Larger models (7B+) would exceed VRAM and use slow CPU

### Performance
- **Fast inference** (~1-2 seconds per response)
- **Low latency** for interactive use
- **Works well with RAG** - context is pre-filtered

### Local Benefits
- **No API costs** - runs entirely local
- **Privacy** - all data stays on my machine
- **Always available** - no internet dependency

## System Architecture

```
PDF → Text Chunks → Embeddings → FAISS Vector DB
                                       ↓
Question → Embedding → Search → Context → LLM → Answer
```

## Required Libraries

In [None]:
import os
import faiss
import requests
import streamlit as st
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
import numpy as np
import openai
from typing import List, Dict, Any
import json

## PDF Processing

Extract text from medical documents:

In [None]:
def extract_pdf_text(pdf_path):
    """Extract text from PDF file"""
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PdfReader(file)
        for page in pdf_reader.pages:
            text += page.extract_text() + "\n"
    return text

# Load my medical document
pdf_text = extract_pdf_text("Good-Medical-Practice-2024---English-102607294.pdf")
print(f"Extracted {len(pdf_text)} characters from PDF")

## Text Chunking

Split document into chunks for retrieval:

In [None]:
def chunk_text(text, chunk_size=1000, overlap=200):
    """Split text into overlapping chunks"""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

# Create chunks from my PDF
text_chunks = chunk_text(pdf_text)
print(f"Created {len(text_chunks)} text chunks")

## Generate Embeddings

Create vector representations using SentenceTransformer:

In [None]:
# Load embedding model
embedding_model = SentenceTransformer('sentence-transformers/multilingual-e5-large')

# Generate embeddings for all chunks
print("Generating embeddings...")
chunk_embeddings = embedding_model.encode(text_chunks)
print(f"Generated {len(chunk_embeddings)} embeddings of dimension {chunk_embeddings[0].shape[0]}")

## FAISS Vector Database

Build efficient similarity search index:

In [None]:
# Create FAISS index
dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product (cosine similarity)

# Add embeddings to index
index.add(chunk_embeddings.astype('float32'))
print(f"Added {index.ntotal} vectors to FAISS index")

# Save index
faiss.write_index(index, "medical_docs.index")
print("Saved FAISS index to file")

## OpenAI Integration

Set up GPT-4o-mini for high-quality responses:

In [None]:
def query_openai(context, question):
    """Query OpenAI with context"""
    try:
        client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a medical expert. Answer based on the provided context."},
                {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
            ],
            max_tokens=500,
            temperature=0.7
        )
        
        return response.choices[0].message.content
    except Exception as e:
        return f"OpenAI Error: {str(e)}"

## Ollama Local LLM

Set up Llama3.2:1B for offline inference:

In [None]:
def query_ollama(context, question):
    """Query Ollama with context"""
    try:
        prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer based on the context provided:"
        
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={
                "model": "llama3.2:1b",
                "prompt": prompt,
                "stream": False
            }
        )
        
        if response.status_code == 200:
            return response.json().get("response", "No response from Ollama")
        else:
            return f"Ollama Error: {response.status_code}"
    except Exception as e:
        return f"Ollama Connection Error: {str(e)}"

## RAG Query Pipeline

Complete question-answering workflow:

In [None]:
def rag_query(question, use_openai=True, top_k=3):
    """Complete RAG pipeline"""
    
    # 1. Encode question
    question_embedding = embedding_model.encode([question])
    
    # 2. Search similar chunks
    scores, indices = index.search(question_embedding.astype('float32'), top_k)
    
    # 3. Get relevant context
    relevant_chunks = [text_chunks[i] for i in indices[0]]
    context = "\n\n".join(relevant_chunks)
    
    # 4. Generate answer
    if use_openai:
        answer = query_openai(context, question)
    else:
        answer = query_ollama(context, question)
    
    return {
        "question": question,
        "answer": answer,
        "context": relevant_chunks,
        "scores": scores[0].tolist()
    }

## Test the System

Try a medical query:

In [None]:
# Test with sample question
test_question = "What are the key principles of good medical practice?"

print("Testing with Ollama (Local):")
result_ollama = rag_query(test_question, use_openai=False)
print(f"Answer: {result_ollama['answer']}")
print(f"Relevance scores: {result_ollama['scores']}")

## Web Interface

Deploy the complete system via Streamlit:

```bash
streamlit run streamlit_app.py --server.port 8878
```

**Features:**
- Choose between OpenAI or Ollama
- Upload medical documents
- Ask questions and get answers
- View source references

**My Setup Performance:**
- **RTX 4070**: Perfect for Llama3.2:1B inference
- **32GB DDR5 RAM**: Fast vector operations
- **Response Time**: 1-3 seconds
- **Memory Usage**: ~4GB RAM, ~3GB VRAM