# Task 3: AI Diagnostic Assistant

## Setup Instructions

Before running this notebook, ensure that all necessary dependencies are installed and required directories are created by executing the setup script:

```bash
bash scripts/setup.sh
```

Alternatively, manually install dependencies with `pip install -r requirements.txt` and create the `outputs/models/` and `outputs/vectorstore/` directories. For detailed setup instructions, refer to the **Setup** section in `docs/task3_implementation_plan.md`.

## Objective
The goal of this task is to build an AI-powered diagnostic assistant that can help users understand their symptoms and provide insights about cancer-related health data. The assistant integrates two tools:
- **Tool 1**: Symptom Checker - Uses RAG (Retrieval-Augmented Generation) with ChromaDB to provide disease information and precautions based on user symptoms
- **Tool 2**: Cancer Analysis - Analyzes breast cancer data patterns using sequential pattern mining insights

## Overview
This notebook implements:
1. ML Model Training & Vocabulary Export (Phase 1)
2. Knowledge Base Setup with ChromaDB (Phase 2)
3. Demonstration of the ProjectAssistant (Phase 4)


## Phase 1: ML Model Training & Vocabulary Export


### 1.1 Setup and Environment


In [3]:
import sys
from pathlib import Path
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import joblib
import json

# Add project root to Python path
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

print("Libraries imported successfully.")


Libraries imported successfully.


### 1.2 Data Loading and Preprocessing


In [None]:
# Load and preprocess the dataset
DATA_PATH = project_root / 'data' / 'dataset.csv'

# Load the dataset
df = pd.read_csv(DATA_PATH)

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nColumn names:")
print(df.columns.tolist())

# Combine symptom columns into a single text field
symptom_cols = [f'Symptom_{i}' for i in range(1, 18)]

# Define function to clean string values only (treat non-strings as None)
def clean_symptom_value(value):
    """Clean symptom value: trim and normalize underscores for strings only, leave non-strings as None."""
    if not isinstance(value, str):
        return None
    # Strip whitespace and normalize underscores (remove spaces around underscores, normalize multiple underscores)
    import re
    cleaned = value.strip()
    cleaned = re.sub(r'\s+_\s+', '_', cleaned)  # Remove spaces around underscores
    cleaned = re.sub(r'\s+_', '_', cleaned)  # Remove spaces before underscores
    cleaned = re.sub(r'_\s+', '_', cleaned)  # Remove spaces after underscores
    cleaned = cleaned.strip('_')  # Remove leading/trailing underscores
    return cleaned if cleaned else None

# Apply cleaning function to each symptom column (only processes strings, leaves NaN as None)
for col in symptom_cols:
    df[col] = df[col].apply(clean_symptom_value)

# Build symptoms_text from symptom_cols by selecting only non-null entries (without prior astype(str))
def build_symptoms_text(row):
    """Build symptoms_text from symptom columns, selecting only non-null entries without converting NaN to strings."""
    symptoms = []
    for val in row[symptom_cols]:
        if pd.notna(val) and val is not None and isinstance(val, str) and val.strip():
            symptoms.append(val)
    return ' '.join(symptoms)

df['symptoms_text'] = df.apply(build_symptoms_text, axis=1)

# Clean up multiple spaces in symptoms_text
df['symptoms_text'] = df['symptoms_text'].str.replace(r'\s+', ' ', regex=True).str.strip()

# Drop rows with empty symptoms_text
df = df[df['symptoms_text'].notna() & (df['symptoms_text'].str.strip() != '')].copy()

# Normalize Disease column: strip whitespace
df['Disease'] = df['Disease'].str.strip()

# Prepare features and target
X = df['symptoms_text']
y = df['Disease']

# Display preprocessing results
print("\n" + "="*50)
print("Preprocessing Results")
print("="*50)
print(f"\nSample symptoms_text:")
print(X.iloc[0])
print(f"\nNumber of unique diseases: {y.nunique()}")
print(f"\nExample - Disease: {y.iloc[0]}, Symptoms: {X.iloc[0]}")


Dataset Shape: (4920, 18)

First few rows:
            Disease   Symptom_1              Symptom_2              Symptom_3  \
0  Fungal infection     itching              skin_rash   nodal_skin_eruptions   
1  Fungal infection   skin_rash   nodal_skin_eruptions    dischromic _patches   
2  Fungal infection     itching   nodal_skin_eruptions    dischromic _patches   
3  Fungal infection     itching              skin_rash    dischromic _patches   
4  Fungal infection     itching              skin_rash   nodal_skin_eruptions   

              Symptom_4 Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9  \
0   dischromic _patches       NaN       NaN       NaN       NaN       NaN   
1                   NaN       NaN       NaN       NaN       NaN       NaN   
2                   NaN       NaN       NaN       NaN       NaN       NaN   
3                   NaN       NaN       NaN       NaN       NaN       NaN   
4                   NaN       NaN       NaN       NaN       NaN       NaN   

  Sympt

### 1.3 Model Training


In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the ML pipeline
disease_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the model
print("Training the model...")
disease_pipeline.fit(X_train, y_train)
print("Model training complete.")

# Evaluate the model
y_pred = disease_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Export the vocabulary
vocabulary = disease_pipeline.named_steps['vectorizer'].get_feature_names_out()
vocabulary_list = vocabulary.tolist()

vocab_path = project_root / 'outputs' / 'models' / 'symptom_vocabulary.json'
vocab_path.parent.mkdir(parents=True, exist_ok=True)
with open(vocab_path, 'w') as f:
    json.dump(vocabulary_list, f, indent=2)

print(f"\nVocabulary exported to: {vocab_path}")
print(f"Vocabulary size: {len(vocabulary_list)} unique symptom tokens")

# Save the trained model
model_path = project_root / 'outputs' / 'models' / 'disease_model.pkl'
model_path.parent.mkdir(parents=True, exist_ok=True)
joblib.dump(disease_pipeline, model_path)

print(f"Trained model saved to: {model_path}")


Training the model...
Model training complete.

Model Accuracy: 1.0000

Classification Report:
                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00        18
                                   AIDS       1.00      1.00      1.00        30
                                   Acne       1.00      1.00      1.00        24
                    Alcoholic hepatitis       1.00      1.00      1.00        25
                                Allergy       1.00      1.00      1.00        24
                              Arthritis       1.00      1.00      1.00        23
                       Bronchial Asthma       1.00      1.00      1.00        33
                   Cervical spondylosis       1.00      1.00      1.00        23
                            Chicken pox       1.00      1.00      1.00        21
                    Chronic cholestasis       1.00      1.00      1.00        15
             

## Phase 2: Knowledge Base Setup


### 2.1 Setup and Environment


In [10]:
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import pandas as pd
from pathlib import Path
import sys

# Define project_root for Phase 2 (allows Phase 2 to run independently)
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

# Initialize SentenceTransformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("SentenceTransformer model loaded successfully.")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

print("Libraries imported successfully.")


AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead.

### 2.2 Load Disease Information


In [None]:
# Load disease descriptions and precautions
DESCRIPTION_PATH = project_root / 'data' / 'symptom_Description.csv'
PRECAUTION_PATH = project_root / 'data' / 'symptom_precaution.csv'

# Load disease descriptions
descriptions_df = pd.read_csv(DESCRIPTION_PATH)
print(f"Descriptions shape: {descriptions_df.shape}")
print(descriptions_df.head())

# Load disease precautions
precautions_df = pd.read_csv(PRECAUTION_PATH)
print(f"\nPrecautions shape: {precautions_df.shape}")
print(precautions_df.head())

# Data cleaning: strip whitespace from Disease column
descriptions_df['Disease'] = descriptions_df['Disease'].str.strip()
precautions_df['Disease'] = precautions_df['Disease'].str.strip()

# Normalize disease names using correction map to fix typos and mismatches
def normalize_disease_name(name):
    """Normalize disease names to fix typos and inconsistencies."""
    # Correction map for known typos/mismatches
    correction_map = {
        'hemmorhoids': 'hemorrhoids',  # Fix typo: hemmorhoids -> hemorrhoids
        'Paroymsal': 'Paroxysmal',  # Fix typo: Paroymsal -> Paroxysmal
    }
    
    normalized = name
    for typo, correct in correction_map.items():
        if typo in normalized:
            normalized = normalized.replace(typo, correct)
    return normalized

# Apply normalization to both dataframes
descriptions_df['Disease'] = descriptions_df['Disease'].apply(normalize_disease_name)
precautions_df['Disease'] = precautions_df['Disease'].apply(normalize_disease_name)

# Verify data alignment after normalization
print(f"\nAfter normalization:")
print(f"Unique diseases in descriptions: {descriptions_df['Disease'].nunique()}")
print(f"Unique diseases in precautions: {precautions_df['Disease'].nunique()}")

desc_diseases = set(descriptions_df['Disease'])
prec_diseases = set(precautions_df['Disease'])
mismatches_desc = desc_diseases - prec_diseases
mismatches_prec = prec_diseases - desc_diseases

if mismatches_desc:
    print(f"Diseases only in descriptions: {mismatches_desc}")
if mismatches_prec:
    print(f"Diseases only in precautions: {mismatches_prec}")
if not mismatches_desc and not mismatches_prec:
    print("✓ All diseases match between descriptions and precautions files.")


### 2.3 Initialize ChromaDB


In [None]:
# Initialize ChromaDB persistent client
VECTORSTORE_PATH = project_root / 'outputs' / 'vectorstore' / 'chroma_db'

# Create the vectorstore directory
VECTORSTORE_PATH.mkdir(parents=True, exist_ok=True)
print(f"Vectorstore directory: {VECTORSTORE_PATH}")

# Initialize ChromaDB persistent client
chroma_client = chromadb.PersistentClient(path=str(VECTORSTORE_PATH))
print("ChromaDB persistent client initialized.")

# Define embedding function for the collection
def embed_function(texts):
    """Embedding function that wraps SentenceTransformer encode."""
    return embedding_model.encode(texts).tolist()

# Create or get the collection with embedding function
collection = chroma_client.get_or_create_collection(
    name="disease_info",
    embedding_function=embed_function
)
print(f"Collection 'disease_info' ready. Current document count: {collection.count()}")


### 2.4 Populate Vector Database


In [None]:
# Merge descriptions and precautions
merged_df = pd.merge(descriptions_df, precautions_df, on='Disease', how='inner')
print(f"Merged dataframe shape: {merged_df.shape}")
print(merged_df.head())

# Assert that merged row count equals the number of unique diseases
unique_disease_count = descriptions_df['Disease'].nunique()
merged_row_count = len(merged_df)
assert merged_row_count == unique_disease_count, f"Merge failed: expected {unique_disease_count} rows, got {merged_row_count}. Check for remaining disease name mismatches."
print(f"\n✓ Merge successful: {merged_row_count} rows match {unique_disease_count} unique diseases.")

# Create combined documents
def create_document(row):
    description = row['Description']
    precautions = [row[f'Precaution_{i}'] for i in range(1, 5) if pd.notna(row[f'Precaution_{i}'])]
    precautions_text = ', '.join(precautions) if precautions else 'No specific precautions listed'
    return f"Disease: {row['Disease']}\n\nDescription: {description}\n\nPrecautions: {precautions_text}"

merged_df['document'] = merged_df.apply(create_document, axis=1)
print("\nSample document:")
print(merged_df['document'].iloc[0])

# Create stable IDs from disease names (slugify)
import re
def slugify_disease_name(name):
    """Create a stable ID from disease name."""
    # Convert to lowercase, replace spaces and special chars with underscores
    slug = name.lower()
    slug = re.sub(r'[^\w\s-]', '', slug)  # Remove special chars except spaces and hyphens
    slug = re.sub(r'[-\s]+', '_', slug)  # Replace spaces and hyphens with underscores
    slug = slug.strip('_')  # Remove leading/trailing underscores
    return f"disease_{slug}"

# Prepare data for ChromaDB
documents = merged_df['document'].tolist()
ids = [slugify_disease_name(disease) for disease in merged_df['Disease']]
metadatas = [{'disease': disease} for disease in merged_df['Disease']]
print(f"\nPrepared {len(documents)} documents for embedding.")

# Clear existing collection to avoid duplicates (idempotent operation)
if collection.count() > 0:
    print(f"Clearing existing collection ({collection.count()} documents)...")
    chroma_client.delete_collection("disease_info")
    collection = chroma_client.create_collection(
        name="disease_info",
        embedding_function=embed_function
    )
    print("Collection cleared and recreated.")

# Add documents to ChromaDB (embeddings will be generated automatically by the registered embedding_function)
collection.add(documents=documents, ids=ids, metadatas=metadatas)
print(f"Successfully added {len(documents)} documents to ChromaDB collection.")
print(f"Final collection count: {collection.count()}")

# Test the vectorstore using query_texts (works with embedding function)
test_results = collection.query(query_texts=['fever and cough'], n_results=3)
print("\nTest query results:")
for i, (doc, metadata) in enumerate(zip(test_results['documents'][0], test_results['metadatas'][0])):
    print(f"{i+1}. {metadata['disease']}: {doc[:100]}...")


## Phase 4: Demonstration


### 4.1 Setup and Import


In [None]:
import sys
from pathlib import Path

# Add project root to Python path
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

# Import the ProjectAssistant class
from scripts.task3_app import ProjectAssistant

print("ProjectAssistant imported successfully.")


### 4.2 Initialize the Assistant


In [None]:
# TODO: Initialize ProjectAssistant
# assistant = ProjectAssistant()


### 4.3 Test Symptom Checker (Tool 1)


In [None]:
# TODO: Test symptom checker queries
# Example queries:
# - "I have fever and cough"
# - "What diseases cause fatigue and headache?"


### 4.4 Test Cancer Analysis (Tool 2)


In [None]:
# TODO: Test cancer analysis queries
# Example queries:
# - "Tell me about breast cancer patterns"
# - "What patterns indicate malignant cases?"


### 4.5 Test Router Behavior


In [None]:
# TODO: Test router behavior with out-of-scope queries
# Example queries:
# - "What's the weather today?"
# - "Tell me a joke"
