[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/biomedical/01_Drug_Discovery_Pipeline.ipynb)

# Drug Discovery Pipeline - Vector Similarity Search

## Overview

This notebook demonstrates a **complete drug discovery pipeline** using Semantica that focuses on **vector similarity search** and **interaction prediction**. The pipeline ingests drug and protein data, extracts compound and target entities, builds a drug-target knowledge graph, and performs similarity search to predict drug-target interactions.

### Key Features

- **Vector-Focused Approach**: Emphasizes embeddings and vector similarity search for drug-target interaction prediction
- **Compound-Target Extraction**: Extracts drug compounds, proteins, and targets from biomedical literature
- **Similarity Search**: Uses vector embeddings to find similar compounds and predict interactions
- **Knowledge Graph Construction**: Builds structured drug-target relationship graphs
- **Interaction Prediction**: Predicts potential drug-target interactions using similarity metrics

### What You'll Learn

- How to ingest biomedical data (drug databases, protein data, literature)
- How to extract compound and target entities from unstructured text
- How to generate embeddings for drugs and proteins
- How to perform similarity search to find similar compounds
- How to build drug-target knowledge graphs
- How to predict drug-target interactions using vector similarity

### Pipeline Architecture

1. **Phase 0**: Setup & Configuration
2. **Phase 1**: Biomedical Data Ingestion
3. **Phase 2**: Document Parsing & Processing
4. **Phase 3**: Entity Extraction (Drugs, Proteins, Targets)
5. **Phase 4**: Embedding Generation
6. **Phase 5**: Vector Store Population
7. **Phase 6**: Similarity Search & Interaction Prediction
8. **Phase 7**: Knowledge Graph Construction
9. **Phase 8**: Visualization & Export

---

## Installation

Install Semantica and required dependencies:


In [None]:
# Install Semantica and required dependencies
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


---

## Phase 0: Setup & Configuration

Configure Semantica for drug discovery with focus on vector similarity search.


In [None]:
import os
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any

# Set API keys
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-groq-api-key-here")

print("Environment configured.")


In [None]:
from semantica.core import Semantica, ConfigManager
from semantica.vector_store import VectorStore
from semantica.embeddings import EmbeddingGenerator
from semantica.semantic_extract import NamedEntityRecognizer, RelationExtractor

# Configure for drug discovery with vector similarity focus
config_dict = {
    "project_name": "Drug_Discovery_Pipeline",
    "embedding": {
        "provider": "sentence_transformers",
        "model": "all-MiniLM-L6-v2"  # 384-dimensional embeddings
    },
    "extraction": {
        "provider": "groq",
        "model": "llama-3.1-8b-instant",
        "temperature": 0.0
    },
    "vector_store": {
        "provider": "faiss",
        "dimension": 384
    },
    "knowledge_graph": {
        "backend": "networkx",
        "merge_entities": True
    }
}

config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)
vector_store = VectorStore(backend="faiss", dimension=384)

print("Semantica configured for drug discovery pipeline.")


---

## Phase 1: Real Data Ingestion (PubMed RSS Feed)

Ingest biomedical data from PubMed RSS feeds using FeedIngestor.


In [None]:
from semantica.ingest import FeedIngestor, FileIngestor
import os

# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

# Option 1: Ingest from PubMed RSS feed (real data source)
# PubMed RSS feed for drug discovery research
pubmed_rss_url = "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+discovery&limit=10&sort=pub_date&fc=article_type"

try:
    feed_ingestor = FeedIngestor()
    # Ingest from PubMed RSS feed
    feed_documents = feed_ingestor.ingest(pubmed_rss_url, method="rss")
    print(f"Ingested {len(feed_documents)} documents from PubMed RSS feed")
    documents = feed_documents
except Exception as e:
    print(f"RSS feed ingestion failed (using sample data): {e}")
    # Fallback: Sample drug and protein data
    sample_drug_data = """
    Aspirin (acetylsalicylic acid) is a medication used to reduce pain, fever, or inflammation. 
    It targets cyclooxygenase enzymes COX-1 and COX-2. Aspirin is commonly used for cardiovascular protection.
    Ibuprofen is a nonsteroidal anti-inflammatory drug (NSAID) that targets COX-1 and COX-2 enzymes.
    Metformin is an antidiabetic medication that targets AMP-activated protein kinase (AMPK).
    Insulin targets the insulin receptor (INSR) to regulate glucose metabolism.
    Warfarin is an anticoagulant that targets vitamin K epoxide reductase complex subunit 1 (VKORC1).
    Atorvastatin is a statin medication that targets HMG-CoA reductase.
    """
    
    # Save sample data
    with open("data/sample_drugs.txt", "w") as f:
        f.write(sample_drug_data)
    
    # Ingest from file
    file_ingestor = FileIngestor()
    documents = file_ingestor.ingest("data/sample_drugs.txt")
    print(f"Ingested {len(documents)} documents from sample data")


In [None]:
---

## Phase 2: Text Normalization & Cleaning

Normalize and clean ingested text data using the normalize module.


In [None]:
from semantica.normalize import TextNormalizer

# Normalize and clean text data
normalizer = TextNormalizer()

# Normalize all documents
normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False  # Preserve drug names (case-sensitive)
    )
    normalized_documents.append(normalized_text)

print(f"Normalized {len(normalized_documents)} documents")
print(f"Sample normalized text (first 200 chars): {normalized_documents[0][:200] if normalized_documents else 'N/A'}")


---

## Phase 3: Advanced Chunking (Entity-Aware)

Use entity-aware chunking to preserve drug/protein entity boundaries for GraphRAG.


In [None]:
from semantica.split import TextSplitter, EntityAwareChunker

# Use entity-aware chunking to preserve drug/protein entity boundaries
# This is crucial for GraphRAG workflows to maintain entity context
splitter = TextSplitter(
    method="entity_aware",
    ner_method="llm",  # Use LLM for better entity recognition
    chunk_size=1000,
    chunk_overlap=200
)

# Alternative: Use EntityAwareChunker directly
entity_chunker = EntityAwareChunker(
    chunk_size=1000,
    chunk_overlap=200,
    ner_method="llm",
    preserve_entities=True
)

# Chunk normalized documents
chunked_documents = []
for doc_text in normalized_documents:
    chunks = splitter.split(doc_text)
    chunked_documents.extend(chunks)

print(f"Created {len(chunked_documents)} chunks using entity-aware chunking")
print(f"Sample chunk (first 300 chars): {chunked_documents[0].content[:300] if chunked_documents else 'N/A'}")


---

## Phase 4: Entity Extraction & Knowledge Graph Construction

Extract drug and protein entities, then build knowledge graph.


In [None]:
# Convert chunks to document format for entity extraction
# Chunks from TextSplitter have content attribute
chunk_docs = []
for chunk in chunked_documents:
    # Create a simple document-like object
    doc_content = chunk.content if hasattr(chunk, 'content') else str(chunk)
    chunk_docs.append(doc_content)

# Build knowledge base with entity extraction
result = core.build_knowledge_base(
    sources=chunk_docs,
    custom_entity_types=["Drug", "Protein", "Target", "Compound", "Enzyme", "Receptor"],
    embeddings=True,
    graph=True
)

# Extract entities
entities = result["entities"]
drugs = [e for e in entities if e.get("type") == "Drug" or "drug" in e.get("type", "").lower()]
proteins = [e for e in entities if e.get("type") == "Protein" or "protein" in e.get("type", "").lower()]

print(f"Extracted {len(drugs)} drugs and {len(proteins)} proteins")
print(f"Sample drugs: {[d.get('text', '')[:30] for d in drugs[:3]]}")
print(f"Sample proteins: {[p.get('text', '')[:30] for p in proteins[:3]]}")

# Get knowledge graph
kg = result["knowledge_graph"]
print(f"Knowledge graph contains {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Phase 5: Vector Store Population & Embedding Generation

Generate embeddings and populate vector store for similarity search.


---

## Phase 4-5: Vector Store & Similarity Search

Generate embeddings and populate vector store for similarity search.


---

## Phase 6: GraphRAG - Hybrid Vector + Graph Retrieval

Use AgentContext for GraphRAG to combine vector similarity search with knowledge graph traversal.


In [None]:
from semantica.context import AgentContext

# Initialize GraphRAG context with vector store and knowledge graph
context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

# Example GraphRAG query: Find drugs and their targets
query = "What drugs target COX enzymes?"
print(f"Query: {query}\n")

# Retrieve using GraphRAG (hybrid vector + graph retrieval)
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,  # Enable graph traversal
    expand_graph=True,  # Expand graph relationships
    include_entities=True,  # Include related entities
    include_relationships=True  # Include relationships
)

print(f"GraphRAG retrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


---

## Phase 7: Similarity Search & Interaction Prediction

Use vector similarity to find similar drugs and predict interactions.


In [None]:
from semantica.embeddings import EmbeddingGenerator

# Generate embeddings for drugs and proteins
embedding_gen = EmbeddingGenerator(provider="sentence_transformers", model="all-MiniLM-L6-v2")

# Create drug embeddings
drug_texts = [f"{d.get('text', '')} {d.get('description', '')}" for d in drugs]
drug_embeddings = embedding_gen.generate_embeddings(drug_texts)

# Store in vector store
drug_ids = vector_store.store_vectors(
    vectors=drug_embeddings,
    metadata=[{"type": "drug", "name": d.get("text", "")} for d in drugs]
)

print(f"Stored {len(drug_ids)} drug embeddings in vector store")


---

## Phase 8: Knowledge Graph Visualization

Visualize drug-target knowledge graph and relationships.


---

## Phase 6: Similarity Search & Interaction Prediction

Use vector similarity to find similar drugs and predict interactions.


In [None]:
# Example: Find drugs similar to Aspirin
query_drug = "Aspirin"
query_embedding = embedding_gen.generate_embeddings([query_drug])[0]

# Search for similar drugs
similar_drugs = vector_store.search_vectors(query_embedding, k=5)

print(f"Drugs similar to '{query_drug}':")
for i, result in enumerate(similar_drugs, 1):
    print(f"{i}. {result['metadata'].get('name', 'Unknown')} (similarity: {result['score']:.3f})")


---

## Phase 7-8: Knowledge Graph & Visualization

Build drug-target knowledge graph and visualize relationships.


In [None]:
from semantica.visualization import KGVisualizer

# Get knowledge graph from result
kg = result["knowledge_graph"]

# Visualize drug-target relationships
visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="drug_target_kg.html",
    layout="spring",
    node_size=20
)

print("Knowledge graph visualization saved to drug_target_kg.html")
print(f"Graph contains {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")
