[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/biomedical/01_Drug_Discovery_Pipeline.ipynb)

# Drug Discovery Pipeline - Vector Similarity Search

## Overview

This notebook demonstrates a **complete drug discovery pipeline** using Semantica that focuses on **vector similarity search** and **interaction prediction**. The pipeline ingests drug and protein data, extracts compound and target entities, builds a drug-target knowledge graph, and performs similarity search to predict drug-target interactions.

### Key Features

- **Vector-Focused Approach**: Emphasizes embeddings and vector similarity search for drug-target interaction prediction
- **Compound-Target Extraction**: Extracts drug compounds, proteins, and targets from biomedical literature
- **Similarity Search**: Uses vector embeddings to find similar compounds and predict interactions
- **Knowledge Graph Construction**: Builds structured drug-target relationship graphs
- **Interaction Prediction**: Predicts potential drug-target interactions using similarity metrics

### What You'll Learn

- How to ingest biomedical data (drug databases, protein data, literature)
- How to extract compound and target entities from unstructured text
- How to generate embeddings for drugs and proteins
- How to perform similarity search to find similar compounds
- How to build drug-target knowledge graphs
- How to predict drug-target interactions using vector similarity

### Pipeline Architecture

1. **Phase 0**: Setup & Configuration
2. **Phase 1**: Biomedical Data Ingestion
3. **Phase 2**: Document Parsing & Processing
4. **Phase 3**: Entity Extraction (Drugs, Proteins, Targets)
5. **Phase 4**: Embedding Generation
6. **Phase 5**: Vector Store Population
7. **Phase 6**: Similarity Search & Interaction Prediction
8. **Phase 7**: Knowledge Graph Construction
9. **Phase 8**: Visualization & Export

---

## Installation

Install Semantica and required dependencies:


In [1]:
# Install Semantica and required dependencies
%pip install -qU semantica networkx matplotlib plotly pandas faiss-cpu beautifulsoup4 groq sentence-transformers scikit-learn


Note: you may need to restart the kernel to use updated packages.




---

## Setup & Configuration

- Configure Semantica for drug discovery
- Focus on vector similarity search


In [2]:
import os

# Set API keys
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")

print("Environment configured.")


Environment configured.


- Import and configure Semantica core components


In [3]:
from semantica.core import Semantica, ConfigManager

# Configure for drug discovery with vector similarity focus
config_dict = {
    "project_name": "Drug_Discovery_Pipeline",
    "embedding": {
        "provider": "sentence_transformers",
        "model": "all-MiniLM-L6-v2"  # 384-dimensional embeddings
    },
    "extraction": {
        "provider": "groq",
        "model": "llama-3.1-8b-instant",
        "temperature": 0.0
    },
    "vector_store": {
        "provider": "faiss",
        "dimension": 384
    },
    "knowledge_graph": {
        "backend": "networkx",
        "merge_entities": True
    }
}

config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)

print("Semantica core configured.")


Semantica core configured.


In [4]:
from semantica.vector_store import VectorStore

# Initialize vector store with FAISS backend
vector_store = VectorStore(backend="faiss", dimension=384)

print("Vector store initialized.")


fastembed not available. Install with: pip install fastembed. Using fallback embedding method.


Vector store initialized.


---

## Biomedical Data Ingestion

- Ingest biomedical data from PubMed RSS feeds
- Use FeedIngestor for data collection


In [5]:
import os
# Create data directory if it doesn't exist
os.makedirs("data", exist_ok=True)

print("Data directory ready.")


Data directory ready.


In [6]:
from semantica.ingest import FeedIngestor
from contextlib import redirect_stderr
from io import StringIO

# Multiple RSS feed options for biomedical data
feed_urls = [
    ("PubMed - Drug Discovery", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+discovery&limit=10&sort=pub_date&fc=article_type"),
    ("PubMed - Drug Target Interaction", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=drug+target+interaction&limit=10&sort=pub_date"),
    ("PubMed - Pharmacokinetics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=pharmacokinetics&limit=10&sort=pub_date"),
    ("PubMed - Pharmacodynamics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=pharmacodynamics&limit=10&sort=pub_date"),
    ("PubMed - Clinical Trial", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=clinical+trial&limit=10&sort=pub_date"),
    ("PubMed - Protein Target", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=protein+target&limit=10&sort=pub_date"),
    ("BioRxiv - Pharmacology", "https://connect.biorxiv.org/biorxiv_xml.php?subject=pharmacology_and_toxicology"),
    ("Nature - Drug Discovery", "https://www.nature.com/subjects/drug-discovery.rss")
]

print(f"Configured {len(feed_urls)} RSS feed sources")


Configured 8 RSS feed sources


In [7]:
# Initialize FeedIngestor
feed_ingestor = FeedIngestor()
print("FeedIngestor initialized")


FeedIngestor initialized


In [8]:
# Try each feed URL until one succeeds (suppress error messages)
documents = None

for feed_name, feed_url in feed_urls:
    try:
        # Suppress stderr to avoid cluttering output with parsing errors
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        # Convert FeedItem objects to documents
        documents = []
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                documents.append(item)
        
        if documents:
            print(f"‚úì Successfully ingested {len(documents)} documents from {feed_name}")
            break
    except Exception:
        continue


Status,Action,Module,Submodule,File,Time
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,-,0.00s
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,-,0.65s
‚úÖ,Semantica is processing,‚è≥ core,LifecycleManager,-,1.19s
üîÑ,Semantica is processing,‚è≥ core,Semantica,-,0.06s
‚úÖ,Semantica is executing,‚öôÔ∏è pipeline,PipelineBuilder,-,0.03s
‚úÖ,Semantica is executing,‚öôÔ∏è pipeline,PipelineValidator,-,0.01s
üîÑ,Semantica is processing,‚è≥ core,build_knowledge_base,chunk_0.txt,0.01s
üîÑ,Semantica is executing,‚öôÔ∏è pipeline,ExecutionEngine,chunk_0.txt,0.01s
üîÑ,Semantica is executing,‚öôÔ∏è pipeline,ExecutionEngine,-,0.01s
üîÑ,Semantica is executing,‚öôÔ∏è pipeline,ResourceScheduler,-,0.00s


‚úì Successfully ingested 30 documents from Nature - Drug Discovery


In [9]:
# Fallback to sample data if all RSS feeds failed
from semantica.ingest import FileIngestor

if not documents:
    sample_drug_data = """
    Aspirin (acetylsalicylic acid) is a medication used to reduce pain, fever, or inflammation. 
    It targets cyclooxygenase enzymes COX-1 and COX-2. Aspirin is commonly used for cardiovascular protection.
    Ibuprofen is a nonsteroidal anti-inflammatory drug (NSAID) that targets COX-1 and COX-2 enzymes.
    Metformin is an antidiabetic medication that targets AMP-activated protein kinase (AMPK).
    Insulin targets the insulin receptor (INSR) to regulate glucose metabolism.
    Warfarin is an anticoagulant that targets vitamin K epoxide reductase complex subunit 1 (VKORC1).
    Atorvastatin is a statin medication that targets HMG-CoA reductase.
    """
    
    import os
    os.makedirs("data", exist_ok=True)
    with open("data/sample_drugs.txt", "w") as f:
        f.write(sample_drug_data)
    
    file_ingestor = FileIngestor()
    documents = file_ingestor.ingest("data/sample_drugs.txt")
    print(f"Ingested {len(documents)} documents from sample data")


In [10]:
from semantica.normalize import TextNormalizer

# Initialize text normalizer
normalizer = TextNormalizer()
print("TextNormalizer initialized")


TextNormalizer initialized


In [11]:
# Normalize all documents
normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False  # Preserve drug names (case-sensitive)
    )
    normalized_documents.append(normalized_text)

print(f"Normalized {len(normalized_documents)} documents")


Normalized 30 documents


# Preview normalized text
if normalized_documents:
    print(f"Sample normalized text (first 200 chars): {normalized_documents[0][:200]}")
else:
    print("No normalized documents available")


---

## Text Normalization & Cleaning

- Clean HTML tags and special characters
- Normalize entity names
- Remove extra whitespace
- Preserve drug names (case-sensitive)


---

## Entity-Aware Chunking

- Use entity-aware chunking to preserve drug/protein entity boundaries
- Essential for GraphRAG operations


In [12]:
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

# Initialize TextSplitter with entity-aware chunking
# Using 'spacy' instead of 'llm' to avoid API key requirements
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",  # Use spaCy for entity recognition (no API key needed)
    chunk_size=1000,
    chunk_overlap=200
)

print("TextSplitter initialized with entity-aware chunking")


TextSplitter initialized with entity-aware chunking


In [13]:
# Chunk normalized documents (suppress error messages)
chunked_documents = []
for doc_text in normalized_documents:
    try:
        # Suppress stderr to avoid cluttering output with NER errors
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        # Fallback to simple recursive splitting if entity-aware fails
        simple_splitter = TextSplitter(method="recursive", chunk_size=1000, chunk_overlap=200)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)

print(f"Created {len(chunked_documents)} chunks")


Created 30 chunks


In [14]:
# Preview sample chunk
if chunked_documents:
    sample_chunk = chunked_documents[0]
    chunk_text = sample_chunk.text if hasattr(sample_chunk, 'text') else str(sample_chunk)
    print(f"Sample chunk (first 200 chars): {chunk_text[:200]}")
else:
    print("No chunks available")


Sample chunk (first 200 chars): Apoptotic signatures allow early and rapid screening of drug-induced liver injury to accelerate drug discovery


---

## Entity Extraction & Knowledge Graph Construction

- Extract drug and protein entities
- Build knowledge graph from extracted entities


In [15]:
# Convert chunks to text files
# build_knowledge_base requires file paths or URLs, not raw text
import os

os.makedirs("data/chunks", exist_ok=True)
chunk_files = []

for i, chunk in enumerate(chunked_documents):
    # Extract text from chunk object
    if hasattr(chunk, 'text'):
        chunk_text = chunk.text
    elif hasattr(chunk, 'content'):
        chunk_text = chunk.content
    else:
        chunk_text = str(chunk)
    
    # Save chunk to temporary file
    chunk_file = f"data/chunks/chunk_{i}.txt"
    with open(chunk_file, "w", encoding="utf-8") as f:
        f.write(chunk_text)
    chunk_files.append(chunk_file)

print(f"Prepared {len(chunk_files)} chunk files for entity extraction")


Prepared 30 chunk files for entity extraction


In [None]:
# Build knowledge base with entity extraction
result = core.build_knowledge_base(
    sources=chunk_files,
    custom_entity_types=["Drug", "Protein", "Target", "Compound", "Enzyme", "Receptor"],
    embeddings=True,
    graph=True
)

print("Knowledge base built successfully")


In [None]:
# Extract entities from result
entities = result["entities"]
print(f"Total entities extracted: {len(entities)}")


In [None]:
# Filter drugs and proteins
drugs = [e for e in entities if e.get("type") == "Drug" or "drug" in e.get("type", "").lower()]
proteins = [e for e in entities if e.get("type") == "Protein" or "protein" in e.get("type", "").lower()]

print(f"Extracted {len(drugs)} drugs and {len(proteins)} proteins")


In [None]:
# Preview sample drugs
print("Sample drugs:")
for d in drugs[:3]:
    print(f"  - {d.get('text', 'Unknown')[:50]}")


In [None]:
# Preview sample proteins
print("Sample proteins:")
for p in proteins[:3]:
    print(f"  - {p.get('text', 'Unknown')[:50]}")


In [None]:
# Get knowledge graph statistics
kg = result["knowledge_graph"]
print(f"Knowledge graph contains:")
print(f"  - {len(kg.get('entities', []))} entities")
print(f"  - {len(kg.get('relationships', []))} relationships")


---

## Entity-Aware Chunking

- Use entity-aware chunking to preserve drug/protein entity boundaries
- Essential for GraphRAG operations


In [None]:
# Define GraphRAG query
query = "What drugs target COX enzymes?"
print(f"Query: {query}")


In [None]:
# Retrieve using GraphRAG (hybrid vector + graph retrieval)
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,  # Enable graph traversal
    expand_graph=True,  # Expand graph relationships
    include_entities=True,  # Include related entities
    include_relationships=True  # Include relationships
)

print(f"GraphRAG retrieved {len(results)} results")


In [None]:
# Display GraphRAG results
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


In [None]:
from semantica.split import TextSplitter
from contextlib import redirect_stderr
from io import StringIO

# Use entity-aware chunking to preserve drug/protein entity boundaries
# Using 'spacy' instead of 'llm' to avoid API key requirements
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",  # Use spaCy for entity recognition (no API key needed)
    chunk_size=1000,
    chunk_overlap=200
)

# Chunk normalized documents (suppress error messages)
chunked_documents = []
for doc_text in normalized_documents:
    try:
        # Suppress stderr to avoid cluttering output with NER errors
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        # Fallback to simple recursive splitting if entity-aware fails
        simple_splitter = TextSplitter(method="recursive", chunk_size=1000, chunk_overlap=200)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)

print(f"Created {len(chunked_documents)} chunks using entity-aware chunking")


Created 30 chunks using entity-aware chunking


In [None]:
# Prepare drug texts for embedding
drug_texts = [f"{d.get('text', '')} {d.get('description', '')}" for d in drugs]
print(f"Prepared {len(drug_texts)} drug texts for embedding")


In [None]:
# Generate drug embeddings
drug_embeddings = embedding_gen.generate_embeddings(drug_texts)
print(f"Generated {len(drug_embeddings)} drug embeddings")


In [None]:
# Store embeddings in vector store
drug_ids = vector_store.store_vectors(
    vectors=drug_embeddings,
    metadata=[{"type": "drug", "name": d.get("text", "")} for d in drugs]
)

print(f"Stored {len(drug_ids)} drug embeddings in vector store")


---

## Entity Extraction & Knowledge Graph Construction

- Extract drug and protein entities
- Build knowledge graph from extracted entities


In [None]:
# Generate query embedding
query_embedding = embedding_gen.generate_embeddings([query_drug])[0]
print("Query embedding generated")


In [None]:
# Search for similar drugs
similar_drugs = vector_store.search_vectors(query_embedding, k=5)
print(f"Found {len(similar_drugs)} similar drugs")


In [None]:
# Display similar drugs
print(f"Drugs similar to '{query_drug}':")
for i, result in enumerate(similar_drugs, 1):
    print(f"{i}. {result['metadata'].get('name', 'Unknown')} (similarity: {result['score']:.3f})")


In [None]:
# Convert chunks to text strings and save to temporary files
# build_knowledge_base requires file paths or URLs, not raw text
import os

os.makedirs("data/chunks", exist_ok=True)
chunk_files = []

for i, chunk in enumerate(chunked_documents):
    # Extract text from chunk object
    if hasattr(chunk, 'text'):
        chunk_text = chunk.text
    elif hasattr(chunk, 'content'):
        chunk_text = chunk.content
    else:
        chunk_text = str(chunk)
    
    # Save chunk to temporary file
    chunk_file = f"data/chunks/chunk_{i}.txt"
    with open(chunk_file, "w", encoding="utf-8") as f:
        f.write(chunk_text)
    chunk_files.append(chunk_file)

print(f"Prepared {len(chunk_files)} chunk files for entity extraction")

# Build knowledge base with entity extraction
result = core.build_knowledge_base(
    sources=chunk_files,
    custom_entity_types=["Drug", "Protein", "Target", "Compound", "Enzyme", "Receptor"],
    embeddings=True,
    graph=True
)

# Extract entities
entities = result["entities"]
drugs = [e for e in entities if e.get("type") == "Drug" or "drug" in e.get("type", "").lower()]
proteins = [e for e in entities if e.get("type") == "Protein" or "protein" in e.get("type", "").lower()]

print(f"Extracted {len(drugs)} drugs and {len(proteins)} proteins")
print(f"\nSample drugs:")
for d in drugs[:3]:
    print(f"  - {d.get('text', 'Unknown')[:50]}")
print(f"\nSample proteins:")
for p in proteins[:3]:
    print(f"  - {p.get('text', 'Unknown')[:50]}")

# Get knowledge graph
kg = result["knowledge_graph"]
print(f"\nKnowledge graph contains:")
print(f"  - {len(kg.get('entities', []))} entities")
print(f"  - {len(kg.get('relationships', []))} relationships")


Prepared 30 chunk files for entity extraction


In [None]:
# Define GraphRAG query
query = "What drugs target COX enzymes?"
print(f"Query: {query}")


In [None]:
# Retrieve using GraphRAG (hybrid vector + graph retrieval)
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,  # Enable graph traversal
    expand_graph=True,  # Expand graph relationships
    include_entities=True,  # Include related entities
    include_relationships=True  # Include relationships
)

print(f"GraphRAG retrieved {len(results)} results")


In [None]:
# Display GraphRAG results
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


In [None]:
# Generate knowledge graph visualization
visualizer.visualize(
    kg,
    output_path="drug_target_kg.html",
    layout="spring",
    node_size=20
)

print("Knowledge graph visualization saved to drug_target_kg.html")


In [None]:
# Display graph statistics
print(f"Graph contains {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


---

## Embedding Generation & Vector Store Population

- Generate embeddings for drugs and proteins
- Populate vector store for similarity search


In [None]:
from semantica.context import AgentContext

# Initialize GraphRAG context with vector store and knowledge graph
context = AgentContext(vector_store=vector_store, knowledge_graph=kg)

# Example GraphRAG query: Find drugs and their targets
query = "What drugs target COX enzymes?"
print(f"Query: {query}\n")

# Retrieve using GraphRAG (hybrid vector + graph retrieval)
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,  # Enable graph traversal
    expand_graph=True,  # Expand graph relationships
    include_entities=True,  # Include related entities
    include_relationships=True  # Include relationships
)

print(f"GraphRAG retrieved {len(results)} results:\n")
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


---

## Similarity Search & Interaction Prediction

- Use vector similarity to find similar drugs
- Predict drug-target interactions


In [None]:
from semantica.embeddings import EmbeddingGenerator

# Generate embeddings for drugs and proteins
embedding_gen = EmbeddingGenerator(provider="sentence_transformers", model="all-MiniLM-L6-v2")

# Create drug embeddings
drug_texts = [f"{d.get('text', '')} {d.get('description', '')}" for d in drugs]
drug_embeddings = embedding_gen.generate_embeddings(drug_texts)

# Store in vector store
drug_ids = vector_store.store_vectors(
    vectors=drug_embeddings,
    metadata=[{"type": "drug", "name": d.get("text", "")} for d in drugs]
)

print(f"Stored {len(drug_ids)} drug embeddings in vector store")


In [None]:
# Retrieve using GraphRAG (hybrid vector + graph retrieval)
results = context.retrieve(
    query,
    max_results=10,
    use_graph=True,  # Enable graph traversal
    expand_graph=True,  # Expand graph relationships
    include_entities=True,  # Include related entities
    include_relationships=True  # Include relationships
)

print(f"GraphRAG retrieved {len(results)} results")


In [None]:
# Display GraphRAG results
for i, result in enumerate(results[:5], 1):
    print(f"{i}. Score: {result.get('score', 0):.3f}")
    print(f"   Content: {result.get('content', '')[:200]}...")
    if result.get('related_entities'):
        print(f"   Related entities: {len(result['related_entities'])}")
    print()


---

## Knowledge Graph Visualization

- Visualize drug-target knowledge graph
- Display relationships between entities


In [None]:
from semantica.visualization import KGVisualizer

# Initialize KG visualizer
visualizer = KGVisualizer()
print("KGVisualizer initialized")


In [None]:
# Generate knowledge graph visualization
visualizer.visualize(
    kg,
    output_path="drug_target_kg.html",
    layout="spring",
    node_size=20
)

print("Knowledge graph visualization saved to drug_target_kg.html")


In [None]:
# Display graph statistics
print(f"Graph contains {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")


In [None]:
# Example: Find drugs similar to Aspirin
query_drug = "Aspirin"
query_embedding = embedding_gen.generate_embeddings([query_drug])[0]

# Search for similar drugs
similar_drugs = vector_store.search_vectors(query_embedding, k=5)

print(f"Drugs similar to '{query_drug}':")
for i, result in enumerate(similar_drugs, 1):
    print(f"{i}. {result['metadata'].get('name', 'Unknown')} (similarity: {result['score']:.3f})")


In [None]:
from semantica.visualization import KGVisualizer

# Get knowledge graph from result
kg = result["knowledge_graph"]

# Visualize drug-target relationships
visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="drug_target_kg.html",
    layout="spring",
    node_size=20
)

print("Knowledge graph visualization saved to drug_target_kg.html")
print(f"Graph contains {len(kg.get('entities', []))} entities and {len(kg.get('relationships', []))} relationships")
