Skip to content

this is a repository which can leverage LLM for knowledge graph creation and search along with semantic search.

License

Notifications You must be signed in to change notification settings

pavanjava/deep_document_knowledge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Medical Knowledge Extraction and Graph Construction System

A sophisticated system for extracting medical entities, relationships, and knowledge from biomedical literature, creating knowledge graphs, and enabling semantic search through vector databases.

🎯 Overview

This project processes biomedical texts (particularly from PubMedQA dataset) to build comprehensive knowledge graphs that capture medical entities, their relationships, and contextual information. The system combines entity-level graph extraction with vector database storage for both structured and semantic querying capabilities.

πŸ—οΈ System Architecture

Medical Text Input
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Text to Entity Extraction                  β”‚
β”‚    (LangExtract + Gemini)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Entity-Relationship-Edge      β”‚
β”‚      JSON Generation            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Neo4j Graph    β”‚  Qdrant Vector β”‚
β”‚   Database      β”‚   Database     β”‚
β”‚  (Cypher)       β”‚ (Embeddings)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Query & Visualization         β”‚
β”‚     Interface                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Core Components

1. Text to Entity Level Graph Converter (text_to_entity_level_graph_converter.py)

  • Purpose: Main orchestrator for the extraction pipeline
  • Key Functions:
    • content_extract(): Extracts medical entities, relations, and edges from text using Gemini 2.5 Flash
    • save_and_visualize_content(): Processes extracted data and populates both graph and vector databases
    • format_rows(): Processes PubMedQA CSV data in batches

Medical Entity Categories:

  • Entities: Diseases, drugs, procedures, biomarkers, cell types, anatomical structures
  • Relations: Causative (causes, leads_to), therapeutic (treats, prevents), quantitative (increased_in, decreased_in)
  • Edges: Specific entity-relationship-entity triples with confidence scores

2. Qdrant Vector Database Integration (ingest_to_qdrant.py)

  • Purpose: Vector storage and semantic search capabilities
  • Features:
    • Medical research metadata extraction using LangExtract
    • Automatic embedding generation with SentenceTransformer
    • Research paper categorization (study type, medical domain, sample size)
    • Async operations for scalability

Extracted Metadata Fields:

  • Research question
  • Study type (prospective, retrospective, randomized, etc.)
  • Medical domain (cardiology, neurology, surgery, etc.)
  • Sample size and population studied
  • Intervention type

3. Neo4j Knowledge Graph (neo4j_populator.py, entity_relation_json_to_cypher_query.py)

  • Purpose: Structured graph database for complex relationship queries
  • Features:
    • Automatic Cypher query generation from extracted JSON
    • Support for both CREATE and SEARCH operations
    • Medical terminology preservation
    • Relationship type classification

4. Search Playground (search_playground.py)

  • Purpose: Interactive querying interface
  • Functionality:
    • Real-time entity extraction from user queries
    • Dynamic Cypher query generation
    • Graph search capabilities

πŸ“‹ Installation

Prerequisites

  • Python 3.8+
  • Neo4j Database
  • Qdrant Vector Database
  • Google Gemini API Key

Setup

# Clone the repository
git clone <repository-url>
cd medical-knowledge-extraction

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Add your GEMINI_API_KEY to .env file

# Start Neo4j (default: bolt://localhost:7687)
# Start Qdrant (default: http://localhost:6333)

Environment Configuration

# .env file
GEMINI_API_KEY=your_gemini_api_key_here
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=neo4j@123
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=th3s3cr3tk3y

πŸš€ Usage

Processing Medical Literature

import asyncio
from text_to_entity_level_graph_converter import main

# Process PubMedQA dataset
asyncio.run(main())

Interactive Search

from search_playground import main

# Query the knowledge graph
query = "What is the spectrum of antimicrobial activity of amoxicillin?"
asyncio.run(main(query))

Manual Text Processing

from text_to_entity_level_graph_converter import content_extract, save_and_visualize_content

# Extract from custom text
medical_text = "ILC2s were increased in patients with CRSwNP..."
result = await content_extract(medical_text)
await save_and_visualize_content(result, context=medical_text, need_vis=True)

πŸ“Š Data Flow

  1. Input Processing: Medical texts from PubMedQA or custom sources
  2. Entity Extraction: Uses Gemini 2.5 Flash with specialized medical prompts
  3. Knowledge Graph Construction: Converts extracted entities to Cypher queries
  4. Vector Database Storage: Creates embeddings for semantic search
  5. Visualization: Generates interactive HTML visualizations

πŸ” Query Capabilities

Graph Queries (Neo4j)

  • Find all entities related to a specific disease
  • Trace treatment pathways and contraindications
  • Identify research gaps in medical literature
  • Map drug-disease relationships

Semantic Search (Qdrant)

  • Find similar research studies
  • Identify papers by methodology or population
  • Discover related medical domains
  • Search by research outcomes

πŸ“ˆ Example Output

Extracted Entities

{
  "extraction_class": "entity",
  "extraction_text": "ILC2s",
  "attributes": {"type": "cell_type", "scale": "entity"}
}

Relationships

{
  "extraction_class": "edge",
  "extraction_text": "(ILC2s, increased_in, CRSwNP)",
  "attributes": {"confidence": "high", "evidence_based": "true"}
}

Generated Cypher

CREATE
  (a:Entity {name: "ILC2s", type: "cell_type"})
  (b:Entity {name: "CRSwNP", type: "disease"})
  (a)-[:INCREASED_IN {confidence: "high"}]->(b)

πŸŽ›οΈ Configuration

Model Settings

  • Primary Model: Gemini 2.5 Flash for speed and accuracy
  • Embedding Model: all-MiniLM-L6-v2 for vector generation
  • Vector Dimension: 384 (configurable)

Processing Parameters

  • Batch Size: 2 papers per batch (configurable)
  • Confidence Threshold: High, Medium, Low classifications
  • Medical Domain Focus: Optimized for clinical and research literature

πŸ”§ Customization

Adding New Medical Domains

  1. Extend entity types in extraction prompt
  2. Add domain-specific relationship types
  3. Update metadata extraction categories
  4. Modify visualization templates

Custom Entity Types

# Add new entity categories
entity_types = [
    "biomarker", "genetic_variant", "pathway", 
    "clinical_trial_phase", "medical_device"
]

🚧 Troubleshooting

Common Issues

  • Neo4j Connection: Verify database is running and credentials are correct
  • Qdrant Setup: Ensure vector database is accessible at specified URL
  • API Limits: Check Gemini API quotas and rate limits
  • Memory Usage: Large datasets may require batch processing

Performance Optimization

  • Use async operations for concurrent processing
  • Implement connection pooling for database operations
  • Cache frequently accessed embeddings
  • Optimize Cypher queries with proper indexing

πŸ“š Dependencies

Core Libraries

  • langextract==1.0.8: Advanced text extraction with LLM integration
  • google-genai==1.30.0: Google Gemini API client
  • neo4j: Graph database driver
  • qdrant-client: Vector database client
  • sentence-transformers: Embedding generation

Supporting Libraries

  • pandas: Data manipulation
  • pydantic: Data validation
  • python-dotenv: Environment configuration
  • protobuf: Serialization support

🎯 Future Enhancements

  • Multi-language medical text support
  • Real-time knowledge graph updates
  • Advanced visualization dashboards
  • Integration with medical ontologies (UMLS, SNOMED CT)
  • Automated literature review generation
  • Clinical decision support features

πŸ“„ License

This project is designed for research and educational purposes in medical informatics and knowledge extraction under MIT license.

🀝 Contributing

Contributions welcome! Please focus on:

  • Medical domain expertise
  • Performance optimizations
  • New visualization features
  • Extended database integrations

πŸ“ž Support

For technical issues or medical domain questions, please create detailed issues with:

  • Input text samples
  • Expected vs actual outputs
  • System configuration details
  • Error logs and stack traces

About

this is a repository which can leverage LLM for knowledge graph creation and search along with semantic search.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages