[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/biomedical/02_Genomic_Variant_Analysis.ipynb)

# Genomic Variant Analysis - Graph Analytics & Pathway Analysis

## Overview

This notebook demonstrates **genomic variant analysis** using Semantica's modular architecture with focus on **graph analytics**, **pathway analysis**, and **temporal knowledge graphs**. The pipeline analyzes genomic data to extract variant entities, build temporal genomic knowledge graphs, and analyze disease associations through reasoning.

### Key Features

- **Graph Analytics Focus**: Emphasizes graph reasoning, centrality measures, and pathway analysis
- **Temporal Analysis**: Builds temporal genomic knowledge graphs to track variant evolution
- **Disease Association**: Analyzes relationships between variants, genes, and diseases
- **Pathway Analysis**: Uses graph traversal to identify biological pathways
- **Impact Prediction**: Predicts variant impact using graph-based reasoning

### What You'll Learn

- How to use Semantica modules directly for genomic analysis
- How to ingest genomic data from multiple sources
- How to extract variant, gene, and disease entities
- How to build temporal knowledge graphs
- How to perform graph analytics (centrality, communities)
- How to use temporal queries for variant evolution
- How to analyze pathways using reasoning
- How to visualize and export genomic knowledge graphs

### Pipeline Flow

```mermaid
graph LR
    A[Data Ingestion] --> B[Text Processing]
    B --> C[Entity Extraction]
    C --> D[Relationship Extraction]
    D --> E[Deduplication]
    E --> F[Temporal KG]
    F --> G[Graph Analytics]
    F --> H[Temporal Queries]
    G --> I[Pathway Analysis]
    H --> I
    I --> J[Disease Associations]
    J --> K[Visualization]
```

### Data Sources

**PubMed RSS Feeds:**
- Genetics, Genomics, Variant Analysis, GWAS
- Genomic Medicine, Precision Medicine, Pharmacogenomics
- Population Genetics, Evolutionary Genomics

**Preprint Servers:**
- BioRxiv (Genetics, Genomics)
- MedRxiv (Genomics, Clinical Genomics)

**Journal RSS Feeds:**
- Nature Genetics, Genome Research
- American Journal of Human Genetics
- Human Molecular Genetics

**Database Links (for reference):**
- [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) - Single Nucleotide Polymorphism database
- [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) - Clinical significance of variants
- [gnomAD](https://gnomad.broadinstitute.org/) - Genome Aggregation Database
- [1000 Genomes](https://www.internationalgenome.org/) - Human genetic variation
- [Ensembl](https://www.ensembl.org/) - Genome browser and annotation
- [UCSC Genome Browser](https://genome.ucsc.edu/) - Genome visualization
- [OMIM](https://www.omim.org/) - Online Mendelian Inheritance in Man

---


## Installation

Install Semantica and required dependencies:


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas groq sentence-transformers


## Configuration & Setup

Set up environment variables and configuration constants.


In [None]:
import os

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "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")


In [None]:
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TEMPORAL_GRANULARITY = "day"


## Ingesting Genomic Data from Multiple Sources

Ingest data from comprehensive genomic sources including PubMed RSS feeds, preprint servers, and journal feeds.


In [None]:
from semantica.ingest import FeedIngestor, FileIngestor
import os
from contextlib import redirect_stderr
from io import StringIO

os.makedirs("data", exist_ok=True)

feed_sources = [
    # PubMed RSS Feeds
    ("PubMed - Genetics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=genetics&limit=15&sort=pub_date"),
    ("PubMed - Genomics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=genomics&limit=15&sort=pub_date"),
    ("PubMed - Variant Analysis", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=variant+analysis&limit=15&sort=pub_date"),
    ("PubMed - GWAS", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=GWAS&limit=15&sort=pub_date"),
    ("PubMed - Genomic Medicine", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=genomic+medicine&limit=15&sort=pub_date"),
    ("PubMed - Precision Medicine", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=precision+medicine&limit=15&sort=pub_date"),
    ("PubMed - Pharmacogenomics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=pharmacogenomics&limit=15&sort=pub_date"),
    ("PubMed - Population Genetics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=population+genetics&limit=15&sort=pub_date"),
    ("PubMed - Evolutionary Genomics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=evolutionary+genomics&limit=15&sort=pub_date"),
    ("PubMed - Clinical Genomics", "https://pubmed.ncbi.nlm.nih.gov/rss/search/1?term=clinical+genomics&limit=15&sort=pub_date"),
    
    # Preprint Servers
    ("BioRxiv - Genetics", "https://connect.biorxiv.org/biorxiv_xml.php?subject=genetics"),
    ("BioRxiv - Genomics", "https://connect.biorxiv.org/biorxiv_xml.php?subject=genomics"),
    ("MedRxiv - Genomics", "https://connect.medrxiv.org/medrxiv_xml.php?subject=genomics"),
    ("MedRxiv - Clinical Genomics", "https://connect.medrxiv.org/medrxiv_xml.php?subject=clinical_genomics"),
    
    # Journal RSS Feeds
    ("Nature Genetics", "https://www.nature.com/ng.rss"),
    ("Genome Research", "https://genome.cshlp.org/rss/current.xml"),
    ("American Journal of Human Genetics", "https://www.cell.com/ajhg.rss"),
    ("Human Molecular Genetics", "https://academic.oup.com/hmg/rss/current"),
]

feed_ingestor = FeedIngestor()
all_documents = []

for feed_name, feed_url in feed_sources:
    try:
        with redirect_stderr(StringIO()):
            feed_data = feed_ingestor.ingest_feed(feed_url, validate=False)
        
        for item in feed_data.items:
            if not item.content:
                item.content = item.description or item.title or ""
            if item.content:
                if not hasattr(item, 'metadata'):
                    item.metadata = {}
                item.metadata['source'] = feed_name
                all_documents.append(item)
    except Exception:
        continue

if not all_documents:
    variant_data = """
    Variant rs699 is located in the AGT gene and associated with hypertension.
    Variant rs7412 in APOE gene is linked to Alzheimer's disease risk.
    BRCA1 variant c.5266dupC increases breast cancer susceptibility.
    CFTR variant F508del causes cystic fibrosis.
    Variant rs1800566 in NAT2 gene affects drug metabolism.
    Variant rs1042713 in ADRB2 gene is associated with asthma response.
    TP53 variant R273H is linked to multiple cancer types.
    Variant rs1799853 in CYP2C9 gene affects warfarin metabolism.
    """
    
    with open("data/variants.txt", "w") as f:
        f.write(variant_data)
    
    file_ingestor = FileIngestor()
    all_documents = file_ingestor.ingest("data/variants.txt")

documents = all_documents
print(f"Ingested {len(documents)} documents")


## Normalizing and Chunking Genomic Documents

Clean and normalize text, then split into chunks using entity-aware chunking to preserve variant/gene entity boundaries.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.split import TextSplitter

normalizer = TextNormalizer()
splitter = TextSplitter(
    method="entity_aware",
    ner_method="spacy",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True,
        lowercase=False
    )
    normalized_documents.append(normalized_text)

chunked_documents = []
for doc_text in normalized_documents:
    try:
        with redirect_stderr(StringIO()):
            chunks = splitter.split(doc_text)
        chunked_documents.extend(chunks)
    except Exception:
        simple_splitter = TextSplitter(method="recursive", chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
        chunks = simple_splitter.split(doc_text)
        chunked_documents.extend(chunks)


In [None]:
from semantica.semantic_extract import NERExtractor

entity_extractor = NERExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_entities = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        entities = entity_extractor.extract_entities(
            chunk_text,
            entity_types=["Variant", "Gene", "Disease", "Pathway", "Protein"]
        )
        all_entities.extend(entities)
    except Exception:
        continue

variants = [e for e in all_entities if e.label == "Variant" or "variant" in e.label.lower()]
genes = [e for e in all_entities if e.label == "Gene" or "gene" in e.label.lower()]
diseases = [e for e in all_entities if e.label == "Disease" or "disease" in e.label.lower()]

print(f"Extracted {len(variants)} variants, {len(genes)} genes, {len(diseases)} diseases")


## Extracting Genomic Relationships

Extract relationships between variants, genes, and diseases to understand genomic associations.


In [None]:
from semantica.semantic_extract import RelationExtractor

relation_extractor = RelationExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0
)

all_relationships = []
for chunk in chunked_documents:
    chunk_text = chunk.text if hasattr(chunk, 'text') else str(chunk)
    try:
        relationships = relation_extractor.extract_relations(
            chunk_text,
            entities=all_entities,
            relation_types=["associated_with", "located_in", "causes", "increases_risk", "affects", "linked_to"]
        )
        all_relationships.extend(relationships)
    except Exception:
        continue

print(f"Extracted {len(all_relationships)} relationships")


## Resolving Duplicate Variants

Detect and merge duplicate variants to ensure data quality and consistency.


In [None]:
from semantica.deduplication import DuplicateDetector

duplicate_detector = DuplicateDetector(
    similarity_threshold=0.85,
    method="semantic"
)

deduplicated_entities = duplicate_detector.detect_duplicates(all_entities)
merged_entities = duplicate_detector.merge_duplicates(deduplicated_entities)

print(f"Deduplicated {len(all_entities)} entities to {len(merged_entities)} unique entities")


## Building Temporal Genomic Knowledge Graph

Construct a temporal knowledge graph from extracted entities and relationships to enable time-aware analysis and variant evolution tracking.


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    resolve_conflicts=True,
    entity_resolution_strategy="fuzzy",
    enable_temporal=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

kg_sources = [{
    "entities": [{"text": e.text, "type": e.label, "confidence": e.confidence} for e in merged_entities],
    "relationships": [{"source": r.source, "target": r.target, "type": r.label, "confidence": r.confidence} for r in all_relationships]
}]

kg = graph_builder.build(kg_sources)

entities_count = len(kg.get('entities', []))
relationships_count = len(kg.get('relationships', []))
print(f"Graph: {entities_count} entities, {relationships_count} relationships")


## Analyzing Graph Structure

Perform comprehensive graph analytics including centrality measures, community detection, and connectivity analysis.


In [None]:
from semantica.kg import GraphAnalyzer, CentralityCalculator, CommunityDetector

graph_analyzer = GraphAnalyzer()
centrality_calc = CentralityCalculator()
community_detector = CommunityDetector()

analysis = graph_analyzer.analyze_graph(kg)

degree_centrality = centrality_calc.calculate_degree_centrality(kg)
betweenness_centrality = centrality_calc.calculate_betweenness_centrality(kg)
closeness_centrality = centrality_calc.calculate_closeness_centrality(kg)

communities = community_detector.detect_communities(kg, method="louvain")
connectivity = graph_analyzer.analyze_connectivity(kg)

print(f"Graph analytics:")
print(f"  - Communities: {len(communities)}")
print(f"  - Connected components: {len(connectivity.get('components', []))}")
print(f"  - Graph density: {analysis.get('density', 0):.3f}")


## Temporal Graph Queries

Query the temporal knowledge graph at specific time points, analyze temporal evolution, and detect temporal patterns.


In [None]:
from semantica.kg import TemporalGraphQuery

temporal_query = TemporalGraphQuery(
    enable_temporal_reasoning=True,
    temporal_granularity=TEMPORAL_GRANULARITY
)

query_results = temporal_query.query_at_time(
    kg,
    query={"type": "Variant"},
    at_time="2024-01-01"
)

evolution = temporal_query.analyze_evolution(kg)
temporal_patterns = temporal_query.detect_temporal_patterns(kg, pattern_type="sequence")

print(f"Temporal queries: {len(query_results)} variants at query time")
print(f"Temporal patterns detected: {len(temporal_patterns)}")


## Pathway Analysis & Reasoning

Use graph reasoning to find pathways between variants and diseases, and infer biological pathways through logical reasoning.


In [None]:
from semantica.reasoning import Reasoner

reasoner = Reasoner()

pathways = reasoner.find_paths(
    kg,
    source_type="Variant",
    target_type="Disease",
    max_hops=3
)

reasoner.add_rule("IF Variant associated_with Gene AND Gene causes Disease THEN Variant increases_risk Disease")
inferred_facts = reasoner.infer_facts(kg)

print(f"Pathway analysis: {len(pathways)} variant-disease pathways found")
print(f"Inferred facts: {len(inferred_facts)}")


## Analyzing Disease Associations

Use graph traversal to find variant-disease associations and calculate association scores.


In [None]:
disease_associations = []
for variant in variants[:10]:
    variant_name = variant.text
    paths = graph_analyzer.find_paths(
        kg,
        source=variant_name,
        target_type="Disease",
        max_hops=2
    )
    for path in paths:
        if path.get('target_type') == 'Disease':
            disease_associations.append({
                'variant': variant_name,
                'disease': path.get('target'),
                'path_length': len(path.get('path', [])),
                'confidence': variant.confidence
            })

disease_associations.sort(key=lambda x: x['confidence'], reverse=True)

print(f"Top disease associations:")
for i, assoc in enumerate(disease_associations[:5], 1):
    print(f"{i}. {assoc['variant']} -> {assoc['disease']} (confidence: {assoc['confidence']:.3f})")


## Visualizing the Temporal Knowledge Graph

Generate an interactive visualization of the temporal genomic knowledge graph.


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(
    kg,
    output_path="genomic_variant_kg.html",
    layout="hierarchical",
    node_size=20
)

print("Visualization saved to genomic_variant_kg.html")


## Exporting Results

Export the temporal knowledge graph to various formats for further analysis or integration with other tools.


In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, output_path="genomic_variant_kg.json", format="json")
exporter.export(kg, output_path="genomic_variant_kg.graphml", format="graphml")

print("Exported knowledge graph to JSON and GraphML formats")
