# Real-World Multi-Source Integration: Python Ecosystem Analysis

This cookbook demonstrates how to build a **Knowledge Graph of the Python Ecosystem** using **real-world data** from live sources.

We will ingest data from:
1.  **Official Website (Web)**: `https://www.python.org/` (using `WebIngestor`)
2.  **Package Registry (API)**: PyPI JSON API for `pandas` (using `RESTIngestor` + `FileIngestor`)
3.  **Source Code (Repo)**: CPython GitHub Repository (using Raw Content)
4.  **Database (DB)**: Local SQLite metrics (using `DBIngestor`)
5.  **Live Search (MCP)**: Real-time search via Model Context Protocol (using `MCPIngestor`)

**Goal**: Construct a unified graph linking Python, key libraries, source code, and live context.

In [19]:
# Installation & Setup
!pip install -q semantica requests beautifulsoup4 fastmcp networkx fastembed


unclosed file <_io.BufferedWriter name=3>


unclosed file <_io.BufferedReader name=4>


unclosed file <_io.BufferedReader name=5>



In [20]:
import os
import json
import requests
import tempfile
import logging
from datetime import datetime

# Semantica Imports
from semantica.ingest import WebIngestor, FileIngestor, RESTIngestor, MCPIngestor, DBIngestor
from semantica.kg import GraphBuilder
from semantica.visualization import KGVisualizer

# Setup Workspace
WORKSPACE_DIR = tempfile.mkdtemp()
print(f"Workspace created at: {WORKSPACE_DIR}")

# Configure Logging to show ingestion progress
logging.basicConfig(level=logging.INFO, format='%(name)s - %(levelname)s - %(message)s')

ImportError: cannot import name 'RESTIngestor' from 'semantica.ingest' (C:\Users\Mohd Kaif\semantica\semantica\ingest\__init__.py)

## Source 1: Official Website (Web Ingestion)

We use `WebIngestor` to crawl the official Python homepage. This demonstrates handling unstructured HTML content.

In [None]:
print("--- 1. Ingesting Web Sources: Python.org + Docs + PEPs ---")

try:
    web_ingestor = WebIngestor(delay=0.5)

    web_targets = {
        "python_home": "https://www.python.org/",
        "psf": "https://www.python.org/psf/",
        "python_docs": "https://docs.python.org/3/",
        "whatsnew_313": "https://docs.python.org/3/whatsnew/3.13.html",
        "asyncio_docs": "https://docs.python.org/3/library/asyncio.html",
        "typing_docs": "https://docs.python.org/3/library/typing.html",
        "pep_703": "https://peps.python.org/pep-0703/",
        "pep_8": "https://peps.python.org/pep-0008/",
        "pep_484": "https://peps.python.org/pep-0484/",
        "packaging_guide": "https://packaging.python.org/en/latest/",
    }

    web_pages = {}
    for key, url in web_targets.items():
        page = web_ingestor.ingest_url(url)
        web_pages[key] = page
        title = getattr(page, "title", "")
        text = getattr(page, "text", "")
        print(f"Ingested {key}: {url}")
        print(f"  Title: {title}")
        print(f"  Content Length: {len(text)} characters")

    entities = [
        {
            "id": "Python",
            "name": "Python",
            "type": "ProgrammingLanguage",
            "properties": {
                "website": web_targets["python_home"],
                "docs": web_targets["python_docs"],
            },
            "source": "python_web",
        },
        {
            "id": "CPython",
            "name": "CPython",
            "type": "Interpreter",
            "properties": {"repo_url": "https://github.com/python/cpython"},
            "source": "python_web",
        },
        {
            "id": "Python Software Foundation",
            "name": "Python Software Foundation",
            "type": "Organization",
            "properties": {"url": web_targets["psf"]},
            "source": "python_web",
        },
        {
            "id": "Python Documentation",
            "name": "Python Documentation",
            "type": "Documentation",
            "properties": {"url": web_targets["python_docs"]},
            "source": "python_docs",
        },
        {
            "id": "Python Packaging User Guide",
            "name": "Python Packaging User Guide",
            "type": "Documentation",
            "properties": {"url": web_targets["packaging_guide"]},
            "source": "python_docs",
        },
        {
            "id": "Python 3.13",
            "name": "Python 3.13",
            "type": "SoftwareVersion",
            "properties": {"release_notes": web_targets["whatsnew_313"]},
            "source": "python_docs",
        },
        {
            "id": "asyncio",
            "name": "asyncio",
            "type": "StdlibModule",
            "properties": {"docs_url": web_targets["asyncio_docs"]},
            "source": "python_docs",
        },
        {
            "id": "typing",
            "name": "typing",
            "type": "StdlibModule",
            "properties": {"docs_url": web_targets["typing_docs"]},
            "source": "python_docs",
        },
        {
            "id": "PEP 703",
            "name": "PEP 703",
            "type": "PEP",
            "properties": {
                "url": web_targets["pep_703"],
                "title": getattr(web_pages.get("pep_703"), "title", ""),
            },
            "source": "pep_site",
        },
        {
            "id": "PEP 8",
            "name": "PEP 8",
            "type": "PEP",
            "properties": {"url": web_targets["pep_8"], "title": getattr(web_pages.get("pep_8"), "title", "")},
            "source": "pep_site",
        },
        {
            "id": "PEP 484",
            "name": "PEP 484",
            "type": "PEP",
            "properties": {"url": web_targets["pep_484"], "title": getattr(web_pages.get("pep_484"), "title", "")},
            "source": "pep_site",
        },
        {
            "id": "Global Interpreter Lock",
            "name": "Global Interpreter Lock",
            "type": "Concept",
            "properties": {"abbrev": "GIL"},
            "source": "pep_site",
        },
        {
            "id": "Type Hints",
            "name": "Type Hints",
            "type": "Concept",
            "properties": {},
            "source": "pep_site",
        },
        {
            "id": "No-GIL Build",
            "name": "No-GIL Build",
            "type": "Feature",
            "properties": {"description": "CPython build configuration without the GIL"},
            "source": "pep_site",
        },
        {
            "id": "Free-Threaded Python",
            "name": "Free-Threaded Python",
            "type": "Feature",
            "properties": {"description": "Python builds that allow threads without a global lock"},
            "source": "pep_site",
        },
        {
            "id": "Packaging",
            "name": "Packaging",
            "type": "Concept",
            "properties": {},
            "source": "python_docs",
        },
    ]

    relationships = [
        {"source": "Python Software Foundation", "target": "Python", "type": "governs"},
        {"source": "Python Software Foundation", "target": "CPython", "type": "supports"},
        {"source": "CPython", "target": "Python", "type": "implements"},
        {"source": "Python Documentation", "target": "Python", "type": "documents"},
        {"source": "Python Documentation", "target": "Python 3.13", "type": "documents"},
        {"source": "Python Documentation", "target": "asyncio", "type": "documents"},
        {"source": "Python Documentation", "target": "typing", "type": "documents"},
        {"source": "Python 3.13", "target": "Python", "type": "version_of"},
        {"source": "asyncio", "target": "Python", "type": "stdlib_of"},
        {"source": "typing", "target": "Python", "type": "stdlib_of"},
        {"source": "Python Packaging User Guide", "target": "Packaging", "type": "documents"},
        {"source": "Python Packaging User Guide", "target": "PyPI", "type": "mentions"},
        {"source": "PEP 703", "target": "No-GIL Build", "type": "proposes"},
        {"source": "PEP 703", "target": "Free-Threaded Python", "type": "proposes"},
        {"source": "PEP 703", "target": "Global Interpreter Lock", "type": "discusses"},
        {"source": "No-GIL Build", "target": "Python 3.13", "type": "planned_for"},
        {"source": "PEP 703", "target": "CPython", "type": "targets"},
        {"source": "PEP 8", "target": "Python", "type": "style_guide_for"},
        {"source": "PEP 484", "target": "Type Hints", "type": "introduces"},
        {"source": "Type Hints", "target": "typing", "type": "implemented_by"},
    ]

    source_web = {
        "name": "Python Web + Docs + PEPs + Packaging",
        "type": "unstructured_web",
        "entities": entities,
        "relationships": relationships,
    }

except Exception as e:
    print(f"Web Ingestion Failed: {e}")
    source_web = {
        "name": "Python Web + Docs + PEPs + Packaging (Offline)",
        "type": "unstructured_web",
        "entities": [
            {"id": "Python", "name": "Python", "type": "ProgrammingLanguage", "properties": {}, "source": "offline"},
            {"id": "Python Documentation", "name": "Python Documentation", "type": "Documentation", "properties": {}, "source": "offline"},
            {"id": "CPython", "name": "CPython", "type": "Interpreter", "properties": {}, "source": "offline"},
            {"id": "Python Software Foundation", "name": "Python Software Foundation", "type": "Organization", "properties": {}, "source": "offline"},
            {"id": "PEP 703", "name": "PEP 703", "type": "PEP", "properties": {}, "source": "offline"},
            {"id": "Global Interpreter Lock", "name": "Global Interpreter Lock", "type": "Concept", "properties": {}, "source": "offline"},
            {"id": "Type Hints", "name": "Type Hints", "type": "Concept", "properties": {}, "source": "offline"},
        ],
        "relationships": [
            {"source": "Python Software Foundation", "target": "Python", "type": "governs"},
            {"source": "CPython", "target": "Python", "type": "implements"},
            {"source": "Python Documentation", "target": "Python", "type": "documents"},
            {"source": "PEP 703", "target": "Global Interpreter Lock", "type": "discusses"},
            {"source": "Type Hints", "target": "Python", "type": "feature_of"}
        ],
    }


--- 1. Ingesting Web Source: Python.org ---


Status,Action,Module,Submodule,File,Time
âœ…,Semantica is building,ðŸ§ kg,CentralityCalculator,-,0.03s
âœ…,Semantica is building,ðŸ§ kg,CommunityDetector,-,0.03s
âœ…,Semantica is exporting,ðŸ’¾ export,GraphExporter,python_ecosystem_kg.json,0.01s
âœ…,Semantica is exporting,ðŸ’¾ export,GraphExporter,python_ecosystem.graphml,0.01s
âœ…,Semantica is processing,ðŸ”— context,ContextGraph,-,0.01s
âœ…,Semantica is processing,ðŸ”— context,ContextRetriever,-,0.05s
âœ…,Semantica is processing,ðŸ”— context,AgentMemory,-,0.03s
âœ…,Semantica is embedding,ðŸ’¾ embeddings,TextEmbedder,-,0.01s
âœ…,Semantica is indexing,ðŸ“Š vector_store,VectorStore,-,0.01s
âœ…,Semantica is visualizing,ðŸ“ˆ visualization,KGVisualizer,-,0.27s


semantica.progress - INFO - [RUNNING] | Module: ingest | Submodule: WebIngestor | File: www.python.org | Message: URL: https://www.python.org/
semantica.progress - INFO - [COMPLETED] | Module: ingest | Submodule: WebIngestor | File: www.python.org | Message: Ingested https://www.python.org/ (200)


Successfully ingested: https://www.python.org/
Title: Welcome to Python.org
Content Length: 6453 characters
Preview: Welcome to Python.org Notice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. Skip to content...


## Source 2: Package Registry (API -> File Ingestion)

We fetch live metadata for the `pandas` library from PyPI's JSON API. We save this as a JSON file and then ingest it using `FileIngestor` to demonstrate structured file handling.

In [None]:
print("\n--- 2. Ingesting API Sources: PyPI (pandas, numpy, scipy, matplotlib, scikit-learn, requests, fastapi) ---")

import re
from urllib.parse import urlparse

packages = ["pandas", "numpy", "scipy", "matplotlib", "scikit-learn", "requests", "fastapi"]
file_ingestor = FileIngestor()
api_ingestor = RESTIngestor(timeout=30)

entities = [
    {"id": "PyPI", "name": "PyPI", "type": "PackageRegistry", "properties": {"url": "https://pypi.org/"}, "source": "pypi_api"},
    {"id": "GitHub", "name": "GitHub", "type": "Platform", "properties": {"url": "https://github.com/"}, "source": "pypi_api"},
]
relationships = []

seen_entities = {e["id"] for e in entities}

def _dep_name(req: str) -> str:
    if not req:
        return ""
    req = req.split(";")[0].strip()
    m = re.match(r"^([A-Za-z0-9_.-]+)", req)
    return (m.group(1) if m else "").strip()

def _url_entity_id(url: str) -> str:
    return f"URL::{url.strip()}"

def _add_url_entity(url: str, label: str, source: str) -> str:
    url = (url or "").strip()
    if not url:
        return ""
    uid = _url_entity_id(url)
    if uid not in seen_entities:
        entities.append({"id": uid, "name": label or url, "type": "WebResource", "properties": {"url": url}, "source": source})
        seen_entities.add(uid)
    return uid

def _is_github(url: str) -> bool:
    try:
        return urlparse(url).netloc.lower().endswith("github.com")
    except Exception:
        return False

try:
    for pkg in packages:
        pypi_url = f"https://pypi.org/pypi/{pkg}/json"
        local_json_path = os.path.join(WORKSPACE_DIR, f"{pkg}_pypi.json")

        api_data = api_ingestor.ingest_endpoint(pypi_url)
        data = api_data.data if isinstance(api_data.data, dict) else {}

        with open(local_json_path, "w", encoding="utf-8") as f:
            json.dump(data, f)

        ingested_file = file_ingestor.ingest_file(local_json_path)
        print(f"Ingested File: {ingested_file.name} ({ingested_file.size} bytes)")

        info = data.get("info", {})
        project_urls = info.get("project_urls") or {}

        if pkg not in seen_entities:
            entities.append(
                {
                    "id": pkg,
                    "name": pkg,
                    "type": "Library",
                    "properties": {
                        "version": info.get("version"),
                        "summary": info.get("summary"),
                        "license": info.get("license"),
                        "requires_python": info.get("requires_python"),
                    },
                    "source": "pypi_api",
                }
            )
            seen_entities.add(pkg)

        relationships.append({"source": pkg, "target": "PyPI", "type": "published_on"})
        relationships.append({"source": pkg, "target": "Python", "type": "written_in"})

        home_page = (info.get("home_page") or "").strip()
        if home_page:
            hp_id = _add_url_entity(home_page, f"{pkg} homepage", "pypi_api")
            if hp_id:
                relationships.append({"source": pkg, "target": hp_id, "type": "has_homepage"})

        normalized_project_urls = {}
        for k, v in project_urls.items():
            if not k or not v:
                continue
            normalized_project_urls[str(k).strip().lower()] = str(v).strip()

        for key, url in normalized_project_urls.items():
            if not url:
                continue
            label = f"{pkg} {key}"
            url_id = _add_url_entity(url, label, "pypi_api")
            if not url_id:
                continue
            if "doc" in key:
                relationships.append({"source": pkg, "target": url_id, "type": "has_documentation"})
            elif "bug" in key or "issue" in key:
                relationships.append({"source": pkg, "target": url_id, "type": "issues_at"})
            elif "source" in key or "github" in key:
                relationships.append({"source": pkg, "target": url_id, "type": "has_source"})
            else:
                relationships.append({"source": pkg, "target": url_id, "type": "related_resource"})
            if _is_github(url):
                relationships.append({"source": url_id, "target": "GitHub", "type": "hosted_on"})

        requires_dist = info.get("requires_dist") or []
        dep_names = []
        for req in requires_dist:
            name = _dep_name(req)
            if name:
                dep_names.append(name)

        unique_deps = sorted(set(dep_names))[:12]
        for dep in unique_deps:
            if dep not in seen_entities:
                entities.append({"id": dep, "name": dep, "type": "Library", "properties": {}, "source": "pypi_requires_dist"})
                seen_entities.add(dep)
            relationships.append({"source": pkg, "target": dep, "type": "depends_on"})

        print(f"Extracted Entity: {pkg} (v{info.get('version')}) with {len(unique_deps)} dependencies (capped)")

    for rel in [
        {"source": "pandas", "target": "numpy", "type": "built_on"},
        {"source": "scipy", "target": "numpy", "type": "built_on"},
        {"source": "matplotlib", "target": "numpy", "type": "built_on"},
        {"source": "scikit-learn", "target": "numpy", "type": "built_on"},
        {"source": "scikit-learn", "target": "scipy", "type": "built_on"},
    ]:
        relationships.append(rel)

    source_api = {
        "name": "PyPI Registry",
        "type": "structured_api",
        "entities": entities,
        "relationships": relationships,
    }

except Exception as e:
    print(f"API Ingestion Failed: {e}")
    source_api = {
        "name": "PyPI Registry (Offline)",
        "type": "structured_api",
        "entities": [
            {"id": "PyPI", "name": "PyPI", "type": "PackageRegistry", "properties": {"url": "https://pypi.org/"}, "source": "offline"},
            {"id": "GitHub", "name": "GitHub", "type": "Platform", "properties": {"url": "https://github.com/"}, "source": "offline"},
            {"id": "pandas", "name": "pandas", "type": "Library", "properties": {}, "source": "offline"},
            {"id": "numpy", "name": "numpy", "type": "Library", "properties": {}, "source": "offline"},
            {"id": "scipy", "name": "scipy", "type": "Library", "properties": {}, "source": "offline"},
            {"id": "matplotlib", "name": "matplotlib", "type": "Library", "properties": {}, "source": "offline"},
            {"id": "scikit-learn", "name": "scikit-learn", "type": "Library", "properties": {}, "source": "offline"},
            {"id": "requests", "name": "requests", "type": "Library", "properties": {}, "source": "offline"},
            {"id": "fastapi", "name": "fastapi", "type": "Library", "properties": {}, "source": "offline"},
        ],
        "relationships": [
            {"source": "pandas", "target": "PyPI", "type": "published_on"},
            {"source": "numpy", "target": "PyPI", "type": "published_on"},
            {"source": "scipy", "target": "PyPI", "type": "published_on"},
            {"source": "matplotlib", "target": "PyPI", "type": "published_on"},
            {"source": "scikit-learn", "target": "PyPI", "type": "published_on"},
            {"source": "requests", "target": "PyPI", "type": "published_on"},
            {"source": "fastapi", "target": "PyPI", "type": "published_on"},
            {"source": "pandas", "target": "Python", "type": "written_in"},
            {"source": "numpy", "target": "Python", "type": "written_in"},
            {"source": "scipy", "target": "Python", "type": "written_in"},
            {"source": "matplotlib", "target": "Python", "type": "written_in"},
            {"source": "scikit-learn", "target": "Python", "type": "written_in"},
            {"source": "requests", "target": "Python", "type": "written_in"},
            {"source": "fastapi", "target": "Python", "type": "written_in"},
            {"source": "pandas", "target": "numpy", "type": "built_on"},
            {"source": "scipy", "target": "numpy", "type": "built_on"},
            {"source": "scikit-learn", "target": "numpy", "type": "built_on"},
            {"source": "scikit-learn", "target": "scipy", "type": "built_on"}
        ],
    }



--- 2. Ingesting API Source: PyPI (Pandas) ---


semantica.file_ingestor - INFO - File ingestor initialized
semantica.progress - INFO - [RUNNING] | Module: ingest | Submodule: FileIngestor | File: pandas_pypi.json | Message: File: pandas_pypi.json
semantica.progress - INFO - [COMPLETED] | Module: ingest | Submodule: FileIngestor | File: pandas_pypi.json | Message: Ingested pandas_pypi.json (json)


Ingested File: pandas_pypi.json
Size: 1831378 bytes
Extracted Entity: pandas (v2.3.3)


## Source 3: Source Code Repository (Raw Content)

We fetch the raw `README.rst` from the official CPython GitHub repository. This represents unstructured technical documentation.

In [21]:
print("\n--- 3. Ingesting Repo Sources: CPython + pandas + NumPy + SciPy (GitHub) ---")

try:
    repo_ingestor = WebIngestor(delay=0.5)

    headers = {"Accept": "application/vnd.github+json", "User-Agent": "semantica-cookbook"}

    repo_targets = {
        "python/cpython": {"readme_raw": "https://raw.githubusercontent.com/python/cpython/main/README.rst", "library": "Python"},
        "pandas-dev/pandas": {"readme_raw": "https://raw.githubusercontent.com/pandas-dev/pandas/main/README.md", "library": "pandas"},
        "numpy/numpy": {"readme_raw": "https://raw.githubusercontent.com/numpy/numpy/main/README.md", "library": "numpy"},
        "scipy/scipy": {"readme_raw": "https://raw.githubusercontent.com/scipy/scipy/main/README.rst", "library": "scipy"},
    }

    entities = [
        {"id": "GitHub", "name": "GitHub", "type": "Platform", "properties": {"url": "https://github.com/"}, "source": "github_api"}
    ]
    relationships = []

    for repo_full, cfg in repo_targets.items():
        readme_url = cfg["readme_raw"]
        readme = repo_ingestor.ingest_url(readme_url)
        print(f"Fetched {repo_full} README: {len(readme.text)} chars")

        repo_url = f"https://github.com/{repo_full}"
        entities.append(
            {
                "id": repo_full,
                "name": repo_full,
                "type": "Repository",
                "properties": {"repo_url": repo_url, "readme_url": readme_url, "readme_title": getattr(readme, "title", "")},
                "source": "github_raw",
            }
        )
        relationships.append({"source": repo_full, "target": "GitHub", "type": "hosted_on"})

        owner = repo_full.split("/")[0]
        org_id = f"GitHubOrg::{owner}"
        entities.append({"id": org_id, "name": owner, "type": "Organization", "properties": {"url": f"https://github.com/{owner}"}, "source": "github_api"})
        relationships.append({"source": repo_full, "target": org_id, "type": "owned_by"})

        lib = cfg.get("library")
        if lib:
            relationships.append({"source": lib, "target": repo_full, "type": "source_code_in"})
            relationships.append({"source": repo_full, "target": lib, "type": "source_code_for"})

        repo_api_url = f"https://api.github.com/repos/{repo_full}"
        repo_resp = requests.get(repo_api_url, headers=headers, timeout=30)
        if repo_resp.status_code == 200:
            meta = repo_resp.json() or {}
            repo_props = {
                "stars": meta.get("stargazers_count"),
                "forks": meta.get("forks_count"),
                "open_issues": meta.get("open_issues_count"),
                "language": meta.get("language"),
                "updated_at": meta.get("updated_at"),
            }
            for ent in entities:
                if isinstance(ent, dict) and ent.get("id") == repo_full:
                    ent.setdefault("properties", {}).update({k: v for k, v in repo_props.items() if v is not None})
                    break
        else:
            print(f"Repo metadata unavailable for {repo_full} (status {repo_resp.status_code}).")

    relationships.append({"source": "python/cpython", "target": "CPython", "type": "repository_for"})
    relationships.append({"source": "python/cpython", "target": "Python", "type": "implements"})

    gh_releases_url = "https://api.github.com/repos/python/cpython/releases?per_page=8"
    release_resp = requests.get(gh_releases_url, headers=headers, timeout=30)
    if release_resp.status_code == 200:
        releases = release_resp.json() or []
        for r in releases:
            tag = r.get("tag_name")
            if not tag:
                continue
            release_id = f"Release::python/cpython::{tag}"
            entities.append(
                {
                    "id": release_id,
                    "name": f"CPython {tag}",
                    "type": "Release",
                    "properties": {"tag": tag, "published_at": r.get("published_at"), "url": r.get("html_url")},
                    "source": "github_api",
                }
            )
            relationships.append({"source": release_id, "target": "python/cpython", "type": "release_of"})
            if tag.startswith("v") and len(tag) >= 4:
                major_minor = ".".join(tag.lstrip("v").split(".")[:2])
                relationships.append({"source": release_id, "target": f"Python {major_minor}", "type": "implements"})
    else:
        print(f"GitHub releases unavailable (status {release_resp.status_code}).")

    source_repo = {
        "name": "GitHub Repos + Metadata + Releases",
        "type": "unstructured_repo",
        "entities": entities,
        "relationships": relationships,
    }

except Exception as e:
    print(f"Repo Ingestion Failed: {e}")
    source_repo = {
        "name": "GitHub Repos + Metadata + Releases (Offline)",
        "type": "unstructured_repo",
        "entities": [
            {"id": "GitHub", "name": "GitHub", "type": "Platform", "properties": {"url": "https://github.com/"}, "source": "offline"},
            {"id": "python/cpython", "name": "python/cpython", "type": "Repository", "properties": {"repo_url": "https://github.com/python/cpython"}, "source": "offline"},
            {"id": "pandas-dev/pandas", "name": "pandas-dev/pandas", "type": "Repository", "properties": {"repo_url": "https://github.com/pandas-dev/pandas"}, "source": "offline"},
            {"id": "numpy/numpy", "name": "numpy/numpy", "type": "Repository", "properties": {"repo_url": "https://github.com/numpy/numpy"}, "source": "offline"},
            {"id": "scipy/scipy", "name": "scipy/scipy", "type": "Repository", "properties": {"repo_url": "https://github.com/scipy/scipy"}, "source": "offline"},
        ],
        "relationships": [
            {"source": "python/cpython", "target": "GitHub", "type": "hosted_on"},
            {"source": "pandas-dev/pandas", "target": "GitHub", "type": "hosted_on"},
            {"source": "numpy/numpy", "target": "GitHub", "type": "hosted_on"},
            {"source": "scipy/scipy", "target": "GitHub", "type": "hosted_on"},
            {"source": "python/cpython", "target": "Python", "type": "implements"},
            {"source": "pandas", "target": "pandas-dev/pandas", "type": "source_code_in"},
            {"source": "numpy", "target": "numpy/numpy", "type": "source_code_in"},
            {"source": "scipy", "target": "scipy/scipy", "type": "source_code_in"}
        ],
    }



--- 3. Ingesting Repo Sources: CPython + pandas + NumPy + SciPy (GitHub) ---


semantica.progress - INFO - [RUNNING] | Module: ingest | Submodule: WebIngestor | File: README.rst | Message: URL: https://raw.githubusercontent.com/python/cpython/main/README.rst
semantica.progress - INFO - [COMPLETED] | Module: ingest | Submodule: WebIngestor | File: README.rst | Message: Ingested https://raw.githubusercontent.com/python/cpython/main/README.rst (200)


Fetched python/cpython README: 8040 chars


semantica.progress - INFO - [RUNNING] | Module: ingest | Submodule: WebIngestor | File: README.md | Message: URL: https://raw.githubusercontent.com/pandas-dev/pandas/main/README.md
semantica.progress - INFO - [COMPLETED] | Module: ingest | Submodule: WebIngestor | File: README.md | Message: Ingested https://raw.githubusercontent.com/pandas-dev/pandas/main/README.md (200)


Fetched pandas-dev/pandas README: 11467 chars


semantica.progress - INFO - [RUNNING] | Module: ingest | Submodule: WebIngestor | File: README.md | Message: URL: https://raw.githubusercontent.com/numpy/numpy/main/README.md
semantica.progress - INFO - [COMPLETED] | Module: ingest | Submodule: WebIngestor | File: README.md | Message: Ingested https://raw.githubusercontent.com/numpy/numpy/main/README.md (200)


Fetched numpy/numpy README: 4031 chars


semantica.progress - INFO - [RUNNING] | Module: ingest | Submodule: WebIngestor | File: README.rst | Message: URL: https://raw.githubusercontent.com/scipy/scipy/main/README.rst
semantica.progress - INFO - [COMPLETED] | Module: ingest | Submodule: WebIngestor | File: README.rst | Message: Ingested https://raw.githubusercontent.com/scipy/scipy/main/README.rst (200)


Fetched scipy/scipy README: 3465 chars


## Source 4: Database (SQLite via DBIngestor)

We ingest structured data from a local SQLite database using `DBIngestor`. This demonstrates database connectivity and SQL query extraction.

**Note:** This example creates a small SQLite database inside the temporary workspace so it works offline.


In [22]:
print("\n--- 4. Ingesting Database Source: SQLite (local) ---")

import sqlite3

db_path = os.path.join(WORKSPACE_DIR, "python_ecosystem_metrics.sqlite")

try:
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute(
        """
        CREATE TABLE IF NOT EXISTS library_metrics (
            library TEXT PRIMARY KEY,
            downloads INTEGER,
            stars INTEGER,
            last_updated TEXT
        )
        """
    )

    cur.execute(
        """
        CREATE TABLE IF NOT EXISTS library_categories (
            library TEXT,
            category TEXT,
            PRIMARY KEY (library, category)
        )
        """
    )

    cur.execute(
        """
        CREATE TABLE IF NOT EXISTS library_dependencies (
            library TEXT,
            dependency TEXT,
            relation TEXT,
            PRIMARY KEY (library, dependency, relation)
        )
        """
    )

    sample_rows = [
        ("numpy", 80000000, 28000, datetime.utcnow().isoformat()),
        ("pandas", 50000000, 45000, datetime.utcnow().isoformat()),
        ("scipy", 20000000, 12000, datetime.utcnow().isoformat()),
        ("matplotlib", 25000000, 21000, datetime.utcnow().isoformat()),
        ("scikit-learn", 18000000, 60000, datetime.utcnow().isoformat()),
        ("requests", 90000000, 52000, datetime.utcnow().isoformat()),
        ("fastapi", 22000000, 80000, datetime.utcnow().isoformat()),
    ]
    cur.executemany(
        "INSERT OR REPLACE INTO library_metrics (library, downloads, stars, last_updated) VALUES (?, ?, ?, ?)",
        sample_rows,
    )

    category_rows = [
        ("numpy", "Numerical"),
        ("pandas", "Data Analysis"),
        ("scipy", "Scientific Computing"),
        ("matplotlib", "Visualization"),
        ("scikit-learn", "Machine Learning"),
        ("requests", "Networking"),
        ("fastapi", "Web"),
    ]
    cur.executemany(
        "INSERT OR REPLACE INTO library_categories (library, category) VALUES (?, ?)",
        category_rows,
    )

    dependency_rows = [
        ("pandas", "numpy", "depends_on"),
        ("scipy", "numpy", "depends_on"),
        ("matplotlib", "numpy", "depends_on"),
        ("scikit-learn", "numpy", "depends_on"),
        ("scikit-learn", "scipy", "depends_on"),
        ("fastapi", "pydantic", "depends_on"),
        ("fastapi", "starlette", "built_on"),
        ("requests", "urllib3", "depends_on"),
        ("requests", "certifi", "depends_on"),
    ]
    cur.executemany(
        "INSERT OR REPLACE INTO library_dependencies (library, dependency, relation) VALUES (?, ?, ?)",
        dependency_rows,
    )
    conn.commit()
    conn.close()

    db_ingestor = DBIngestor()
    sqlite_conn_str = f"sqlite:///{db_path}"

    rows = db_ingestor.execute_query(
        sqlite_conn_str,
        "SELECT library, downloads, stars, last_updated FROM library_metrics WHERE downloads >= :min_downloads",
        min_downloads=1000000,
    )

    cat_rows = db_ingestor.execute_query(
        sqlite_conn_str,
        "SELECT library, category FROM library_categories",
    )

    dep_rows = db_ingestor.execute_query(
        sqlite_conn_str,
        "SELECT library, dependency, relation FROM library_dependencies",
    )

    if not rows and not cat_rows and not dep_rows:
        raise RuntimeError("No data found in SQLite database")

    entities = []
    relationships = []

    entities.append({"id": "SQLite", "name": "SQLite", "type": "Database", "properties": {"path": db_path}, "source": "sqlite_db"})
    for row in rows:
        lib = row.get("library")
        if not lib:
            continue
        entities.append({"id": lib, "name": lib, "type": "Library", "properties": {}, "source": "sqlite_db"})
        metric_id = f"{lib}::metrics"
        entities.append(
            {
                "id": metric_id,
                "name": f"{lib} metrics",
                "type": "LibraryMetrics",
                "properties": {
                    "downloads": row.get("downloads"),
                    "stars": row.get("stars"),
                    "last_updated": row.get("last_updated"),
                },
                "source": "sqlite_db",
            }
        )
        relationships.append({"source": metric_id, "target": lib, "type": "metrics_for"})
        relationships.append({"source": lib, "target": "Python", "type": "ecosystem_of"})
        relationships.append({"source": metric_id, "target": "SQLite", "type": "stored_in"})

    for row in cat_rows:
        lib = row.get("library")
        cat = row.get("category")
        if not lib or not cat:
            continue
        cat_id = f"Category::{cat}"
        entities.append({"id": cat_id, "name": cat, "type": "Category", "properties": {}, "source": "sqlite_db"})
        relationships.append({"source": lib, "target": cat_id, "type": "categorized_as"})

    for row in dep_rows:
        lib = row.get("library")
        dep = row.get("dependency")
        rel = row.get("relation") or "depends_on"
        if not lib or not dep:
            continue
        entities.append({"id": dep, "name": dep, "type": "Library", "properties": {}, "source": "sqlite_db"})
        relationships.append({"source": lib, "target": dep, "type": rel})

    print(f"Ingested {len(entities)} rows from SQLite metrics table")
    source_db = {
        "name": "SQLite Metrics",
        "type": "database",
        "entities": entities,
        "relationships": relationships,
    }

except Exception as e:
    print(f"Database ingestion via DBIngestor skipped: {e}")
    entities = [
        {"id": "SQLite", "name": "SQLite", "type": "Database", "properties": {"path": db_path}, "source": "sqlite_db_offline"},
        {"id": "pandas", "name": "pandas", "type": "Library", "properties": {}, "source": "sqlite_db_offline"},
        {"id": "pandas::metrics", "name": "pandas metrics", "type": "LibraryMetrics", "properties": {"downloads": 50000000, "stars": 45000}, "source": "sqlite_db_offline"},
        {"id": "numpy", "name": "numpy", "type": "Library", "properties": {}, "source": "sqlite_db_offline"},
        {"id": "numpy::metrics", "name": "numpy metrics", "type": "LibraryMetrics", "properties": {"downloads": 80000000, "stars": 28000}, "source": "sqlite_db_offline"},
        {"id": "scipy", "name": "scipy", "type": "Library", "properties": {}, "source": "sqlite_db_offline"},
        {"id": "scipy::metrics", "name": "scipy metrics", "type": "LibraryMetrics", "properties": {"downloads": 20000000, "stars": 12000}, "source": "sqlite_db_offline"},
        {"id": "fastapi", "name": "fastapi", "type": "Library", "properties": {}, "source": "sqlite_db_offline"},
        {"id": "fastapi::metrics", "name": "fastapi metrics", "type": "LibraryMetrics", "properties": {"downloads": 22000000, "stars": 80000}, "source": "sqlite_db_offline"},
    ]
    relationships = [
        {"source": "pandas::metrics", "target": "pandas", "type": "metrics_for"},
        {"source": "numpy::metrics", "target": "numpy", "type": "metrics_for"},
        {"source": "scipy::metrics", "target": "scipy", "type": "metrics_for"},
        {"source": "fastapi::metrics", "target": "fastapi", "type": "metrics_for"},
        {"source": "pandas", "target": "Python", "type": "ecosystem_of"},
        {"source": "numpy", "target": "Python", "type": "ecosystem_of"},
        {"source": "scipy", "target": "Python", "type": "ecosystem_of"},
        {"source": "fastapi", "target": "Python", "type": "ecosystem_of"},
        {"source": "pandas", "target": "numpy", "type": "depends_on"},
        {"source": "fastapi", "target": "pydantic", "type": "depends_on"}
    ]
    source_db = {
        "name": "SQLite Metrics (Offline)",
        "type": "database",
        "entities": entities,
        "relationships": relationships,
    }


semantica.database_connector - INFO - Connected to sqlite



--- 4. Ingesting Database Source: SQLite (local) ---


semantica.database_connector - INFO - Disconnected from database
semantica.database_connector - INFO - Connected to sqlite
semantica.database_connector - INFO - Disconnected from database
semantica.database_connector - INFO - Connected to sqlite
semantica.database_connector - INFO - Disconnected from database


Ingested 31 rows from SQLite metrics table


## Source 5: Model Context Protocol (MCP)

We attempt to connect to a local MCP server (e.g., a Web Search tool) to get live context. 

**Note:** If no MCP server is running at `localhost:8000`, this section will gracefully fallback to simulated data, but the code provided is production-ready for MCP integration.

Useful MCP server directories / references:
- https://github.com/modelcontextprotocol/servers
- https://glama.ai/mcp/servers
- https://github.com/punkpeye/awesome-mcp-servers
- https://github.com/wong2/awesome-mcp-servers
- https://mcp.so
- Brave Search MCP Server: https://github.com/brave/brave-search-mcp-server

In [23]:
print("\n--- 5. Ingesting via MCP (Model Context Protocol) ---")

mcp_server_urls = [
    "http://localhost:8000/mcp",
    "http://localhost:8000/sse",
    "http://127.0.0.1:8000/mcp",
    "http://127.0.0.1:8000/sse",
    "http://localhost:8080/mcp",
    "http://localhost:8080/sse",
    "http://127.0.0.1:8080/mcp",
    "http://127.0.0.1:8080/sse",
]

try:
    mcp_client_logger = logging.getLogger("semantica.mcp_client")
    mcp_ingestor_logger = logging.getLogger("semantica.mcp_ingestor")
    prev_client_level = mcp_client_logger.level
    prev_ingestor_level = mcp_ingestor_logger.level
    mcp_client_logger.setLevel(logging.CRITICAL)
    mcp_ingestor_logger.setLevel(logging.CRITICAL)

    mcp = MCPIngestor()
    connected_url = None
    for url in mcp_server_urls:
        try:
            mcp.connect("web_search", url=url)
            connected_url = url
            break
        except Exception:
            continue
    if not connected_url:
        raise RuntimeError("No MCP server reachable on known local SSE endpoints")

    print(f"Connected to MCP Server at {connected_url}")
    
    payload = {"query": "latest python 3.13 features"}
    mcp_result = None
    for tool_name in ["search", "brave_web_search", "brave_local_search"]:
        try:
            mcp_result = mcp.ingest_tool_output("web_search", tool_name, payload)
            break
        except Exception:
            continue
    if mcp_result is None:
        raise RuntimeError("No compatible MCP search tool found")
    search_results = getattr(mcp_result, "content", mcp_result)
    if not isinstance(search_results, dict):
        raise RuntimeError("MCP tool output content was not a dict")
    
    print("Received Live Data from MCP.")
    
    raw_entities = search_results.get("entities", []) or []
    raw_relationships = search_results.get("relationships", []) or []
    normalized_entities = []
    for ent in raw_entities:
        if not isinstance(ent, dict):
            continue
        name = ent.get("name") or ent.get("text") or ent.get("id")
        if not name:
            continue
        normalized_entities.append(
            {
                "id": ent.get("id") or name,
                "name": name,
                "type": ent.get("type") or ent.get("label") or "Entity",
                "properties": ent.get("properties") or ent.get("metadata") or {},
                "source": "mcp_live",
            }
        )

    normalized_relationships = []
    for rel in raw_relationships:
        if not isinstance(rel, dict):
            continue
        src = rel.get("source") or rel.get("subject")
        tgt = rel.get("target") or rel.get("object")
        rtype = rel.get("type") or rel.get("label") or rel.get("predicate")
        if not src or not tgt or not rtype:
            continue
        normalized_relationships.append({"source": src, "target": tgt, "type": rtype, "properties": rel.get("properties") or rel.get("metadata") or {}})

    endpoint_ids = {e.get("id") for e in normalized_entities if isinstance(e, dict) and e.get("id")}
    for rel in normalized_relationships:
        src = rel.get("source")
        tgt = rel.get("target")
        for node_id in [src, tgt]:
            if node_id and node_id not in endpoint_ids:
                normalized_entities.append({"id": node_id, "name": node_id, "type": "Entity", "properties": {}, "source": "mcp_live"})
                endpoint_ids.add(node_id)

    source_mcp = {
        "name": "MCP Search",
        "type": "agent_tool",
        "entities": normalized_entities,
        "relationships": normalized_relationships,
        "source": "mcp_live"
    }

except Exception as e:
    print("MCP Server not detected. Using simulated 'Live Search' data.")
    print("To enable: start a local MCP server (commonly `http://localhost:8000/mcp` or `http://localhost:8000/sse`)")
    
    source_mcp = {
        "name": "MCP Search (Simulated)",
        "type": "agent_tool",
        "entities": [
            {
                "id": "Python 3.13",
                "name": "Python 3.13",
                "type": "SoftwareVersion",
                "properties": {"status": "In Development", "feature": "No-GIL Build"},
                "source": "mcp_simulated"
            }
        ],
        "relationships": [
            {"source": "Python 3.13", "target": "Python", "type": "version_of"}
        ]
    }
finally:
    if 'mcp_client_logger' in locals() and 'prev_client_level' in locals():
        mcp_client_logger.setLevel(prev_client_level)
    if 'mcp_ingestor_logger' in locals() and 'prev_ingestor_level' in locals():
        mcp_ingestor_logger.setLevel(prev_ingestor_level)


--- 5. Ingesting via MCP (Model Context Protocol) ---
MCP Server not detected. Using simulated 'Live Search' data.
To enable: start a local MCP server (commonly `http://localhost:8000/mcp` or `http://localhost:8000/sse`)


## Phase 5: Knowledge Graph Construction

We merge all these real-world data points into a single Knowledge Graph.

In [25]:
print("\n--- Building Knowledge Graph ---")

all_sources = [source_web, source_api, source_repo, source_db, source_mcp]

builder = GraphBuilder(merge_entities=True, resolve_conflicts=True)

kg = builder.build(sources=all_sources) or {}
kg.setdefault("entities", [])
kg.setdefault("relationships", [])

entity_ids = set()
for node in kg.get("entities", []):
    if isinstance(node, dict):
        node_id = node.get("id") or node.get("entity_id") or node.get("name")
        if node_id:
            entity_ids.add(node_id)

missing_ids = set()
for rel in kg.get("relationships", []):
    if not isinstance(rel, dict):
        continue
    src = rel.get("source") or rel.get("subject")
    tgt = rel.get("target") or rel.get("object")
    if src and src not in entity_ids:
        missing_ids.add(src)
    if tgt and tgt not in entity_ids:
        missing_ids.add(tgt)

for mid in sorted(missing_ids):
    kg["entities"].append({"id": mid, "name": mid, "type": "Entity", "properties": {}, "source": "auto"})
    entity_ids.add(mid)

print(f"Graph Statistics:")
print(f"Nodes: {len(kg.get('entities', []))}")
print(f"Edges: {len(kg.get('relationships', []))}")

# List all nodes to verify integration
print("\nEntities in Graph:")
for node in kg.get('entities', []):
    name = node.get('name') or node.get('label') or node.get('id')
    ntype = node.get('type') or node.get('label') or 'Entity'
    print(f"- {name} ({ntype})")


--- Building Knowledge Graph ---


semantica.progress - INFO - [RUNNING] | Module: kg | Submodule: GraphBuilder | Message: Knowledge graph from 5 source(s)
semantica.graph_builder - INFO - Building knowledge graph from 5 source(s)
semantica.graph_builder - INFO - Resolving 43 entities using fuzzy strategy
semantica.entity_resolver - INFO - Resolving 43 entities using fuzzy strategy
semantica.progress - INFO - [RUNNING] | Module: kg | Submodule: EntityResolver | Message: Resolving entities
semantica.progress - INFO - [RUNNING] | Module: deduplication | Submodule: DuplicateDetector | Message: Detecting duplicate groups from 43 entities
semantica.duplicate_detector - INFO - Detecting duplicate groups from 43 entities
semantica.progress - INFO - [RUNNING] | Module: deduplication | Submodule: DuplicateDetector | Message: Detecting duplicates in 43 entities
semantica.duplicate_detector - INFO - Detecting duplicates in 43 entities (threshold: 0.7)
semantica.progress - INFO - [RUNNING] | Module: deduplication | Submodule: Dupli

Graph Statistics:
Nodes: 38
Edges: 57

Entities in Graph:
- numpy/numpy (Repository)
- python/cpython (Repository)
- matplotlib metrics (LibraryMetrics)
- requests metrics (LibraryMetrics)
- GitHub (Platform)
- SQLite (Database)
- Data Analysis (Category)
- Visualization (Category)
- Machine Learning (Category)
- Networking (Category)
- Web (Category)
- pydantic (Library)
- starlette (Library)
- urllib3 (Library)
- certifi (Library)
- CPython (Entity)
- Category::Numerical (Entity)
- Category::Scientific Computing (Entity)
- GitHubOrg::numpy (Entity)
- GitHubOrg::pandas-dev (Entity)
- GitHubOrg::python (Entity)
- GitHubOrg::scipy (Entity)
- Python (Entity)
- Python 3.13 (Entity)
- fastapi (Entity)
- fastapi::metrics (Entity)
- matplotlib (Entity)
- numpy (Entity)
- numpy::metrics (Entity)
- pandas (Entity)
- pandas-dev/pandas (Entity)
- pandas::metrics (Entity)
- requests (Entity)
- scikit-learn (Entity)
- scikit-learn::metrics (Entity)
- scipy (Entity)
- scipy/scipy (Entity)
- scipy::

## Phase 6: Advanced Graph Analytics

We can perform network analysis on the constructed graph to find key entities. Here, we calculate **Degree Centrality** to identify the most connected nodes.

In [26]:
from semantica.kg import CentralityCalculator, CommunityDetector

print("\n--- Running Graph Analytics (Semantica) ---")

# 1. Centrality Analysis
centrality_calc = CentralityCalculator()
degree_centrality = centrality_calc.calculate_degree_centrality(kg)

print("Top 5 Most Central Entities (Degree):")
for ranking in degree_centrality.get("rankings", [])[:5]:
    print(f"- {ranking['node']}: {ranking['score']:.4f}")

# 2. Community Detection
try:
    detector = CommunityDetector()
    result = detector.detect_communities(kg, algorithm="louvain") or {}

    communities_raw = result.get("communities")
    if communities_raw is None:
        communities_raw = result.get("node_assignments")

    communities = []
    if isinstance(communities_raw, list):
        for c in communities_raw:
            if isinstance(c, (list, tuple, set)):
                communities.append(list(c))
            elif isinstance(c, dict):
                communities.append(list(c.keys()))
            else:
                communities.append([str(c)])
    elif isinstance(communities_raw, dict):
        comm_map = {}
        for node_id, comm_id in communities_raw.items():
            comm_map.setdefault(comm_id, []).append(node_id)
        # Sort by community ID safely
        sorted_keys = sorted(comm_map.keys(), key=lambda x: str(x))
        communities = [comm_map[k] for k in sorted_keys]

    print(f"\nDetected {len(communities)} Communities:")
    for i, comm in enumerate(communities[:3]):
        # Ensure elements are strings
        sample = [str(x) for x in list(comm)[:5]]
        print(f"Community {i+1}: {', '.join(sample)}...")
except Exception as e:
    print(f"\nCommunity detection skipped: {e}")

semantica.centrality_calculator - INFO - Centrality calculator initialized



--- Running Graph Analytics (Semantica) ---


semantica.progress - INFO - [RUNNING] | Module: kg | Submodule: CentralityCalculator | Message: Calculating degree centrality
semantica.centrality_calculator - INFO - Calculating degree centrality
semantica.progress - INFO - [RUNNING] | Module: kg | Submodule: CentralityCalculator | Message: Processing graph structure...
semantica.progress - INFO - [COMPLETED] | Module: kg | Submodule: CentralityCalculator | Message: Calculated degree centrality for 38 nodes
semantica.community_detector - INFO - Detecting communities using louvain algorithm


Top 5 Most Central Entities (Degree):
- Python: 0.2432
- numpy: 0.2162
- SQLite: 0.1892
- scipy: 0.1622
- pandas: 0.1351


semantica.progress - INFO - [RUNNING] | Module: kg | Submodule: CommunityDetector | Message: Detecting communities using Louvain algorithm
semantica.community_detector - INFO - Detecting communities using Louvain algorithm
semantica.progress - INFO - [RUNNING] | Module: kg | Submodule: CommunityDetector | Message: Detecting communities with NetworkX...
semantica.progress - INFO - [COMPLETED] | Module: kg | Submodule: CommunityDetector | Message: Detected 7 communities



Detected 7 Communities:
Community 1: frozenset({'GitHubOrg::numpy', 'pandas::metrics', 'GitHubOrg::pandas-dev', 'Category::Data Analysis', 'GitHub', 'pandas', 'numpy/numpy', 'pandas-dev/pandas'})...
Community 2: frozenset({'matplotlib::metrics', 'numpy', 'SQLite', 'Category::Visualization', 'fastapi::metrics', 'numpy::metrics', 'Category::Numerical', 'matplotlib'})...
Community 3: frozenset({'Python 3.13', 'GitHubOrg::python', 'Python', 'python/cpython', 'CPython'})...


## Phase 7: Semantic Querying

We can query the graph to find specific relationships, such as tracing the lineage of Python versions or finding libraries related to Python.

In [27]:
from semantica.kg import ConnectivityAnalyzer

print("\n--- Semantic Querying & Path Finding ---")

kg = globals().get("kg")
kg = kg if isinstance(kg, dict) else {}
kg.setdefault("entities", [])
kg.setdefault("relationships", [])

analyzer = ConnectivityAnalyzer()

# 1. Check Connectivity
connectivity = analyzer.analyze_connectivity(kg)
print(f"Graph Connected: {connectivity.get('is_connected')}")
print(f"Connected Components: {connectivity.get('num_components')}")

# 2. Find Path between Entities
source = "pandas"
target = "Python"

print(f"\nFinding path from '{source}' to '{target}':")
try:
    path_result = analyzer.calculate_shortest_paths(kg, source=source, target=target)
    
    if path_result.get("exists"):
        path = path_result["path"]
        print(f"Path Found: {' -> '.join(path)}")
        print(f"Distance: {path_result['distance']}")
    else:
        print("No path found.")
except Exception as e:
    print(f"Path finding error: {e}")
    # Fallback to simple neighbor check
    print("Falling back to direct neighbor check...")
    found = False
    for rel in kg.get('relationships', []):
        if rel.get('source') == source and rel.get('target') == target:
            print(f" - [{rel.get('type', 'related_to')}] -> {target}")
            found = True
    if not found:
        print("No direct edge found.")

semantica.connectivity_analyzer - INFO - Analyzing graph connectivity
semantica.connectivity_analyzer - INFO - Finding connected components
semantica.connectivity_analyzer - INFO - Calculating connectivity metrics
semantica.connectivity_analyzer - INFO - Calculating shortest paths from pandas to Python



--- Semantic Querying & Path Finding ---
Graph Connected: True
Connected Components: 1

Finding path from 'pandas' to 'Python':
Path Found: pandas -> Python
Distance: 1


## Phase 8: Export & Persistence

Finally, we save the constructed Knowledge Graph to a JSON file for external use or visualization in other tools.

In [28]:
from semantica.export import GraphExporter

print("\n--- Exporting Knowledge Graph ---")

# Use Semantica's GraphExporter for robust export
exporter = GraphExporter(format="json", include_attributes=True)
export_path = os.path.join(WORKSPACE_DIR, "python_ecosystem_kg.json")

try:
    exporter.export_knowledge_graph(kg, export_path)
    print(f"Graph saved to: {export_path}")
    
    # Optional: Export to GraphML for Gephi
    graphml_path = os.path.join(WORKSPACE_DIR, "python_ecosystem.graphml")
    exporter_ml = GraphExporter(format="graphml")
    exporter_ml.export_knowledge_graph(kg, graphml_path)
    print(f"GraphML saved to: {graphml_path} (Ready for Gephi/Cytoscape)")
    
except Exception as e:
    print(f"Export failed: {e}")
    # Fallback
    import json
    with open(export_path, "w") as f:
        json.dump(kg, f, default=str)
    print("Fallback export used.")


--- Exporting Knowledge Graph ---


semantica.progress - INFO - [RUNNING] | Module: export | Submodule: GraphExporter | File: python_ecosystem_kg.json | Message: Exporting graph to json: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem_kg.json
semantica.progress - INFO - [RUNNING] | Module: export | Submodule: GraphExporter | File: python_ecosystem_kg.json | Message: Exporting in json format...
semantica.graph_exporter - INFO - Exported graph (json) to: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem_kg.json
semantica.progress - INFO - [COMPLETED] | Module: export | Submodule: GraphExporter | File: python_ecosystem_kg.json | Message: Exported graph (json) to: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem_kg.json


Graph saved to: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem_kg.json


semantica.progress - INFO - [RUNNING] | Module: export | Submodule: GraphExporter | File: python_ecosystem.graphml | Message: Exporting graph to graphml: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem.graphml
semantica.progress - INFO - [RUNNING] | Module: export | Submodule: GraphExporter | File: python_ecosystem.graphml | Message: Exporting in graphml format...
semantica.graph_exporter - INFO - Exported graph (graphml) to: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem.graphml
semantica.progress - INFO - [COMPLETED] | Module: export | Submodule: GraphExporter | File: python_ecosystem.graphml | Message: Exported graph (graphml) to: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem.graphml


GraphML saved to: C:\Users\MOHDKA~1\AppData\Local\Temp\tmpsn4xpie2\python_ecosystem.graphml (Ready for Gephi/Cytoscape)


## Phase 9: Context Engineering for LLM Agents

This is the **critical step** where we turn our Knowledge Graph into a queryable **Context** for AI Agents.
We use the `AgentContext` module to ingest our graph and enable **Retrieval Augmented Generation (RAG)** capabilities.

In [29]:
print("\n--- Context Engineering ---")

import json
from semantica.context import AgentContext, ContextGraph
from semantica.vector_store import VectorStore

kg = globals().get("kg")
kg = kg if isinstance(kg, dict) else {}
kg.setdefault("entities", [])
kg.setdefault("relationships", [])

# 1. Initialize Vector Store (with FastEmbed support)
# We try to use the high-performance 'fastembed' model if available
vs = VectorStore(backend="inmemory", dimension=384)
try:
    if hasattr(vs, "embedder") and vs.embedder:
        print("Initializing FastEmbed model (BAAI/bge-small-en-v1.5)...")
        vs.embedder.set_text_model(method="fastembed", model_name="BAAI/bge-small-en-v1.5")
except Exception as e:
    print(f"FastEmbed not available ({e}). Using fallback keyword/random embedding.")
    print("Tip: Run '!pip install fastembed' and restart kernel for better results.")

# 2. Initialize Context Graph
cg = ContextGraph()

# 3. Create the Agent Context
# This binds the Vector Store (Content) and Knowledge Graph (Structure) together
context = AgentContext(vector_store=vs, knowledge_graph=cg)

# 4. Ingest Graph Structure
# We map our generic KG data to the specific structure ContextGraph expects
print("Building Context Graph structure...")

kg_entities = kg.get("entities", []) if isinstance(kg, dict) else []
kg_relationships = kg.get("relationships", []) if isinstance(kg, dict) else []

context_entities = []
for node in kg_entities:
    if not isinstance(node, dict):
        continue
    name = node.get("name") or node.get("id")
    if not name:
        continue
    context_entities.append(
        {
            "id": node.get("id") or name,
            "text": name,
            "type": node.get("type") or "Entity",
            "metadata": node.get("properties") or {},
        }
    )

context_relationships = []
for rel in kg_relationships:
    if not isinstance(rel, dict):
        continue
    src = rel.get("source")
    tgt = rel.get("target")
    rtype = rel.get("type")
    if not src or not tgt or not rtype:
        continue
    context_relationships.append({"source_id": src, "target_id": tgt, "type": rtype})

cg.build_from_entities_and_relationships(context_entities, context_relationships)
print(f"Context Graph: {cg.stats()['node_count']} nodes, {cg.stats()['edge_count']} edges")

# 5. Index Entities for Vector Retrieval (Batch Store)
# We transform entities into "documents" so the Vector Store can index them.
# This allows the Agent to "find" the graph nodes using semantic search.
print("Indexing entities into Vector Store...")

entity_documents = []
for node in kg_entities:
    # Create a rich textual description for the embedding
    if not isinstance(node, dict):
        continue
    name = node.get("name") or node.get("id")
    if not name:
        continue
    description = f"{name} is a {node.get('type', 'Entity')}."
    props = node.get('properties', {})
    if props:
        # Flatten properties into string for better semantic context
        prop_str = ", ".join([f"{k}: {v}" for k,v in props.items() if isinstance(v, (str, int, float))])
        description += f" Properties: {prop_str}"
    
    # Create document object
    entity_documents.append({
        "content": description,
        "metadata": {
            "source": "knowledge_graph",
            "original_id": node.get("id") or name,
            "type": node.get('type', 'Entity')
        }
    })

# Batch store all entity descriptions
# extract_entities=False because we are storing the entities themselves
context.store(entity_documents, extract_entities=False)
print(f"Successfully indexed {len(entity_documents)} entities.")

# 6. Simulate an Agent Query (GraphRAG)
query = "pandas library"
print(f"\nAgent Query: '{query}'")

# Retrieve context using Hybrid Search (Vector + Graph)
results = context.retrieve(
    query,
    use_graph=True,
    expand_graph=True,  # Follow edges to get related context (e.g. pandas -> Python)
    max_results=3
)

print("\n--- Retrieved Context for LLM ---")
if results:
    for res in results:
        # Access dictionary keys instead of attributes
        print(f"Content: {res['content']}")
        print(f"Score: {res['score']:.4f}")
        
        # Check for related entities in the dictionary
        if 'related_entities' in res and res['related_entities']:
            # related_entities is a list of dicts, we want the 'text' or 'id'
            related = [e.get('text', e.get('id', 'Unknown')) for e in res['related_entities']]
            print(f"Graph Expansion: {', '.join(related)}")
        print("-" * 30)
else:
    print("No context retrieved.")

semantica.embedding_generator - INFO - Embedding generator initialized
semantica.embedding_generator - INFO - Switched text model to: fastembed/BAAI/bge-small-en-v1.5



--- Context Engineering ---
Initializing FastEmbed model (BAAI/bge-small-en-v1.5)...
Building Context Graph structure...


semantica.progress - INFO - [RUNNING] | Module: context | Submodule: ContextGraph | Message: Building graph from 38 entities and 57 relationships
semantica.progress - INFO - [COMPLETED] | Module: context | Submodule: ContextGraph


Context Graph: 38 nodes, 57 edges
Indexing entities into Vector Store...


semantica.progress - INFO - [RUNNING] | Module: context | Submodule: AgentMemory | Message: Storing memory: numpy/numpy is a Repository. Properties: repo_url:...
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: AgentMemory | Message: Generating embedding...
semantica.progress - INFO - [RUNNING] | Module: embeddings | Submodule: TextEmbedder | Message: Generating text embedding: numpy/numpy is a Repository. Properties: repo_url:...
semantica.progress - INFO - [RUNNING] | Module: embeddings | Submodule: TextEmbedder | Message: Using fallback embedding method...
semantica.progress - INFO - [COMPLETED] | Module: embeddings | Submodule: TextEmbedder | Message: Generated embedding (dim: 16)
semantica.progress - INFO - [RUNNING] | Module: vector_store | Submodule: VectorStore | Message: Storing 1 vectors
semantica.progress - INFO - [RUNNING] | Module: vector_store | Submodule: VectorStore | Message: Storing vectors...
semantica.progress - INFO - [RUNNING] | Module: vector_

semantica.progress - INFO - [RUNNING] | Module: vector_store | Submodule: VectorStore | Message: Updating vector index...
semantica.progress - INFO - [COMPLETED] | Module: vector_store | Submodule: VectorStore | Message: Stored 1 vectors
semantica.progress - INFO - [COMPLETED] | Module: context | Submodule: AgentMemory | Message: Stored memory: mem_3888c9bc0f86
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: AgentMemory | Message: Storing memory: Machine Learning is a Category....
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: AgentMemory | Message: Generating embedding...
semantica.progress - INFO - [RUNNING] | Module: embeddings | Submodule: TextEmbedder | Message: Generating text embedding: Machine Learning is a Category....
semantica.progress - INFO - [RUNNING] | Module: embeddings | Submodule: TextEmbedder | Message: Using fallback embedding method...
semantica.progress - INFO - [COMPLETED] | Module: embeddings | Submodule: TextEmbedder |

Successfully indexed 38 entities.

Agent Query: 'pandas library'


semantica.progress - INFO - [RUNNING] | Module: context | Submodule: ContextRetriever | Message: Retrieving context for: pandas library...
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: ContextRetriever | Message: Retrieving from vector store...
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: ContextRetriever | Message: Retrieving from knowledge graph...
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: ContextRetriever | Message: Retrieving from memory...
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: AgentMemory | Message: Retrieving memories for: pandas library...
semantica.progress - INFO - [RUNNING] | Module: context | Submodule: AgentMemory | Message: Searching vector store...
semantica.progress - INFO - [RUNNING] | Module: embeddings | Submodule: TextEmbedder | Message: Generating text embedding: pandas library...
semantica.progress - INFO - [RUNNING] | Module: embeddings | Submodule: TextEmbedde


--- Retrieved Context for LLM ---
Content: pandas::metrics is a Entity.
Score: 0.6000
------------------------------
Content: pandas-dev/pandas is a Entity.
Score: 0.6000
------------------------------
Content: pandas is a Entity.
Score: 0.6000
------------------------------


In [30]:
print("\n--- Visualizing Graph ---")
visualizer = KGVisualizer(layout="force", color_scheme="vibrant")
fig = visualizer.visualize_network(kg, output="interactive")
fig.show()


--- Visualizing Graph ---


semantica.progress - INFO - [RUNNING] | Module: visualization | Submodule: KGVisualizer | Message: Visualizing knowledge graph network
semantica.kg_visualizer - INFO - Visualizing knowledge graph network
semantica.progress - INFO - [RUNNING] | Module: visualization | Submodule: KGVisualizer | Message: Extracting entities and relationships...
semantica.kg_visualizer - INFO - Graph Structure Analysis: 38 nodes, 57 edges
semantica.kg_visualizer - INFO - Entity Types: Category, Database, Entity, Library, LibraryMetrics, Platform, Repository
semantica.progress - INFO - [RUNNING] | Module: visualization | Submodule: KGVisualizer | Message: Building node and edge lists...
semantica.progress - INFO - [RUNNING] | Module: visualization | Submodule: KGVisualizer | Message: Generating visualization...
semantica.progress - INFO - [COMPLETED] | Module: visualization | Submodule: KGVisualizer | Message: Visualization generated: 38 nodes, 57 edges
