[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/01_Welcome_to_Semantica.ipynb)

# Welcome to Semantica

**Open Source Framework for Semantic Layer & Knowledge Engineering**

Semantica is a Python framework for transforming raw, messy, multi-source data into **semantic layers** and **knowledge graphs** that are ready to power GraphRAG, AI agents, multi-agent systems, and analytical applications.

This notebook is an executable introduction. It combines:

- High-level explanation of what Semantica is and why it exists
- A structured tour of the architecture and key modules
- Small, runnable code snippets that show the end-to-end flow

**You should use this notebook to understand the big picture, not to learn every API in depth.**

## What Is Semantica?

Semantica is a **semantic intelligence and knowledge engineering framework**. It helps you:

- Build **knowledge graphs** from unstructured and semi-structured data
- Create a unified **semantic layer** on top of diverse data sources
- Power **GraphRAG**, AI agents, and multi-agent systems with structured knowledge
- Incorporate **temporal and quality-aware reasoning** into your applications

### Core Capabilities

- **Universal ingestion**: Files, web, feeds, databases, repositories, streams
- **Rich parsing**: PDFs, Office documents, HTML, JSON, CSV, images, code
- **Normalization**: Cleaning, language detection, entity normalization, date/number standardization
- **Semantic extraction**: Named entities, relationships, events, semantic networks
- **Knowledge graph construction**: Property graphs from entities and relations
- **Embeddings and vector search**: Text and graph embeddings, hybrid retrieval
- **Reasoning and ontology**: Rule-based inference, ontology generation and validation
- **Visualization and analytics**: Graph visualizations and quality metrics

## Who Is Semantica For?

- **AI/ML engineers** building GraphRAG systems, agents, and tools that need long-term memory
- **Data engineers** orchestrating semantic enrichment pipelines over large, heterogeneous datasets
- **Knowledge engineers and ontologists** designing and maintaining formal knowledge structures
- **Researchers and analysts** creating domain knowledge graphs from documents and data feeds
- **Product and platform teams** embedding semantic intelligence into applications and services

## Architecture Overview

Semantica is organized as three conceptual layers and multiple concrete modules.

### Layers

- **Input Layer**
  - Connects to files, web pages, APIs, databases, email, feeds, repositories, and streams
  - Normalizes these different sources into a unified internal representation

- **Semantic Layer**
  - Performs parsing, cleaning, semantic extraction, graph construction, embeddings, and reasoning
  - This is where **unstructured data becomes structured knowledge**

- **Output Layer**
  - Exposes knowledge graphs, embeddings, ontologies, and analytics
  - Integrates with vector stores, graph databases, and downstream applications

## Key Modules at a Glance

Below is a conceptual overview of the core modules. Later cells show a small end-to-end example that stitches several of them together.

- **Ingest**
  - Components: `FileIngestor`, `WebIngestor`, `FeedIngestor`, `StreamIngestor`, `DBIngestor`, `EmailIngestor`, `RepoIngestor`, `MCPIngestor`
  - Responsibility: Bring data from many sources into the pipeline

- **Parse**
  - Components: `DocumentParser`, format-specific parsers like `PDFParser`, `HTMLParser`, `JSONParser`, `ExcelParser`
  - Responsibility: Turn raw content into structured document objects

- **Normalize**
  - Components: `TextNormalizer`, `TextCleaner`, `EntityNormalizer`, `DateNormalizer`, `NumberNormalizer`
  - Responsibility: Clean text, standardize entities and values, prepare for extraction

- **Semantic extract**
  - Components: `NERExtractor`, `RelationExtractor`, `SemanticAnalyzer`, `SemanticNetworkExtractor`
  - Responsibility: Identify entities and relationships that will become nodes and edges in the graph

- **Knowledge graph (KG)**
  - Components: `GraphBuilder`, `GraphAnalyzer`, `GraphValidator`, `EntityResolver`, `ConflictDetector`
  - Responsibility: Build, analyze, and validate the knowledge graph

- **Embeddings and vector store**
  - Components: `EmbeddingGenerator`, `VectorStore`, `HybridSearch`
  - Responsibility: Generate vector representations and enable semantic search

- **Graph store**
  - Components: `GraphStore`, adapters for backends like Neo4j or FalkorDB
  - Responsibility: Persist and query graphs in external databases

- **Reasoning and ontology**
  - Components: `InferenceEngine`, `RuleManager`, `OntologyGenerator`, `OntologyValidator`
  - Responsibility: Apply rules, infer new facts, and maintain formal ontologies

- **Visualization**
  - Components: `KGVisualizer`, `EmbeddingVisualizer`, `QualityVisualizer`, `AnalyticsVisualizer`
  - Responsibility: Inspect, debug, and present graphs, embeddings, and quality metrics

## Core Concepts (High-Level)

- **Knowledge graph**
  - Nodes represent entities such as people, organizations, locations, events, or concepts
  - Edges represent relationships such as `works_for`, `located_in`, `founded_by`
  - Properties capture attributes and metadata such as timestamps, sources, and confidence

- **Entities and relationships**
  - Entities are extracted from text and data using NER
  - Relationships connect entities and are extracted using pattern-based, model-based, or LLM-based methods

- **Embeddings**
  - Numerical vectors that encode semantic meaning of text or graph structures
  - Used for semantic search, clustering, and similarity-based retrieval

- **GraphRAG**
  - Combines vector search with graph traversal
  - Uses both embeddings and graph structure to retrieve rich, context-aware information

- **Ontology**
  - A formal model of classes, relationships, and constraints in a domain
  - Used to standardize meaning, enable reasoning, and integrate heterogeneous data

- **Quality and governance**
  - Quality metrics (completeness, consistency, accuracy, coverage)
  - Conflict detection and resolution at the knowledge graph level

## Installation

You can install Semantica from PyPI. In this notebook, we use a pip cell so it can run in local Jupyter or Colab.

Equivalent shell commands:

```bash
pip install semantica
pip install semantica[all]
```

In [None]:
%pip install -U "semantica[all]"
import semantica
semantica.__version__

## Basic Configuration

Semantica uses configuration for API keys, embedding providers, and knowledge graph options. The example below mirrors a typical configuration while staying simple enough for a notebook.

In [None]:
import os
from pathlib import Path

os.environ["SEMANTICA_API_KEY"] = "your_openai_key"
os.environ["SEMANTICA_EMBEDDING_PROVIDER"] = "openai"
os.environ["SEMANTICA_MODEL_NAME"] = "gpt-4"

config_text = """api_keys:
  openai: your_key_here
  anthropic: your_key_here
embedding:
  provider: openai
  model: text-embedding-3-large
  dimensions: 3072
knowledge_graph:
  backend: networkx
  temporal: true
"""
Path("config.yaml").write_text(config_text, encoding="utf-8")
Path("config.yaml").read_text(encoding="utf-8")

## Quick Start: Build a Tiny Knowledge Base

The simplest way to use Semantica is the high-level `build` helper. It ingests data, runs a default pipeline, and returns a dictionary that includes the knowledge graph, embeddings, metadata, and statistics.

In [None]:
from semantica import build
from pathlib import Path

docs_dir = Path("welcome_docs")
docs_dir.mkdir(exist_ok=True)
text_path = docs_dir / "apple.txt"
text_content = (
    "Apple Inc. was founded by Steve Jobs, Steve Wozniak and Ronald Wayne in"
    " Cupertino, California."
)
text_path.write_text(text_content, encoding="utf-8")

result = build(str(docs_dir), embeddings=False, graph=True)
sorted(result.keys())

## Minimal End-to-End Pipeline

The next example shows how to explicitly use several modules in sequence. This mirrors the architecture discussed earlier:

1. Ingest a directory of documents
2. Parse them into structured documents
3. Normalize text
4. Extract entities and relationships
5. Build and analyze a knowledge graph
6. Create embeddings and store them in a vector store
7. Run a hybrid semantic search query

In [None]:
from semantica.ingest import FileIngestor
from semantica.parse import DocumentParser
from semantica.normalize import TextNormalizer
from semantica.semantic_extract import NERExtractor, RelationExtractor
from semantica.kg import GraphBuilder, GraphAnalyzer
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore, HybridSearch

ingestor = FileIngestor()
documents = ingestor.ingest(str(docs_dir))

parser = DocumentParser()
parsed_docs = parser.parse(documents)

normalizer = TextNormalizer()
normalized_docs = normalizer.normalize(parsed_docs)

ner = NERExtractor()
entities = ner.extract(normalized_docs)
rel_extractor = RelationExtractor()
relationships = rel_extractor.extract(normalized_docs, entities)

builder = GraphBuilder()
kg = builder.build(entities, relationships)
analyzer = GraphAnalyzer()
metrics = analyzer.analyze(kg)

emb_generator = EmbeddingGenerator()
embeddings = emb_generator.generate_embeddings(documents, data_type="text")

vec_store = VectorStore()
vec_store.store(embeddings, documents, metadata={})
hybrid = HybridSearch(vec_store)
search_results = hybrid.search("Apple founders", top_k=3)
len(search_results)

## Visualization

Semantica includes a powerful visualization module. Here we create an interactive network graph from the knowledge graph built above.

In [None]:
from semantica.visualization import KGVisualizer

# Create a visualizer instance
viz = KGVisualizer(layout="force", color_scheme="vibrant")

# Generate an interactive network visualization
# This returns a Plotly figure object that renders in the notebook
fig = viz.visualize_network(kg, output="interactive")
fig.show()

## Ontology Generation

You can also automatically generate an ontology (a formal model of your domain) from the extracted entities and relationships.

In [None]:
from semantica.ontology import OntologyGenerator

generator = OntologyGenerator(base_uri="https://example.org/ontology/")

# Generate ontology from the extracted data
ontology = generator.generate_ontology({
    "entities": entities,
    "relationships": relationships
})

# View inferred classes
[cls["name"] for cls in ontology.get("classes", [])[:5]]

## Advanced: Data Splitting and Chunking

For RAG applications, splitting documents into smaller chunks is essential. Semantica provides a `split` module for this purpose.

In [None]:
from semantica.split import Splitter

splitter = Splitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_documents(documents)

print(f"Original documents: {len(documents)}")
print(f"Generated chunks: {len(chunks)}")

## Advanced: Reasoning and Inference

The `reasoning` module allows you to derive new facts from existing knowledge using logic rules.

In [None]:
from semantica.reasoning import InferenceEngine

# Simple rule: If X founded Y, then X works_for Y
rule = """
IF (?x founded ?y) THEN (?x works_for ?y)
"""

engine = InferenceEngine()
engine.add_rule(rule)
inferred_facts = engine.infer(kg)

print(f"Inferred {len(inferred_facts)} new facts")

## Advanced: Export and Persistence

You can save your knowledge graph to disk or export it to standard formats like CSV, JSON, or RDF.

In [None]:
from semantica.export import GraphExporter

exporter = GraphExporter()
exporter.export(kg, format="json", output_path="knowledge_graph.json")

print("Graph exported to knowledge_graph.json")

## Other Specialized Modules

Semantica includes many other specialized modules for advanced use cases:

- **Context**: Manages agent memory, conversation history, and context graphs for AI agents.
- **Conflicts**: Detects and resolves conflicting information from different sources (e.g., different birth dates for the same person).
- **Deduplication**: Identifies and merges duplicate entities (e.g., "Steve Jobs" vs "Stephen Jobs").
- **Pipeline**: Orchestrates complex, multi-step workflows with retries and error handling.
- **Triplet Store**: Adapters for enterprise RDF stores like BlazeGraph, Jena, or Virtuoso.
- **Graph Store**: Adapters for Property Graph databases like Neo4j or FalkorDB.

## Using the Core `Semantica` Class

For more complex systems, you can work directly with the `Semantica` core class and a configuration object. This gives you access to lifecycle management, plugin registration, and orchestration helpers.

In [None]:
from semantica.core import Semantica, ConfigManager

config_manager = ConfigManager()
config = config_manager.load_from_file("config.yaml")

framework = Semantica(config=config)
framework.initialize()

kb_result = framework.build_knowledge_base(
    sources=[str(docs_dir)],
    embeddings=True,
    graph=True,
)

framework.shutdown()
sorted(kb_result.keys())

## Where to Go Next

- Run the notebooks under `cookbook/introduction` for focused module overviews
- Explore `cookbook/use_cases` for domain-specific end-to-end workflows
- Read the **Core Concepts** documentation for deeper theory and best practices