[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/09_Your_First_Knowledge_Graph.ipynb)

# üöÄ Your First Knowledge Graph

## Overview

This notebook walks you through creating your first knowledge graph from a simple document. You'll learn the complete end-to-end workflow from ingesting a file to visualizing the resulting knowledge graph.

> [!TIP]
> This is the perfect starting point if you are new to Semantica. No prior knowledge of knowledge graphs is required!

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/kg/)

### üéØ Learning Objectives

- **Understand the Workflow**: Learn the `File ‚Üí Parse ‚Üí Extract ‚Üí Graph` pipeline
- **Ingest Data**: Load documents using `FileIngestor`
- **Parse Content**: Extract text using `DocumentParser`
- **Extract Knowledge**: Identify entities using `NERExtractor`
- **Build Graph**: Construct a graph using `GraphBuilder`
- **Visualize**: See your graph come to life with `KGVisualizer`

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

## üîÑ Simple End-to-End Workflow

The complete workflow consists of four main steps:

1. **üì• Ingest** - Load data from files or other sources
2. **üìÑ Parse** - Extract and structure content from documents
3. **‚õèÔ∏è Extract** - Identify entities and relationships
4. **üï∏Ô∏è Build Graph** - Construct the knowledge graph

Each step is demonstrated in the code cells below.

> [!TIP]
> **Alternative: Using Semantica Framework**
> 
> For a simpler, high-level approach, you can use the `Semantica` framework class which orchestrates all these steps:
> 
> ```python
> from semantica.core import Semantica
> 
> framework = Semantica()
> framework.initialize()
> 
> result = framework.build_knowledge_base(
>     sources=["sample_document.txt"],
>     embeddings=True,
>     graph=True
> )
> 
> framework.shutdown()
> ```
> 
> This notebook shows the step-by-step approach for learning. See [Core Module Usage Guide](../../../semantica/core/core_usage.md) for more details.

---

## üìÇ Step 1: Ingest a File

In this step, we'll use `FileIngestor` to load a document. The ingestor supports various file formats including PDF, DOCX, TXT, and more.


In [1]:
!pip install semantica





In [2]:
from semantica.ingest import FileIngestor
from pathlib import Path

# Initialize the ingestor
ingestor = FileIngestor()

# Create a sample document for demonstration
sample_text = """
Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.
The company is headquartered in Cupertino, California.
Tim Cook is the current CEO of Apple Inc.
Apple designs and manufactures consumer electronics, software, and online services.
"""

sample_file = Path("sample_document.txt")
sample_file.write_text(sample_text)

print(f"File: {sample_file}")
print(f"Content length: {len(sample_text)} characters")

# Ingest the file
file_object = ingestor.ingest_file(sample_file, read_content=True)
print(f"  File name: {file_object.name}")
print(f"  File type: {file_object.file_type}")
print(f"  Content available: {file_object.content is not None}")


File: sample_document.txt
Content length: 281 characters


Status,Action,Module,Submodule,File,Time
‚úÖ,Semantica is ingesting,üì• ingest,FileIngestor,sample_document.txt,0.02s
‚úÖ,Semantica is parsing,üîç parse,DocumentParser,sample_document.txt,0.02s


  File name: sample_document.txt
  File type: txt
  Content available: True


## üìÑ Step 2: Parse the Document

After ingesting the file, we need to parse it to extract the text content. The `DocumentParser` handles various file formats and extracts structured content.


In [3]:
from semantica.parse import DocumentParser

parser = DocumentParser()
# Parse the document to extract text
parsed_document = parser.parse_document(str(sample_file))
parsed_content = parsed_document.get("content", "")
print(f"  Parsed content length: {len(parsed_content) if parsed_content else 0} characters")
print(f"  Preview: {parsed_content[:200] if parsed_content else 'N/A'}...")

  Parsed content length: 0 characters
  Preview: N/A...


## ‚õèÔ∏è Step 3: Extract Entities

Now we'll extract entities from the parsed text using Named Entity Recognition (NER). This identifies people, organizations, locations, dates, and other entities in the text.

> [!NOTE]
> In a real scenario, you would use `NERExtractor` with an LLM or model backend. Here we simulate the output for demonstration purposes.


In [4]:
from semantica.semantic_extract import NamedEntityRecognizer, NERExtractor

ner = NamedEntityRecognizer()
extractor = NERExtractor()

print(f"\nText: {parsed_content[:100]}...")

# Simulated extraction results
expected_entities = [
    {"text": "Apple Inc.", "type": "Organization", "start": 0, "end": 10},
    {"text": "Steve Jobs", "type": "Person", "start": 50, "end": 60},
    {"text": "Steve Wozniak", "type": "Person", "start": 62, "end": 75},
    {"text": "Ronald Wayne", "type": "Person", "start": 81, "end": 93},
    {"text": "1976", "type": "Date", "start": 97, "end": 101},
    {"text": "Cupertino, California", "type": "Location", "start": 130, "end": 151},
    {"text": "Tim Cook", "type": "Person", "start": 153, "end": 161},
]

for entity in expected_entities:
    print(f"  - {entity['text']} ({entity['type']})")



Text: ...
  - Apple Inc. (Organization)
  - Steve Jobs (Person)
  - Steve Wozniak (Person)
  - Ronald Wayne (Person)
  - 1976 (Date)
  - Cupertino, California (Location)
  - Tim Cook (Person)


## üï∏Ô∏è Step 4: Build the Knowledge Graph

Using the extracted entities and relationships, we'll construct a knowledge graph. The graph represents entities as nodes and relationships as edges.


In [5]:
from semantica.kg import GraphBuilder
import networkx as nx

builder = GraphBuilder()

# Prepare data for graph construction
entities_data = [
    {"id": f"entity_{i}", "name": entity["text"], "type": entity["type"]}
    for i, entity in enumerate(expected_entities)
]

relationships_data = [
    {"source": "entity_0", "target": "entity_1", "type": "founded_by"},
    {"source": "entity_0", "target": "entity_2", "type": "founded_by"},
    {"source": "entity_0", "target": "entity_3", "type": "founded_by"},
    {"source": "entity_0", "target": "entity_4", "type": "founded_in"},
    {"source": "entity_0", "target": "entity_5", "type": "located_in"},
    {"source": "entity_6", "target": "entity_0", "type": "ceo_of"},
]

# Build the graph using NetworkX
kg = nx.DiGraph()

for entity in entities_data:
    kg.add_node(entity["id"], name=entity["name"], type=entity["type"])

for rel in relationships_data:
    source_name = entities_data[int(rel["source"].split("_")[1])]["name"]
    target_name = entities_data[int(rel["target"].split("_")[1])]["name"]
    kg.add_edge(rel["source"], rel["target"], type=rel["type"])

print(f"  Nodes (entities): {len(kg.nodes)}")
print(f"  Edges (relationships): {len(kg.edges)}")

for node_id in kg.nodes():
    node_data = kg.nodes[node_id]
    print(f"  Node: {node_data['name']} ({node_data['type']})")

for source, target, data in kg.edges(data=True):
    source_name = kg.nodes[source]['name']
    target_name = kg.nodes[target]['name']
    print(f"  {source_name} --[{data['type']}]--> {target_name}")


  Nodes (entities): 7
  Edges (relationships): 6
  Node: Apple Inc. (Organization)
  Node: Steve Jobs (Person)
  Node: Steve Wozniak (Person)
  Node: Ronald Wayne (Person)
  Node: 1976 (Date)
  Node: Cupertino, California (Location)
  Node: Tim Cook (Person)
  Apple Inc. --[founded_by]--> Steve Jobs
  Apple Inc. --[founded_by]--> Steve Wozniak
  Apple Inc. --[founded_by]--> Ronald Wayne
  Apple Inc. --[founded_in]--> 1976
  Apple Inc. --[located_in]--> Cupertino, California
  Tim Cook --[ceo_of]--> Apple Inc.


## üìä Step 5: Visualize and Analyze

Finally, we'll visualize the knowledge graph and analyze its structure. This helps you understand the relationships and entities in your data.


In [6]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()

print(f"  Total entities: {len(kg.nodes)}")
print(f"  Total relationships: {len(kg.edges)}")

entity_types = {}
for node_id in kg.nodes():
    entity_type = kg.nodes[node_id]['type']
    entity_types[entity_type] = entity_types.get(entity_type, 0) + 1

for etype, count in entity_types.items():
    print(f"  - {etype}: {count}")

rel_types = {}
for _, _, data in kg.edges(data=True):
    rel_type = data.get('type', 'unknown')
    rel_types[rel_type] = rel_types.get(rel_type, 0) + 1

for rtype, count in rel_types.items():
    print(f"  - {rtype}: {count}")

# Cleanup
if sample_file.exists():
    sample_file.unlink()


  Total entities: 7
  Total relationships: 6
  - Organization: 1
  - Person: 4
  - Date: 1
  - Location: 1
  - founded_by: 3
  - founded_in: 1
  - located_in: 1
  - ceo_of: 1
