# Ontology-Aware Knowledge Extraction: From Text to Knowledge Graph

## Overview

This notebook demonstrates **ontology-driven entity extraction** using large language models (LLMs) to transform unstructured text into structured RDF knowledge graphs. Unlike traditional Named Entity Recognition (NER) systems, this approach leverages the semantic understanding of modern LLMs combined with formal ontologies to achieve **concept-aware extraction**.

## The Challenge: Ambiguous Terms in Unstructured Text

The corpus (`jaguar_corpus.txt`) contains information about multiple entities that share the word "Jaguar":
- üêÜ **Wildlife jaguars** (Panthera onca) - the big cat species
- üöó **Jaguar cars** (E-Type, XK-E) - luxury automobiles
- üé∏ **Fender Jaguar** - electric guitars

Traditional keyword-based extraction would indiscriminately extract all "Jaguar" mentions, creating noise and confusion in our knowledge graph.

## The Solution: Ontology-Driven Knowledge Representation

By providing the LLM with our **jaguar conservation ontology** (`jaguar_ontology.ttl`), the model understands:
- The **domain context** (wildlife conservation)
- The **class hierarchy** (Jaguar ‚Üí Animal ‚Üí Species)
- The **properties and relationships** (hasGender, occursIn, monitoredByOrg)
- The **semantic structure** of our knowledge domain

The LLM uses this ontological understanding to:
1. **Filter relevant entities** - Only extract wildlife-related jaguars
2. **Recognize relationships** - Identify connections between jaguars, locations, organizations, threats
3. **Generate aligned RDF** - Produce Turtle syntax that perfectly matches our ontology structure
4. **Infer implicit relationships** - Connect entities based on contextual understanding

## Knowledge Graphs vs. Traditional Databases

**Knowledge Graphs (RDF/SPARQL)** enable:
- **Formal ontologies** that define domain semantics
- **Schema flexibility** with open-world assumptions
- **Reasoning capabilities** through RDFS/OWL inference
- **Semantic interoperability** across datasets
- **Context-aware querying** with SPARQL

This notebook showcases how ontologies transform LLMs from simple text processors into **semantic knowledge miners** that understand concepts, not just patterns.

---

Let's begin by setting up our environment and loading the necessary data.


## Step 1: Initialize OpenAI Client

First, we initialize the OpenAI client with API credentials from our environment variables. This notebook uses **GPT-5** (or GPT-4o as fallback) for its advanced reasoning capabilities and deep semantic understanding.

The model needs to:
- Understand complex ontology structures
- Parse relationships and properties
- Generate valid RDF Turtle syntax
- Reason about entity disambiguation


In [None]:
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
# Initialize the OpenAI client 
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")
 )

print("OpenAI client initialized successfully!")

# Make a call to the OpenAI client
messages = [
    {"role": "user", "content": "Tell me a coding joke"}
 ]

response = client.chat.completions.create(
    model=os.getenv("OPENAI_RESPONSES_MODEL_ID", "gpt-4o"),
    messages=messages
 )

# Print the response content
print(response.choices[0].message.content)


## Step 2: Load the Jaguar Conservation Ontology

We load the **formal ontology** that defines our domain model. This ontology contains:

- **Classes**: `Jaguar`, `Habitat`, `ConservationEffort`, `Organization`, `Threat`, etc.
- **Properties**: `hasGender`, `occursIn`, `monitoredByOrg`, `facesThreat`, etc.
- **Relationships**: How entities connect (e.g., jaguars ‚Üí locations ‚Üí organizations)
- **Data types**: String labels, dates, boolean values, etc.

The ontology serves as the **semantic blueprint** that guides the LLM's extraction process. It defines what concepts are relevant to our domain and how they should be structured.


In [None]:
file_path = 'data/jaguar_ontology.ttl'
with open(file_path, 'r') as f:
    ontology = f.read()

print(f"Successfully loaded ontology from {file_path}. Content length: {len(ontology)} characters.")

## Step 3: Load the Text Corpus

Now we load the unstructured text corpus containing mixed content about:
- **Wildlife jaguars**: Conservation efforts, individual animals (El Jefe, Macho B), habitats, threats
- **Jaguar cars**: E-Type models, specifications, history
- **Fender Jaguar guitars**: Musical instruments, production details

This mixed-content corpus simulates a real-world scenario where domain-specific information is embedded within noisy, multi-domain text sources.

**The key question**: How do we extract ONLY the wildlife-related information while ignoring cars and guitars?


In [None]:
file_path = 'data/jaguar_corpus.txt'
with open(file_path, 'r') as f:
    corpus = f.read()

print(f"Successfully loaded text from {file_path}. Content length: {len(corpus)} characters.")

## Step 4: Ontology-Aware Extraction with GPT-5

### The Magic: Concept Understanding Through Formal Semantics

This is where **ontology-driven extraction** happens. We provide GPT-5 with:
1. The **ontology** (semantic structure)
2. The **corpus** (unstructured text)
3. Instructions to extract entities and relationships that **align with the ontology**

### What Makes This Intelligent?

The LLM doesn't just pattern-match "jaguar" keywords. Instead, it:
- **Understands the domain** from the ontology classes and properties
- **Disambiguates entities** based on semantic context (wildlife vs. cars vs. guitars)
- **Extracts ONLY relevant information** about wildlife jaguars
- **Generates valid RDF Turtle** that conforms to our ontology structure
- **Infers relationships** between entities based on contextual clues

### Processing Time

‚è±Ô∏è **Expected duration**: 2-4 minutes (GPT-5 uses high reasoning effort for deep semantic analysis)

### Output

The result will be **RDF Turtle code** that can be directly imported into GraphDB using "Import ‚Üí Text snippet". The data will beautifully align with your ontology, creating a semantically rich knowledge graph ready for SPARQL queries.  

In [None]:
#Make the prompt

prompt = f"""
Given this ontology: 
{ontology}

Extract relevant named entities, their relations and related information from this text. Think deep and analyze all information in the relevant text thoroughly. Try to infer relevant relationships between entities if not directly mentioned in the text.Finally Generate RDF turtle code that align with the ontology for the found entities and relationships. Make sure to give all entities relevant rdfs:label.

{corpus}

"""

# Make a call to the OpenAI client
messages = [
    {"role": "user", "content": prompt}
 ]

response = client.chat.completions.create(
    model="gpt-5",
    messages=messages,
    max_completion_tokens=20000,
    reasoning_effort="high"
 )

# Print the response content
print(response.choices[0].message.content)

---

## Why This Requires RDF and Formal Ontologies

### The Limitations of Labeled Property Graphs (LPG)

This ontology-driven extraction approach **cannot be replicated** with LPG databases like Neo4j for fundamental architectural reasons:

### 1. **No Formal Ontologies**

**LPG databases lack formal ontology support:**
- Labels and relationship types are just **strings**, not semantically defined concepts
- No formal class hierarchies (RDFS/OWL)
- No property domain/range definitions
- No reasoning or inference capabilities

**Example:** In Neo4j, you might have:
```cypher
(:Jaguar {name: "El Jefe"})-[:OCCURS_IN]->(:Location {name: "Arizona"})
```

But the LLM has no formal structure to understand:
- What a `Jaguar` node represents (could be a car, guitar, or animal)
- What properties `Jaguar` should have
- What relationships are valid
- How concepts relate hierarchically

### 2. **No Semantic Standards**

**RDF provides W3C standards:**
- **RDFS** (Resource Description Framework Schema) - defines classes, properties, hierarchies
- **OWL** (Web Ontology Language) - enables reasoning and inference
- **SHACL** - validates data shapes
- **SKOS** - manages vocabularies and taxonomies

**LPG has:**
- Vendor-specific schemas (Neo4j, TigerGraph, etc.)
- No formal semantics
- No standardized reasoning
- No interoperability guarantees

### 3. **Knowledge Representation vs. Data Storage**

| Aspect | RDF/SPARQL (Knowledge Graphs) | LPG (Neo4j, etc.) |
|--------|-------------------------------|-------------------|
| **Schema** | Formal ontologies (RDFS/OWL) | Informal schemas |
| **Semantics** | Explicit, machine-readable | Implicit, application-level |
| **Reasoning** | Built-in (RDFS/OWL inference) | Application-specific |
| **Standards** | W3C standards (RDF, SPARQL) | Vendor-specific |
| **Interoperability** | Universal (URIs, namespaces) | Limited |
| **Open World** | Yes (absence ‚â† false) | No (closed world) |
| **LLM Guidance** | Rich semantic structure | Just node/edge labels |

### 4. **Why LLMs Need Ontologies**

When extracting knowledge, the LLM requires:

‚úÖ **With RDF Ontologies:**
- Formal class definitions (`ont:Jaguar rdfs:subClassOf ont:Animal`)
- Property domains and ranges (`ont:hasGender rdfs:domain ont:Jaguar ; rdfs:range xsd:string`)
- Relationship semantics (`ont:monitoredByOrg rdfs:range ont:Organization`)
- Hierarchical understanding (taxonomies)
- Validation rules (cardinality, data types)

‚ùå **With LPG:**
- Just strings: `"Jaguar"`, `"OCCURS_IN"`, `"Location"`
- No formal meaning
- No hierarchical structure
- No validation constraints
- No way to distinguish domain concepts from noise

### 5. **The "Jaguar Problem" Revisited**

**Without formal ontologies**, an LLM extracting to LPG would:
- Struggle to disambiguate jaguars (cars vs. animals)
- Have no semantic guidance on valid properties
- Create inconsistent graph structures
- Require extensive post-processing and validation

**With RDF ontologies**, the LLM:
- Understands domain semantics from formal definitions
- Generates semantically valid, structured data
- Automatically filters irrelevant entities
- Produces interoperable, standards-compliant knowledge

---

## Conclusion: Knowledge Graphs Enable Semantic AI

This notebook demonstrates how **formal ontologies** (RDFS/OWL) transform LLMs from text processors into **semantic knowledge miners**. By grounding extraction in formal semantics, we achieve:

- üéØ **Precision** - Extract only domain-relevant entities
- üîó **Structure** - Generate properly connected knowledge graphs
- üß† **Intelligence** - Enable reasoning and inference
- üåê **Interoperability** - Create standards-compliant data
- üîç **Queryability** - Support complex SPARQL queries

**RDF + Ontologies + LLMs** = The future of intelligent knowledge extraction

LPG databases are excellent for operational graph queries, but they fundamentally lack the **semantic infrastructure** required for ontology-driven knowledge representation. For true knowledge graphs that machines can understand and reason over, RDF and formal ontologies are essential.
