# Dark Souls 3: Knowledge Graph Construction

This notebook presents a step-by-step process for constructing a knowledge graph from in-game textual data extracted from *Dark Souls III*.

Dark Souls is known for its indirect, cryptic narrative style. Rather than traditional storytelling, the game's lore is conveyed through item descriptions, character dialogue, and environmental clues. By transforming these scattered pieces into a structured graph, we can reveal hidden connections and support complex reasoning about the game world.

The dataset used here was extracted from the game files and made available by the [DarkSouls3.TextViewer project](https://github.com/mrexodia/DarkSouls3.TextViewer), which aggregates in-game texts such as item descriptions and NPC dialogue.

This notebook includes:
- Data extraction and preparation from the JSON dump
- Entity and relationship extraction using language models
- Knowledge graph construction using the extracted triples
- Graph exploration and insight generation
- Graph-based question answering via queries


## 1. Setup and Imports

In [None]:
import json
from transformers import GPT2TokenizerFast
import random
import time
import openai
from tqdm import tqdm
import ast
import re
import pandas as pd
import os
import networkx as nx
from neo4j import GraphDatabase
from dotenv import load_dotenv

## 2. Loading the Original JSON Dataset

The full dataset is provided as a JSON file (`ds3.json`) containing raw textual content extracted directly from the game files of *Dark Souls III*, including item descriptions, spell texts, equipment lore, and NPC dialogues.

To focus the analysis, I filter the dataset to retain only the entries related to **accessories**, **magic**, **weapons**, and **conversations** — categories that typically contain the richest lore content and entity interactions. This results in a more concise and meaningful subset for relationship extraction.

The filtered texts are saved as `ds3_clean_texts.json`, which serves as the foundation for the knowledge extraction and graph construction steps that follow.

In [2]:
with open("ds3.json", "r", encoding="utf-8") as f:
    data = json.load(f)

eng_data = data["languages"]["engUS"]

## 3. Selecting Lore-Rich Categories

To ensure the graph captures meaningful narrative relationships, I restrict the scope to the following categories:

- **accessory**
- **magic**
- **weapon**
- **conversations**

These groups are known to contain dense lore elements and references to entities and events within the Dark Souls universe.


In [3]:
categories_to_use = ["accessory", "magic", "weapon", "conversations"]

## 4. Extracting and Structuring Text Entries

For each entry in the selected categories, I extract the `name`, `description`, and `knowledge` fields when available. These components contain the textual content that later serves as input for relationship extraction.

In [4]:
clean_data = []

for category in categories_to_use:
    for entry in eng_data[category].values():
        name = entry.get("name", "").strip()
        description = entry.get("description", "").strip()
        knowledge = entry.get("knowledge", "").strip()
        if name or description or knowledge:
            clean_data.append({
                "category": category,
                "name": name,
                "description": description,
                "knowledge": knowledge
            })

## 5. Saving the Cleaned Dataset

To streamline development and avoid reprocessing the full JSON file each time, I save the filtered and structured entries as `ds3_clean_texts.json`.


In [5]:
with open("ds3_clean_texts.json", "w", encoding="utf-8") as f:
    json.dump(clean_data, f, ensure_ascii=False, indent=2)

print(f"✅ Extracted and saved: {len(clean_data)} entries in ds3_clean_texts.json")


✅ Extracted and saved: 3473 entries in ds3_clean_texts.json


## 6. Inspecting the Cleaned Dataset

I load the cleaned dataset and display a few samples to verify encoding, formatting, and overall relevance before proceeding to language model extraction.


In [6]:
with open("ds3_clean_texts.json", "r", encoding="utf-8") as f:
    clean_data = json.load(f)

print(f"Total entries: {len(clean_data)}")
clean_data[:3]  # inspect first few items

Total entries: 3473


[{'category': 'accessory',
  'name': "Havel's Ring",
  'description': 'Boosts maximum equipment load',
  'knowledge': "This ring was named after Havel the Rock,\nLord Gwyn's old battlefield compatriot.\n\nHavel's men wore the ring to express faith in\ntheir leader and to carry a heavier load."},
 {'category': 'accessory',
  'name': 'Red Tearstone Ring',
  'description': 'Boosts ATK while HP is low',
  'knowledge': 'The rare gem called tearstone has the\nuncanny ability to sense imminent death.\n\nThis red tearstone from Carim boosts the\nattack of its wearer when in danger.'},
 {'category': 'accessory',
  'name': 'Darkmoon Blade Covenant Ring',
  'description': "Answer Dark Sun Gwyndolin's summoning",
  'knowledge': "Ring granted to those bound by the\nDarkmoon Blade covenant.\nAnswer Dark Sun Gwyndolin's summoning.\n\nGwyndolin, all too aware of his repulsive,\nfrail appearance, created the illusion of \na sister Gwynevere, who helps him guard\nover Anor Londo. An unmasking of these\n

## 7. Token Estimation and Sampling Strategy

Before running large-scale extraction with GPT-4o, I estimate the token usage to control costs and processing time. The full dataset contains approximately 268,000 tokens. Based on GPT-4o pricing — $0.005 per 1,000 input tokens and $0.015 per 1,000 output tokens — the total cost would be around $2.

To strike a balance between coverage, cost, and iteration speed, I select a representative sample of 100 entries for initial experimentation. This approach keeps the total cost under $1 while preserving the diversity of item types in the dataset.


In [None]:
# Load GPT2 tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(texts):
    total_tokens = 0
    for entry in clean_data:
        combined = f"{entry['name']}\n{entry['description']}\n{entry['knowledge']}"
        tokens = len(tokenizer.encode(combined))
        total_tokens += tokens
    return total_tokens

total_tokens = count_tokens(clean_data)
print(f"Estimated total tokens (input only): {total_tokens:,}")


Estimated total tokens (input only): 268,105


## 8. Sampling Entries for Triple Extraction

I randomly sample 100 entries from the cleaned dataset, ensuring a variety of item types (e.g., weapons, armor, spells, NPC dialogue) are included.

These entries are saved as `sampled_lore_entries.json` and serve as input for the GPT-4o-based triple extraction in the next step.

In [16]:
# Set seed for reproducibility
random.seed(42)

# Sample 100 entries from the cleaned dataset
sampled_data = random.sample(clean_data, 100)

# Save to file for processing with the OpenAI API
with open("sampled_lore_entries.json", "w", encoding="utf-8") as f:
    json.dump(sampled_data, f, indent=2, ensure_ascii=False)

### 9. Extracting Triples Using GPT-4o

To transform the item descriptions into structured knowledge, I use **GPT-4o** via the OpenAI API.

For each entry — containing a `name`, `description`, and optional `knowledge` field — a structured prompt is sent to the model, asking it to extract **1 to 3 subject–predicate–object triples** that represent meaningful relationships between entities.

Each extracted triple follows the format:

("Entity A", "relation", "Entity B")

This step automates the conversion of unstructured game lore into structured data, making it suitable for graph-based analysis and visualization.

The extracted triples are saved to `extracted_triples.json` and will be used in the next step to build the knowledge graph.

> ℹ️ **Note:** To re-run this step, you must provide a valid OpenAI API key. The notebook is configured to read the key from an environment variable (`OPENAI_API_KEY`) loaded via a `.env` file.

However, this step has already been executed, and the output file `extracted_triples.json` is included in the repository. There’s no need to re-run the extraction unless you wish to modify the prompt or input data.

In [None]:
# 1. Initialize OpenAI client
load_dotenv()
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 2. Choose the model
MODEL = "gpt-4o"

# 3. Verify that the model exists in the account
available_models = [m.id for m in client.models.list().data]
if MODEL not in available_models:
    raise ValueError(f"Model '{MODEL}' not available in your account. Available models are:\n{available_models}")

print(f"Model '{MODEL}' is available. Proceeding with extraction.")

# 4. Load sampled entries
with open("sampled_lore_entries.json", "r", encoding="utf-8") as f:
    entries = json.load(f)

print(f"Loaded {len(entries)} entries for triple extraction.")

# 5. Define the system prompt
system_prompt = (
    "You are a helpful assistant that extracts meaningful relationships from fantasy game item descriptions."
)

# 6. Function to generate user prompts
def make_user_prompt(entry):
    content = f"{entry['name']}\n{entry['description']}\n{entry['knowledge']}".strip()
    return (
        f"""Given the following game lore text, extract 1 to 3 subject–predicate–object triples.
Each triple should describe a meaningful relationship between entities mentioned in the text.

Return your answer strictly as a Python list of triples like this:
[
  ("Entity A", "relation", "Entity B"),
  ...
]

Text:
\"\"\"
{content}
\"\"\"
"""
    )

# 7. Initialize results list
results = []

# 8. Loop through entries and call OpenAI API
for entry in tqdm(entries, desc="Extracting triples"):
    try:
        user_prompt = make_user_prompt(entry)

        response = client.chat.completions.create(
            model=MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.2
        )

        output = response.choices[0].message.content.strip()

        results.append({
            "input": entry,
            "triples": output,
            "error": None
        })

        time.sleep(1)  # Optional delay to avoid rate limits

    except Exception as e:
        print(f"Error on entry '{entry.get('name', 'N/A')}': {e}")
        results.append({
            "input": entry,
            "triples": None,
            "error": str(e)
        })

# 9. Save results to JSON
with open("extracted_triples.json", "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print("Extraction completed. Results saved to 'extracted_triples.json'.")


Model 'gpt-4o' is available. Proceeding with extraction.
Loaded 100 entries for triple extraction.


Extracting triples: 100%|██████████| 100/100 [04:12<00:00,  2.52s/it]

Extraction completed. Results saved to 'extracted_triples.json'.





### 10. Inspecting Extracted Triples

Before proceeding to graph construction, I inspect a random sample of the extracted triples to verify their **quality and consistency**.

This step ensures that GPT-4o correctly followed the prompt structure, extracted **meaningful subject–predicate–object relationships**, and returned **syntactically valid outputs**.

By reviewing a few examples, I can check for common issues such as:

- Formatting errors  
- Hallucinated or irrelevant entities  
- Incomplete or malformed triples  

This verification step helps determine whether any cleaning or adjustments are needed before building the graph.


In [19]:
# Load the extracted triples
with open("extracted_triples.json", "r", encoding="utf-8") as f:
    extracted = json.load(f)

# Sample 5 random entries
sample = random.sample(extracted, 5)

# Display the results
for i, entry in enumerate(sample, 1):
    print(f"Sample {i}")
    print("-" * 80)
    print(f"Item Name: {entry['input'].get('name', 'N/A')}")
    print(f"Description: {entry['input'].get('description', '').strip()}")
    print(f"Knowledge: {entry['input'].get('knowledge', '').strip()}")
    print("\nExtracted Triples:")
    print(entry.get('triples', 'No triples extracted'))
    print("\nError:" if entry.get('error') else "\nNo errors.")
    if entry.get('error'):
        print(entry.get('error'))
    print("=" * 80)
    print()


Sample 1
--------------------------------------------------------------------------------
Item Name: Blood Red and White Shield
Description: 
Knowledge: Standard round wooden shield. It features a striking red and white design.

Wooden shields are light, manageable, and offer relatively high magic absorption.

Skill: Parry
Repel an attack at the right time to follow up with a critical hit. Works while equipped in either hand.

Extracted Triples:
```python
[
  ("Blood Red and White Shield", "features", "red and white design"),
  ("Wooden shields", "offer", "high magic absorption"),
  ("Skill: Parry", "allows", "repelling an attack to follow up with a critical hit")
]
```

No errors.

Sample 2
--------------------------------------------------------------------------------
Item Name: Fire Crest Shield
Description: 
Knowledge: Shield of the Knights of Blue, engraved with a crest.

One of the enchanted blue shields. The Crest
Shield in particular greatly reduces magic damage.

Skill: Parry

### 11. Parsing and Cleaning Extracted Triples

Before constructing the knowledge graph, I parse the extracted triples and apply light cleaning to improve **entity consistency**.

The cleaning process includes:

- Removing generic prefixes like `"Skill:"` or `"Skill"` from entity names  
- Simplifying overly verbose subjects and objects when possible  
- Trimming whitespace and normalizing text formats  

These steps help reduce duplication and enhance the quality of nodes and edges in the final graph.


In [24]:
# Clean the entity names
def clean_entity(text):
    text = text.strip()
    if text.lower().startswith("skill:"):
        text = text[len("skill:"):].strip()
    elif text.lower().startswith("skill"):
        text = text[len("skill"):].strip()
    return text

# Clean individual triples
def clean_triple(triple):
    subject, predicate, obj = triple
    subject = clean_entity(subject)
    obj = clean_entity(obj)
    return (subject, predicate.strip(), obj)

# Remove markdown code block markers if present
def clean_text_block(text):
    if text is None:
        return ""
    text = text.strip()

    if text.startswith("```") and text.endswith("```"):
        lines = text.splitlines()
        # Remove first line (```python or ```) and last line (```)
        text = "\n".join(lines[1:-1]).strip()

    return text

# Parse the triples using ast and regex as fallback
def parse_triples(text):
    text = clean_text_block(text)

    try:
        triples = ast.literal_eval(text)
        if isinstance(triples, list):
            return [clean_triple(t) for t in triples if isinstance(t, tuple) and len(t) == 3]
    except Exception:
        pass

    # Fallback — regex parsing
    pattern = r'\("(.+?)",\s*"(.*?)",\s*"(.*?)"\)'
    matches = re.findall(pattern, text)

    if matches:
        return [clean_triple(t) for t in matches]

    return []

# Apply parsing to all entries
parsed_triples = []

for entry in extracted:
    raw_output = entry.get('triples')
    if raw_output and not entry.get('error'):
        triples = parse_triples(raw_output)
        if triples:
            parsed_triples.extend(triples)
        else:
            print(f"Failed to parse entry: {entry['input'].get('name', 'N/A')}")

print(f"Parsed and cleaned {len(parsed_triples)} triples.")


Parsed and cleaned 300 triples.


### 12. Building the Knowledge Graph in NetworkX

I start constructing the knowledge graph using **NetworkX**, a Python library for graph analysis.

In this step:

- Each unique entity becomes a **node**  
- Each subject–predicate–object triple forms a **directed edge** between nodes, with the predicate as the **edge label**

This allows for inspection of the graph structure, basic property computation (such as the number of nodes and edges), and preliminary exploration before deploying to a graph database.


In [None]:
# 1. Initialize a directed graph
G = nx.DiGraph()

# 2. Add edges from triples
for subj, rel, obj in parsed_triples:
    G.add_node(subj)
    G.add_node(obj)
    G.add_edge(subj, obj, label=rel)

# 3. Basic graph info
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")

# 4. Inspect some edges
print("Sample edges with relationships:")
for u, v, data in list(G.edges(data=True))[:10]:
    print(f"{u} -[{data['label']}]-> {v}")


Number of nodes: 406
Number of edges: 273
Sample edges with relationships:
Deep Crimson Parma -[is]-> standard round wooden shield
red paint -[symbolizes]-> blood of warriors
Wooden shields -[offer]-> relatively high magic absorption
Wooden shields -[offer]-> high magic absorption
Dagger -[used by]-> citizens of the Undead Commune
inner blade -[lined with]-> fish hook-like barbs
weapon -[had]-> ritualistic use
Bellowing Dragoncrest Ring -[greatly boosts]-> sorceries
Bellowing Dragoncrest Ring -[given to]-> those who are deemed fit to undertake the journey of discovery in Vinheim
Bellowing dragon -[symbolizes]-> the true nature of the consummate sorcerer


### 13. Exporting the Graph to CSV

To document the graph structure and maintain a clear, reproducible dataset, I export the graph to two CSV files:

- A **nodes** file (`nodes.csv`), where each unique entity becomes a node  
- An **edges** file (`edges.csv`), where each subject–predicate–object triple becomes a directed relationship

These CSV files serve both as a backup of the structured graph and as an intermediate step for transparency. While the primary pipeline uploads the graph directly into Neo4j using Python (see the next section), having the CSVs available facilitates inspection, reproducibility, and potential reuse in other graph environments.


In [None]:
# 1. Export nodes
nodes = pd.DataFrame({
    "id": list(G.nodes),
    "label": "Entity"  # Assigning a generic label to all nodes
})

nodes.to_csv("nodes.csv", index=False, encoding="utf-8")
print(f"Exported {len(nodes)} nodes to 'nodes.csv'.")

# 2. Export edges
edges = pd.DataFrame([
    {
        "source": u,
        "target": v,
        "type": data["label"]  # The relationship label
    }
    for u, v, data in G.edges(data=True)
])

edges.to_csv("edges.csv", index=False, encoding="utf-8")
print(f"Exported {len(edges)} edges to 'edges.csv'.")


Exported 406 nodes to 'nodes.csv'.
Exported 273 edges to 'edges.csv'.


### 14. Loading the Graph into Neo4j AuraDB with Python

In this step, I connect to Neo4j AuraDB and upload the graph directly from the `nodes.csv` and `edges.csv` files.

The process includes:

- Establishing a secure connection to the Neo4j instance using the official Neo4j Python driver.
- Creating nodes from the `nodes.csv` file, with label `Entity` and property `id`.
- Creating relationships from the `edges.csv` file, using the `source` and `target` columns to define the start and end nodes, and the `type` column to define the relationship type.

This automated approach directly populates the graph database with the structured knowledge extracted from the Dark Souls item descriptions.


In [None]:
# 1. Load nodes and edges CSV
nodes = pd.read_csv("nodes.csv", dtype=str).fillna("")
edges = pd.read_csv("edges.csv", dtype=str).fillna("")

print(f"Loaded {len(nodes)} nodes and {len(edges)} edges.")

# 2. Configure Neo4j connection
uri = "neo4j+s://708e44c6.databases.neo4j.io"
username = "neo4j"
password = "TqOd26SToKqnpQn4LEhcWIDn0HXKeTV7GBSq74lziTQ"

driver = GraphDatabase.driver(uri, auth=(username, password))


# 3. Function to create nodes
def create_nodes(tx, nodes_df):
    query = """
    UNWIND $rows AS row
    MERGE (n:Entity {id: row.id})
    """
    data = [{"id": row["id"]} for _, row in nodes_df.iterrows()]
    tx.run(query, rows=data)


# 4. Function to create relationships
def create_relationships(tx, edges_df):
    query = """
    UNWIND $rows AS row
    MATCH (a:Entity {id: row.source})
    MATCH (b:Entity {id: row.target})
    MERGE (a)-[r:RELATION {type: row.rel_type}]->(b)
    SET r.label = row.rel_type
    """
    data = [
        {
            "source": row["source"],
            "target": row["target"],
            "rel_type": row["type"].upper().replace(" ", "_")
        }
        for _, row in edges_df.iterrows()
    ]
    tx.run(query, rows=data)


# 5. Push data to Neo4j
with driver.session() as session:
    print("Creating nodes...")
    session.execute_write(create_nodes, nodes)
    print("✅ Nodes created successfully.")

    print("Creating relationships...")
    session.execute_write(create_relationships, edges)
    print("✅ Relationships created successfully.")

driver.close()
print("✅ Graph loaded into Neo4j AuraDB successfully.")


Loaded 406 nodes and 273 edges.
Creating nodes...
✅ Nodes created successfully.
Creating relationships...
✅ Relationships created successfully.
✅ Graph loaded into Neo4j AuraDB successfully.


### 15. Building an Interactive Graph App with Streamlit

To explore the knowledge graph interactively, I created a web application using Streamlit.

The app connects directly to the Neo4j AuraDB instance, retrieves nodes and relationships using Cypher queries, and renders an interactive graph using PyVis. Users can visually navigate the graph, inspect connections, and explore the Dark Souls lore.

This solution improves user experience by offering an intuitive and dynamic interface for graph exploration and discovery.

🔗 **Try the app here:** [Dark Souls Knowledge Graph Explorer](https://your-streamlit-cloud-link)
