# Neo4j: A Practical Introduction for Knowledge Graphs

## What is Neo4j?

**Neo4j** is a **graph database system**, not just a plotting or visualization library.

It consists of three tightly integrated parts:

* **Graph database**
  Stores data as **nodes** and **relationships**, optimized for graph traversal and querying.
* **Query language (Cypher)**
  A declarative language designed specifically for graphs (e.g. pattern matching).
* **Browser-based visualization tool**
  An interactive UI for exploring, querying, and debugging graphs.

### How Neo4j differs from plotting tools (e.g. Plotly, NetworkX)

| Plotting tools         | Neo4j                    |
| ---------------------- | ------------------------ |
| Visualization-only     | Persistent database      |
| Static or client-side  | Server-based             |
| No query language      | Cypher for graph queries |
| No schema awareness    | Schema + metadata        |
| No traversal semantics | Native graph traversal   |

Neo4j is best thought of as **infrastructure**, not a plotting library.

---

## Execution model and security

* Neo4j runs as a **server process**
* Even for local use, it:

  * listens on ports
  * manages persistent data
  * enforces authentication
* Therefore, a **password is required**

  * this password protects the *database*, **not the operating system**
  * it is not a one-time startup password

Typical usage is:

* local execution
* SSH access
* port forwarding for browser visualization

---

## Installation on Ubuntu (minimal setup)

### 1. Install Java (required)

```bash
sudo apt update
sudo apt install -y openjdk-17-jdk
```

### 2. Install Neo4j (Community Edition)

```bash
wget -O - https://debian.neo4j.com/neotechnology.gpg.key | sudo apt-key add -
echo 'deb https://debian.neo4j.com stable 5' | sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update
sudo apt install -y neo4j
```

### 3. Start Neo4j

```bash
sudo systemctl enable neo4j
sudo systemctl start neo4j
```

### 4. Set initial password

```bash
sudo neo4j-admin dbms set-initial-password YOUR_PASSWORD
```

---

## Using Neo4j with Python and LangChain

LangChain provides a Python interface (`Neo4jGraph`) that allows:

* sending graphs to Neo4j
* schema introspection
* querying and reasoning over graphs
* visualization via Neo4j Browser

### Python connection example

```python
from langchain_neo4j import Neo4jGraph

graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="YOUR_PASSWORD"
)
```

At this point, `graph` **is the database connection**.

---

## APOC: Why it is required

LangChain relies on **APOC** (Awesome Procedures on Cypher) to:

* introspect the graph schema
* discover node labels and relationships
* call `apoc.meta.data()`

Without APOC, LangChain cannot initialize the graph interface.

---

## Installing APOC (Neo4j 5.x)

### ‚ö†Ô∏è Critical rule: version matching

> **APOC version must exactly match the Neo4j version (major + minor).**

Check your Neo4j version:

```bash
neo4j --version
```

Example:

```
neo4j 5.13.0
```

### Download matching APOC Core JAR

```bash
wget https://github.com/neo4j/apoc/releases/download/5.13.0/apoc-5.13.0-core.jar
```

### Move to plugins directory

```bash
sudo mv apoc-5.13.0-core.jar /var/lib/neo4j/plugins/
sudo chown neo4j:neo4j /var/lib/neo4j/plugins/apoc-5.13.0-core.jar
```

---

## Neo4j configuration for APOC (Neo4j 5.x)

Edit the config file:

```bash
sudo nano /etc/neo4j/neo4j.conf
```

Add:

```ini
dbms.security.procedures.allowlist=apoc.*
dbms.security.procedures.unrestricted=apoc.*
```

Restart Neo4j:

```bash
sudo systemctl restart neo4j
```

Verify APOC:

```cypher
RETURN apoc.version();
CALL apoc.meta.data();
```

---

## Sending graphs from Python to Neo4j

### Example: adding a graph document

```python
from langchain_community.graphs.graph_document import GraphDocument
from langchain_core.documents import Document

doc = Document(page_content="Marie Curie was married to Pierre Curie.")

graph_doc = GraphDocument(
    nodes=[
        {"id": "Marie_Curie", "type": "Person"},
        {"id": "Pierre_Curie", "type": "Person"},
    ],
    relationships=[
        {
            "source": "Marie_Curie",
            "target": "Pierre_Curie",
            "type": "MARRIED_TO"
        }
    ],
    source=doc
)

graph.add_graph_documents([graph_doc])
```

---

## Accessing the Neo4j Browser

Neo4j exposes a web UI at:

```
http://localhost:7474
```

### If Neo4j runs on a remote server

You can use **SSH port forwarding**:

```bash
ssh -L 7474:localhost:7474 -L 7687:localhost:7687 user@server_ip
```

üëâ Then open in your local browser:

```
http://localhost:7474
```

üí° **VS Code** supports port forwarding automatically when connected via Remote SSH.

---

## Visualizing graphs in the browser

### Show all nodes and relationships

```cypher
MATCH (n)-[r]->(m)
RETURN n, r, m;
```

### Show all nodes (including isolated ones)

```cypher
MATCH (n)
RETURN n;
```

### Clear the database (development use)

```cypher
MATCH (n)
DETACH DELETE n;
```

---

## Summary

* Neo4j is a **database + query language + visualization tool**
* It runs as a **server**, even for local use
* Authentication protects the database, not the OS
* LangChain integrates cleanly via `Neo4jGraph`
* APOC is **required** and must match Neo4j‚Äôs version
* Visualization happens via the Neo4j Browser
* SSH / VS Code port forwarding enables remote usage

In [1]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_core.documents import Document
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

In [2]:
text = '''
The solar system consists of the Sun and the objects that orbit it, including planets, moons, asteroids, comets, and meteoroids.
The Sun is a star at the center of the Solar System.
Mercury is a planet in the Solar System. Mercury orbits the Sun. Mercury has no atmosphere and no magnetic field.
Venus is a planet in the Solar System. Venus orbits the Sun. Venus has a thick atmosphere. The atmosphere of Venus is composed mainly of carbon dioxide. Venus has no magnetic field.
Earth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon. Earth has a thick atmosphere composed mainly of nitrogen and oxygen. Earth has a strong magnetic field.
Mars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and Deimos. Mars has a thin atmosphere composed mainly of carbon dioxide. Mars has a weak magnetic field.
Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io, Europa, Ganymede, and Callisto. Jupiter has a thick atmosphere composed mainly of hydrogen and helium. Jupiter has a strong magnetic field.
'''
print(text)


The solar system consists of the Sun and the objects that orbit it, including planets, moons, asteroids, comets, and meteoroids.
The Sun is a star at the center of the Solar System.
Mercury is a planet in the Solar System. Mercury orbits the Sun. Mercury has no atmosphere and no magnetic field.
Venus is a planet in the Solar System. Venus orbits the Sun. Venus has a thick atmosphere. The atmosphere of Venus is composed mainly of carbon dioxide. Venus has no magnetic field.
Earth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon. Earth has a thick atmosphere composed mainly of nitrogen and oxygen. Earth has a strong magnetic field.
Mars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and Deimos. Mars has a thin atmosphere composed mainly of carbon dioxide. Mars has a weak magnetic field.
Jupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has moons called Io, Europa, Ganymede, and Callis

In [3]:
# Initialize the ChatOllama model with the specified model name
# model_name = 'qwen3-vl:4b'
# model_name = 'llama3.2:3b'  # Or another text-focused model
model_name = 'tomasonjo/llama3-text2cypher-demo:8b_4bit'
# and initialize the ChatOllama instance
chat_model = ChatOllama(
    model=model_name,
    validate_model_on_init=True,
    temperature=0
)

In [4]:
# Create a ChatPromptTemplate for graph extraction
graph_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an expert Neo4j Cypher query generator.

TASK:
- Translate the user's natural language question into a Cypher query.

CONSTRAINTS:
- Use ONLY the schema provided below.
- Do NOT invent labels, relationship types, or properties.
- Do NOT explain the query.
- Output ONLY valid Cypher.
- If the question cannot be answered unambiguously using the schema, output:
  // CANNOT_ANSWER

GRAPH SCHEMA:
Node labels:
- Star {{name}}
- Planet {{name}}
- Moon {{name}}
- Atmosphere {{description}}
- Substance {{name}}
- PhysicalProperty {{name, value}}

Relationships:
- (Planet)-[:ORBITS]->(Star)
- (Moon)-[:ORBITS]->(Planet)
- (Planet)-[:HAS_ATMOSPHERE]->(Atmosphere)
- (Atmosphere)-[:COMPOSED_OF]->(Substance)
- (Planet)-[:HAS_PROPERTY]->(PhysicalProperty)

QUERY RULES:
1. Always specify node labels.
2. Always specify relationship directions.
3. Use meaningful variable names.
4. Return only properties, not full nodes.
5. Use DISTINCT unless duplicates are required.
6. Use OPTIONAL MATCH if information may be missing.
7. Do not use APOC or procedures.

FAILURE CONDITIONS:
- If required entities, labels, or relationships are missing from the schema,
  output:
  // CANNOT_ANSWER

EXAMPLES:
Question:
Which planet orbits the Sun?

Cypher:
MATCH (planet:Planet)-[:ORBITS]->(star:Star {{name: "Sun"}})
RETURN DISTINCT planet.name

Question:
What substances compose the atmosphere of Mars?

Cypher:
MATCH (planet:Planet {{name: "Mars"}})
      -[:HAS_ATMOSPHERE]->(atm:Atmosphere)
      -[:COMPOSED_OF]->(substance:Substance)
RETURN DISTINCT substance.name

Question:
Does Jupiter have a magnetic field?

Cypher:
MATCH (planet:Planet {{name: "Jupiter"}})
      -[:HAS_PROPERTY]->(prop:PhysicalProperty {{name: "magnetic_field"}})
RETURN DISTINCT prop.value
"""),
    ("human", "{input}")
])


In [5]:
no_schema = LLMGraphTransformer(
    llm=chat_model,
    prompt=graph_prompt,
)

In [6]:
documents = [Document(page_content=text)]

In [7]:
print(documents)

[Document(metadata={}, page_content='\nThe solar system consists of the Sun and the objects that orbit it, including planets, moons, asteroids, comets, and meteoroids.\nThe Sun is a star at the center of the Solar System.\nMercury is a planet in the Solar System. Mercury orbits the Sun. Mercury has no atmosphere and no magnetic field.\nVenus is a planet in the Solar System. Venus orbits the Sun. Venus has a thick atmosphere. The atmosphere of Venus is composed mainly of carbon dioxide. Venus has no magnetic field.\nEarth is a planet in the Solar System. Earth orbits the Sun. Earth has one moon called the Moon. Earth has a thick atmosphere composed mainly of nitrogen and oxygen. Earth has a strong magnetic field.\nMars is a planet in the Solar System. Mars orbits the Sun. Mars has two moons called Phobos and Deimos. Mars has a thin atmosphere composed mainly of carbon dioxide. Mars has a weak magnetic field.\nJupiter is a planet in the Solar System. Jupiter orbits the Sun. Jupiter has m

In [8]:
graph_no_schema = no_schema.convert_to_graph_documents(documents)

In [9]:
print(graph_no_schema)

[GraphDocument(nodes=[Node(id='Sun', type='Star', properties={}), Node(id='Mercury', type='Planet', properties={}), Node(id='Venus', type='Planet', properties={}), Node(id='Earth', type='Planet', properties={}), Node(id='Moon', type='Moon', properties={}), Node(id='Mars', type='Planet', properties={}), Node(id='Phobos', type='Moon', properties={}), Node(id='Deimos', type='Moon', properties={}), Node(id='Jupiter', type='Planet', properties={}), Node(id='Io', type='Moon', properties={}), Node(id='Europa', type='Moon', properties={}), Node(id='Ganymede', type='Moon', properties={}), Node(id='Callisto', type='Moon', properties={})], relationships=[Relationship(source=Node(id='Mercury', type='Planet', properties={}), target=Node(id='Sun', type='Star', properties={}), type='ORBITS', properties={}), Relationship(source=Node(id='Venus', type='Planet', properties={}), target=Node(id='Sun', type='Star', properties={}), type='ORBITS', properties={}), Relationship(source=Node(id='Earth', type='Pla

# Managing Secrets with Environment Variables (`.env` + `.gitignore`)

When working with Neo4j (or any service that requires credentials), **passwords should never be hard-coded** in Python files or committed to Git repositories.

Instead, credentials are stored in **environment variables**, which are loaded at runtime.

---

## Why environment variables?

Environment variables allow you to:

* keep **secrets out of source code**
* safely share repositories publicly
* use different credentials on different machines
* avoid accidental password leaks on GitHub

This is especially important when:

* working with databases
* publishing tutorials
* collaborating with others

---

## The `.env` file (local, private)

A `.env` file is a simple text file that contains environment variables:

```env
NEO4J_URL=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_real_password_here
```

### Important properties of `.env`

* contains **real credentials**
* should exist **only on your local machine**
* must **never be committed to Git**

---

## Using `.gitignore` to protect secrets

To ensure `.env` is never committed, add it to `.gitignore`:

```gitignore
.env
```

This tells Git:

> ‚ÄúIgnore this file completely, even if it exists locally.‚Äù

As a result:

* your password stays private
* collaborators won‚Äôt see your credentials
* GitHub never stores your secrets

---

## The `.env.example` file (safe to share)

Since `.env` is ignored, collaborators need a **template** showing which variables are required.

This is the purpose of `.env.example`.

Example:

```env
# Neo4j Configuration
NEO4J_URL=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password_here
```

### Why `.env.example` is useful

* contains **no real secrets**
* documents required environment variables
* can be safely committed to Git
* allows others to create their own `.env` file

Typical workflow:

1. Clone repository
2. Copy `.env.example` ‚Üí `.env`
3. Fill in real credentials locally

---

## Loading environment variables in Python

The `python-dotenv` package loads variables from `.env` into the environment.

### Installation

```bash
pip install python-dotenv
```

### Example usage

```python
import os
from dotenv import load_dotenv
from langchain_neo4j import Neo4jGraph

# Load environment variables from .env
load_dotenv()

neo4j_url = os.getenv("NEO4J_URL", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    raise ValueError(
        "NEO4J_PASSWORD environment variable is not set. "
        "Please create a .env file with your credentials."
    )

graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password
)
```

---

## What happens at runtime?

1. `.env` is read **locally**
2. Variables are injected into the process environment
3. Python accesses them via `os.getenv(...)`
4. Neo4j credentials are never hard-coded or committed

---

## What Git sees vs. what Python sees

| File                  | Visible to Git | Visible to Python |
| --------------------- | -------------- | ----------------- |
| `.env`                | ‚ùå              | ‚úÖ                 |
| `.env.example`        | ‚úÖ              | ‚ùå                 |
| Python source         | ‚úÖ              | ‚úÖ                 |
| Environment variables | ‚ùå              | ‚úÖ                 |

---

## Best practices (recommended)

* ‚úî Never commit `.env`
* ‚úî Always provide `.env.example`
* ‚úî Validate required variables at startup
* ‚úî Use defaults only for non-sensitive values
* ‚úî Treat passwords as disposable (easy to rotate)

---

## Summary

* Environment variables protect secrets from version control
* `.env` stores **real credentials locally**
* `.gitignore` ensures secrets are never committed
* `.env.example` documents required configuration
* `python-dotenv` bridges `.env` and Python

This pattern is **standard practice** for secure, reproducible research code and production systems alike.

In [10]:
import os
from dotenv import load_dotenv
from langchain_neo4j import Neo4jGraph

# Load environment variables from .env file
load_dotenv()

# Get credentials from environment variables
neo4j_url = os.getenv("NEO4J_URL", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    raise ValueError("NEO4J_PASSWORD environment variable is not set. Please create a .env file with your credentials.")

graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password
)

In [11]:
print(type(graph))

<class 'langchain_neo4j.graphs.neo4j_graph.Neo4jGraph'>


In [12]:
graph.add_graph_documents(graph_no_schema)

In [13]:
# graph.query("MATCH (n) DETACH DELETE n;")
# graph.query("MATCH (n) RETURN n;")

In [14]:
cypher_prompt = ChatPromptTemplate.from_messages([
    ("system", """
You are an expert Neo4j Cypher query generator.

TASK:
- Translate the user's natural language question into a Cypher query.

CONSTRAINTS:
- Use ONLY the schema provided below.
- Do NOT invent labels, relationship types, or properties.
- Do NOT explain the query.
- Output ONLY valid Cypher.
- If the question cannot be answered unambiguously using the schema, output:
  // CANNOT_ANSWER

GRAPH SCHEMA:
Node labels:
- Star {{name}}
- Planet {{name}}
- Moon {{name}}
- Atmosphere {{description}}
- Substance {{name}}
- PhysicalProperty {{name, value}}

Relationships:
- (Planet)-[:ORBITS]->(Star)
- (Moon)-[:ORBITS]->(Planet)
- (Planet)-[:HAS_ATMOSPHERE]->(Atmosphere)
- (Atmosphere)-[:COMPOSED_OF]->(Substance)
- (Planet)-[:HAS_PROPERTY]->(PhysicalProperty)

QUERY RULES:
1. Always specify node labels.
2. Always specify relationship directions.
3. Use meaningful variable names.
4. Return only properties, not full nodes.
5. Use DISTINCT unless duplicates are required.
6. Use OPTIONAL MATCH if information may be missing.
7. Do not use APOC or procedures.

"""),
    ("human", "{question}")
])


In [15]:
from langchain_neo4j import GraphCypherQAChain

# The process of occurs in two steps:
# 1) The LLM generates a Cypher query based on the user's question and the graph schema.
# 2) The returned Cypher query is turned to a text answer.
# cypher_prompt=cypher_prompt, concerns the first step
# qa_prompt=qa_prompt, concerns the second step

graphchain = GraphCypherQAChain.from_llm(
    chat_model,
    graph=graph,
    cypher_prompt=cypher_prompt,
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_requests=True
)

# results = graphchain.invoke({"query": "Which planet orbits the Sun?"})
results = graphchain.invoke({"query": "Which planet has no atmosphere?"})
print(results)



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Planet)-[:HAS_ATMOSPHERE]->(a:Atmosphere)
WHERE NOT EXISTS {
  MATCH (p)-[:HAS_ATMOSPHERE]->()
}
RETURN p.name AS PlanetWithoutAtmosphere[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
{'query': 'Which planet has no atmosphere?', 'result': 'Mercury has no atmosphere.', 'intermediate_steps': [{'query': 'MATCH (p:Planet)-[:HAS_ATMOSPHERE]->(a:Atmosphere)\nWHERE NOT EXISTS {\n  MATCH (p)-[:HAS_ATMOSPHERE]->()\n}\nRETURN p.name AS PlanetWithoutAtmosphere'}, {'context': []}]}


In [16]:
print(results['query'])
print(results['intermediate_steps'])
print('RESULT: \n', results['result'])

Which planet has no atmosphere?
[{'query': 'MATCH (p:Planet)-[:HAS_ATMOSPHERE]->(a:Atmosphere)\nWHERE NOT EXISTS {\n  MATCH (p)-[:HAS_ATMOSPHERE]->()\n}\nRETURN p.name AS PlanetWithoutAtmosphere'}, {'context': []}]
RESULT: 
 Mercury has no atmosphere.


## `GraphCypherQAChain.from_llm`: How It Works

`GraphCypherQAChain.from_llm` is LangChain‚Äôs canonical abstraction for **Graph-based Retrieval-Augmented Generation (GraphRAG)** using a property graph (e.g. Neo4j).
It enables an LLM to **translate natural language questions into Cypher queries**, execute them on a graph database, and then **verbalize the results**.

At a high level, it is a **two-step LLM pipeline**:

---

## 1Ô∏è‚É£ Step 1: Natural Language ‚Üí Cypher (Query Generation)

In the first step, the LLM is prompted to **generate a Cypher query** based on:

* The user‚Äôs natural-language question
* A textual description of the graph schema
* Optional examples and constraints

**Conceptually:**

```text
User Question
   ‚Üì
LLM (Cypher generation prompt)
   ‚Üì
Cypher Query
```

### What the LLM ‚Äúknows‚Äù here

The LLM does **not** inspect the database directly.
It only sees:

* Node labels
* Relationship types
* Property names
* Optional natural-language descriptions of the schema

This information is injected via the chain‚Äôs **Cypher prompt template**.

### How to affect Cypher generation

You can (and should) control this step carefully:

* **Schema clarity is critical**

  * Explicitly list labels, relationships, and properties
  * Avoid overloaded or ambiguous names
* **Add natural-language descriptions**

  * Especially useful for abstract domains (papers, music, biology, law)
* **Constrain the output**

  * ‚ÄúUse only provided labels and relationships‚Äù
  * ‚ÄúDo not hallucinate properties‚Äù
* **Provide examples**

  * Few-shot examples dramatically improve reliability
* **Use `cypher_prompt` explicitly**

  * Do *not* rely on defaults for production or teaching material

‚ö†Ô∏è This step is the **single biggest failure point** in GraphRAG systems.

---

## 2Ô∏è‚É£ Step 2: Cypher Execution ‚Üí Natural Language Answer

Once the Cypher query is generated:

1. The query is executed against the graph database
2. The raw results (records, paths, properties) are returned
3. A second LLM call turns these results into a human-readable answer

**Conceptually:**

```text
Cypher Query
   ‚Üì
Graph Database
   ‚Üì
Query Results
   ‚Üì
LLM (answer synthesis prompt)
   ‚Üì
Final Answer
```

### Key properties of this step

* The LLM is now **grounded** in actual data
* Hallucination risk is significantly lower
* The quality of the answer depends on:

  * Result size and structure
  * How much raw data is exposed to the LLM
  * The answer prompt‚Äôs instructions

---

## Why This Is *Not* Knowledge Graph Generation

It‚Äôs important to distinguish **GraphCypherQAChain** from **KG-generation-first approaches**.

### GraphCypherQAChain (this approach)

* Assumes the **graph already exists**
* LLM is used only for:

  * Query generation
  * Answer verbalization
* Strong guarantees:

  * Deterministic retrieval
  * Explainable intermediate steps
* Ideal for:

  * Structured databases
  * Curated knowledge graphs
  * Teaching and debugging

### KG generation + GraphRAG priming

* LLM first **creates or augments the knowledge graph**
* Then queries it
* Pros:

  * Faster bootstrapping
  * Less schema engineering upfront
* Cons:

  * Error compounding
  * Harder to audit
  * Weaker guarantees

üìå **Best practice** for serious applications and education:

> Use *KG generation* for exploration and prototyping,
> but *GraphCypherQAChain* for querying **trusted graphs**.

---

## Best Practices Summary

‚úî Explicit schema descriptions
‚úî Carefully designed Cypher prompt
‚úî Few-shot Cypher examples
‚úî Tight constraints on labels and properties
‚úî Small, interpretable query results
‚úî Separate ‚Äúgraph construction‚Äù from ‚Äúgraph querying‚Äù