# Neo4j: A Practical Introduction for Knowledge Graphs

## What is Neo4j?

**Neo4j** is a **graph database system**, not just a plotting or visualization library.

It consists of three tightly integrated parts:

* **Graph database**
  Stores data as **nodes** and **relationships**, optimized for graph traversal and querying.
* **Query language (Cypher)**
  A declarative language designed specifically for graphs (e.g. pattern matching).
* **Browser-based visualization tool**
  An interactive UI for exploring, querying, and debugging graphs.

### How Neo4j differs from plotting tools (e.g. Plotly, NetworkX)

| Plotting tools         | Neo4j                    |
| ---------------------- | ------------------------ |
| Visualization-only     | Persistent database      |
| Static or client-side  | Server-based             |
| No query language      | Cypher for graph queries |
| No schema awareness    | Schema + metadata        |
| No traversal semantics | Native graph traversal   |

Neo4j is best thought of as **infrastructure**, not a plotting library.

---

## Execution model and security

* Neo4j runs as a **server process**
* Even for local use, it:

  * listens on ports
  * manages persistent data
  * enforces authentication
* Therefore, a **password is required**

  * this password protects the *database*, **not the operating system**
  * it is not a one-time startup password

Typical usage is:

* local execution
* SSH access
* port forwarding for browser visualization

---

## Installation on Ubuntu (minimal setup)

### 1. Install Java (required)

```bash
sudo apt update
sudo apt install -y openjdk-17-jdk
```

### 2. Install Neo4j (Community Edition)

```bash
wget -O - https://debian.neo4j.com/neotechnology.gpg.key | sudo apt-key add -
echo 'deb https://debian.neo4j.com stable 5' | sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update
sudo apt install -y neo4j
```

### 3. Start Neo4j

```bash
sudo systemctl enable neo4j
sudo systemctl start neo4j
```

### 4. Set initial password

```bash
sudo neo4j-admin dbms set-initial-password YOUR_PASSWORD
```

---

## Using Neo4j with Python and LangChain

LangChain provides a Python interface (`Neo4jGraph`) that allows:

* sending graphs to Neo4j
* schema introspection
* querying and reasoning over graphs
* visualization via Neo4j Browser

### Python connection example

```python
from langchain_neo4j import Neo4jGraph

graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="YOUR_PASSWORD"
)
```

At this point, `graph` **is the database connection**.

---

## APOC: Why it is required

LangChain relies on **APOC** (Awesome Procedures on Cypher) to:

* introspect the graph schema
* discover node labels and relationships
* call `apoc.meta.data()`

Without APOC, LangChain cannot initialize the graph interface.

---

## Installing APOC (Neo4j 5.x)

### ‚ö†Ô∏è Critical rule: version matching

> **APOC version must exactly match the Neo4j version (major + minor).**

Check your Neo4j version:

```bash
neo4j --version
```

Example:

```
neo4j 5.13.0
```

### Download matching APOC Core JAR

```bash
wget https://github.com/neo4j/apoc/releases/download/5.13.0/apoc-5.13.0-core.jar
```

### Move to plugins directory

```bash
sudo mv apoc-5.13.0-core.jar /var/lib/neo4j/plugins/
sudo chown neo4j:neo4j /var/lib/neo4j/plugins/apoc-5.13.0-core.jar
```

---

## Neo4j configuration for APOC (Neo4j 5.x)

Edit the config file:

```bash
sudo nano /etc/neo4j/neo4j.conf
```

Add:

```ini
dbms.security.procedures.allowlist=apoc.*
dbms.security.procedures.unrestricted=apoc.*
```

Restart Neo4j:

```bash
sudo systemctl restart neo4j
```

Verify APOC:

```cypher
RETURN apoc.version();
CALL apoc.meta.data();
```

---

## Sending graphs from Python to Neo4j

### Example: adding a graph document

```python
from langchain_community.graphs.graph_document import GraphDocument
from langchain_core.documents import Document

doc = Document(page_content="Marie Curie was married to Pierre Curie.")

graph_doc = GraphDocument(
    nodes=[
        {"id": "Marie_Curie", "type": "Person"},
        {"id": "Pierre_Curie", "type": "Person"},
    ],
    relationships=[
        {
            "source": "Marie_Curie",
            "target": "Pierre_Curie",
            "type": "MARRIED_TO"
        }
    ],
    source=doc
)

graph.add_graph_documents([graph_doc])
```

---

## Accessing the Neo4j Browser

Neo4j exposes a web UI at:

```
http://localhost:7474
```

### If Neo4j runs on a remote server

You can use **SSH port forwarding**:

```bash
ssh -L 7474:localhost:7474 -L 7687:localhost:7687 user@server_ip
```

üëâ Then open in your local browser:

```
http://localhost:7474
```

üí° **VS Code** supports port forwarding automatically when connected via Remote SSH.

---

## Visualizing graphs in the browser

### Show all nodes and relationships

```cypher
MATCH (n)-[r]->(m)
RETURN n, r, m;
```

### Show all nodes (including isolated ones)

```cypher
MATCH (n)
RETURN n;
```

### Clear the database (development use)

```cypher
MATCH (n)
DETACH DELETE n;
```

---

## Summary

* Neo4j is a **database + query language + visualization tool**
* It runs as a **server**, even for local use
* Authentication protects the database, not the OS
* LangChain integrates cleanly via `Neo4jGraph`
* APOC is **required** and must match Neo4j‚Äôs version
* Visualization happens via the Neo4j Browser
* SSH / VS Code port forwarding enables remote usage

In [1]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_core.documents import Document
from langchain_ollama import ChatOllama
from langchain_community.retrievers import WikipediaRetriever
from langchain_core.prompts import ChatPromptTemplate

In [2]:
retriever = WikipediaRetriever()
docs = retriever.invoke("Apollo 11")
text = docs[0].page_content
print(text)

Apollo 11 (July 16‚Äì24, 1969) was the fifth manned flight in the United States Apollo program and the first spaceflight to land humans on the Moon. Commander Neil Armstrong and Lunar Module Pilot Edwin "Buzz" Aldrin landed the Lunar Module Eagle on July 20 at 20:17 UTC, and Armstrong became the first person to step onto the surface about six hours later, at 02:56 UTC on July 21. Aldrin joined him 19 minutes afterward, and together they spent about two and a half hours exploring the site they had named Tranquility Base upon landing. They collected 47.5 pounds (21.5 kg) of lunar material before re-entering the Lunar Module. In total, they were on the Moon‚Äôs surface for 21 hours, 36 minutes before returning to the Command Module Columbia, which remained in lunar orbit, piloted by Michael Collins.
Apollo 11 was launched by a Saturn V rocket from Kennedy Space Center in Florida, on July 16 at 13:32 UTC (9:32 am EDT, local time). The Apollo spacecraft consisted of three parts: the command



  lis = BeautifulSoup(html).find_all('li')


In [3]:
# Initialize the ChatOllama model with the specified model name
# model_name = 'qwen3-vl:4b'
model_name = 'llama3.2:3b'  # Or another text-focused model

# and initialize the ChatOllama instance
chat_model = ChatOllama(
    model=model_name,
    validate_model_on_init=True,
    temperature=0
)

In [30]:
# Create a ChatPromptTemplate for graph extraction
graph_prompt = ChatPromptTemplate.from_messages([
    ("system", """
You are an expert Neo4j Cypher query generator.

TASK:
- Translate the user's natural language question into a Cypher query.

CONSTRAINTS:
- Use ONLY the schema provided below.
- Do NOT invent labels, relationship types, or properties.
- Do NOT explain the query.
- Output ONLY valid Cypher.
- If the question cannot be answered unambiguously using the schema, output:
  // CANNOT_ANSWER

GRAPH SCHEMA:
Node labels:
- Person {name, citizenship}
- Vehicle {name, type}
- Base {name}
- Location {name}
- Country {name}
- Planet {name}

Relationships:
- (Person)-[:OPERATED]->(Vehicle)
- (Person)-[:COLLABORATED_WITH]->(Person)
- (Person)-[:CITIZEN_OF]->(Country)
- (Vehicle)-[:DEPARTED_FROM_BASE]->(Base)
- (Base)-[:LOCATED_IN]->(Location)
- (Location)-[:IS_IN]->(Country)
- (Planet)-[:CONTAINS]->(Location)

QUERY RULES:
1. Always specify node labels.
2. Always specify relationship directions.
3. Use meaningful variable names.
4. Return only properties, not full nodes.
5. Use DISTINCT unless duplicates are required.
6. Use OPTIONAL MATCH if information may be missing.
7. Do not use APOC or procedures.

FAILURE CONDITIONS:
- If required entities, labels, or relationships are missing from the schema,
  output:
  // CANNOT_ANSWER

EXAMPLES:
Question:
Who operated Apollo 11?

Cypher:
MATCH (person:Person)-[:OPERATED]->(vehicle:Vehicle {name: "Apollo 11"})
RETURN DISTINCT person.name

Question:
Which country is the launch base of Apollo 11 located in?

Cypher:
MATCH (vehicle:Vehicle {name: "Apollo 11"})
      -[:DEPARTED_FROM_BASE]->(base:Base)
      -[:LOCATED_IN]->(location:Location)
      -[:IS_IN]->(country:Country)
RETURN DISTINCT country.name
"""),
    ("human", "{input}")
])


In [31]:
no_schema = LLMGraphTransformer(
    llm=chat_model,
    prompt=graph_extraction_prompt,
)

In [32]:
documents = [Document(page_content=text)]

In [33]:
print(documents)

[Document(metadata={}, page_content='Apollo 11 (July 16‚Äì24, 1969) was the fifth manned flight in the United States Apollo program and the first spaceflight to land humans on the Moon. Commander Neil Armstrong and Lunar Module Pilot Edwin "Buzz" Aldrin landed the Lunar Module Eagle on July 20 at 20:17 UTC, and Armstrong became the first person to step onto the surface about six hours later, at 02:56 UTC on July 21. Aldrin joined him 19 minutes afterward, and together they spent about two and a half hours exploring the site they had named Tranquility Base upon landing. They collected 47.5 pounds (21.5 kg) of lunar material before re-entering the Lunar Module. In total, they were on the Moon‚Äôs surface for 21 hours, 36 minutes before returning to the Command Module Columbia, which remained in lunar orbit, piloted by Michael Collins.\nApollo 11 was launched by a Saturn V rocket from Kennedy Space Center in Florida, on July 16 at 13:32 UTC (9:32 am EDT, local time). The Apollo spacecraft

In [34]:
graph_no_schema = no_schema.convert_to_graph_documents(documents)

In [35]:
print(graph_no_schema)

[GraphDocument(nodes=[Node(id='Apollo 11', type='Mission', properties={}), Node(id='Neil Armstrong', type='Person', properties={}), Node(id="Edwin 'Buzz' Aldrin", type='Person', properties={}), Node(id='Lunar Module Eagle', type='Vehicle', properties={}), Node(id='Commander Neil Armstrong', type='Role', properties={}), Node(id="Lunar Module Pilot Edwin 'Buzz' Aldrin", type='Role', properties={}), Node(id='Tranquility Base', type='Location', properties={}), Node(id='Moon', type='Location', properties={}), Node(id='Kennedy Space Center', type='Location', properties={}), Node(id='Florida', type='Location', properties={}), Node(id='Saturn V Rocket', type='Vehicle', properties={}), Node(id='Command Module Columbia', type='Vehicle', properties={}), Node(id='Service Module Sm', type='Vehicle', properties={}), Node(id='Lunar Module Lm', type='Vehicle', properties={}), Node(id='Descent Stage', type='Component', properties={}), Node(id='Ascent Stage', type='Component', properties={}), Node(id='M

# Managing Secrets with Environment Variables (`.env` + `.gitignore`)

When working with Neo4j (or any service that requires credentials), **passwords should never be hard-coded** in Python files or committed to Git repositories.

Instead, credentials are stored in **environment variables**, which are loaded at runtime.

---

## Why environment variables?

Environment variables allow you to:

* keep **secrets out of source code**
* safely share repositories publicly
* use different credentials on different machines
* avoid accidental password leaks on GitHub

This is especially important when:

* working with databases
* publishing tutorials
* collaborating with others

---

## The `.env` file (local, private)

A `.env` file is a simple text file that contains environment variables:

```env
NEO4J_URL=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_real_password_here
```

### Important properties of `.env`

* contains **real credentials**
* should exist **only on your local machine**
* must **never be committed to Git**

---

## Using `.gitignore` to protect secrets

To ensure `.env` is never committed, add it to `.gitignore`:

```gitignore
.env
```

This tells Git:

> ‚ÄúIgnore this file completely, even if it exists locally.‚Äù

As a result:

* your password stays private
* collaborators won‚Äôt see your credentials
* GitHub never stores your secrets

---

## The `.env.example` file (safe to share)

Since `.env` is ignored, collaborators need a **template** showing which variables are required.

This is the purpose of `.env.example`.

Example:

```env
# Neo4j Configuration
NEO4J_URL=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password_here
```

### Why `.env.example` is useful

* contains **no real secrets**
* documents required environment variables
* can be safely committed to Git
* allows others to create their own `.env` file

Typical workflow:

1. Clone repository
2. Copy `.env.example` ‚Üí `.env`
3. Fill in real credentials locally

---

## Loading environment variables in Python

The `python-dotenv` package loads variables from `.env` into the environment.

### Installation

```bash
pip install python-dotenv
```

### Example usage

```python
import os
from dotenv import load_dotenv
from langchain_neo4j import Neo4jGraph

# Load environment variables from .env
load_dotenv()

neo4j_url = os.getenv("NEO4J_URL", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    raise ValueError(
        "NEO4J_PASSWORD environment variable is not set. "
        "Please create a .env file with your credentials."
    )

graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password
)
```

---

## What happens at runtime?

1. `.env` is read **locally**
2. Variables are injected into the process environment
3. Python accesses them via `os.getenv(...)`
4. Neo4j credentials are never hard-coded or committed

---

## What Git sees vs. what Python sees

| File                  | Visible to Git | Visible to Python |
| --------------------- | -------------- | ----------------- |
| `.env`                | ‚ùå              | ‚úÖ                 |
| `.env.example`        | ‚úÖ              | ‚ùå                 |
| Python source         | ‚úÖ              | ‚úÖ                 |
| Environment variables | ‚ùå              | ‚úÖ                 |

---

## Best practices (recommended)

* ‚úî Never commit `.env`
* ‚úî Always provide `.env.example`
* ‚úî Validate required variables at startup
* ‚úî Use defaults only for non-sensitive values
* ‚úî Treat passwords as disposable (easy to rotate)

---

## Summary

* Environment variables protect secrets from version control
* `.env` stores **real credentials locally**
* `.gitignore` ensures secrets are never committed
* `.env.example` documents required configuration
* `python-dotenv` bridges `.env` and Python

This pattern is **standard practice** for secure, reproducible research code and production systems alike.

In [10]:
import os
from dotenv import load_dotenv
from langchain_neo4j import Neo4jGraph

# Load environment variables from .env file
load_dotenv()

# Get credentials from environment variables
neo4j_url = os.getenv("NEO4J_URL", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    raise ValueError("NEO4J_PASSWORD environment variable is not set. Please create a .env file with your credentials.")

graph = Neo4jGraph(
    url=neo4j_url,
    username=neo4j_user,
    password=neo4j_password
)

In [11]:
print(type(graph))

<class 'langchain_neo4j.graphs.neo4j_graph.Neo4jGraph'>


In [36]:
graph.add_graph_documents(graph_no_schema)

In [None]:
# graph.query("MATCH (n) DETACH DELETE n;")
# graph.query("MATCH (n) RETURN n;")

[{'n': {'id': 'Apollo 11'}},
 {'n': {'id': 'Neil Armstrong'}},
 {'n': {'id': "Edwin 'Buzz' Aldrin"}},
 {'n': {'id': 'Lunar Module Eagle'}},
 {'n': {'id': 'Commander Neil Armstrong'}},
 {'n': {'id': "Lunar Module Pilot Edwin 'Buzz' Aldrin"}},
 {'n': {'id': 'Tranquility Base'}},
 {'n': {'id': 'Moon'}},
 {'n': {'id': 'Kennedy Space Center'}},
 {'n': {'id': 'Florida'}},
 {'n': {'id': 'Saturn V Rocket'}},
 {'n': {'id': 'Command Module Columbia'}},
 {'n': {'id': 'Service Module Sm'}},
 {'n': {'id': 'Lunar Module Lm'}},
 {'n': {'id': 'Descent Stage'}},
 {'n': {'id': 'Ascent Stage'}},
 {'n': {'id': 'Mare Tranquillitatis'}},
 {'n': {'id': 'Pacific Ocean'}},
 {'n': {'id': 'United States'}},
 {'n': {'id': 'Soviet Union'}},
 {'n': {'id': 'National Aeronautics And Space Administration (Nasa)'}},
 {'n': {'id': 'Project Mercury'}},
 {'n': {'id': 'Sputnik 1'}},
 {'n': {'id': 'Yuri Gagarin'}},
 {'n': {'id': 'Alan Shepard'}},
 {'n': {'id': 'Dwight D. Eisenhower'}},
 {'n': {'id': 'John F. Kennedy'}}]

In [37]:
from langchain_neo4j import GraphCypherQAChain

graphchain = GraphCypherQAChain.from_llm(
    chat_model,
    graph=graph,
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_requests=True
)

results = graphchain.invoke({"query": "Who operated the lunar module?"})
print(results)



[1m> Entering new GraphCypherQAChain chain...[0m




Generated Cypher:
[32;1m[1;3mMATCH (a:Person)-[:OPERATED]->(b:LunarModule) RETURN a[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m
{'query': 'Who operated the lunar module?', 'result': "I don't know the answer.", 'intermediate_steps': [{'query': 'MATCH (a:Person)-[:OPERATED]->(b:LunarModule) RETURN a'}, {'context': []}]}


In [38]:
print(results['query'])
print(results['intermediate_steps'])
print(results['result'])

Who operated the lunar module?
[{'query': 'MATCH (a:Person)-[:OPERATED]->(b:LunarModule) RETURN a'}, {'context': []}]
I don't know the answer.
