<a href="https://colab.research.google.com/github/okanbursa/GraphRAG/blob/main/graph_constructing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to construct knowledge graphs

In this guide we'll go over the basic ways of constructing a knowledge graph based on unstructured text. The constructured graph can then be used as knowledge base in a [RAG](/docs/concepts/rag/) application.

## ⚠️ Security note ⚠️

Constructing knowledge graphs requires executing write access to the database. There are inherent risks in doing this. Make sure that you verify and validate data before importing it. For more on general security best practices, [see here](/docs/security).


## Architecture

At a high-level, the steps of constructing a knowledge graph from text are:

1. **Extracting structured information from text**: Model is used to extract structured graph information from text.
2. **Storing into graph database**: Storing the extracted structured graph information into a graph database enables downstream RAG applications

## Setup

First, get required packages and set environment variables.
In this example, we will be using Neo4j graph database.

In [2]:
%pip install --upgrade --quiet  langchain langchain-neo4j langchain-openai langchain-experimental neo4j

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/301.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.7/301.7 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

We default to OpenAI models in this guide.

In [3]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()
#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

# Uncomment the below to use LangSmith. Not required.
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
# os.environ["LANGCHAIN_TRACING_V2"] = "true"

··········


Next, we need to define Neo4j credentials and connection.
Follow [these installation steps](https://neo4j.com/docs/operations-manual/current/installation/) to set up a Neo4j database.

In [7]:
import os

from langchain_neo4j import Neo4jGraph

os.environ["NEO4J_URI"] = "neo4j+s://41e7e72d.databases.neo4j.io"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "onr3FnCI4vW3CHD9yDKkP5eKM2eyvhI2c6SQPE-wlUI"


graph = Neo4jGraph(refresh_schema=False)

## LLM Graph Transformer

Extracting graph data from text enables the transformation of unstructured information into structured formats, facilitating deeper insights and more efficient navigation through complex relationships and patterns. The `LLMGraphTransformer` converts text documents into structured graph documents by leveraging a LLM to parse and categorize entities and their relationships. The selection of the LLM model significantly influences the output by determining the accuracy and nuance of the extracted graph data.


In [8]:
import os

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4-turbo")

llm_transformer = LLMGraphTransformer(llm=llm)

Now we can pass in example text and examine the results.

In [9]:
from langchain_core.documents import Document

text = """
Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.
She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.
Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.
She was, in 1906, the first woman to become a professor at the University of Paris.
"""
documents = [Document(page_content=text)]
graph_documents = llm_transformer.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents[0].nodes}")
print(f"Relationships:{graph_documents[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Institution', properties={}), Node(id='Nobel Prize', type='Award', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Nobel Prize', type='Award', properties={}), type='WINNER', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Institution', properties={}), type='PROFESSOR', properties={}), Relationship(source=Node(id='Pierre Curie', type='Person', properties={}), target=Node(id='Nobel Prize', type='Award', properties={}), type='WINNER', properties={})]


Examine the following image to better grasp the structure of the generated knowledge graph.

![graph_construction1.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/graph_construction1.png?raw=1)

Note that the graph construction process is non-deterministic since we are using LLM. Therefore, you might get slightly different results on each execution.

Additionally, you have the flexibility to define specific types of nodes and relationships for extraction according to your requirements.

In [10]:
llm_transformer_filtered = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Country", "Organization"],
    allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
)
graph_documents_filtered = llm_transformer_filtered.convert_to_graph_documents(
    documents
)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]


To define the graph schema more precisely, consider using a three-tuple approach for relationships. In this approach, each tuple consists of three elements: the source node, the relationship type, and the target node.

In [12]:
allowed_relationships = [
    ("Person", "SPOUSE", "Person"),
    ("Person", "NATIONALITY", "Country"),
    ("Person", "WORKED_AT", "Organization"),
]

llm_transformer_tuple = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Country", "Organization"],
    allowed_relationships=allowed_relationships,
)
graph_documents_filtered = llm_transformer_tuple.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents_filtered[0].nodes}")
print(f"Relationships:{graph_documents_filtered[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='Poland', type='Country', properties={}), Node(id='France', type='Country', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Poland', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='France', type='Country', properties={}), type='NATIONALITY', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]


For a better understanding of the generated graph, we can again visualize it.

![graph_construction2.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/graph_construction2.png?raw=1)

The `node_properties` parameter enables the extraction of node properties, allowing the creation of a more detailed graph.
When set to `True`, LLM autonomously identifies and extracts relevant node properties.
Conversely, if `node_properties` is defined as a list of strings, the LLM selectively retrieves only the specified properties from the text.

In [13]:
llm_transformer_props = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Country", "Organization"],
    allowed_relationships=["NATIONALITY", "LOCATED_IN", "WORKED_AT", "SPOUSE"],
    node_properties=["born_year"],
)
graph_documents_props = llm_transformer_props.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents_props[0].nodes}")
print(f"Relationships:{graph_documents_props[0].relationships}")

Nodes:[Node(id='Marie Curie', type='Person', properties={'born_year': '1867'}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='University Of Paris', type='Organization', properties={})]
Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]


## Storing to graph database

The generated graph documents can be stored to a graph database using the `add_graph_documents` method.

In [14]:
graph.add_graph_documents(graph_documents_props)

Most graph databases support indexes to optimize data import and retrieval. Since we might not know all the node labels in advance, we can handle this by adding a secondary base label to each node using the `baseEntityLabel` parameter.

In [None]:
graph.add_graph_documents(graph_documents, baseEntityLabel=True)

Results will look like:

![graph_construction3.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/graph_construction3.png?raw=1)

The final option is to also import the source documents for the extracted nodes and relationships. This approach lets us track which documents each entity appeared in.

In [15]:
graph.add_graph_documents(graph_documents, include_source=True)

Graph will have the following structure:

![graph_construction4.png](https://github.com/langchain-ai/langchain/blob/master/docs/static/img/graph_construction4.png?raw=1)

In this visualization, the source document is highlighted in blue, with all entities extracted from it connected by `MENTIONS` relationships.

In [16]:
print(f"Relationships:{graph_documents_props[0].relationships}")

Relationships:[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='WORKED_AT', properties={})]


# GRAPH RAG EVALUATION

Now this code will extract the information from the Psychology Wikipedia page and construct a knowledge graph to test the graph RAG pipeline.

In [18]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=0093c3c18b6cbbc5d6e471ebf3d7083f510e7950d55f8a38e4d4f4c6bc645ce1
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [21]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import TokenTextSplitter

# Read the wikipedia article
raw_documents = WikipediaLoader(query="Psychology").load()
# Define chunking strategy
text_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)

# Only take the first the raw_documents
documents = text_splitter.split_documents(raw_documents[:3])

In [27]:
llm_transformer_query = LLMGraphTransformer(
    llm=llm,
)

In [28]:
graph_documents_query = llm_transformer_query.convert_to_graph_documents(documents)
print(f"Nodes:{graph_documents_query[0].nodes}")
print(f"Relationships:{graph_documents_query[0].relationships}")

Nodes:[Node(id='Psychology', type='Discipline', properties={}), Node(id='Mind', type='Concept', properties={}), Node(id='Behavior', type='Concept', properties={}), Node(id='Human', type='Species', properties={}), Node(id='Nonhuman', type='Species', properties={}), Node(id='Conscious Phenomena', type='Concept', properties={}), Node(id='Unconscious Phenomena', type='Concept', properties={}), Node(id='Mental Processes', type='Concept', properties={}), Node(id='Thoughts', type='Concept', properties={}), Node(id='Feelings', type='Concept', properties={}), Node(id='Motives', type='Concept', properties={}), Node(id='Biological Psychologists', type='Group', properties={}), Node(id='Neuroscience', type='Discipline', properties={}), Node(id='Social Scientists', type='Group', properties={}), Node(id='Individual', type='Entity', properties={}), Node(id='Group', type='Entity', properties={}), Node(id='Psychologist', type='Profession', properties={}), Node(id='Behavioral Scientist', type='Profession

Now we will gonna add these nodes and relationships to the repository.

In [29]:
graph.add_graph_documents(graph_documents_query)

# RAG APPLICATION

This is the first step to automotize this generation process to the RAG.

In [31]:
!pip install --upgrade --quiet  langchain langchain-neo4j langchain-openai langchain-experimental neo4j

import os
from langchain_neo4j import Neo4jGraph
from langchain_openai import ChatOpenAI
from langchain.chains import GraphCypherQAChain

In [34]:
# Instantiate Neo4jGraph from langchain_neo4j
graph = Neo4jGraph(url=os.environ["NEO4J_URI"], username=os.environ["NEO4J_USERNAME"], password=os.environ["NEO4J_PASSWORD"], refresh_schema=False)

# Refresh schema before using the graph in the chain
graph.refresh_schema()

In [35]:
# Query the knowledge graph in a RAG application
from langchain.chains import GraphCypherQAChain

graph.refresh_schema()

cypher_chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-4"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    validate_cypher=True, # Validate relationship directions
    verbose=True
)

ValidationError: 1 validation error for GraphCypherQAChain
graph
  Input should be an instance of GraphStore [type=is_instance_of, input_value=<langchain_neo4j.graphs.n...bject at 0x7c130d023340>, input_type=Neo4jGraph]
    For further information visit https://errors.pydantic.dev/2.10/v/is_instance_of

In [None]:
cypher_chain.invoke({"query": "Who has pychology degree?"})

# DBPEDIA Integration

Here we are extracting DBpedia information. We do not change the query results yet but it could be a good example to validate LLM in this regard.

In [39]:
!pip install rdflib

Collecting rdflib
  Downloading rdflib-7.1.1-py3-none-any.whl.metadata (11 kB)
Collecting isodate<1.0.0,>=0.7.2 (from rdflib)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Downloading rdflib-7.1.1-py3-none-any.whl (562 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/562.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m481.3/562.4 kB[0m [31m16.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m562.4/562.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.7.2 rdflib-7.1.1


In [40]:
from langchain_community.graphs import RdfGraph
from langchain.chains import GraphSparqlQAChain

In [41]:
graph = RdfGraph(query_endpoint="https://dbpedia.org/sparql")

In [49]:
dbpedia_chainGPT3point5 = GraphSparqlQAChain.from_llm(
    ChatOpenAI(model="gpt-3.5-turbo-1106",
               temperature=0,
               verbose=True),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True # Added allow_dangerous_requests=True
)
dbpedia_chainGPT4 = GraphSparqlQAChain.from_llm(
    ChatOpenAI(model="gpt-4",
               temperature=0,
               verbose=True),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True # Added allow_dangerous_requests=True
)
dbpedia_chainGPT4Turbo = GraphSparqlQAChain.from_llm(
    ChatOpenAI(model="gpt-4-turbo",
               temperature=0,
               verbose=True),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True # Added allow_dangerous_requests=True
)

In [56]:
query = """
Relevant DBpedia Knowledge Graph relationship types (relations):
  ?movie rdf:type dbo:Film .
  ?movie dbo:director dbr:?name .
  FILTER regex(?name,<input director's name>)

Associated namespaces:
 dbr:  <http://dbpedia.org/resource/>
 dbo:  <http://dbpedia.org/ontology/>
 rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

List movies by Spike Lee
"""

res3point5 = dbpedia_chainGPT3point5.invoke({dbpedia_chainGPT3point5.input_key: query})[dbpedia_chainGPT3point5.output_key]
print(res3point5)

res4 = dbpedia_chainGPT4.invoke({dbpedia_chainGPT4.input_key: query})[dbpedia_chainGPT4.output_key]

print(res4)

res4plus = dbpedia_chainGPT4Turbo.invoke({dbpedia_chainGPT4Turbo.input_key: query})[dbpedia_chainGPT4Turbo.output_key]

print(res4plus)




[1m> Entering new GraphSparqlQAChain chain...[0m
Identified intent:
[32;1m[1;3mSELECT[0m
Generated SPARQL:
[32;1m[1;3m```
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?movie
WHERE {
    ?movie rdf:type dbo:Film .
    ?movie dbo:director dbr:Spike_Lee .
}
```[0m


ValueError: You did something wrong formulating either the URI or your SPARQL query