## Creating a Knowledge Graph from Text

Jay Urbain, PhD
3/8/2024

I have borrowed heavily from the following references:  

[OpenAI functions](https://openai-functions.readthedocs.io/en/latest/)

[LangChain](https://python.langchain.com/docs/get_started/introduction)

[Neo4j Graph Academy](https://graphacademy.neo4j.com/?utm_medium=PaidSearch&utm_source=Google&utm_campaign=Evergreenutm_content=AMS-Search-SEMBrand-Evergreen-None-SEM-SEM-NonABM&utm_adgroup=core-brand&utm_term=neo4j&gclid=CjwKCAiAopuvBhBCEiwAm8jaMcAgpz6vnKYcebYhVlh_zs_1r2An0ntE1ZwZTJhBP_y-FuhKnvBPHRoCAi0QAvD_BwE&_ga=2.5212147.1588432770.1709587205-1644350129.1709327800&_gac=1.159108936.1709673196.CjwKCAiAopuvBhBCEiwAm8jaMcAgpz6vnKYcebYhVlh_zs_1r2An0ntE1ZwZTJhBP_y-FuhKnvBPHRoCAi0QAvD_BwE&_gl=1*w500ed*_ga*MTY0NDM1MDEyOS4xNzA5MzI3ODAw*_ga_DL38Q8KGQC*MTcwOTY3MzE5My4xMy4xLjE3MDk2NzMxOTYuMC4wLjA.)

[LLM Knowledge Graph Blog](https://neo4j.com/blog/unifying-llm-knowledge-graph/)

[Building a Knowledge Base from Texts: a Full Practical Example](https://medium.com/nlplanet/building-a-knowledge-base-from-texts-a-full-practical-example-8dbbffb912fa)

[Implementing Advanced Retrieval RAG Strategies With Neo4j](https://medium.com/neo4j/implementing-advanced-retrieval-rag-strategies-with-neo4j-c968a002c513)


Many more ...

In [1]:
!pip install -U pydantic langchain neo4j openai wikipedia tiktoken langchain_openai

Collecting neo4j
  Downloading neo4j-5.28.1-py3-none-any.whl.metadata (5.9 kB)
Collecting openai
  Downloading openai-1.75.0-py3-none-any.whl.metadata (25 kB)
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.3.14-py3-none-any.whl.metadata (2.3 kB)
Collecting langchain-core<1.0.0,>=0.3.51 (from langchain)
  Downloading langchain_core-0.3.53-py3-none-any.whl.metadata (5.9 kB)
Downloading neo4j-5.28.1-py3-none-any.whl (312 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.3/312.3 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading openai-1.75.0-py3-none-any.whl (646 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.0/647.0 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloadi

In [2]:
!pip install -U langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

*italicized text*# Constructing a knowledge graph from text using a LLM.

## Implement information extraction with LangChain.

## Store entity-relations in Neo4j Graph Database

Extracting structured information from unstructured text has been around for a long time but is a difficult, unsolved problem.  

LLMs have signficantly shifted the field of information extraction. Improving the results and lowering the barrier to entry.

The information extraction pipeline generates subject-predicate-object triplets that we can use to create a graph representation of entity relations. The nodes represent entities, while the connecting lines denote the relationships between these entities.

The information extraction part of the pipleline still needs a lot of work.

We are going to use OpenAI functions in combination with LangChain to construct a knowledge graph from a sample Wikipedia page.

# Neo4j Environment setup and LangChain wrapper

You will need to set up a free instance on [Neo4j](https://neo4j.com/?utm_source=Google&utm_medium=PaidSearch&utm_campaign=Evergreenutm_content%3DAMS-Search-SEMBrand-Evergreen-None-SEM-SEM-NonABM&utm_term=neo4j&utm_adgroup=core-brand&gad_source=1&gclid=CjwKCAiAopuvBhBCEiwAm8jaMeUc2oE7Q6UxRfIKlMe4_GkvemQ-iFy9ZUfDsZnXTpKCZ957UuzCJxoCUmcQAvD_BwE) Aura which offers cloud instances of Neo4j database.


LangChain is a framework designed to simplify the creation of applications using large language models. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.

The following code will instantiate a LangChain wrapper to connect to Neo4j Database.

Enter your neo4j database url and password.

You will need to use your url and password. You can get the URL after you start an instance under _Instances_ -> _Connections_.

In [4]:
from langchain.graphs import Neo4jGraph

url = "neo4j+s://9826edf8.databases.neo4j.io"
username ="neo4j"
password = "xxx"
graph = Neo4jGraph(
    url=url,
    username=username,
    password=password
)

## Information extraction pipeline
A typical information extraction pipeline contains the following steps.

- Coreference resolution

- Named entity recognition

- Entity disambiguation

Coreference resolution is the task of finding all expressions that refer to a specific entity. Concretely, it links pronouns to people.

In the named entity recognition part of the pipeline we try to extract all  mentioned entities.

Entity disambiguation is the process of identifying and distinguishing between entities with similar names or references to ensure the correct entity is recognized in a given context. This step can be improved using graph machine learning (node classification).

In the final step, the model tries to identify various relationships between entities.

## Extracting structured information with [OpenAI functions](https://openai-functions.readthedocs.io/en/latest/)


OpenAI functions are used to extract structured information from natural language. The idea behind OpenAI functions is to have an LLM output a predefined JSON object with populated values. The predefined JSON object can be used as input to other functions in RAG applications, or it can be used to extract predefined structured information from text.

In LangChain, you can pass a class description of the desired JSON object of the OpenAI functions feature.  

LangChain already has definitions of nodes and relationship as [Pydantic](https://docs.pydantic.dev/1.10/usage/models/) classes that we can use.

OpenAI functions don't currently support a dictionary object as a value. Therefore, we have to overwrite the properties definition to adhere to the limitations of the functions' endpoint.

In [6]:
from langchain_community.graphs.graph_document import (
    Node as BaseNode,
    Relationship as BaseRelationship,
    GraphDocument,
)
from langchain.schema import Document
from typing import List, Dict, Any, Optional
# from langchain.pydantic_v1 import Field, BaseModel
from pydantic import Field, BaseModel


class Property(BaseModel):
  """A single property consisting of key and value"""
  key: str = Field(..., description="key")
  value: str = Field(..., description="value")

class Node(BaseNode):
    properties: Optional[List[Property]] = Field(
        None, description="List of node properties")

class Relationship(BaseRelationship):
    properties: Optional[List[Property]] = Field(
        None, description="List of relationship properties"
    )

class KnowledgeGraph(BaseModel):
    """Generate a knowledge graph with entities and relationships."""
    nodes: List[Node] = Field(
        ..., description="List of nodes in the knowledge graph")
    rels: List[Relationship] = Field(
        ..., description="List of relationships in the knowledge graph"
    )

Below are the properties.

Because you can only pass a single object to the API, we can combine the nodes and relationships in a single class called KnowledgeGraph.

In [18]:
def format_property_key(s: str) -> str:
    words = s.split()
    if not words:
        return s
    first_word = words[0].lower()
    capitalized_words = [word.capitalize() for word in words[1:]]
    return "".join([first_word] + capitalized_words)

def props_to_dict(props) -> dict:
    """Convert properties to a dictionary."""
    properties = {}
    if not props:
      return properties
    for p in props:
        properties[format_property_key(p.key)] = p.value
    return properties

def map_to_base_node(node: Node) -> BaseNode:
    """Map the KnowledgeGraph Node to the base Node."""
    properties = props_to_dict(node.properties) if node.properties else {}
    # Add name property for better Cypher statement generation
    properties["name"] = node.id.title()
    return BaseNode(
        id=node.id.title(), type=node.type.capitalize(), properties=properties
    )


def map_to_base_relationship(rel: Relationship) -> BaseRelationship:
    """Map the KnowledgeGraph Relationship to the base Relationship."""
    source = map_to_base_node(rel.source)
    target = map_to_base_node(rel.target)
    properties = props_to_dict(rel.properties) if rel.properties else {}
    return BaseRelationship(
        source=source, target=target, type=rel.type, properties=properties
    )

Finally, we have to do prompt engineering. I'm not good at this, so please improve this.

Here's the recommended  approach:
* Iterate over prompt and improve results using natural language
* If something doesn't work as intended, ask ChatGPT to make it more clear for a LLM to understand the task
* Finally, when the prompt has all the instructions needed, ask ChatGPT to summarize the instructions in a markdown format, saving on tokens and perhaps having more clear instructions

We have specified markdown format. People claim OpenAI models respond better to markdown syntax in prompts. I have not substantiated this.

Remember to enter you OpenAI API_KEY.

Here we go.


In [13]:
import os
from langchain.chains.openai_functions import (
    create_openai_fn_chain,
    create_structured_output_chain,
)
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

API_KEY="xxx"

os.environ["OPENAI_API_KEY"] = API_KEY
llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)

def get_extraction_chain(
    allowed_nodes: Optional[List[str]] = None,
    allowed_rels: Optional[List[str]] = None
    ):
    prompt = ChatPromptTemplate.from_messages(
        [(
          "system",
          f"""# Knowledge Graph Instructions for GPT-4
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
{'- **Allowed Node Labels:**' + ", ".join(allowed_nodes) if allowed_nodes else ""}
{'- **Allowed Relationship Types**:' + ", ".join(allowed_rels) if allowed_rels else ""}
## 3. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
## 4. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
## 5. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination.
          """),
            ("human", "Use the given format to extract information from the following input: {input}"),
            ("human", "Tip: Make sure to answer in the correct format"),
        ])
    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)

 OpenAI function output is a structured JSON object, and structured JSON syntax adds a lot of token overhead to the result.

Besides the general instructions, we have also added the option to limit which node or relationship types should be extracted from text.

With the Neo4j connection and LLM prompt ready, which means we can define the information extraction pipeline as a single function.

In [14]:
def extract_and_store_graph(
    document: Document,
    nodes:Optional[List[str]] = None,
    rels:Optional[List[str]]=None) -> None:
    # Extract graph data using OpenAI functions
    extract_chain = get_extraction_chain(nodes, rels)
    data = extract_chain.invoke(document.page_content)['function']
    # Construct a graph document
    graph_document = GraphDocument(
      nodes = [map_to_base_node(node) for node in data.nodes],
      relationships = [map_to_base_relationship(rel) for rel in data.rels],
      source = document
    )
    # Store information into a graph
    graph.add_graph_documents([graph_document])

The function takes in a LangChain document as well as optional nodes and relationship parameters, which are used to limit the types of objects we want the LLM to identify and extract. The add_graph_documents method is a  Neo4j graph object.

## Evaluation

Extract information from a Wikipedia page and construct a knowledge graph to test the pipeline. We are using the Wikipedia loader and text chunking modules provided by LangChain.

In [15]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import TokenTextSplitter

# Read the wikipedia article
raw_documents = WikipediaLoader(query="Albert Einstein").load()
# Define chunking strategy
text_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)

# Only take the first the raw_documents
documents = text_splitter.split_documents(raw_documents[:3])



  lis = BeautifulSoup(html).find_all('li')


In [16]:
len(documents)

3

The chunk size is relatively large so we can provide as much context as possible around a single sentence. Important for co-reference resolution.
The coreference step will only work if the entity and its reference appear in the same chunk.

Run documents through the information extraction pipeline.

In [19]:
from tqdm import tqdm

for i, d in tqdm(enumerate(documents), total=len(documents)):
    extract_and_store_graph(d)

100%|██████████| 3/3 [01:15<00:00, 25.01s/it]


The process takes takes a few minutes.

Below are the types of nodes and relationships the LLM identified.


Since the graph schema is not provided, the LLM decides on the fly what types of node labels and relationship types it will use.

If you log into neo4j and log into your graph you can browse and visualize the graph entities and relations.

```
Node labels
*(104)
Award
Character
Company
Conference
Disease
Event
Location
Movie
Organization
Person
School
Service
University

Relationship types
*(105)
ALLEGATION
ALMA_MATER
ATTENDED
BIRTH_PLACE
CAUSE_OF_DEATH
CEO
CO-FOUNDER
CONTAINS
CONTRIBUTOR
CREATOR
CURRENTCEO
DESCRIPTION
DEVELOPED
DEVELOPER
FEATURED_IN
FORMERCEO
FORMERNAME
FOUNDEDBY
FOUNDER
FUNDING
HABIT
HEART
INCLUDED_IN
INVOLVED_IN
KEYNOTESPEAKER
NAMED_AS
OPERATED_BY
OWNED_BY
OWNER
PERSONALITY_TRAIT
PERSPECTIVE
PIONEER
PRESIDENT
PREVIOUSWORK
PRODUCER
RECIPIENT
RECORD_HOLDER
RESIDENCE
SPEAKER
Property keys
```




Skip the following two cells, then come back and attempt to filter the entities.

In [17]:
# Delete the graph
graph.query("MATCH (n) DETACH DELETE n")

[]

In [18]:
# Specify which node labels should be extracted by the LLM
allowed_nodes = ["Person", "Company", "Location", "Event", "Technology", "Service", "GPU", "Award", "University", "School"]

for i, d in tqdm(enumerate(documents), total=len(documents)):
    extract_and_store_graph(d, allowed_nodes)

  0%|          | 0/3 [00:00<?, ?it/s]


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-EEvab***********************************************************************************OoYA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

That's a little cleaner.

```
Relationship types
*(49)
ALMAMATER
ATTENDED
CEO
CO-FOUNDER
DONATED
FOUNDER
KEYNOTE_SPEAKER
PRESIDENT
RECEIVED_FUNDING_FROM
SPEAKER
```

In this example we have only limited the node labels, but you can also limit with the `extract_and_store_graph` function.

Please explore using different entities and identifying allowed relations.

If your game, try to improvie the OpenAI function prompts.

Note: We skipped Entity disambiguation.

* Using [entity linking](https://wikifier.org/about.html) or [entity disambiguation NLP models](https://github.com/SapienzaNLP/extend)

* Doing a [second pass through an LLM and asking it to perform entity disambiguation](https://medium.com/neo4j/creating-a-knowledge-graph-from-video-transcripts-with-gpt-4-52d7c7b9f32c)

* [Graph-based approaches](https://neo4j.com/developer-blog/exploring-supervised-entity-resolution-in-neo4j/)


## RAG Application

Finally, we can browse information in the knowledge graph by constructing Cypher statements. Cypher is a structured query language used to work with graph databases, similar to how SQL is used for relational databases.

LangChain has a [GraphCypherQAChain](https://medium.com/neo4j/langchain-cypher-search-tips-tricks-f7c9e9abca4d) that reads the schema of the graph and constructs appropriate Cypher statements based on the user input.

In [22]:
# Query the knowledge graph in a RAG application
from langchain.chains import GraphCypherQAChain

graph.refresh_schema()

cypher_chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-4"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    validate_cypher=True, # Validate relationship directions
    verbose=True,
    allow_dangerous_requests=True
)

In [23]:
cypher_chain.invoke({"query": "Albert Einstein"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person) WHERE p.name = 'Albert Einstein' RETURN p[0m
Full Context:
[32;1m[1;3m[{'p': {'birthdate': '14 March 1879', 'occupation': 'theoretical physicist', 'nationality': 'German', 'name': 'Albert Einstein', 'id': 'Albert Einstein', 'deathdate': '18 April 1955'}}][0m

[1m> Finished chain.[0m


{'query': 'Albert Einstein',
 'result': 'Albert Einstein was a German theoretical physicist born on 14 March 1879 and passed away on 18 April 1955.'}

In [24]:
cypher_chain.invoke({"query": "Albert Einstein occupation"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person {name: "Albert Einstein"}) RETURN p.occupation[0m
Full Context:
[32;1m[1;3m[{'p.occupation': 'theoretical physicist'}][0m

[1m> Finished chain.[0m


{'query': 'Albert Einstein occupation',
 'result': "Albert Einstein's occupation is a theoretical physicist."}

In [25]:
cypher_chain.invoke({"query": "When was Albert Einstein born"})



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (p:Person {name: "Albert Einstein"}) RETURN p.birthdate[0m
Full Context:
[32;1m[1;3m[{'p.birthdate': '14 March 1879'}][0m

[1m> Finished chain.[0m


{'query': 'When was Albert Einstein born',
 'result': 'Albert Einstein was born on 14 March 1879.'}