## Knowledge graphs from unstructured data

Use the Neo4j GraphRAG package to convert unstructured text into a knowledge graph of nodes and relationships.

In [None]:
%pip install "neo4j_graphrag[openai]"

## Define your entities and relations

You can define a simple list of entities and relationships and a tuple of patterns:


In [None]:
# List the entities and relations the LLM should look for in the text
node_types = ["Company", "Location", "Country"]
relationship_types = ["LOCATED_IN", "IN_COUNTRY", "FUNDED_BY"]
patterns = [
    ("Company", "LOCATED_IN", "Location"),
    ("Location", "IN_COUNTRY", "Country"),
    ("Company", "FUNDED_BY", "Company"),
]

Or you can be more accurate and define the entities and relationships 

In [None]:
entities = [
  {
    "label": "Company",
    "properties": [
      {"name": "id", "type": "STRING", "description": "The unique identifier for the company in lower-kebab-case.  For example, acme-industries-llc"},
      {"name": "name", "type": "STRING"}
    ]
  },
  {
    "label": "Location",
    "properties": [
      {"name": "name", "type": "STRING"}
    ]
  },
  {
    "label": "Country",
    "properties": [
      {"name": "name", "type": "STRING"}
    ]
  },
  {
    "label": "FundingRound",
    "properties": [
      {"name": "id", "type": "STRING", "description": "The unique identifier for the funding round consisting of the company name and series in lower-kebab-case.  For example, neo4j-series-a-2024"},
      {"name": "series", "type": "STRING"},
      {"name": "amount", "type": "FLOAT"},
    ]
  },
]

relations = [
    { "source": "Company", "target": "Location", "label": "LOCATED_IN" },
    { "source": "Location", "target": "Country", "label": "IN_COUNTRY" },
    { "source": "Company", "target": "Company", "label": "HAS_FUNDING_ROUND" },
    { "source": "FundingRound", "target": "Company", "label": "FUNDED_BY", "properties": [
      {"name": "date", "type": "DATE"}, {"name": "amount", "type": "FLOAT"},
    ] },
]

## Specific Prompt Templates

You can define a specific prompt template to define

In [24]:
from neo4j_graphrag.experimental.pipeline.kg_builder import ERExtractionTemplate

approved_companies = ["Neo4j", "Greenbridge Partners", "One Peak Partners"]

# --- Establish Prompt Template for KG Builder ---
joined_names = '\n'.join(f'- {name}' for name in approved_companies)

company_instruction = (
    "Extract only information about the following companies. "
    "If a company is mentioned but is not in this list, ignore it. "
    "When extracting, the company name must match exactly as shown below. "
    "Do not generate or include any company not on this list or an alternate name for any company on this list. "
    "ONLY USE THE COMPANY NAME EXACTLY AS SHOWN IN THE LIST. "
    "If the text refers to 'the Company', 'the Registrant', or uses a pronoun or generic phrase instead of a company name, "
    "you MUST look up and use the exact company name from the allowed list based on context (such as the file being processed). "
    "UNDER NO CIRCUMSTANCES should you output 'the Company', 'the Registrant', or any generic phrase as a company name. "
    "Only use the exact allowed company name.\n\n"
    f"Allowed Companies (match exactly):\n{joined_names}\n\n"
)

custom_template = company_instruction + ERExtractionTemplate.DEFAULT_TEMPLATE
prompt_template = ERExtractionTemplate(template=custom_template)

## Create the pipeline using `SimpleKGPipeline`

In [None]:
from neo4j import GraphDatabase
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.llm import OpenAILLM

NEO4J_URI = "neo4j://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "neoneoneo"

# Connect to the Neo4j database
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))


# Create an Embedder object
embedder = OpenAIEmbeddings(model="text-embedding-3-large")

# Instantiate the LLM
llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "max_tokens": 2000,
        "response_format": {"type": "json_object"},
        "temperature": 0,
    },
)

# Instantiate the SimpleKGPipeline
kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    embedder=embedder,
    # schema={
    #     "node_types": node_types,
    #     "relationship_types": relationship_types,
    #     "patterns": patterns,
    # },
    prompt_template=prompt_template,
    entities=entities,
    relations=relations,
    # on_error="IGNORE",
    from_pdf=False,
)


In [4]:
# Run the pipeline on a piece of text
text = """
Neo4j is developed by Neo4j, Inc., based in San Mateo, California, United States and Malmö, Sweden.

In November 2016, Neo4j successfully secured $36M in Series D Funding led by Greenbridge Partners Ltd.[14]

In November 2018, Neo4j successfully secured $80M in Series E Funding led by One Peak Partners and Morgan Stanley Expansion Capital, with participation from other investors including Creandum, Eight Roads and Greenbridge Partners.[15]

In June 2021, Neo4j announced another round of funding, $325M in Series F, led by Eurazeo and GV (Alphabet’s venture capital arm), which valued the company at over $2 billion making the company one of the best funded database companies.[16][17]

In late 2024, the company began preparations for an initial public offering (IPO) on the Nasdaq stock exchange, although the timing of the IPO will be dependent on broader market conditions.[18]
"""

# res = await kg_builder.run_async(text=text)
# driver.close()

In [21]:
res.result

{'resolver': {'number_of_nodes_to_resolve': 42, 'number_of_created_nodes': 18}}

## Bonus: Automatic schema construction

The `SchemaFromTextExtractor` will use an LLM to examine a piece of text and return a list of node types, relationship types and patterns.

In [7]:
from neo4j_graphrag.experimental.components.schema import SchemaFromTextExtractor
from neo4j_graphrag.llm import OpenAILLM

# Instantiate the automatic schema extractor component
schema_extractor = SchemaFromTextExtractor(
    llm=OpenAILLM(
        model_name="gpt-4o",
        model_params={
            "max_tokens": 2000,
            "response_format": {"type": "json_object"},
        },
    )
)

# Extract the schema from the text
extracted_schema = await schema_extractor.run(text=text)

In [None]:
extracted_schema.node_types

(NodeType(label='Company', description='', properties=[PropertyType(name='name', type='STRING', description='', required=False), PropertyType(name='location', type='STRING', description='', required=False), PropertyType(name='valuation', type='FLOAT', description='', required=False)], additional_properties=False),
 NodeType(label='Person', description='', properties=[PropertyType(name='role', type='STRING', description='', required=False)], additional_properties=False),
 NodeType(label='FundingRound', description='', properties=[PropertyType(name='amount', type='FLOAT', description='', required=False), PropertyType(name='series', type='STRING', description='', required=False), PropertyType(name='date', type='DATE', description='', required=False)], additional_properties=False),
 NodeType(label='Event', description='', properties=[PropertyType(name='type', type='STRING', description='', required=False), PropertyType(name='date', type='DATE', description='', required=False)], additional_

In [None]:
extracted_schema.relationship_types

(RelationshipType(label='RECEIVED', description='', properties=[], additional_properties=True),
 RelationshipType(label='LED_BY', description='', properties=[], additional_properties=True),
 RelationshipType(label='PARTICIPATED_IN', description='', properties=[], additional_properties=True),
 RelationshipType(label='ENGAGED_IN', description='', properties=[], additional_properties=True))

In [None]:
extracted_schema.patterns

(('Company', 'RECEIVED', 'FundingRound'),
 ('FundingRound', 'LED_BY', 'Company'),
 ('FundingRound', 'PARTICIPATED_IN', 'Company'),
 ('Company', 'ENGAGED_IN', 'Event'))