# Knowledge Graph Creation by Entity Extraction in Memgraph
In this example, we summarized the book *The Catcher in the Rye*, identified key
entities using Spacy LLM and GPT-4, then generated and executed Cypher queries
in Memgraph to create a knowledge graph around the book's themes and characters.

Before we get started, make sure you have Memgraph instance running in the background. If you want to quickly try out Memgraph Platform (Memgraph database + [MAGE library](https://memgraph.com/docs/advanced-algorithms/available-algorithms) + [Memgraph Lab](https://memgraph.com/docs/data-visualization)) for the first time, run the following command with [Docker](https://docs.docker.com/engine/install/) running in the background:

For Linux/macOS:
`curl https://install.memgraph.com | sh`

For Windows:
`iwr https://windows.memgraph.com | iex`

## Entity extraction
The first step in the process is to extract entities from the summary using
SpaCy’s large language model.
[SpaCy](https://spacy.io/usage/large-language-models) is an advanced NLP
(natural language processing) library in Python, designed for tasks like entity
recognition, part-of-speech tagging, and dependency parsing. It’s widely used
for its speed and accuracy in processing text.

To start, we need to install SpaCy and the specific model we’ll be using.

In [None]:
%pip install spacy
%pip install spacy_llm
%python -m spacy download en_core_web_md

In [None]:
%pip install openai neo4j

Next, set up your OpenAI API key.

In [29]:
import os
from wasabi import msg

os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Check for OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
    msg.fail("OPENAI_API_KEY environment variable not set. Please set it to proceed.", exits=1)

Here’s the summary of *The Catcher in the Rye* that we'll use to create the
knowledge graph.

In [30]:
# Sample text summary for processing
summary="'The Catcher in the Rye' by J.D. Salinger follows Holden Caulfield, a troubled teenager who narrates his experiences over a few days after being expelled from his elite boarding school, Pencey Prep. Set in post-World War II New York City, the story revolves around Holden’s encounters with various characters, reflecting his disillusionment with the adult world and his search for identity and meaning. The novel begins with Holden being expelled due to poor academic performance, which sets the stage for his wandering through New York City. His isolation becomes a central theme, symbolizing his struggle with mental health and alienation. Throughout the book, Holden interacts with multiple characters, including teachers, former classmates, strangers, and his younger sister, Phoebe. Each interaction reveals his distrust of adults and his disdain for what he calls phoniness. He idolizes Phoebe as a symbol of innocence and sincerity, which stands in contrast to his views on the rest of society. Holden’s fixation on preserving innocence is symbolized by his dream of being the catcher in the rye, a protector who saves children from losing their innocence. Key symbols also include his red hunting hat, which represents Holden's uniqueness and desire for protection, and the Museum of Natural History, a place he values for its permanence in contrast to life’s constant change and unpredictability. Holden’s narrative reveals symptoms of depression and lingering trauma from the death of his younger brother, Allie, which complicates his ability to cope with the challenges of adulthood. His internal struggles suggest unresolved grief and a fear of growing up. The climax of the story occurs when Holden, overwhelmed, plans to run away but has a meaningful encounter with Phoebe that changes his mind. Her innocence and love provide him with a sense of purpose, grounding him and encouraging him to continue facing his reality. By the novel’s end, Holden reluctantly begins to accept life’s imperfections and complexities. The main characters include Holden Caulfield, who is marked by cynicism, vulnerability, and compassion; Phoebe Caulfield, his younger sister who represents innocence and serves as an emotional anchor for Holden; Mr. Antolini, a former teacher who offers him guidance and represents an adult Holden partially trusts; and Allie Caulfield, Holden’s deceased younger brother, whose memory profoundly impacts him. The novel is set primarily in New York City, with scenes at Pencey Prep and various urban locations, emphasizing Holden's sense of disorientation and social critique. Themes of alienation, innocence, identity, and the challenges of adolescence permeate the novel, creating a poignant exploration of a young person grappling with mental health and the transition to adulthood."

Extract entitites from the text using spaCy

In [3]:

import json
from collections import Counter
from pathlib import Path

import spacy
from spacy_llm.util import assemble

# load the spaCy model
nlp = spacy.load("en_core_web_md")

# split document into sentences
def split_document_sent(text):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents]

# define custom relationship extraction and text processing
def process_text(text, verbose=False):
    doc = nlp(text)
    if verbose:
        msg.text(f"Text: {doc.text}")
        msg.text(f"Entities: {[(ent.text, ent.label_) for ent in doc.ents]}")
        # Relations extraction logic can be added here
    return doc

# Pipeline to run entity extraction
def extract_entities(text, verbose=False):
    processed_data = []
    entity_counts = Counter()

    sentences = split_document_sent(text)
    for sent in sentences:
        doc = process_text(sent, verbose)
        entities = [(ent.text, ent.label_) for ent in doc.ents]

        # Store processed data for each sentence
        processed_data.append({'text': doc.text, 'entities': entities})

        # Update counters
        entity_counts.update([ent[1] for ent in entities])

    # Export to JSON
    with open('processed_data.json', 'w') as f:
        json.dump(processed_data, f)

    # Display summary
    msg.text(f"Entity counts: {entity_counts}")

# Run the pipeline on the summary text
verbose = True
extract_entities(summary, verbose)


Text: 'The Catcher in the Rye' by J.D. Salinger follows Holden Caulfield, a
troubled teenager who narrates his experiences over a few days after being
expelled from his elite boarding school, Pencey Prep.
Entities: [('J.D. Salinger', 'PERSON'), ('Holden Caulfield', 'PERSON'), ('a few
days', 'DATE'), ('Pencey', 'GPE')]
Text: Set in post-World War II New York City, the story revolves around Holden’s
encounters with various characters, reflecting his disillusionment with the
adult world and his search for identity and meaning.
Entities: [('post-World War II', 'EVENT'), ('New York City', 'GPE'), ('Holden',
'PERSON')]
Text: The novel begins with Holden being expelled due to poor academic
performance, which sets the stage for his wandering through New York City.
Entities: [('Holden', 'PERSON'), ('New York City', 'GPE')]
Text: His isolation becomes a central theme, symbolizing his struggle with
mental health and alienation.
Entities: []
Text: Throughout the book, Holden interacts with multipl

## Create node and relationship parameters

In [8]:
import json
import openai
from pathlib import Path

# Load processed data from JSON
json_path = Path("processed_data.json")
with open(json_path, "r") as f:
    processed_data = json.load(f)

# Prepare nodes and relationships
nodes = []
relationships = []

# Formulate a prompt for GPT-4
prompt = (
    "Extract entities and relationships from the following JSON data. For each entry in data['entities'], "
    "create a 'node' dictionary with fields 'id' (unique identifier), 'name' (entity text), and 'type' (entity label). "
    "For entities that have meaningful connections, define 'relationships' as dictionaries with 'source' (source node id), "
    "'target' (target node id), and 'relationship' (type of connection). Create max 30 nodes, format relationships in the format of capital letters and _ inbetween words and format the entire response in the JSON output containing only variables nodes and relationships without any text inbetween"
    "JSON data:\n"
    f"{json.dumps(processed_data)}"
)

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that structures data into nodes and relationships."},
        {"role": "user", "content": prompt}
    ],
    max_tokens=1000
)
output = response.choices[0].message.content

print(output)
structured_data = json.loads(output)  # Assuming GPT-4 outputs structured JSON

# Populate nodes and relationships lists
nodes.extend(structured_data.get("nodes", []))
relationships.extend(structured_data.get("relationships", []))


{
  "nodes": [
    {"id": 1, "name": "J.D. Salinger", "type": "PERSON"},
    {"id": 2, "name": "Holden Caulfield", "type": "PERSON"},
    {"id": 3, "name": "a few days", "type": "DATE"},
    {"id": 4, "name": "Pencey", "type": "GPE"},
    {"id": 5, "name": "post-World War II", "type": "EVENT"},
    {"id": 6, "name": "New York City", "type": "GPE"},
    {"id": 7, "name": "Holden", "type": "PERSON"},
    {"id": 8, "name": "Phoebe", "type": "PERSON"},
    {"id": 9, "name": "the Museum of Natural History", "type": "ORG"},
    {"id": 10, "name": "Allie", "type": "PERSON"},
    {"id": 11, "name": "Phoebe Caulfield", "type": "PERSON"},
    {"id": 12, "name": "Antolini", "type": "PERSON"},
    {"id": 13, "name": "Allie Caulfield", "type": "PERSON"},
    {"id": 14, "name": "Pencey Prep", "type": "ORG"}
  ],
  "relationships": [
    {"source": 1, "target": 2, "relationship": "AUTHORED_BY"},
    {"source": 2, "target": 3, "relationship": "NARRATION_DURATION"},
    {"source": 2, "target": 4, "rela

## Generate queries

In [31]:
def generate_cypher_queries(nodes, relationships):
    queries = []

    # Create nodes
    for node in nodes:
        query = f"CREATE (n:{node['type']} {{id: '{node['id']}', name: '{node['name']}'}})"
        queries.append(query)

    # Create relationships
    for rel in relationships:
        query = f"MATCH (a {{id: '{rel['source']}'}}), (b {{id: '{rel['target']}'}}) " \
                f"CREATE (a)-[:{rel['relationship']}]->(b)"
        queries.append(query)

    return queries

cypher_queries = generate_cypher_queries(nodes, relationships)
print(cypher_queries)

["CREATE (n:PERSON {id: '1', name: 'J.D. Salinger'})", "CREATE (n:PERSON {id: '2', name: 'Holden Caulfield'})", "CREATE (n:DATE {id: '3', name: 'a few days'})", "CREATE (n:GPE {id: '4', name: 'Pencey'})", "CREATE (n:EVENT {id: '5', name: 'post-World War II'})", "CREATE (n:GPE {id: '6', name: 'New York City'})", "CREATE (n:PERSON {id: '7', name: 'Holden'})", "CREATE (n:PERSON {id: '8', name: 'Phoebe'})", "CREATE (n:ORG {id: '9', name: 'the Museum of Natural History'})", "CREATE (n:PERSON {id: '10', name: 'Allie'})", "CREATE (n:PERSON {id: '11', name: 'Phoebe Caulfield'})", "CREATE (n:PERSON {id: '12', name: 'Antolini'})", "CREATE (n:PERSON {id: '13', name: 'Allie Caulfield'})", "CREATE (n:ORG {id: '14', name: 'Pencey Prep'})", "MATCH (a {id: '1'}), (b {id: '2'}) CREATE (a)-[:AUTHORED_BY]->(b)", "MATCH (a {id: '2'}), (b {id: '3'}) CREATE (a)-[:NARRATION_DURATION]->(b)", "MATCH (a {id: '2'}), (b {id: '4'}) CREATE (a)-[:STUDENT_OF]->(b)", "MATCH (a {id: '2'}), (b {id: '6'}) CREATE (a)-[:LO

## Execute queries

In [32]:
from neo4j import GraphDatabase

# Initialize the Neo4j driver for Memgraph (modify the URI if necessary)
uri = "bolt://localhost:7687"
user = ""
password = ""
driver = GraphDatabase.driver(uri, auth=(user, password))

# Function to execute Cypher queries in Memgraph
def execute_cypher_queries(queries):
    with driver.session() as session:
        session.run("MATCH (n) DETACH DELETE n;")
        for query in queries:
            try:
                session.run(query)
                msg.good(f"Executed query: {query}")
            except Exception as e:
                msg.fail(f"Error executing query: {query}. Error: {e}")

# Execute the generated Cypher queries
execute_cypher_queries(cypher_queries)



[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '1', name: 'J.D.
Salinger'})[0m
[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '2', name: 'Holden
Caulfield'})[0m
[38;5;2m✔ Executed query: CREATE (n:DATE {id: '3', name: 'a few days'})[0m
[38;5;2m✔ Executed query: CREATE (n:GPE {id: '4', name: 'Pencey'})[0m
[38;5;2m✔ Executed query: CREATE (n:EVENT {id: '5', name: 'post-World War
II'})[0m
[38;5;2m✔ Executed query: CREATE (n:GPE {id: '6', name: 'New York City'})[0m
[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '7', name: 'Holden'})[0m
[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '8', name: 'Phoebe'})[0m
[38;5;2m✔ Executed query: CREATE (n:ORG {id: '9', name: 'the Museum of Natural
History'})[0m
[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '10', name: 'Allie'})[0m
[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '11', name: 'Phoebe
Caulfield'})[0m
[38;5;2m✔ Executed query: CREATE (n:PERSON {id: '12', name: 'Antolini'})[0m
[38;5;2m✔ Executed query: CREATE (n