# Creating a knowledge graph from email

We need to create a knowledge graph to support RAG because the ability to query what is happening in email from context runs into context length restrictions very quickly. The knowledge graph itself is also a useful source, from a search perspective, for understanding a user's interactions with their contacts.

The following assumes the system has run and pre-populated a weaviate instance at 127.0.0.1:8080 with some corpus of emails

Now retrieve a list of all items in the emails topic:

In [1]:
import time
import kafka
import json
import uuid

consumer = kafka.KafkaConsumer(bootstrap_servers='127.0.0.1:9092', 
                               auto_offset_reset='earliest',
                                group_id=str(uuid.uuid4()),
                                value_deserializer=lambda v: json.loads(v.decode('utf-8')))
consumer.subscribe(topics=['emails'])
ts = time.monotonic()
print("Reading from " + str(ts))
emails = []
while True:
    messages = consumer.poll(timeout_ms=1000)
    nts = time.monotonic() - ts
    for topic, messages in messages.items():
        for message in messages:
            email = message.value
            print(str(nts) + " " + str(email))
            emails.append(email)
    if nts > 10:
        break
consumer.commit()
consumer.close()

print("Emails received: " + str(len(emails)))

Reading from 462791.728905879
3.347494204994291 {'email_id': '18ecf76f3f70e3ab', 'history_id': '8035529', 'thread_id': '18ecf76f3f70e3ab', 'labels': ['UNREAD', 'CATEGORY_UPDATES', 'INBOX'], 'date': '2024-04-11 17:21:02', 'to': [{'name': None, 'email': 'keith@madsync.com'}], 'cc': [], 'bcc': [], 'subject': 'Latest News from My Portfolios', 'from': {'email': 'fool@motley.fool.com', 'name': 'The Motley Fool'}, 'body': "Daily Update        The Motley Fool\n( https://clicks.fool.com/f/a/7xAY6aiL4d6rsD2ZWf8-WA~~/AAQRxQA~/RgRn-vVeP0QkaHR0cDovL3d3dy5mb29sLmNvbT9saWQ9M2g5d214NWZzazMxVwNzcGNCCmYQXnAYZn0vJP1SEWtlaXRoQG1hZHN5bmMuY29tWAQAAAH5 )\n\nDaily Update     Stocks I'm Following\n\nRecommendations | 0\n\n\n\n    No recommendations were\nissued on the stocks in your portfolios   No recommendations were\nissued on the stocks in your portfolios\n         Recent Recs from your\nServices   Closing Price     \nTSLA (\nhttps://clicks.fool.com/f/a/vix8zWbc7yni4z3lK2fp0A~~/AAQRxQA~/RgRn-vVeP0RTaHR0cHM

We will now use an LLM and a specially constructed prompt to extract nodes from emails. First create the LLM:

In [2]:
import library.llm_api as llm   
import library.weaviate as we

m = llm.LLM_API('localhost', '4891', we.Weaviate())




Creating new schema Email with vectorizer None
Connecting to 127.0.0.1:8080
Schema already exists
Creating new schema EmailText with vectorizer vectorizer=<Vectorizers.TEXT2VEC_TRANSFORMERS: 'text2vec-transformers'> poolingStrategy='masked_mean' vectorizeClassName=True inferenceUrl=None passageInferenceUrl=None queryInferenceUrl=None
Connecting to 127.0.0.1:8080
Schema already exists


  obj, end = self.scan_once(s, idx)


We need to contruct a prompt template based on the items we wish to extract. The reference for this prompt and the approach is here: https://bratanic-tomaz.medium.com/constructing-knowledge-graphs-from-text-using-openai-functions-096a6d010c17. 

This prompt template relies on two lists restricting its ability to create node types and relationships between them. These aree `allowed_nodes` and `allowed_rels`

In [34]:
# allowed_nodes = ['Person','Location','Position', 'Team','Institution','Activity','Event','Request','Document','Initiative']
# allowed_rels = ['participatesIn','hasAppliedTo','hasLocation','hasInstitution','hasActivity','hasEvent','hasRequest','hasDocument','hasInitiative']


allowed_nodes = [
    {
        "label": "Person",
        "description": "A human or other entity that performs actions",
        "allowedRelationships": [
            {"relatesWith": "Location", "type": "isLocatedAt"},
            {"relatesWith": "Event", "type": "participatesIn"},
            {"relatesWith": "Communication", "type": "makesRequests"},
            {"relatesWith": "Communication", "type": "respondsToRequests"},
            {"relatesWith": "Document", "type": "writes"},
            {"relatesWith": "Document", "type": "edits"},
            {"relatesWith": "Document", "type": "commentsOn"},
            {"relatesWith": "Initiative", "type": "participatesIn"},
            {"relatesWith": "Institution", "type": "belongsTo"},
            {"relatesWith": "Position", "type": "holds"},
            {"relatesWith": "Position", "type": "appliedFor"}

        ]
    },
    {
        "label": "Position",
        "description": "A job or role",
        "allowedRelationships": [
            {"relatesWith": "Person", "type": "isFilledBy"},
            {"relatesWith": "Institution", "type": "isOfferedBy"}
        ]
    },
    {
        "label": "Location",
        "descrition": "A geographical locale",
        "allowedRelationships": [
            {"relatesWith": "Person", "type": "locates"},
            {"relatesWith": "Institution", "type": "locates"}
        ]
    },
    {
        "label": "Institution",
        "description": "An organization or company",
        "allowedRelationships": [
            {"relatesWith": "Location", "type": "isLocatedAt"},
            {"relatesWith": "Person", "type": "employs"},
            {"relatesWith": "Person", "type": "affiliatesWith"},
            {"relatesWith": "Person", "type": "hires"},
            {"relatesWith": "Event", "type": "holds"},
            {"relatesWith": "Document", "type": "houses"},
            {"relatesWith": "Initiative", "type": "sponsors"},
            {"relatesWith": "Position", "type": "hiresFor"}
        ]
    },
    {
        "label": "Event",
        "description": "An occurence with a specific date and time",
        "allowedRelationships": [
            {"relatesWith": "Person", "type": "hasParticipant"},
            {"relatesWith": "Document", "type": "isAssociatedWith"},
            {"relatesWith": "Event", "type": "recursWith"},
            {"relatesWith": "Initiative", "type": "supports"}
        ]
    },
    # {
    #     "node": "Communication",
    #     "allowedRelationships": [
    #         {"relatesWith": "Person", "type": "makesRequests"},
    #         {"relatesWith": "Document", "type": "associatesWith"},
    #         {"relatesWith": "Event", "type": "recursesWith"}
    #     ]
    # },
    {
        "label": "Initiative",
        "description": "A project or program",
        "allowedRelationships": [
            {"relatesWith": "Person", "type": "hasParticipant"},
            {"relatesWith": "Person", "type": "isLedBy"},
            {"relatesWith": "Event", "type": "isDiscussedAt"},
            {"relatesWith": "Institution", "type": "isSponsoredBy"},
            {"relatesWith": "Document", "type": "isSupportedBy"}
        ]
    },
    {
        "label": "Document",
        "description": "A piece of writing or other content",
        "allowedRelationships": [
            {"relatesWith": "Person", "type": "isWrittenBy"},
            {"relatesWith": "Event", "type": "isAssociatedWith"},
            {"relatesWith": "Initiative", "type": "supports"}
        ]
    }
]

The prompt itself is in markdown form as that seems to improve LLM performance (ref) and is very tightly constrained:

In [35]:
prompt = f"""# Knowledge Graph Instructions
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
{'- **Allowed Node Labels And the Allowed Relationships Between Them:**' + str(allowed_nodes) if allowed_nodes else ""}
## 3. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`. Use plain English for property values and node names. Every node must have a 'name' property.
- **Data Preferences**: If there exists an appropriate node type for a value, prefer creating a node for it to crating a property.
- **Explainability**: Each node must have a property named 'context' that includes the sentence or phrase from which the node was extracted.
## 4. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"), 
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.  
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. Names of all nodes
must be expressed in normal english with no modification from what is provided in context.
## 5. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination. The nodes and relationships extracted should be in the format of a Cypher query."""

Now we run the prompt and an email together through the model and print the response

In [36]:
sample_email = """Hi Keith,Thank you for your interest in the Engineering Manager, LLM Scaling position at Anthropic. We appreciate you taking the time to submit an application and are delighted that you would consider joining our team!
Please note that with the high volume of strong candidates expressing interest in our roles, our team may need some additional time to thoroughly review your application. We may not reach out unless we think you are a strong fit for the role you applied to. In the meantime, we truly appreciate your patience throughout our hiring process, and we will do our best to provide updates where possible.
We are grateful for your interest, and for the time and effort that you have invested in our process so far.
Best regards,The Anthropic Team"""

full_prompt = f"""{prompt}
Use the given format to extract information from the following input: {sample_email}

The extracted information is:"""

print(full_prompt)
print()


# Knowledge Graph Instructions
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
- **Allowed Node Labels And the Allowed Relationships Between Them:**[{'label': 'Person', 'description': 'A human or other entity that performs actions', 'allowedRelationships': [{'relatesWith': 'Location', 'type': 'isLocatedAt'}, {'relatesWith': 'Event'

The expected outcome from the email sample is something like 

```cypher
CREATE (e1:Position {name:"Engineering Manager, LLM Scaling", context:"Thank you for your interest in the Engineering Manager, LLM Scaling position"})
CREATE (i1:Institution {name: "Anthropic", context: "position at Anthropic"})-[:hiresFor]->(e1)
CREATE (p1:Person {name:"Keith", context:"Hi Keith,"})-[:isInterestedIn]->(e1)
CREATE (e1)-[:isOfferedBy]->(i1)
```

In [37]:
response = m.model.generate(full_prompt, max_tokens=1000, temperature=0.0)
print(str(response))


```cypher
CREATE (p1:Person {name:"Keith", context:"Hi Keith,"})-[:hasInterestIn]->(e1:EngineeringManagerLLMScalingPosition {name:"Engineering Manager, LLM Scaling position at Anthropic"})
CREATE (p2:Person {name:"Anthropic", context:"We appreciate you taking the time to submit an application and are delighted that you would consider joining our team!"})-[:employs]->(e1)
```


In [24]:
# count = 0
# for e in emails:
#     count += 1
#     print(str(count) + ": " + str(e['body']))
#     print('---------------------------------------------------------------------------------------')

1: Daily Update        The Motley Fool
( https://clicks.fool.com/f/a/7xAY6aiL4d6rsD2ZWf8-WA~~/AAQRxQA~/RgRn-vVeP0QkaHR0cDovL3d3dy5mb29sLmNvbT9saWQ9M2g5d214NWZzazMxVwNzcGNCCmYQXnAYZn0vJP1SEWtlaXRoQG1hZHN5bmMuY29tWAQAAAH5 )

Daily Update     Stocks I'm Following

Recommendations | 0



    No recommendations were
issued on the stocks in your portfolios   No recommendations were
issued on the stocks in your portfolios
         Recent Recs from your
Services   Closing Price     
TSLA (
https://clicks.fool.com/f/a/vix8zWbc7yni4z3lK2fp0A~~/AAQRxQA~/RgRn-vVeP0RTaHR0cHM6Ly93d3cuZm9vbC5jb20vMTgvY292ZXJhZ2UvdXBkYXRlcy8yMDI0LzA0LzExL25vLTUtdGVzbGEtYXByaWwtMjAyNC1yYW5raW5ncy9XA3NwY0IKZhBecBhmfS8k_VIRa2VpdGhAbWFkc3luYy5jb21YBAAAAfk~ )
4/11/2024 |

Stock Advisor (
https://clicks.fool.com/f/a/vix8zWbc7yni4z3lK2fp0A~~/AAQRxQA~/RgRn-vVeP0RTaHR0cHM6Ly93d3cuZm9vbC5jb20vMTgvY292ZXJhZ2UvdXBkYXRlcy8yMDI0LzA0LzExL25vLTUtdGVzbGEtYXByaWwtMjAyNC1yYW5raW5ncy9XA3NwY0IKZhBecBhmfS8k_VIRa2VpdGhAbWFkc3luYy5jb21YBAAAA