# Create knowledge graph from pdf with LLM.
![title](neo4jdogs.png)

This notebook creates a Knowledge graph with help of OpenAI GPT4. It's stored into a pickle-file, and you can load and use it in following notebooks:
- [Knowledge graph with Neo4J (Cypher)](./graph-neo4j.ipynb)
- [Knowledge graph with Azure CosmosDB (Gremlin)](./graph-cosmosdb.ipynb)

In [None]:
%pip install -r requirements.txt

# Load env variables and connect to Azure Openai
 

In [1]:
import os
from dotenv import load_dotenv
from langchain.schema import Document
from langchain_openai import AzureChatOpenAI
from langchain.schema import OutputParserException
load_dotenv()

llm = AzureChatOpenAI(
    model=os.getenv("OPENAI_DEPLOYMENT_NAME"), 
    temperature=0, 
    max_tokens=4000,
    verbose=True)

In [2]:
# Simplified Pydantic model of the graph.
# The Langchain KnowledgeGraph model is too complicated as OpenAI functions schema 


from typing import List, Dict, Optional, Union
from langchain.pydantic_v1 import Field, BaseModel

class Property(BaseModel):
  """A single property consisting of key and value"""
  key: str = Field(..., description="key")
  value: str = Field(..., description="value")

class Node(BaseModel):
    "Represents a node in a graph with associated properties"
    id: Union[str, int]
    type: Optional[str] = "Node"
    properties: Optional[List[Property]] = Field(
        None, description="List of node properties")

class Relationship(BaseModel):
    "Represents a directed relationship between two nodes in a graph."
    source: Union[str, int] = Field(..., description="Id of source node")
    target: Union[str, int] = Field(..., description="Id of target node")
    type: Optional[str] =  Field(..., description="Type of relationship")
    properties: Optional[List[Property]] = Field(
        None, description="List of relationship properties"
    )
    
class KnowledgeGraph(BaseModel):
    """Knowlege graph consisting of nodes and relationships"""
    nodes: List[Node] = Field(
        ..., description="List of nodes in the knowledge graph")
    rels: List[Relationship] = Field(
        ..., description="List of relationships in the knowledge graph"
    )

# Magic
The prompt, the function call and the chain.

In [3]:

from langchain.prompts import ChatPromptTemplate
from langchain_core.prompts.chat import MessagesPlaceholder
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.tracers import ConsoleCallbackHandler
from langchain.tools import tool

# The prompt is from langchain examples

system_prompt = """
# Knowledge Graph Instructions for GPT-4
## 1. Overview
You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.
- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.
## 2. Labeling Nodes
- **Consistency**: Ensure you use basic or elementary types for node labels.
  - For example, when you identify an entity representing a person, always label it as **"person"**. Avoid using more specific terms like "mathematician" or "scientist".
- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
## 3. Handling Numerical Data and Dates
- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.
- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.
- **Property Format**: Properties must be in a key-value format.
- **Quotation Marks**: Never use escaped single or double quotes within property values.
- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.
## 4. Coreference Resolution
- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),
always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
## 5. Strict Compliance
Adhere to the rules strictly. Non-compliance will result in termination.
*Double check* that the JSON structure is correct.

"""

# Store function call results to this list
function_responses = []

# Define function call in form of a langchain tool (it's converted to a open-ai function). The function schema is defined with pydantic.
@tool
def knowledge_graph(object: KnowledgeGraph) -> Dict[str, List]:
    """A Tool to convert text to knowledge grap"""
    function_responses.append(object)
    return  True # Don't return anything for the LLM so that the context does not grow.


# Added some more precise instructions to have more control over the output. 
# Likely you would have existing schemas or terminology that you would like to reuse.

prompt = ChatPromptTemplate.from_messages([
                ("system",system_prompt),
                MessagesPlaceholder("chat_history", optional=True),
                ("human", 
                 """In this particular case we are interested in dogs. We want to extract information about dog breeds and their characteristics.
                    Characterics should be nodes and relationships should be between dog breeds and their characteristics. 
                    Ignore other entities than dogs, like people and addresses.
                    - **Allowed Node Labels:** Breed, BreedingGroup, Characteristic
                    {input}"""),
                MessagesPlaceholder("agent_scratchpad"),
            ])

function_agent = create_openai_functions_agent(llm, [knowledge_graph], prompt)
chain = AgentExecutor(agent=function_agent, tools=[knowledge_graph], verbose=True, callbacks=[ConsoleCallbackHandler()])


In [4]:
from model import map_to_base_node, map_to_base_relationship
from langchain_community.graphs.graph_document import GraphDocument

# Convert our simplified graph to a Langchain graph document


def extract_and_store_graph(data: KnowledgeGraph, document: Document) -> Optional[GraphDocument]:
    # Extract graph data using OpenAI functions
            
        # Construct a graph document
        nodes = []
        rels = []
        try:
            nodes= list(map(map_to_base_node, data.nodes))
            rels= map_to_base_relationship(data.rels, nodes)
        except Exception as e:
            print("parsing exception")
            print(e)
        
        if len(nodes) == 0:
            return None
            
        return GraphDocument(
            nodes = nodes,
            relationships = rels,
            source = document
        )

        

# Download test documents
Get some info about different dog breeds

In [5]:
from tqdm import tqdm
import urllib.request

local_folder = "./data/"
os.makedirs(local_folder,exist_ok=True)

doc_names = []

documents = [
"https://www.marinhumane.org/wp-content/uploads/2017/06/Dog-Breed-Characteristics-Behavior.pdf" 
]
for doc in tqdm(documents):
    print("Downloading", doc)
    doc_names.append(doc.split("/")[-1])
    if os.path.isfile(local_folder + doc.split("/")[-1]):
        continue
    urllib.request.urlretrieve(doc, local_folder + doc.split("/")[-1])
    

100%|██████████| 1/1 [00:00<00:00, 1674.37it/s]

Downloading https://www.marinhumane.org/wp-content/uploads/2017/06/Dog-Breed-Characteristics-Behavior.pdf





## PDF to Txt
Read and chuck the docs

In [6]:
import time

from pdf import parse_pdf

# Currently this utility only uses pypdf. For more serious stuff you should use Azure Document Intelligence or similar service.

docs_pages_map = dict()
for doc in doc_names:
    print("Processing ",doc)
    start_time = time.time()
    
    doc_map = parse_pdf(file=local_folder+doc)
    docs_pages_map[doc]= doc_map
    
    # Capture the end time and Calculate the elapsed time
    end_time = time.time()
    elapsed_time = end_time - start_time

    print(f"Processed {len(doc_map)} pages in {elapsed_time:.6f} seconds\n")
    
print(docs_pages_map)

Processing  Dog-Breed-Characteristics-Behavior.pdf
Processed 7 pages in 0.240982 seconds

{'Dog-Breed-Characteristics-Behavior.pdf': [(0, 0, ' \n  \n Behavior & Training  \n 415.506.6 280 \n Available B&T Services  \n \n \n171 Bel Marin Keys Blvd., Novato, CA  94949    Dog Breed Characteristics & Behavior  \nLike us at :   Page 1 of 7 \nDog Breed Characteristics & Behavior  \n \nWhy is it important to know about the characteristics and behavior of different breeds?  \nAll dogs are individuals and have their own personalities. At the same time, different breeds tend to also \nhave certain characteristics that help define that particular breed. This information can be helpful to you \nwhen you are choosing a  dog or trying to understand his  behavior.  \n \nThe AKC (American Kennel Club) places dog breeds within seven different groups. In order to ac count for \nthe different behaviors within a particular group, some groups can be further subdivided into families.  \n \nHerding group:  \

## Load documents and execute LLM
Parse the docs with LLM to extract the graph.
This will take some time. Later we store the graph into pickle,so that you don't need to do this all the time.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

full_doc = ""
graph_docs = []


for doc_name,doc_map in docs_pages_map.items():    
    for page in tqdm(doc_map):
        try:
            text = page[2].strip()
            # This will update the list function_responses. It's a global variable. (Not optimal I know)
            data = chain.invoke(
                {
                    "input": text,
                }
            )      
            for graph_doc in function_responses:
                tmp = extract_and_store_graph(graph_doc, Document(page_content=text))
                if tmp:                                    
                    graph_docs.append(tmp)
                
            function_responses.clear()
        except OutputParserException as e:
            print("output exception")
            print(e)
     

  0%|          | 0/7 [00:00<?, ?it/s]

[32;1m[1;3m[chain/start][0m [1m[1:chain:AgentExecutor] Entering Chain run with input:
[0m{
  "input": "Behavior & Training  \n 415.506.6 280 \n Available B&T Services  \n \n \n171 Bel Marin Keys Blvd., Novato, CA  94949    Dog Breed Characteristics & Behavior  \nLike us at :   Page 1 of 7 \nDog Breed Characteristics & Behavior  \n \nWhy is it important to know about the characteristics and behavior of different breeds?  \nAll dogs are individuals and have their own personalities. At the same time, different breeds tend to also \nhave certain characteristics that help define that particular breed. This information can be helpful to you \nwhen you are choosing a  dog or trying to understand his  behavior.  \n \nThe AKC (American Kennel Club) places dog breeds within seven different groups. In order to ac count for \nthe different behaviors within a particular group, some groups can be further subdivided into families.  \n \nHerding group:  \nBreeds in this group were bred to herd sh

  0%|          | 0/7 [00:05<?, ?it/s]

[31;1m[1;3m[chain/error][0m [1m[1:chain:AgentExecutor] [5.65s] Chain run errored with error:
[0m"KeyboardInterrupt()Traceback (most recent call last):\n\n\n  File \"/home/pj/dev/cosmosdb-llm-knowledge-graph/.venv/lib/python3.11/site-packages/langchain/chains/base.py\", line 156, in invoke\n    self._call(inputs, run_manager=run_manager)\n\n\n  File \"/home/pj/dev/cosmosdb-llm-knowledge-graph/.venv/lib/python3.11/site-packages/langchain/agents/agent.py\", line 1391, in _call\n    next_step_output = self._take_next_step(\n                       ^^^^^^^^^^^^^^^^^^^^^\n\n\n  File \"/home/pj/dev/cosmosdb-llm-knowledge-graph/.venv/lib/python3.11/site-packages/langchain/agents/agent.py\", line 1097, in _take_next_step\n    [\n\n\n  File \"/home/pj/dev/cosmosdb-llm-knowledge-graph/.venv/lib/python3.11/site-packages/langchain/agents/agent.py\", line 1097, in <listcomp>\n    [\n\n\n  File \"/home/pj/dev/cosmosdb-llm-knowledge-graph/.venv/lib/python3.11/site-packages/langchain/agents/agent.p




KeyboardInterrupt: 

In [None]:
#Let's pickle the graph so we don't have to redo this all the time
import pickle


with open('./data/graph_docs.pkl','wb') as f:
    pickle.dump(graph_docs, f)