In [None]:
%pip install langchain langchain-community neo4j openai wikipedia tiktoken langchain_openai pdfplumber python-dotenv

Here, we have overwritten the properties value to be a list of Property classes instead of a dictionary to overcome the limitations of the API. Because you can only pass a single object to the API, we can to combine the nodes and relationships in a single class called KnowledgeGraph.

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
from langchain.schema import (
   AIMessage,
   HumanMessage,
   SystemMessage
)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI("GPT-4o")

input = f"""This directive defines the roles and responsibilities for managing and overseeing NASA’s nuclear
flight safety activities. It provides the requirements to implement NASA’s policy to protect the
public, NASA workforce, high-value equipment and property, and the environment from potential
harm as a result of NASA activities and operations, by factoring safety as an integral feature of
programs, projects, technologies, operations, and facilities.
b. This directive also describes NASA’s implementation of Federal requirements under National
Security Presidential Memorandum (NSPM)-20, “Presidential Memorandum on Launch of
Spacecraft Containing Space Nuclear Systems,” dated August 20, 2019, radiological contingency
planning (RCP) as a part of broader NASA emergency management activities (see NPD 8710.1 and
NPR 8715.2) and other factors, as well as agency-specific activities relating to ensuring safety and
mission success for NASA-sponsored payloads containing space nuclear systems (SNS) or other
radioactive material (note that these terms are defined in Appendix A).
c. This directive establishes a framework where other requirements, guidance, and processes (e.g.,
Department of Energy (DOE) nuclear safety and security requirements, U.S. Air and Space Force
range safety requirements, NASA payload safety processes) relevant to nuclear flight safety can be
implemented in to the overall Safety and Mission Assurance (SMA) process."""

prompt = f"""# Knowledge Graph Instructions for GPT-4
Step 1:
Split each sentence from the text into a set of entailed clauses that are maximally shortened. Format the clauses into RDF triples that have only two commas and show them only. No explanation needed. 

For instance, the below sentence:
This directive defines the roles and responsibilities for managing and overseeing NASA’s nuclear flight safety activities. Lions, zebras, and whales are animals.

Should be split like so:
This directive, defines, the roles and responsibilities
The roles and responsibilities, are for, managing and overseeing NASA’s nuclear flight safety activities
Lions, are, animals
zebras, are, animals
whales, are, animals

Step 2: 
Treat the triples as an A-box ontology and generate a corresponding OWL2-DL T-box ontology in turtle format. Derive general names for classes of subjects and objects (avoid using 
individual names from the triples). However, use predicate names as property names without change. Make sure all classes are used and are related as either domains of ranges of object properties.

Step 3: 
Parse the triples from step 1 into a readable A-Box ontology in turtle format using the terms of the above T-box. Group the triples by subject. Use words from the text 
directly as individual names.."""


messages = [
   SystemMessage(content=input),
   HumanMessage(content=prompt)
]
response = llm(messages)
print("LLM Response: \n" + response)




Besides the general instructions, I have also added the option to limit which node or relationship types should be extracted from text. You'll see through examples why this might come in handy. We have the Neo4j connection and LLM prompt ready, which means we can define the information extraction pipeline as a single function.

In [None]:
def extract_and_store_graph(
    document: Document,
    nodes:Optional[List[str]] = None,
    rels:Optional[List[str]]=None) -> None:
    # Extract graph data using OpenAI functions
    extract_chain = get_extraction_chain(nodes, rels)
    data = extract_chain.invoke(document.page_content)['function']
    print(data)
    # Construct a graph document
    graph_document = GraphDocument(
      nodes = [map_to_base_node(node) for node in data.nodes],
      relationships = [map_to_base_relationship(rel) for rel in data.rels],
      source = document
    )
    # Store information into a graph
    graph.add_graph_documents([graph_document])

The function takes in a LangChain document as well as optional nodes and relationship parameters, which are used to limit the types of objects we want the LLM to identify and extract. A month or so ago, we added the add_graph_documents method the Neo4j graph object, which we can utilize here to seamlessly import the graph.

# Evaluation
We will extract information from the Tom Hanks Wikipedia page and construct a knowledge graph to test the pipeline. Here, we will utilize the Wikipedia loader and text chunking modules provided by LangChain.

You might have noticed that we use a relatively large chunk_size value. The reason is that we want to provide as much context as possible around a single sentence in order for the coreference resolution part to work as best as possible. Remember, the coreference step will only work if the entity and its reference appear in the same chunk; otherwise, the LLM doesn't have enough information to link the two.

Now we can go ahead and run the documents through the information extraction pipeline.

In [None]:
from langchain.text_splitter import TokenTextSplitter
from langchain_core.documents.base import Document
import pdfplumber
from html import escape

def pdf_to_html(pdf_path, html_path):
    with pdfplumber.open(pdf_path) as pdf:
        html = '<html><body>'
        
        for page in pdf.pages:
            # Extract text
            page_text = page.extract_text()
            html += f'<p>{page_text}</p>'
            
            # Extract tables
            for table in page.extract_tables():
                html += '<table>'
                for row in table:
                    html += '<tr>'
                    for cell in row:
                        html += f'<td>{escape(cell)}</td>' if cell is not None else '<td></td>'
                        # escape html tags in text
                        
                    html += '</tr>'
                html += '</table>'
        html += '</body></html>'
    
    # Write HTML to file
    with open(html_path, 'w', encoding='utf-8') as html_file:
        html_file.write(html)
    return html

text_content = pdf_to_html('big sample.pdf', 'output.html')

# Create a Document object
document = Document(text_content)

# Define chunking strategy
text_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)

# Split the document into chunks
documents = text_splitter.split_documents([document])

In [None]:
import os
from openai import OpenAI
from tqdm import tqdm


'''client = OpenAI(
      api_key = os.environ.get("")
   )

with open("output.html", "r") as file:
   context = file.read()
   
messages =[
         {"role": "system", "content": context},
         {"role": "user", "content": "Generate JSON scheme"},
      ]
   
# Make API request        
completion = client.chat.completions.create(
   model="gpt-4-turbo-preview",
   messages=messages,
)

output = completion.choices[0].message.content
print(output)'''

In [None]:
from tqdm import tqdm
graph.query("MATCH (n) DETACH DELETE n")
for i, d in tqdm(enumerate(documents), total=len(documents)):
    extract_and_store_graph(d)

The process takes around 5 minutes, which is relatively slow. Therefore, you would probably want parallel API calls in production to deal with this problem and achieve some sort of scalability. Let's first look at the types of nodes and relationships the LLM identified.

In [None]:
# Delete the graph
#graph.query("MATCH (n) DETACH DELETE n")

In [None]:
# Specify which node labels should be extracted by the LLM
#allowed_nodes = ["Person", "Company", "Location", "Event", "Movie", "Service", "Award"]

#for i, d in tqdm(enumerate(documents), total=len(documents)):
#    extract_and_store_graph(d, allowed_nodes)

# Rag Application
The last thing we will do is show you how you can browse information in a knowledge graph by constructing Cypher statements. Cypher is a structured query language used to work with graph databases, similar to how SQL is used for relational databases. LangChain has a GraphCypherQAChain that reads the schema of the graph and constructs appropriate Cypher statements based on the user input.

In [None]:
# Query the knowledge graph in a RAG application
from langchain.chains import GraphCypherQAChain

graph.refresh_schema()

cypher_chain = GraphCypherQAChain.from_llm(
    graph=graph,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-4"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    validate_cypher=True, # Validate relationship directions
    verbose=True
)

In [None]:
query = "what is the content of Chief, Safety And Mission Assurance?"
cypher_chain.invoke({"query": query})
