## Set Up Neo4j Environment

Start by creating Neo4jGraph instance

In [1]:
import os
from dotenv import load_dotenv
from langchain_community.graphs import Neo4jGraph


# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
#graph = Neo4jGraph(refresh_schema=False)

## Chunk Text

In the microsoft paper, it is recommended to use smaller chunk sizes (~600) to extract more entities overall. The steps for this section are:
- Import PDFs for this example (lets use 2 papers)
- Convert PDF to markdown or txt files (use matextract workflow as basis)
- Chunk text from articles (will have to explore best way to do this)

Covert pdf to md using marker:

In [3]:
! marker_single data/pdf/10.1038s41586-019-1798-7.pdf data/md --langs English


Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32


Detecting bboxes:   0%|          | 0/2 [00:00<?, ?it/s]



Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32
Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32
Loaded recognition model vikp/surya_rec on device cpu with dtype torch.float32
Loaded texify model to cpu with torch.float32 dtype


Clean up markdown file, removing sections, etc. For this example, I'll do a more simple approach and use just one .md file. We'll take the text and save it as a single chunk.

To scale this workflow, we'd use multiple chunks (from multiple papers), instead of just one. We could use a vector database to store a lot of chunks, and retrieve the most relevant ones.

ChemNLP offers a more thorough workflow for cleaning up .md files generated from scholarly articles.

In [2]:
import re

def clean_text(text):
    # Delete the pattern [MISSING_PAGE_FAIL:x]
    cleaned_text = re.sub(r"\[MISSING_PAGE_FAIL:\d+\]", "", text)

    # Delete the acknowledgements section
    cleaned_text = re.sub(
        r"## Acknowledgements.*?(?=##|$)", "", cleaned_text, flags=re.S
    )

    # delete the references section 
    cleaned_text = re.sub(r"## *Notes And References.*", "", cleaned_text, flags=re.S)

    return cleaned_text



input_file = "data/md/10.1039c0dt00999g/10.1039c0dt00999g.md"

with open(input_file, "r", encoding="utf-8") as f:
    content = f.read()

# clean the text
text = clean_text(content)
print(text)

This article is published as part of the *Dalton Transactions* **themed issue entitled:** 

# New Talent Asia

Highlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*

![0_image_0.png](0_image_0.png)

Image reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-crystal-to-single-crystal transformation Zheng-Ming Hao and Xian-Ming Zhang, *Dalton Trans*., 2011, DOI: 10.1039/C0DT00979B, ARTICLES: 
Negative thermal expansion emerging upon structural phase transition in ZrV2O7 and HfV2O7 Yasuhisa Yamamura, Aruto Horikoshi, Syuma Yasuzuka, Hideki Saitoh and Kazuya Saito Dalton Trans., 2011, DOI: 10

Document specific text chunking strategies will be used - ie. Langchains "MarkdownTextSplitter"

Similar to recursive text splitting, but a step up.

In [3]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 2000, chunk_overlap=0)

In [4]:
chunks = splitter.create_documents([text])

In [5]:
chunks[0]

Document(page_content='This article is published as part of the *Dalton Transactions* **themed issue entitled:** \n\n# New Talent Asia\n\nHighlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*\n\n![0_image_0.png](0_image_0.png)\n\nImage reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-crystal-to-single-crystal transformation Zheng-Ming Hao and Xian-Ming Zhang, *Dalton Trans*., 2011, DOI: 10.1039/C0DT00979B, ARTICLES: \nNegative thermal expansion emerging upon structural phase transition in ZrV2O7 and HfV2O7 Yasuhisa Yamamura, Aruto Horikoshi, Syuma Yasuzuka, Hideki Saitoh and Kazuya Sa

### Extracting Nodes and Relationships

Could use two approaches:
- Loop through each chunk with the LLM to extract the KG info (nodes,relationships, properties)
- Employ a vector databse and do similarity search based off a query, then only look through those chunks

I will begin with the first approach. If its not too time consuming/expensive, I think it would be best.

LLMGraphTransformer from langchain will be used to extract entities from unstructured text. Providing a predefined schema is optional. However, you cannot provide a description/more context related to your schema. 

I think it would be useful to provide some MOF-related context to ensure the LLM knows what it is looking for. Maybe modifying the system prompt in the Langchain code base could help. We could pass an optional "schema description" variable where we provide additional context. We just have to be careful about exceeding the context window of our LLM. If possible, maybe we could pass an optional description for each part of a schema, ie. you can choose to further describe a Node type or relationship, so the LLM better understands what to look for.

LLMGraph transformer has an optional "Prompt" input, maybe this can be used to provide additional context.

In [6]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
llm_transformer = LLMGraphTransformer(
  llm=llm, 
  #node_properties=["description"], I dont think we need node properties. And "Description" is probably too vague
  #relationship_properties=["description"]
)

In [7]:
from typing import List
from langchain_community.graphs.graph_document import GraphDocument
from langchain_core.documents import Document

def process_document(doc: Document) -> List[GraphDocument]:
    return llm_transformer.convert_to_graph_documents([doc])

In [8]:
# Process each chunk using the process_text function
graph_documents = []
for chunk in chunks:
    graph_document = process_document(chunk)
    graph_documents.extend(graph_document)

# Print or handle the processed graph_documents as needed
print(f"Processed {len(graph_documents)} GraphDocuments.")

KeyboardInterrupt: 

In [None]:
graph_documents

[GraphDocument(nodes=[Node(id='Dalton Transactions', type='Publication'), Node(id='New Talent Asia', type='Themed issue'), Node(id='Masahiro Yamashita', type='Person'), Node(id='Tohoku University', type='Institution'), Node(id='Yuan Han', type='Person'), Node(id='Han Vinh Huynh', type='Person'), Node(id='Zheng-Ming Hao', type='Person'), Node(id='Xian-Ming Zhang', type='Person'), Node(id='Yasuhisa Yamamura', type='Person'), Node(id='Aruto Horikoshi', type='Person'), Node(id='Syuma Yasuzuka', type='Person'), Node(id='Hideki Saitoh', type='Person'), Node(id='Kazuya Saito', type='Person'), Node(id='Zhihuan Weng', type='Person'), Node(id='Satoshi Muratsugu', type='Person'), Node(id='Nozomu Ishiguro', type='Person'), Node(id='Shin-Ichi Ohkoshi', type='Person'), Node(id='Mizuki Tada', type='Person')], relationships=[Relationship(source=Node(id='New Talent Asia', type='Themed issue'), target=Node(id='Dalton Transactions', type='Publication'), type='PART_OF'), Relationship(source=Node(id='Masah

Now, lets try providing a prompt into the LLM Graph Transformer.

- This prompt will overwrite the default prompt from langchain, therefore, it is probably not a great idea to do this unless we completely mimick the langchain prompt
- Instead, maybe altering the langchain LLMGraphTransformer class, and add a section for providing additional context in the prompt, is a better idea

In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = ChatPromptTemplate([
    ("system", "You are creating a knowledge graph from various literature articles. These articles are about Metal-Organic Frameworks (MOFs). "
    "Identify the nodes and relationships found in the article. "
    "For example, if the text states: 'The chemical formula of MOF-5 is [Zn₄O(BDC)₃]', then you should identify "
    "MOF-5 as a 'Metal-Organic Framework' node, Zn as a 'Metal' node, and '(BDC)₃' as a 'Linker' node. "
    "You should also identify the relationships between these nodes, such as: MOF-5 'Has_Linker' with '(BDC)₃'.")
])


In [9]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
llm_transformer = LLMGraphTransformer(
  llm=llm, 
  #prompt = template,
  allowed_nodes=["MOF", "Bond", "Atom", "Metal", "Linker"],
  allowed_relationships=["Has_Bond", "Has_Atom", "Has_Linker"],
  #node_properties=["description"], I dont think we need node properties. And "Description" is probably too vague
  #relationship_properties=["description"]
)

In [10]:
# Process each chunk using the process_text function
graph_documents = []
for chunk in chunks:
    graph_document = process_document(chunk)
    graph_documents.extend(graph_document)

# Print or handle the processed graph_documents as needed
print(f"Processed {len(graph_documents)} GraphDocuments.")

Processed 20 GraphDocuments.


In [11]:
graph_documents

[GraphDocument(nodes=[Node(id='Dalton Transactions', type='Mof'), Node(id='New Talent Asia', type='Mof'), Node(id='Tohoku University', type='Mof')], relationships=[Relationship(source=Node(id='Dalton Transactions', type='Mof'), target=Node(id='New Talent Asia', type='Mof'), type='HAS_BOND')], source=Document(page_content='This article is published as part of the *Dalton Transactions* **themed issue entitled:** \n\n# New Talent Asia\n\nHighlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*\n\n![0_image_0.png](0_image_0.png)\n\nImage reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-cryst

### Changing Langchain's source code to allow for optional "additional context" prompt

In [5]:
context = ''' 
### **Node Types:**
- **MOF (Metal-Organic Framework):** Refers to compounds consisting of metal ions or clusters coordinated to organic ligands.
- **Bond:** Represents a connection between two atoms within a MOF.
- **Atom:** The basic unit of a chemical element in a MOF.
- **Metal:** A chemical element forming positive ions and involved in the MOF's structure.
- **Linker:** An organic molecule connecting metal ions or clusters in a MOF.

### **Relationship Types:**
- **Has_Bond:** Links an "Atom" to another "Atom" via a bond within a MOF.
- **Has_Atom:** Indicates that a "MOF" contains a specific "Atom".
- **Has_Linker:** Indicates that a "MOF" contains a specific "Linker".

### **Important Guidelines:**
1. **Scientific Context:** Interpret terms within the context of chemistry. For example, "Atom" should refer to elements within a MOF, not other uses of the word.
2. **Disambiguation:** If a term could be ambiguous, prefer the scientific interpretation. Ignore non-scientific entities or terms unless directly relevant to MOFs.
3. **Entity Consistency:** Ensure consistent naming for entities. For example, always use the full name of a MOF or a chemical element even if it appears in a shortened form in the text.
4. **Domain-Specific Instructions:** Use technical jargon or abbreviations only within the context of chemistry, and classify them correctly.
5. **Filtering Non-Relevant Content:** Ignore or deprioritize non-scientific text or sections irrelevant to the chemistry-specific nodes and relationships.
6. **Known Misinterpretations:** Do not classify journal names, like "Dalton Transactions," as a scientific entity like "MOF."

### **Examples of Correct Classifications:**
- "The MOF Zn-BTC has a bond between Zinc (Metal) and Benzene Tricarboxylate (Linker)."
  - `{"head": "Zn-BTC", "head_type": "MOF", "relation": "Has_Bond", "tail": "Zinc", "tail_type": "Metal"}`
  - `{"head": "Zn-BTC", "head_type": "MOF", "relation": "Has_Linker", "tail": "Benzene Tricarboxylate", "tail_type": "Linker"}`

### **Incorrect Classifications to Avoid:**
- "Dalton Transactions is a journal related to MOFs."
  - **Incorrect:** `{"head": "Dalton Transactions", "head_type": "MOF", "relation": "Has_Linker", "tail": "MOFs", "tail_type": "Linker"}` 
  - **Correct Approach:** Do not classify "Dalton Transactions" as a "MOF" or any other node type.
'''

In [6]:
from custom_llm import LLMGraphTransformer as CustomLLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
llm_transformer = CustomLLMGraphTransformer(
  llm=llm, 
  context = context,
  allowed_nodes=["MOF", "Bond", "Atom", "Metal", "Linker"],
  allowed_relationships=["Has_Bond", "Has_Atom", "Has_Linker"],
)

In [7]:
from typing import List
from langchain_community.graphs.graph_document import GraphDocument
from langchain_core.documents import Document
def process_document(doc: Document) -> List[GraphDocument]:
    return llm_transformer.convert_to_graph_documents([doc])

In [11]:
# Process each chunk using the process_text function
custom_graph_documents = []
for chunk in chunks:
    custom_graph_document = process_document(chunk)
    custom_graph_documents.extend(custom_graph_document)

# Print or handle the processed graph_documents as needed
print(f"Processed {len(custom_graph_documents)} GraphDocuments.")

Processed 20 GraphDocuments.


In [12]:
custom_graph_documents

[GraphDocument(nodes=[Node(id='Dalton Transactions', type='Mof')], relationships=[], source=Document(page_content='This article is published as part of the *Dalton Transactions* **themed issue entitled:** \n\n# New Talent Asia\n\nHighlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*\n\n![0_image_0.png](0_image_0.png)\n\nImage reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-crystal-to-single-crystal transformation Zheng-Ming Hao and Xian-Ming Zhang, *Dalton Trans*., 2011, DOI: 10.1039/C0DT00979B, ARTICLES: \nNegative thermal expansion emerging upon structural phase transition in ZrV2O

In [14]:
allowed_nodes=["MOF", "Bond", "Atom", "Metal", "Linker"],
allowed_relationships=["Has_Bond", "Has_Atom", "Has_Linker"],

from custom_llm import create_unstructured_prompt
prompt = create_unstructured_prompt(
    node_labels=allowed_nodes,
    rel_types=allowed_relationships,
    additional_context=context
)
print(prompt)



input_variables=['"head"', 'input'] messages=[SystemMessage(content='You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph. Your task is to identify the entities and relations requested with the user prompt from a given text. You must generate the output in a JSON format containing a list with JSON objects. Each object should have the keys: "head", "head_type", "relation", "tail", and "tail_type". The "head" key must contain the text of the extracted entity with one of the types from the provided list in the user prompt.\nThe "head_type" key must contain the type of the extracted head entity, which must be one of the types from ([\'MOF\', \'Bond\', \'Atom\', \'Metal\', \'Linker\'],).\nThe "relation" key must contain the type of relation between the "head" and the "tail", which must be one of the relations from ([\'Has_Bond\', \'Has_Atom\', \'Has_Linker\'],).\nThe "tail" key must represent the text of an extracted entity which 